GithubHelp home page GithubHelp logo

tole42 / urlclustering Goto Github PK

View Code? Open in Web Editor NEW

This project forked from daremon/urlclustering

0.0 2.0 0.0 164 KB

Package to facilitate URL clustering

License: MIT License

Makefile 2.14% Python 97.86%

urlclustering's Introduction

urlclustering

This package facilitates the clustering of similar URLs of a website.

Live demo: http://urlclustering.com

General information

You give a (preferably long and complete) list of URLs as input e.g.:

urls = [
    'http://example.com',
    'http://example.com/about',
    'http://example.com/contact',

    'http://example.com/cat/sports',
    'http://example.com/cat/tech',
    'http://example.com/cat/life',
    'http://example.com/cat/politics',

    'http://example.com/tag/623/tag1',
    'http://example.com/tag/335/tag2',
    'http://example.com/tag/671/tag3',

    'http://example.com/article/?id=1',
    'http://example.com/article/?id=2',
    'http://example.com/article/?id=3',
]

You get a list of clusters as a result. For each cluster you get:

  • a REGEX that matches all cluster URLs
  • a HUMAN readable string representing the cluster
  • a list with all matched cluster URLs

So for our example the result is:

REGEX: http://example.com/cat/([^/]+)
HUMAN: http://example.com/cat/[...]
URLS:
    http://example.com/cat/sports
    http://example.com/cat/tech
    http://example.com/cat/life
    http://example.com/cat/politics

REGEX: http://example.com/tag/(\d+)/([^/]+)
HUMAN: http://example.com/tag/[NUMBER]/[...]
URLS:
    http://example.com/tag/623/tag1
    http://example.com/tag/335/tag2
    http://example.com/tag/671/tag3

REGEX: http://example.com/article/?\?id=(\d+)
HUMAN: http://example.com/article?id=[NUMBER]
URLS:
    http://example.com/article/?id=1
    http://example.com/article/?id=2
    http://example.com/article/?id=3

UNCLUSTERED URLS:
    http://example.com
    http://example.com/about
    http://example.com/contact

When to use

This is most useful for website analysis tools that report findings to the user. E.g. a service that crawls your website and reports page loading time may find that 10,000 pages take >2 seconds to load. Instead of listing 10,000 URLs it's better to cluster them. So the end user will see something like:

Slow pages (>2 secs):
- http://example.com/                             (1 URL)
- http://example.com/sitemap                      (1 URL)
- http://example.com/search?q=[...]               (578 URLs)
- http://example.com/tags?tag1=[...]&tag2=[...]   (409 URLs)
- http://example.com/article?id=[NUMBER]          (7209 URLs)

How it works:

URLs are grouped by domain. Only same domain URLs are clustered.

URLs are then grouped by a signature which is the number of path elements and the number of QueryString parameters & values the URL has.

Examples:

URLs with the same signature are inserted in a tree structure. For each part (path element or QS parameter or QS value) two nodes are created:

  • One with the verbatim part.
  • One with the reduced part i.e. a regex that could replace the part.

Leaf nodes hold the number of URLs that match and the number of reductions.

E.g. inserting URL http://ex.com/article?123 will create 2 top nodes:

root 1: `article`
root 2: `[^/]+`

And each top node will have two children:

child 1: `123`
child 2: `\d+`

Inserting 3 URLs of the form /article/[0-9]+ would lead to a tree like this:

       `article`                        `[^/]+`
  /    /      \     \             /    /      \     \
`123`  `456`  `789`  `\d+`      `123`  `456`  `789`  `\d+`
1 URL  1 URL  1 URL  3 URLs     1 URL  1 URL  1 URL  3 URLs
0 re   0 re   0 re   1 re       1 re   1 re   1 re   2  re

The final step is to choose the best leafs. In this case article -> \d+ is best because it macthes all 3 URLs with 1 reduction so the cluster returned is http://ex.com/article/[NUMBER]

License

Copyright (c) 2015 Dimitris Giannitsaros.

Licensed under the MIT License.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.