GithubHelp home page GithubHelp logo

centipede's People

Contributors

kix avatar msvrtan avatar ninir avatar pedrotroller avatar umpirsky avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

centipede's Issues

Crash when parsing a link to a phone number

When parsing my website with centipede run http://auxptitsplaisirs.lo/, I get the following exception :

[InvalidArgumentException]                                                     
  Unable to parse malformed url: http://auxptitsplaisirs.lotel:+41 21 907 26 26

Because it fails to parse the following link :

<a href="tel:+41 21 907 26 26">+41 21 907 26 26</a>

ca_FR Not in french

Hi,
The ca_FR is not in french but a mixt of italian (or something else) and other languages

Configuration through yml config file

centipede.yml:

rules:
    foo:
        url: https://github.com/foo
        status: 200
        text: Foo
    bar:
        url: https://github.com/bar
        status: 200
        text: Bar

ignore:
    - https://github.com/ignore
    - https://github.com/ignored

HTTP errors while using `depth` argument

Actually, HTTP errors are supported, but when they're "depthed", they're always handled as exceptions, breaking the script, so it seems these should be caught and rendered as classic errored-crawls...

Here is an example of use:

$ centipede run http://example.com/
200 http://example.com/page
404 http://example.com/non_existing

If I'm using the depth argument, once the crawler reaches a page while the "current depth" is different than the root one, errors are no more handled.

In the above example, suppose there is a link on the /page page pointing on /non_existing, the result will be similar to this one:

$ centipede run http://example.com/ 1
200 http://example.com/page

  [GuzzleHttp\Exception\ClientException]
  Client error response [url] http://example.com/non_existing [status code] 404 [reason phrase] Not Found

This breaks the script and the crawler can no more be able to navigate, obviously.

Crawler not catching all links?

For example, script won't find any childs here:

./vendor/bin/centipede run http://www.altevents.darbai.webas.lt/
200 http://www.altevents.darbai.webas.lt/

Why so? there are more links on that page. but it only checks the mainpage?

I have checked the UrlExtractor class, the extract() method receives valid html,
but DOMDocument does not extract the links. Maybe we should use symfony/dom-crawler ?

Thanks!

[RFC] Scenario plugin system

Hi.

Just a little reflection. Why not provide a system that allow you to describe some scenarios to execute before to crawl urls.

For example :

// MyProject/centipede.php
$scenarios = new Centipede\Scenario\Collection();

$scenarios->add('customer_account', function ($browser) {
    $broswer->visit('/login');
    $broswer->fillIn('login', '[email protected]');
    $broswer->fillIn('password', 'strong_password');
    $broswer->press('Connection');
    $broswer->visit('/account');
});

$scenarios->add('backoffice', function ($browser) {
    $broswer->visit('/admin/login');
    $broswer->fillIn('seller[uuid]', '1122334455');
    $broswer->fillIn('seller[password]', 'strong_password');
    $broswer->press('Access to dashboard');
    $broswer->visit('/admin/dashboard');
});

$scenarios->add('public', function ($browser) {
    $broswer->visit('/');
});

return $scenarios;

And then, the crawler will execute scenarios one by one and crawl urls each time.

It should be installable.

$ composer global require umpirsky/centipede:0.1.*@dev

ends with the following error

Your requirements could not be resolved to an installable set of packages.

  Problem 1
    - umpirsky/centipede-crawler 0.1.0 requires react/promise dev-master@dev -> no matching package found.
    - Can only install one of: guzzlehttp/guzzle[5.3.0, 4.1.3].
    - Can only install one of: guzzlehttp/guzzle[5.3.0, 4.1.3].
    - Can only install one of: guzzlehttp/guzzle[5.3.0, 4.1.3].
    - umpirsky/centipede-crawler 0.1.1 requires guzzlehttp/guzzle ~5.3 -> satisfiable by guzzlehttp/guzzle[5.3.0].
    - umpirsky/centipede 0.1.x-dev requires umpirsky/centipede-crawler ~0.1 -> satisfiable by umpirsky/centipede-crawler[0.1.0, 0.1.1].
    - Installation request for umpirsky/centipede 0.1.*@dev -> satisfiable by umpirsky/centipede[0.1.x-dev].
    - Installation request for guzzlehttp/guzzle == 4.1.3.0 -> satisfiable by guzzlehttp/guzzle[4.1.3].

However it is related to the centipede-crawler, I decided to open it here as the whole project is not installable.

Should support relative link targets

Centipede works fine while running on the root path like http://example.com/, however it does not correctly resolve relative href targets when running on a sub-path like http://example.com/demo.

Some test cases (incomplete):

/demo/ + a => /demo/a
/demo/ + /a => /a
/demo/ + ../a => /a
/demo + a => /a
/demo + /a => /a
/demo + ../a => /a

Phar support?

Do you plan to provide a phar archive? Could be interesting to use it globally instead of requiring centipede in the project ;)

Is there a way to see, which page was parrent for an URL we are crawling?

Hello,

For example I do:
$urls = (new Centipede\Crawler('domain.com'))->crawl();

and then try to Guzzle get all the pages to check for statuses, etc.

But for example I see, that url '/en/abou-us' is not accessible. (gives 404). Is there a way, to check, on which of parent URLs this '/en/about-us' was found? Because it could be that the link is inside 1 article, or so. sot it would be hard hunting.

Thank you!

Support for ignore urls

Simple url ignore list in centipede.yml config.

ignore:
    - example.com/foo
    - example.com/bar

Hookable/eventable crawler behavior

Maybe this should be opened against the crawler component, but this looks like an application-wide feature.

We could trigger application-wide events during the crawler's lifecycle when a request is initiated, when a URL is discovered, when a response is retrieved, etc.
This would allow us to implement:

  • Clever and clean output handling (pretty much in @everzet's Behat way, though I'm not suggesting to use decorators)
  • Passing request/response data to pluggable event listeners. Imagine if we're getting a 500 from a Symfony2 app and this is caught by an event handler that sends an email to the developer, with a profiler link attached.
  • Passing the whole list of URLs discovered during a run to some kind of listener, to make sure that we don't return 404's for pages that are linked from Google search

And I'm sure there's more use cases.

How to install corretly?

Hello,

what would be the 'correct way' to install it?
Somehow simple composer requireumpirsky/centipede:dev-master did not work..

I needed to specify pull paths.. composer.json example:

{
    "require": {
        "react/promise": "dev-master",
        "umpirsky/centipede": "dev-master",
        "umpirsky/centipede-crawler": "dev-master"
    },

    "repositories": [
        {
            "type": "vcs",
            "url": "https://github.com/umpirsky/centipede"
        },
        {
            "type": "vcs",
            "url": "https://github.com/umpirsky/centipede-crawler"
        }
    ]
}

Better support for depth argument

Actually, when you have a link to the root url on any website, if you use the depth argument to crawl a website, this "brand link" will be crawled each time.

I'd recommend to add some kind of system that will enhance readability for the explored urls.

  1. First, add something to the crawler that will store all crawled urls. A simple array with url => true pair and some issets would permit this kind of use.
  2. Add some kind of "level" informations in the logs.

For example, here is the output of this command:

$ centipede run https://github.com 3
200 https://github.com
200 https://github.com/join
200 https://github.com/join
200 https://github.com/join
200 https://github.com/login?return_to=%2Fjoin
200 https://github.com/explore
200 https://github.com/features
200 https://github.com/blog
200 https://github.com/about
200 https://github.com/site/terms
200 https://github.com/site/privacy
200 https://github.com/security
200 https://github.com/contact
200 https://github.com/login
200 https://github.com/login?return_to=%2Flogin
200 https://github.com/login?return_to=%2Flogin
200 https://github.com/password_reset
200 https://github.com/plans
200 https://github.com/login?return_to=%2Fpricing
200 https://github.com/login?return_to=%2Fpricing
200 https://github.com/signup
200 https://github.com/early_access/large_file_storage?utm_source=github_site&utm_medium=pricing_signup_link&utm_campaign=gitlfs
200 https://github.com/login?return_to=%2Fearly_access%2Fgit-lfs
200 https://github.com/join?return_to=%2Fearly_access%2Fgit-lfs
200 mailto:[email protected]
200 https://github.com/plans
200 https://github.com/integrations

Here, I don't know where all different were "clicked from". Maybe I could guess, maybe not. The /join is seen many times, which is not much correct...

Edit (for this part): In fact, a structure I'd like to see might be something like this:

$ centipede run http://example.com/
200 http://example.com/about
200 > http://example.com/about/team
200 > http://example.com/about/commercial
200 http://example.com/products
200 > http://example.com/products/specials
200 > > http://example.com/products/exclusive

Etc.

Fails with "PHP Catchable fatal error"

The current master (bfe44f6) does not work and quits with the following error. Didn't dig into this any further, but looks like one of the dependencies received an incompatible update:

$ php bin/centipede run http://10.52.18.222/
PHP Catchable fatal error:  Argument 2 passed to Centipede\Console\Command\Run::Centipede\Console\Command\{closure}() must be an instance of GuzzleHttp\Message\FutureResponse, instance of Symfony\Component\BrowserKit\Response given, called in /tmp/phar-composer8/vendor/umpirsky/centipede-crawler/src/Centipede/Crawler.php on line 64 and defined in /tmp/phar-composer8/src/Centipede/Console/Command/Run.php on line 30
PHP Stack trace:
PHP   1. {main}() /tmp/phar-composer8/bin/centipede:0
PHP   2. Symfony\Component\Console\Application->run() /tmp/phar-composer8/bin/centipede:19
PHP   3. Symfony\Component\Console\Application->doRun() /tmp/phar-composer8/vendor/symfony/console/Symfony/Component/Console/Application.php:126
PHP   4. Symfony\Component\Console\Application->doRunCommand() /tmp/phar-composer8/vendor/symfony/console/Symfony/Component/Console/Application.php:195
PHP   5. Symfony\Component\Console\Command\Command->run() /tmp/phar-composer8/vendor/symfony/console/Symfony/Component/Console/Application.php:874
PHP   6. Centipede\Console\Command\Run->execute() /tmp/phar-composer8/vendor/symfony/console/Symfony/Component/Console/Command/Command.php:252
PHP   7. Centipede\Crawler->crawl() /tmp/phar-composer8/src/Centipede/Console/Command/Run.php:44
PHP   8. Centipede\Crawler->request() /tmp/phar-composer8/vendor/umpirsky/centipede-crawler/src/Centipede/Crawler.php:27
PHP   9. Centipede\Console\Command\Run->Centipede\Console\Command\{closure}() /tmp/phar-composer8/vendor/umpirsky/centipede-crawler/src/Centipede/Crawler.php:64

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.