umpirsky / centipede Goto Github PK

View Code? Open in Web Editor NEW

141.0 141.0 15.0 123 KB

:sparkler: The Simplest automated testing tool on Earth.

License: MIT License

PHP 100.00%

centipede's People

Contributors

Stargazers

Watchers

Forkers

kix jjsaunier garethrees gnutix pedrotroller testpulse jorik041 wildstar akovalyov ninir l3dlp-sandbox apoorva-shah dmitryck rogeriopradoj theradcoder

centipede's Issues

Crash when parsing a link to a phone number

When parsing my website with centipede run http://auxptitsplaisirs.lo/, I get the following exception :

[InvalidArgumentException]                                                     
  Unable to parse malformed url: http://auxptitsplaisirs.lotel:+41 21 907 26 26

Because it fails to parse the following link :

<a href="tel:+41 21 907 26 26">+41 21 907 26 26</a>

ca_FR Not in french

Hi,
The ca_FR is not in french but a mixt of italian (or something else) and other languages

Configuration through yml config file

centipede.yml:

rules:
    foo:
        url: https://github.com/foo
        status: 200
        text: Foo
    bar:
        url: https://github.com/bar
        status: 200
        text: Bar

ignore:
    - https://github.com/ignore
    - https://github.com/ignored

HTTP errors while using `depth` argument

Actually, HTTP errors are supported, but when they're "depthed", they're always handled as exceptions, breaking the script, so it seems these should be caught and rendered as classic errored-crawls...

Here is an example of use:

$ centipede run http://example.com/
200 http://example.com/page
404 http://example.com/non_existing

If I'm using the depth argument, once the crawler reaches a page while the "current depth" is different than the root one, errors are no more handled.

In the above example, suppose there is a link on the /page page pointing on /non_existing, the result will be similar to this one:

$ centipede run http://example.com/ 1
200 http://example.com/page

  [GuzzleHttp\Exception\ClientException]
  Client error response [url] http://example.com/non_existing [status code] 404 [reason phrase] Not Found

This breaks the script and the crawler can no more be able to navigate, obviously.

Crawler not catching all links?

For example, script won't find any childs here:

./vendor/bin/centipede run http://www.altevents.darbai.webas.lt/
200 http://www.altevents.darbai.webas.lt/

Why so? there are more links on that page. but it only checks the mainpage?

I have checked the UrlExtractor class, the extract() method receives valid html,
but DOMDocument does not extract the links. Maybe we should use symfony/dom-crawler ?

Thanks!

[RFC] Scenario plugin system

Hi.

Just a little reflection. Why not provide a system that allow you to describe some scenarios to execute before to crawl urls.

For example :

// MyProject/centipede.php
$scenarios = new Centipede\Scenario\Collection();

$scenarios->add('customer_account', function ($browser) {
    $broswer->visit('/login');
    $broswer->fillIn('login', '[email protected]');
    $broswer->fillIn('password', 'strong_password');
    $broswer->press('Connection');
    $broswer->visit('/account');
});

$scenarios->add('backoffice', function ($browser) {
    $broswer->visit('/admin/login');
    $broswer->fillIn('seller[uuid]', '1122334455');
    $broswer->fillIn('seller[password]', 'strong_password');
    $broswer->press('Access to dashboard');
    $broswer->visit('/admin/dashboard');
});

$scenarios->add('public', function ($browser) {
    $broswer->visit('/');
});

return $scenarios;

And then, the crawler will execute scenarios one by one and crawl urls each time.

Incorrect link parsing. Final "/" is ommited

Parser treat this link
<a href="https://site.com/path/">XXXX</a>
as https://site.com/path
correct one is
https://site.com/path/

It should be installable.

$ composer global require umpirsky/centipede:0.1.*@dev

ends with the following error

Your requirements could not be resolved to an installable set of packages.

  Problem 1
    - umpirsky/centipede-crawler 0.1.0 requires react/promise dev-master@dev -> no matching package found.
    - Can only install one of: guzzlehttp/guzzle[5.3.0, 4.1.3].
    - Can only install one of: guzzlehttp/guzzle[5.3.0, 4.1.3].
    - Can only install one of: guzzlehttp/guzzle[5.3.0, 4.1.3].
    - umpirsky/centipede-crawler 0.1.1 requires guzzlehttp/guzzle ~5.3 -> satisfiable by guzzlehttp/guzzle[5.3.0].
    - umpirsky/centipede 0.1.x-dev requires umpirsky/centipede-crawler ~0.1 -> satisfiable by umpirsky/centipede-crawler[0.1.0, 0.1.1].
    - Installation request for umpirsky/centipede 0.1.*@dev -> satisfiable by umpirsky/centipede[0.1.x-dev].
    - Installation request for guzzlehttp/guzzle == 4.1.3.0 -> satisfiable by guzzlehttp/guzzle[4.1.3].

However it is related to the centipede-crawler, I decided to open it here as the whole project is not installable.

Should support relative link targets

Centipede works fine while running on the root path like http://example.com/, however it does not correctly resolve relative href targets when running on a sub-path like http://example.com/demo.

Some test cases (incomplete):

/demo/ + a => /demo/a
/demo/ + /a => /a
/demo/ + ../a => /a
/demo + a => /a
/demo + /a => /a
/demo + ../a => /a

Support for custom status code per url

Support for custom status code per url in centipede.yml config.

status:
    example.com/foo: 403

Support for custom text check per url

Support for custom text check per url in centipede.yml config.

text:
    example.com/foo: 'This is foo page!'

Phar support?

Do you plan to provide a phar archive? Could be interesting to use it globally instead of requiring centipede in the project ;)

Is there a way to see, which page was parrent for an URL we are crawling?

Hello,

For example I do:
$urls = (new Centipede\Crawler('domain.com'))->crawl();

and then try to Guzzle get all the pages to check for statuses, etc.

But for example I see, that url '/en/abou-us' is not accessible. (gives 404). Is there a way, to check, on which of parent URLs this '/en/about-us' was found? Because it could be that the link is inside 1 article, or so. sot it would be hard hunting.

Thank you!

Support for ignore urls

Simple url ignore list in centipede.yml config.

ignore:
    - example.com/foo
    - example.com/bar

[RFC] Switch to JavaScript

I am thinking to switch this project to JavaScript.

I am considering PhantomJS and simple plugin system (similar to #26) with Cucumber.js or CodeceptJS.

Hookable/eventable crawler behavior

Maybe this should be opened against the crawler component, but this looks like an application-wide feature.

We could trigger application-wide events during the crawler's lifecycle when a request is initiated, when a URL is discovered, when a response is retrieved, etc.
This would allow us to implement:

Clever and clean output handling (pretty much in @everzet's Behat way, though I'm not suggesting to use decorators)
Passing request/response data to pluggable event listeners. Imagine if we're getting a 500 from a Symfony2 app and this is caught by an event handler that sends an email to the developer, with a profiler link attached.
Passing the whole list of URLs discovered during a run to some kind of listener, to make sure that we don't return 404's for pages that are linked from Google search

And I'm sure there's more use cases.

How to install corretly?

Hello,

what would be the 'correct way' to install it?
Somehow simple composer requireumpirsky/centipede:dev-master did not work..

I needed to specify pull paths.. composer.json example:

{
    "require": {
        "react/promise": "dev-master",
        "umpirsky/centipede": "dev-master",
        "umpirsky/centipede-crawler": "dev-master"
    },

    "repositories": [
        {
            "type": "vcs",
            "url": "https://github.com/umpirsky/centipede"
        },
        {
            "type": "vcs",
            "url": "https://github.com/umpirsky/centipede-crawler"
        }
    ]
}

Consider using php spider

https://github.com/mvdbos/php-spider

Better output formatting

Maybe support multiple output formatters.

Better support for depth argument

Actually, when you have a link to the root url on any website, if you use the depth argument to crawl a website, this "brand link" will be crawled each time.

I'd recommend to add some kind of system that will enhance readability for the explored urls.

First, add something to the crawler that will store all crawled urls. A simple array with url => true pair and some issets would permit this kind of use.
Add some kind of "level" informations in the logs.

For example, here is the output of this command:

$ centipede run https://github.com 3
200 https://github.com
200 https://github.com/join
200 https://github.com/join
200 https://github.com/join
200 https://github.com/login?return_to=%2Fjoin
200 https://github.com/explore
200 https://github.com/features
200 https://github.com/blog
200 https://github.com/about
200 https://github.com/site/terms
200 https://github.com/site/privacy
200 https://github.com/security
200 https://github.com/contact
200 https://github.com/login
200 https://github.com/login?return_to=%2Flogin
200 https://github.com/login?return_to=%2Flogin
200 https://github.com/password_reset
200 https://github.com/plans
200 https://github.com/login?return_to=%2Fpricing
200 https://github.com/login?return_to=%2Fpricing
200 https://github.com/signup
200 https://github.com/early_access/large_file_storage?utm_source=github_site&utm_medium=pricing_signup_link&utm_campaign=gitlfs
200 https://github.com/login?return_to=%2Fearly_access%2Fgit-lfs
200 https://github.com/join?return_to=%2Fearly_access%2Fgit-lfs
200 mailto:[email protected]
200 https://github.com/plans
200 https://github.com/integrations

Here, I don't know where all different were "clicked from". Maybe I could guess, maybe not. The /join is seen many times, which is not much correct...

Edit (for this part): In fact, a structure I'd like to see might be something like this:

$ centipede run http://example.com/
200 http://example.com/about
200 > http://example.com/about/team
200 > http://example.com/about/commercial
200 http://example.com/products
200 > http://example.com/products/specials
200 > > http://example.com/products/exclusive

Etc.

Fails with "PHP Catchable fatal error"

The current master (bfe44f6) does not work and quits with the following error. Didn't dig into this any further, but looks like one of the dependencies received an incompatible update:

$ php bin/centipede run http://10.52.18.222/
PHP Catchable fatal error:  Argument 2 passed to Centipede\Console\Command\Run::Centipede\Console\Command\{closure}() must be an instance of GuzzleHttp\Message\FutureResponse, instance of Symfony\Component\BrowserKit\Response given, called in /tmp/phar-composer8/vendor/umpirsky/centipede-crawler/src/Centipede/Crawler.php on line 64 and defined in /tmp/phar-composer8/src/Centipede/Console/Command/Run.php on line 30
PHP Stack trace:
PHP   1. {main}() /tmp/phar-composer8/bin/centipede:0
PHP   2. Symfony\Component\Console\Application->run() /tmp/phar-composer8/bin/centipede:19
PHP   3. Symfony\Component\Console\Application->doRun() /tmp/phar-composer8/vendor/symfony/console/Symfony/Component/Console/Application.php:126
PHP   4. Symfony\Component\Console\Application->doRunCommand() /tmp/phar-composer8/vendor/symfony/console/Symfony/Component/Console/Application.php:195
PHP   5. Symfony\Component\Console\Command\Command->run() /tmp/phar-composer8/vendor/symfony/console/Symfony/Component/Console/Application.php:874
PHP   6. Centipede\Console\Command\Run->execute() /tmp/phar-composer8/vendor/symfony/console/Symfony/Component/Console/Command/Command.php:252
PHP   7. Centipede\Crawler->crawl() /tmp/phar-composer8/src/Centipede/Console/Command/Run.php:44
PHP   8. Centipede\Crawler->request() /tmp/phar-composer8/vendor/umpirsky/centipede-crawler/src/Centipede/Crawler.php:27
PHP   9. Centipede\Console\Command\Run->Centipede\Console\Command\{closure}() /tmp/phar-composer8/vendor/umpirsky/centipede-crawler/src/Centipede/Crawler.php:64

umpirsky / centipede Goto Github PK

centipede's People

Contributors

Stargazers

Watchers

Forkers

centipede's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs