umpirsky / centipede Goto Github PK
View Code? Open in Web Editor NEW:sparkler: The Simplest automated testing tool on Earth.
License: MIT License
:sparkler: The Simplest automated testing tool on Earth.
License: MIT License
Maybe support multiple output formatters.
Hello,
For example I do:
$urls = (new Centipede\Crawler('domain.com'))->crawl();
and then try to Guzzle get all the pages to check for statuses, etc.
But for example I see, that url '/en/abou-us' is not accessible. (gives 404). Is there a way, to check, on which of parent URLs this '/en/about-us' was found? Because it could be that the link is inside 1 article, or so. sot it would be hard hunting.
Thank you!
Hi.
Just a little reflection. Why not provide a system that allow you to describe some scenarios to execute before to crawl urls.
For example :
// MyProject/centipede.php
$scenarios = new Centipede\Scenario\Collection();
$scenarios->add('customer_account', function ($browser) {
$broswer->visit('/login');
$broswer->fillIn('login', '[email protected]');
$broswer->fillIn('password', 'strong_password');
$broswer->press('Connection');
$broswer->visit('/account');
});
$scenarios->add('backoffice', function ($browser) {
$broswer->visit('/admin/login');
$broswer->fillIn('seller[uuid]', '1122334455');
$broswer->fillIn('seller[password]', 'strong_password');
$broswer->press('Access to dashboard');
$broswer->visit('/admin/dashboard');
});
$scenarios->add('public', function ($browser) {
$broswer->visit('/');
});
return $scenarios;
And then, the crawler will execute scenarios one by one and crawl urls each time.
Actually, HTTP errors are supported, but when they're "depthed", they're always handled as exceptions, breaking the script, so it seems these should be caught and rendered as classic errored-crawls...
Here is an example of use:
$ centipede run http://example.com/
200 http://example.com/page
404 http://example.com/non_existing
If I'm using the depth
argument, once the crawler reaches a page while the "current depth" is different than the root one, errors are no more handled.
In the above example, suppose there is a link on the /page
page pointing on /non_existing
, the result will be similar to this one:
$ centipede run http://example.com/ 1
200 http://example.com/page
[GuzzleHttp\Exception\ClientException]
Client error response [url] http://example.com/non_existing [status code] 404 [reason phrase] Not Found
This breaks the script and the crawler can no more be able to navigate, obviously.
Actually, when you have a link to the root url on any website, if you use the depth
argument to crawl a website, this "brand link" will be crawled each time.
I'd recommend to add some kind of system that will enhance readability for the explored urls.
url => true
pair and some isset
s would permit this kind of use.For example, here is the output of this command:
$ centipede run https://github.com 3
200 https://github.com
200 https://github.com/join
200 https://github.com/join
200 https://github.com/join
200 https://github.com/login?return_to=%2Fjoin
200 https://github.com/explore
200 https://github.com/features
200 https://github.com/blog
200 https://github.com/about
200 https://github.com/site/terms
200 https://github.com/site/privacy
200 https://github.com/security
200 https://github.com/contact
200 https://github.com/login
200 https://github.com/login?return_to=%2Flogin
200 https://github.com/login?return_to=%2Flogin
200 https://github.com/password_reset
200 https://github.com/plans
200 https://github.com/login?return_to=%2Fpricing
200 https://github.com/login?return_to=%2Fpricing
200 https://github.com/signup
200 https://github.com/early_access/large_file_storage?utm_source=github_site&utm_medium=pricing_signup_link&utm_campaign=gitlfs
200 https://github.com/login?return_to=%2Fearly_access%2Fgit-lfs
200 https://github.com/join?return_to=%2Fearly_access%2Fgit-lfs
200 mailto:[email protected]
200 https://github.com/plans
200 https://github.com/integrations
Here, I don't know where all different were "clicked from". Maybe I could guess, maybe not. The /join
is seen many times, which is not much correct...
Edit (for this part): In fact, a structure I'd like to see might be something like this:
$ centipede run http://example.com/
200 http://example.com/about
200 > http://example.com/about/team
200 > http://example.com/about/commercial
200 http://example.com/products
200 > http://example.com/products/specials
200 > > http://example.com/products/exclusive
Etc.
Support for custom status code per url in centipede.yml
config.
status:
example.com/foo: 403
Do you plan to provide a phar archive? Could be interesting to use it globally instead of requiring centipede in the project ;)
For example, script won't find any childs here:
./vendor/bin/centipede run http://www.altevents.darbai.webas.lt/
200 http://www.altevents.darbai.webas.lt/
Why so? there are more links on that page. but it only checks the mainpage?
I have checked the UrlExtractor class, the extract() method receives valid html,
but DOMDocument does not extract the links. Maybe we should use symfony/dom-crawler ?
Thanks!
centipede.yml:
rules:
foo:
url: https://github.com/foo
status: 200
text: Foo
bar:
url: https://github.com/bar
status: 200
text: Bar
ignore:
- https://github.com/ignore
- https://github.com/ignored
Hi,
The ca_FR is not in french but a mixt of italian (or something else) and other languages
Parser treat this link
<a href="https://site.com/path/">XXXX</a>
as https://site.com/path
correct one is
https://site.com/path/
The current master (bfe44f6) does not work and quits with the following error. Didn't dig into this any further, but looks like one of the dependencies received an incompatible update:
$ php bin/centipede run http://10.52.18.222/
PHP Catchable fatal error: Argument 2 passed to Centipede\Console\Command\Run::Centipede\Console\Command\{closure}() must be an instance of GuzzleHttp\Message\FutureResponse, instance of Symfony\Component\BrowserKit\Response given, called in /tmp/phar-composer8/vendor/umpirsky/centipede-crawler/src/Centipede/Crawler.php on line 64 and defined in /tmp/phar-composer8/src/Centipede/Console/Command/Run.php on line 30
PHP Stack trace:
PHP 1. {main}() /tmp/phar-composer8/bin/centipede:0
PHP 2. Symfony\Component\Console\Application->run() /tmp/phar-composer8/bin/centipede:19
PHP 3. Symfony\Component\Console\Application->doRun() /tmp/phar-composer8/vendor/symfony/console/Symfony/Component/Console/Application.php:126
PHP 4. Symfony\Component\Console\Application->doRunCommand() /tmp/phar-composer8/vendor/symfony/console/Symfony/Component/Console/Application.php:195
PHP 5. Symfony\Component\Console\Command\Command->run() /tmp/phar-composer8/vendor/symfony/console/Symfony/Component/Console/Application.php:874
PHP 6. Centipede\Console\Command\Run->execute() /tmp/phar-composer8/vendor/symfony/console/Symfony/Component/Console/Command/Command.php:252
PHP 7. Centipede\Crawler->crawl() /tmp/phar-composer8/src/Centipede/Console/Command/Run.php:44
PHP 8. Centipede\Crawler->request() /tmp/phar-composer8/vendor/umpirsky/centipede-crawler/src/Centipede/Crawler.php:27
PHP 9. Centipede\Console\Command\Run->Centipede\Console\Command\{closure}() /tmp/phar-composer8/vendor/umpirsky/centipede-crawler/src/Centipede/Crawler.php:64
Maybe this should be opened against the crawler component, but this looks like an application-wide feature.
We could trigger application-wide events during the crawler's lifecycle when a request is initiated, when a URL is discovered, when a response is retrieved, etc.
This would allow us to implement:
And I'm sure there's more use cases.
When parsing my website with centipede run http://auxptitsplaisirs.lo/
, I get the following exception :
[InvalidArgumentException]
Unable to parse malformed url: http://auxptitsplaisirs.lotel:+41 21 907 26 26
Because it fails to parse the following link :
<a href="tel:+41 21 907 26 26">+41 21 907 26 26</a>
Hello,
what would be the 'correct way' to install it?
Somehow simple composer requireumpirsky/centipede:dev-master did not work..
I needed to specify pull paths.. composer.json example:
{
"require": {
"react/promise": "dev-master",
"umpirsky/centipede": "dev-master",
"umpirsky/centipede-crawler": "dev-master"
},
"repositories": [
{
"type": "vcs",
"url": "https://github.com/umpirsky/centipede"
},
{
"type": "vcs",
"url": "https://github.com/umpirsky/centipede-crawler"
}
]
}
$ composer global require umpirsky/centipede:0.1.*@dev
ends with the following error
Your requirements could not be resolved to an installable set of packages.
Problem 1
- umpirsky/centipede-crawler 0.1.0 requires react/promise dev-master@dev -> no matching package found.
- Can only install one of: guzzlehttp/guzzle[5.3.0, 4.1.3].
- Can only install one of: guzzlehttp/guzzle[5.3.0, 4.1.3].
- Can only install one of: guzzlehttp/guzzle[5.3.0, 4.1.3].
- umpirsky/centipede-crawler 0.1.1 requires guzzlehttp/guzzle ~5.3 -> satisfiable by guzzlehttp/guzzle[5.3.0].
- umpirsky/centipede 0.1.x-dev requires umpirsky/centipede-crawler ~0.1 -> satisfiable by umpirsky/centipede-crawler[0.1.0, 0.1.1].
- Installation request for umpirsky/centipede 0.1.*@dev -> satisfiable by umpirsky/centipede[0.1.x-dev].
- Installation request for guzzlehttp/guzzle == 4.1.3.0 -> satisfiable by guzzlehttp/guzzle[4.1.3].
However it is related to the centipede-crawler
, I decided to open it here as the whole project is not installable.
Support for custom text check per url in centipede.yml
config.
text:
example.com/foo: 'This is foo page!'
I am thinking to switch this project to JavaScript.
I am considering PhantomJS and simple plugin system (similar to #26) with Cucumber.js or CodeceptJS.
Centipede works fine while running on the root path like http://example.com/
, however it does not correctly resolve relative href
targets when running on a sub-path like http://example.com/demo
.
Some test cases (incomplete):
/demo/ + a => /demo/a
/demo/ + /a => /a
/demo/ + ../a => /a
/demo + a => /a
/demo + /a => /a
/demo + ../a => /a
Simple url ignore list in centipede.yml
config.
ignore:
- example.com/foo
- example.com/bar
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.