GithubHelp home page GithubHelp logo

spatie / crawler Goto Github PK

View Code? Open in Web Editor NEW
2.5K 65.0 356.0 472 KB

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

Home Page: https://freek.dev/308-building-a-crawler-in-php

License: MIT License

PHP 89.07% JavaScript 10.61% Shell 0.32%
php crawler guzzle concurrency

crawler's Introduction

๐Ÿ•ธ Crawl the web using PHP ๐Ÿ•ท

Latest Version on Packagist MIT Licensed Tests Total Downloads

This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to crawl multiple urls concurrently.

Because the crawler can execute JavaScript, it can crawl JavaScript rendered sites. Under the hood Chrome and Puppeteer are used to power this feature.

Support us

We invest a lot of resources into creating best in class open source packages. You can support us by buying one of our paid products.

We highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using. You'll find our address on our contact page. We publish all received postcards on our virtual postcard wall.

Installation

This package can be installed via Composer:

composer require spatie/crawler

Usage

The crawler can be instantiated like this

use Spatie\Crawler\Crawler;

Crawler::create()
    ->setCrawlObserver(<class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>)
    ->startCrawling($url);

The argument passed to setCrawlObserver must be an object that extends the \Spatie\Crawler\CrawlObservers\CrawlObserver abstract class:

namespace Spatie\Crawler\CrawlObservers;

use GuzzleHttp\Exception\RequestException;
use Psr\Http\Message\ResponseInterface;
use Psr\Http\Message\UriInterface;

abstract class CrawlObserver
{
    /*
     * Called when the crawler will crawl the url.
     */
    public function willCrawl(UriInterface $url, ?string $linkText): void
    {
    }

    /*
     * Called when the crawler has crawled the given url successfully.
     */
    abstract public function crawled(
        UriInterface $url,
        ResponseInterface $response,
        ?UriInterface $foundOnUrl = null,
        ?string $linkText,
    ): void;

    /*
     * Called when the crawler had a problem crawling the given url.
     */
    abstract public function crawlFailed(
        UriInterface $url,
        RequestException $requestException,
        ?UriInterface $foundOnUrl = null,
        ?string $linkText = null,
    ): void;

    /**
     * Called when the crawl has ended.
     */
    public function finishedCrawling(): void
    {
    }
}

Using multiple observers

You can set multiple observers with setCrawlObservers:

Crawler::create()
    ->setCrawlObservers([
        <class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>,
        <class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>,
        ...
     ])
    ->startCrawling($url);

Alternatively you can set multiple observers one by one with addCrawlObserver:

Crawler::create()
    ->addCrawlObserver(<class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>)
    ->addCrawlObserver(<class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>)
    ->addCrawlObserver(<class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>)
    ->startCrawling($url);

Executing JavaScript

By default, the crawler will not execute JavaScript. This is how you can enable the execution of JavaScript:

Crawler::create()
    ->executeJavaScript()
    ...

In order to make it possible to get the body html after the javascript has been executed, this package depends on our Browsershot package. This package uses Puppeteer under the hood. Here are some pointers on how to install it on your system.

Browsershot will make an educated guess as to where its dependencies are installed on your system. By default, the Crawler will instantiate a new Browsershot instance. You may find the need to set a custom created instance using the setBrowsershot(Browsershot $browsershot) method.

Crawler::create()
    ->setBrowsershot($browsershot)
    ->executeJavaScript()
    ...

Note that the crawler will still work even if you don't have the system dependencies required by Browsershot. These system dependencies are only required if you're calling executeJavaScript().

Filtering certain urls

You can tell the crawler not to visit certain urls by using the setCrawlProfile-function. That function expects an object that extends Spatie\Crawler\CrawlProfiles\CrawlProfile:

/*
 * Determine if the given url should be crawled.
 */
public function shouldCrawl(UriInterface $url): bool;

This package comes with three CrawlProfiles out of the box:

  • CrawlAllUrls: this profile will crawl all urls on all pages including urls to an external site.
  • CrawlInternalUrls: this profile will only crawl the internal urls on the pages of a host.
  • CrawlSubdomains: this profile will only crawl the internal urls and its subdomains on the pages of a host.

Custom link extraction

You can customize how links are extracted from a page by passing a custom UrlParser to the crawler.

Crawler::create()
    ->setUrlParserClass(<class that implements \Spatie\Crawler\UrlParsers\UrlParser>::class)
    ...

By default, the LinkUrlParser is used. This parser will extract all links from the href attribute of a tags.

There is also a built-in SitemapUrlParser that will extract & crawl all links from a sitemap. It does support sitemap index files.

Crawler::create()
    ->setUrlParserClass(SitemapUrlParser::class)
    ...

Ignoring robots.txt and robots meta

By default, the crawler will respect robots data. It is possible to disable these checks like so:

Crawler::create()
    ->ignoreRobots()
    ...

Robots data can come from either a robots.txt file, meta tags or response headers. More information on the spec can be found here: http://www.robotstxt.org/.

Parsing robots data is done by our package spatie/robots-txt.

Accept links with rel="nofollow" attribute

By default, the crawler will reject all links containing attribute rel="nofollow". It is possible to disable these checks like so:

Crawler::create()
    ->acceptNofollowLinks()
    ...

Using a custom User Agent

In order to respect robots.txt rules for a custom User Agent you can specify your own custom User Agent.

Crawler::create()
    ->setUserAgent('my-agent')

You can add your specific crawl rule group for 'my-agent' in robots.txt. This example disallows crawling the entire site for crawlers identified by 'my-agent'.

// Disallow crawling for my-agent
User-agent: my-agent
Disallow: /

Setting the number of concurrent requests

To improve the speed of the crawl the package concurrently crawls 10 urls by default. If you want to change that number you can use the setConcurrency method.

Crawler::create()
    ->setConcurrency(1) // now all urls will be crawled one by one

Defining Crawl Limits

By default, the crawler continues until it has crawled every page it can find. This behavior might cause issues if you are working in an environment with limitations such as a serverless environment.

The crawl behavior can be controlled with the following two options:

  • Total Crawl Limit (setTotalCrawlLimit): This limit defines the maximal count of URLs to crawl.
  • Current Crawl Limit (setCurrentCrawlLimit): This defines how many URLs are processed during the current crawl.

Let's take a look at some examples to clarify the difference between these two methods.

Example 1: Using the total crawl limit

The setTotalCrawlLimit method allows you to limit the total number of URLs to crawl, no matter how often you call the crawler.

$queue = <your selection/implementation of a queue>;

// Crawls 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(5)
    ->startCrawling($url);

// Doesn't crawl further as the total limit is reached.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(5)
    ->startCrawling($url);

Example 2: Using the current crawl limit

The setCurrentCrawlLimit will set a limit on how many URls will be crawled per execution. This piece of code will process 5 pages with each execution, without a total limit of pages to crawl.

$queue = <your selection/implementation of a queue>;

// Crawls 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

// Crawls the next 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

Example 3: Combining the total and crawl limit

Both limits can be combined to control the crawler:

$queue = <your selection/implementation of a queue>;

// Crawls 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(10)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

// Crawls the next 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(10)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

// Doesn't crawl further as the total limit is reached.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(10)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

Example 4: Crawling across requests

You can use the setCurrentCrawlLimit to break up long running crawls. The following example demonstrates a (simplified) approach. It's made up of an initial request and any number of follow-up requests continuing the crawl.

Initial Request

To start crawling across different requests, you will need to create a new queue of your selected queue-driver. Start by passing the queue-instance to the crawler. The crawler will start filling the queue as pages are processed and new URLs are discovered. Serialize and store the queue reference after the crawler has finished (using the current crawl limit).

// Create a queue using your queue-driver.
$queue = <your selection/implementation of a queue>;

// Crawl the first set of URLs
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(10)
    ->startCrawling($url);

// Serialize and store your queue
$serializedQueue = serialize($queue);

Subsequent Requests

For any following requests you will need to unserialize your original queue and pass it to the crawler:

// Unserialize queue
$queue = unserialize($serializedQueue);

// Crawls the next set of URLs
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(10)
    ->startCrawling($url);

// Serialize and store your queue
$serialized_queue = serialize($queue);

The behavior is based on the information in the queue. Only if the same queue-instance is passed in the behavior works as described. When a completely new queue is passed in, the limits of previous crawls -even for the same website- won't apply.

An example with more details can be found here.

Setting the maximum crawl depth

By default, the crawler continues until it has crawled every page of the supplied URL. If you want to limit the depth of the crawler you can use the setMaximumDepth method.

Crawler::create()
    ->setMaximumDepth(2)

Setting the maximum response size

Most html pages are quite small. But the crawler could accidentally pick up on large files such as PDFs and MP3s. To keep memory usage low in such cases the crawler will only use the responses that are smaller than 2 MB. If, when streaming a response, it becomes larger than 2 MB, the crawler will stop streaming the response. An empty response body will be assumed.

You can change the maximum response size.

// let's use a 3 MB maximum.
Crawler::create()
    ->setMaximumResponseSize(1024 * 1024 * 3)

Add a delay between requests

In some cases you might get rate-limited when crawling too aggressively. To circumvent this, you can use the setDelayBetweenRequests() method to add a pause between every request. This value is expressed in milliseconds.

Crawler::create()
    ->setDelayBetweenRequests(150) // After every page crawled, the crawler will wait for 150ms

Limiting which content-types to parse

By default, every found page will be downloaded (up to setMaximumResponseSize() in size) and parsed for additional links. You can limit which content-types should be downloaded and parsed by setting the setParseableMimeTypes() with an array of allowed types.

Crawler::create()
    ->setParseableMimeTypes(['text/html', 'text/plain'])

This will prevent downloading the body of pages that have different mime types, like binary files, audio/video, ... that are unlikely to have links embedded in them. This feature mostly saves bandwidth.

Using a custom crawl queue

When crawling a site the crawler will put urls to be crawled in a queue. By default, this queue is stored in memory using the built-in ArrayCrawlQueue.

When a site is very large you may want to store that queue elsewhere, maybe a database. In such cases, you can write your own crawl queue.

A valid crawl queue is any class that implements the Spatie\Crawler\CrawlQueues\CrawlQueue-interface. You can pass your custom crawl queue via the setCrawlQueue method on the crawler.

Crawler::create()
    ->setCrawlQueue(<implementation of \Spatie\Crawler\CrawlQueues\CrawlQueue>)

Here

Change the default base url scheme

By default, the crawler will set the base url scheme to http if none. You have the ability to change that with setDefaultScheme.

Crawler::create()
    ->setDefaultScheme('https')

Changelog

Please see CHANGELOG for more information what has changed recently.

Contributing

Please see CONTRIBUTING for details.

Testing

First, install the Puppeteer dependency, or your tests will fail.

npm install puppeteer

To run the tests you'll have to start the included node based server first in a separate terminal window.

cd tests/server
npm install
node server.js

With the server running, you can start testing.

composer test

Security

If you've found a bug regarding security please mail [email protected] instead of using the issue tracker.

Postcardware

You're free to use this package, but if it makes it to your production environment we highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using.

Our address is: Spatie, Kruikstraat 22, 2018 Antwerp, Belgium.

We publish all received postcards on our company website.

Credits

License

The MIT License (MIT). Please see License File for more information.

crawler's People

Contributors

adamtomat avatar adrianmrn avatar akalongman avatar akoepcke avatar alexvanderbist avatar andrzejkupczyk avatar barocode avatar benmorel avatar brendt avatar brentrobert avatar buismaarten avatar denvers avatar dvdty avatar freekmurze avatar juukie avatar localheinz avatar mansoorkhan96 avatar mattiasgeniar avatar mbardelmeijer avatar michielkempen avatar nicolasmica avatar nielsvanpach avatar pascalbaljet avatar patinthehat avatar redominus avatar rubenvanassche avatar sebastiandedeyne avatar spekulatius avatar systream avatar tvke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crawler's Issues

[question] How does concurrent requests work indepth?

Hi,

I am wondering on how your concurrent requests work indepth? I mean, do you use pthreads for it or just the GuzzleHttp based promises?

If the last one applies, could you please provide a second version of the project using the pthreads extension? I think that should really speed up your crawler.

Thanks in advice,

alpham8

Crawler stops when encounter with 404 page.

When crawler encounter with 404 page it call crawlFailed method as mentoined in documentation, but it is also just exit (stop crawling). But i sure that it has a lot not visited links in its pool. So it seems it should work further (just skip that page/step) and continue to work. But it stops.

maximumDepth not checked while respecting robots

The maximumDepth isn't checked while crawling with respect to robots. This might lead to issues when crawling large sites. The only way for me to get it to work is to use the 'ignoreRobots' function on the crawler, but then I won't be respecting the guidelines of the crawled site.

Avoid executing useless code. addtoDepthTree

When maximumDepth is null addtoDepthTree shouldn't be executed along with shouldCrawl. This way the crawler avoid a foreeach loop with a recursive call inside. Also avoid two calls to GuzzleHttp\Psr7\Url __toString Method.

Loop

Hi,
I think I experienced, crawler went into a loop.

Is there any easy way that we can prevent crawling pages that has been already crawled?

Better `CrawlProfile` interface or abstract `CrawlProfile`

Different crawl profiles need different parameters. For example the CrawlSubdomains constructor expects a $baseUrl in the constructor whilst the Spatie\Sitemap\Crawler\Profile expects a callback in the constructor.

We should probably define a better CrawlProfile interface in the next major version of spatie/crawler or even an abstract CrawlProfile that can be extended and includes a way to set a base url and custom shouldCrawl callback on all crawl profiles.

This way we can avoid issues like https://github.com/spatie/laravel-sitemap/issues/103

Authentication

Hi, can't find any HTTP authentication options. Is it possible?

Thanks

Depth support?

Based on previous issues, I know you said you don't plan on implementing support for depth, but could you give any pointers on implementing this?

Crawler scans only one page with js execution enabled

Crawler::create()
    ->setCrawlObserver(new Logger())
    ->executeJavaScript()
    ->startCrawling($url);

Logger โ€” an empty class without any logic and extended from CrawlObserver.
Disabling javascript execution forces crawler scan myriad of pages normally.
Does crawler can handle with external code such as <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js"></script>?

proxy support

is it possible to use multi proxy and sth like rotating between ips for crawling?

Possible recrawl bugs in v2.1

I initially thought this might be a duplicate of issue #3. I have a simple site I am scraping (just for test purposes, it's my own site) and I am seeing some unusual behaviour in terms of URLs being scraped more than once. I am expecting a successful crawl to record about 10 pages, with one crawl per page.

In the above-mentioned issue, attention was drawn to the method hasAlreadyCrawled(), which does not seem to be present in the 2.x releases. I don't know if that is relevant.

  • If I use 2.1 mostly as-is, the system will call hasBeenCrawled in the CrawlObserver many times for each URL.
  • If I use 2.1 but keep a track of what has already been crawled, and reject duplicates using shouldCrawl(), then the system seems to run the CPU at 100% for many minutes, and needs cancelling via ^C.

I think 1.3 is mostly OK, with one surprising gotcha:

  • If I use 1.3 as-is, I get the results I expect - 13 URLs scraped in 34 sec.
  • If I use 1.3 but keep a track of what has already been crawled, shouldCrawl() is called for all 13, but hasBeenCrawled() is only called for the root page. I guess this is not a bug, and that I should simply not be trying to keep track of dups on my end.

Here is the script I am using with 1.3, you can see I've commented out things that are for the 2.x branch. This should run without any modifications being necessary. I've copied CrawlInternalUrls into the script, as this is not included in that release.

Here is the script I am using with 2.1, and this is what the output looks like (from hasBeenCrawled()):

Crawled URL: /
Crawled URL: /
Crawled URL: /
Crawled URL: /
Crawled URL: /
Crawled URL: /
Crawled URL: /en/tutorial/make-your-own-blog/introduction
Crawled URL: /en/tutorial/make-your-own-blog/using-real-data
Crawled URL: /en/tutorial/make-your-own-blog/adding-more-features
Crawled URL: /en/tutorial/make-your-own-blog/improving-the-installer
Crawled URL: /en/tutorial/make-your-own-blog/commenting-form
Crawled URL: /en/tutorial/make-your-own-blog/adding-a-login-system
Crawled URL: /en/tutorial/make-your-own-blog/tidy-up
Crawled URL: /en/tutorial/make-your-own-blog/new-post-creation
Crawled URL: /en/tutorial/make-your-own-blog/post-editing
Crawled URL: /en/tutorial/make-your-own-blog/all-posts-page
Crawled URL: /en/tutorial/make-your-own-blog/comment-admin
Crawled URL: /en/tutorial/make-your-own-blog/adding-some-polish
Crawled URL: /en/tutorial/make-your-own-blog/introduction
Crawled URL: /en/tutorial/make-your-own-blog/introduction
Crawled URL: /en/tutorial/make-your-own-blog/introduction
Crawled URL: /en/tutorial/make-your-own-blog/introduction
Crawled URL: /en/tutorial/make-your-own-blog/introduction
Crawled URL: /en/tutorial/make-your-own-blog/using-real-data
Crawled URL: /en/tutorial/make-your-own-blog/using-real-data
Crawled URL: /en/tutorial/make-your-own-blog/using-real-data
Crawled URL: /en/tutorial/make-your-own-blog/using-real-data
Crawled URL: /en/tutorial/make-your-own-blog/using-real-data
Crawled URL: /en/tutorial/make-your-own-blog/adding-more-features
Crawled URL: /en/tutorial/make-your-own-blog/adding-more-features
Crawled URL: /en/tutorial/make-your-own-blog/adding-more-features
^C

As you can see there's a lot of dups in there. That's version 2.1.2, on Ubuntu, using PHP 7.1.x.

I will fall back to 1.3.x for the time being, but do let me know if you want me to do any testing for you - happy to help if I can.

Provide Browsershot object or bodyHtml to CrawlObserver

Currently, Browsershot executes Javascript on a page after that page has been provided to a CrawlObserver. It would be useful if instead that execution could happen first, and either the Browsershot object or the bodyHtml from that object could be provided in some way to the observer.

Guzzle Config Options not honored fully

Whoops, sorry for the initially empty issue report. Here's what I meant to include......


I've started playing with this library recently and noticed a few odd things happening with Guzzle options. From what I'm seeing in Crawler.php there are 2 bugs causing this odd behaviour. I could be wrong on these, but from my debugging the current behaviour is errant.

1. Issue with create method logic

Check out this line and line 40 (right above).

Since the function call initialises an empty array no matter what, the null coalesce will trigger as false. The default code provided will never be triggered from what I can see; effectively dead code unless I'm mistaken.

If I'm correct that the default options should still be possible then I suggest this:

        $client = new Client($clientOptions ?? [
                RequestOptions::COOKIES => true,
            ]);

Be changed to:

        $client = new Client( (bool) count($clientOptions) ?  $clientOptions : [
                RequestOptions::COOKIES => true,
            ]);

2. With the update to use Pools none of the client options will ever be used

Since all crawls now use the pool system for Guzzle this means that on line 122 we override any user input for options.

Unless there's a reason for this to be done, I would suggest this be updated to use:

$this->client->getOptions()

...to populate pool options. This way the library will honour either the default settings, or user input, all the way through it's usage of guzzle.

3. (Optional) If both are to be addressed...

...then the client creation may need to look like:

        $client = new Client( (bool) count($clientOptions) ?  $clientOptions : [
                RequestOptions::CONNECT_TIMEOUT => 10,
                RequestOptions::TIMEOUT => 10,
                RequestOptions::ALLOW_REDIRECTS => false,
                RequestOptions::COOKIES => true,
            ]);

So I'm looking forward to some feedback on this; I would have submitted a PR but I'm not 100% sure what is intended here. Once it's been discussed I'd be happy to submit anything as a PR.

"RequestOptions::COOKIES => true" can lead to PHP Warnings

Hey guys,

I was playing around with your crawler. After using it on several domains, I noticed this warnings in my terminal: "PHP Warning: strpos(): Empty needle in /var/www/scanner/vendor/guzzlehttp/guzzle/src/Cookie/SetCookie.php on line 315"
Dont't had the time to dig further into this, but a "RequestOptions::COOKIES => false" solved the warnings.
Unfortunately I cannot tell you on which domain this appeared exactly.
Why should you need cookies anyway on a crawler?

kindly regards

[question] How to filter by Content-Type?

Hello, I intend to create a sitemap using this library, and I would like to remove duplicates and to filter out content that is not html content ( filter out images, scripts, css and other stuff ).

At this time I don't worry about the dupes since I am queueing urls on a database.

By looking at the header I will know if the page is html content and not an image, so I would like to ask if shouldCrawl method has already looked ahead at the url in order to get that header? If so, I don't need to request the page again just to check the headers.

Best regards

Handlers extendability

The new handlers for request fullfillment and failure are a really cool idea. The problem is that they are not easy to extend without also extending the Crawler class because they can not be passed as constructed objects.

Is there any reason why they should be instantiated every time a Pool is created?

Thanks for your amazing work. ๐Ÿ˜„

Crawler parses body for all status codes

The addAllLinksToCrawlQueue method is run on all HTTP responses including response codes like 301 and 302 for redirects. This isn't a noticeable issue when not executing Javascript, as the bodies are usually empty. When executing Javascript it becomes a problem as Puppeteer follows the redirects, resulting in links on the destination page being parsed even if that page is excluded by the CrawlProfile.

It would be nice if response codes could be whitelisted (maybe 200 or 2XX) or blacklisted (maybe 301, 302, 307 and 308) before Javascript is executed and the body parsed.

inlineJavascript

Hello,

I'm using your fine code and I've noticed that it doesn't seem to handle inline javascript.
Try these two lines in artisan tinker:

use \Spatie\Crawler\Url;
(new Url('javascript:linkTo_UnCryptMailto('iwehpk6cbq:bqWckr:oe');'))

Possible fix in url.php:
/**
* Determine if this is an inline javascript
*
* @return bool
*/
public function isJavascript()
{
return $this->scheme === 'javascript';
}

..and then use this method to filter out these links.

Cant get crawler to work

I don't do alot of object oriented PHP, so I apologize in advance, but I'm sure others who want to test out this repo may have the same question.

I have installed the composer library and below is my code

The spatie library is at vendor/crawler/spatie relative to where this code is. I keep getting the following error.

Fatal error: Uncaught Error: Class 'Crawler' not found in C:\xampp\htdocs\crawler2\index.php:7 Stack trace: #0 {main} thrown in C:\xampp\htdocs\crawler2\index.php on line 7

Any chance you could help?

setCrawlObserver('\Spatie\Crawler\CrawlObserver') ->setConcurrency(1) //now all urls will be crawled one by one ->startCrawling($url);

how to combine CrawlProfiles?

ive written a new CrawlProfile for Spatie\Crawler and like to add it as a Command.

$crawlProfile = $input->getOption('dont-crawl-external-links')
? new CrawlInternalUrls($baseUrl) : $input->getOption('ignore-querystrings')
? new IgnoreQueryStrings($baseUrl) : new CrawlAllUrls();

This does not give errors, but does not combine the Profiles. Any hints?

Documentation outdated.

Looks like the documentation is outdated.

For instance, the docs say Spatie\Crawler\CrawlObserver is an abstract class but looks like now this is an Interface.

Class 'Spatie\Crawler' not found in

When I try to use your package, I found errors:

Warning: Class 'Tightenco\Collect\Support\Debug\Dumper' not found in vendor\tightenco\collect\src\Collect\Support\alias.php on line 18

Warning: Class 'Tightenco\Collect\Support\Debug\HtmlDumper' not found in vendor\tightenco\collect\src\Collect\Support\alias.php on line 18

I use windows 10 x64, php 7.1

Crawler can't render using java

$a = new \Spatie\Crawler\CrawlerTest();
$a->it_can_crawl_all_links_rendered_by_javascript();

PHP Fatal error: Uncaught Symfony\Component\Process\Exception\ProcessFailedException: The command "PATH=$PATH:/usr/local/bin NODE_PATH=npm root -g node '/public_html/crawl/Crawler/vendor/spatie/browsershot/src/../bin/browser.js' '{"url":"https://www.example.com/","action":"content","options":{"args":[],"viewport":{"width":800,"height":600}}}'" failed.
`

sh: npm: command not found
sh: node: command not found
in /public_html/crawl/Crawler/vendor/spatie/browsershot/src/Browsershot.php:586

Used on shared hosting

How would one go about implementing craw limits per url format?

Hello,

I'm trying to implement this throttling feature, but i seem to get stuck.
How do I force the crawler to ignore URL's by format if it has crawled beyond a set limit?

Eg. If the URL is like /param1/param2 how do I make it ignore crawling same URLS again?

Setting custom user-agent?

For one of my projects, I would like to be able to set the User-Agent header to a specific value. Is there a way to set it in the crawler's Guzzle client instance?

How to install?

Hello, I am finding difficulties installing this package. Can you explain what steps to follow once you download this package to make it work?

How to load CrawlQueue saved in the database?

Hi,

I have problems with loading old data from database to the custom CrawlQueue.

In Crawler I set setMaximumCrawlCount variable to 200 (this takes around 30s in my test site to grab links). When CrawlObserver call to finishedCrawling method, I save informations from CrawlQueue->urls implementation to the database. Then in next checking I load this data to the CrawlQueue->urls and CrawlQueue->pendingUrls, but this gives me often timeout (more then 60s). So I decided to load only last 200 urls from database to the CrawlQueue->urls and CrawlQueue->pendingUrls and rest of old data from database is checking in the CrawlProfile->shouldCrawl and return false if url exists in the database, but with this method I can only grab some small numbers of links with every checks (around 10).

Can you help me with this?

Does it support 5.5.9

Hi, I'm not able to install it via composer because of PHP version depency, my PHP version is 5.5.9 it is really imposible to find a server that runs on PHP 5.6, does it work compatible with php 5.5* ?

Thanks

How can i monitor crawler queue

Hi,

I want to write a CLI crawler and i want to have a way of showing links in queue vs links scanned,
or links remaining.

Is there a way to do this?

Thanks.

Capitalized url's turned to lowercase

Related to Spatie http-status-check tool. I ran the http-status-check scanner and I got an awful lot of 404's. Turns out all the used url's are lowercase, while the assets in my cms use capitalized folders.

  • This is the actual url for my asset (works, 200): http://www.domain.be/src/Frontend/Files/userfiles/images/test.jpg
  • This url is used in the crawler/http-status-code scanner and does not work (error 404): http://www.domain.be/src/frontend/files/userfiles/images/test.jpg

Any idea why all the used url's are lowercase? Otherwise great tool ๐Ÿ‘

support for deph

Hello,

any plans for it to go depper into the base_url links?

Thinking about testing post-launched site tests, and maybe tests specific clients on weekly basis, so we know that everything works on their side and so on.

Thanks

relative urls bugs!

The normalizeUrl() function in Crawler.php has not been implemented properly for the reason of :

A relative url in a page has infinite absolute url representation in relative-to-absolute transformation according to RFC1808. It varies according to current page url.

For example:
The current url is:
http://www.example.com/a/b.html,
and there is anchor:
href="c.html",
According to current normalizeUrl function, the anchor will be normalize to url:
http://www.example.com/c.html,
But it's wrong, it should be:
http://www.example.com/a/c.html

Check respectRobots before rejecting links

->reject(function (Link $link) {
return $link->getNode()->getAttribute('rel') === 'nofollow';
})

Link rejection should check respectRobots flag before rejecting a nofollow link.
Adding this code should be enough

return $this->crawler->mustRespectRobots() && $link->getNode()->getAttribute('rel') === 'nofollow';

CrawlAllUrls does not cover sub-domains

Hi,
Thanks for this great package. I would like to request or learn if it is already built that when I use CrawlAllUrls profile, it does not cover sub-domains even tough it covers all other external urls.

301 redirects - error message?

Thanks very much for providing the package and code!

Just quick feedback:

I just tested it a bit and I used a start URL which was redirected (301) because of https - there was no error message so it was a kind of confusing why the crawler stopped immediately.

Maybe an error message or a hint in the readme would be nice.

Best regards,
Matthias

I get an error while crawling

Type error: Argument 1 passed to Spatie\Crawler\Crawler::shouldCrawl() must be an instance of Tree\Node\Node, null given

The problem is in this snipped:

                $node = $this->addtoDepthTree($this->depthTree, $url, $foundOnUrl);

                if (! $this->shouldCrawl($node)) {
                    return;
                }

                if ($this->maximumCrawlCountReached()) {
                    return;
                }

                $crawlUrl = CrawlUrl::create($url, $foundOnUrl);

                $this->addToCrawlQueue($crawlUrl);

addtoDepthTree can return a null object.

And then ! $this->shouldCrawl($node) fails because of:

protected function shouldCrawl(Node $node): bool not allowing null.

how to get all urls of a webite

hi i want to get all links of a website
how i can do it with this?

i tried this one but it didnt crawl all links
and it just gives me 43 link
but i know there is around 64 links in that website

Crawler::create()
->setConcurrency(1)
->setCrawlObserver(new CrawlLogger())
->startCrawling('http://pardiis.org');

Impossible work with persistent queue

The issue is that it's impossible to pass to Crawler a Queue which work with persistent storage because Crawler saves state in memory in depthTree attribute.

If set a Queue which already has pending urls then after calling startCrawling Crawler fails with FatalError because depthTree is empty and new urls can't be attached to the tree.

scrape a website

Is it possible to scrape a website via this package?

I need to get comments of all links that start with siteexample.com/category/?????

actually I couldn't find an example to use DOM parser in this package

thank you

Fix Spatier\Crawler\Url for getting Query

Hello, I have modified Spatier\Crawler\Url file as below. Now it is able to get the url with query values.

<?php

namespace Spatie\Crawler;

class Url
{
    /**
     * @var null|string
     */
    public $scheme;

    /**
     * @var null|string
     */
    public $host;

    /**
     * @var null|string
     */
    public $path;

    /**
     * @var null|string
     */

    public $query;

    /**
     * @param $url
     *
     * @return static
     */
    public static function create($url)
    {
        return new static($url);
    }

    /**
     * Url constructor.
     *
     * @param $url
     */
    public function __construct($url)
    {
        $urlProperties = parse_url($url);

        foreach (['scheme', 'host', 'path','query'] as $property) {
            if (isset($urlProperties[$property])) {
                $this->$property = $urlProperties[$property];
            }
        }
    }

    /**
     * Determine if the url is relative.
     *
     * @return bool
     */
    public function isRelative()
    {
        return is_null($this->host);
    }

    /**
     * Determine if the url is protocol independent.
     *
     * @return bool
     */
    public function isProtocolIndependent()
    {
        return is_null($this->scheme);
    }

    /**
     * Determine if this is a mailto-link.
     *
     * @return bool
     */
    public function isEmailUrl()
    {
        return $this->scheme === 'mailto';
    }

    /**
     * Set the scheme.
     *
     * @param string $scheme
     *
     * @return $this
     */
    public function setScheme($scheme)
    {
        $this->scheme = $scheme;

        return $this;
    }

    /**
     * Set the host.
     *
     * @param string $host
     *
     * @return $this
     */
    public function setHost($host)
    {
        $this->host = $host;

        return $this;
    }

    /**
     * Remove the fragment.
     *
     * @return $this
     */
    public function removeFragment()
    {
        $this->path = explode('#', $this->path)[0];

        return $this;
    }

    /**
     * Convert the url to string.
     *
     * @return string
     */
    public function __toString()
    {
        $path = starts_with($this->path, '/') ? substr($this->path, 1) : $this->path;

        return "{$this->scheme}://{$this->host}/{$path}".(($this->query!='')?("?".$this->query):'');
    }
}

Handling 301 redirects

How do you handle 301 redirects with this crawler? I have a product where people typically enter the base domain to crawl and if they enter site.com and it 301s to https://site.com the crawler stop there cold. How can I prevent this?

Crawler should respect robots.txt

Right now the crawler checks all links regardless of what's in the robots.txt file. Ideally the crawler should have a respectRobotsTxt method that parses the robots.txt file (or header) and only crawl the allowed links.

I'd accept a PR (with tests) that adds this to the package.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.