zrashwani / arachnid Goto Github PK

Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites

License: MIT License

PHP 88.24% HTML 11.76%

php scraping crawler seo

arachnid's Introduction

Arachnid Web Crawler

This library will crawl all unique internal links found on a given website up to a specified maximum page depth.

This library is using symfony/panther & FriendsOfPHP/Goutte libraries to scrap site pages and extract main SEO-related info, including: title, h1 elements, h2 elements, statusCode, contentType, meta description, meta keyword and canonicalLink.

This library is based on the original blog post by Zeid Rashwani here:

http://zrashwani.com/simple-web-spider-php-goutte

Josh Lockhart adapted the original blog post's code (with permission) for Composer and Packagist and updated the syntax to conform with the PSR-2 coding standard.

How to Install

You can install this library with Composer. Drop this into your composer.json manifest file:

{
    "require": {
        "zrashwani/arachnid": "dev-master"
    }
}

Then run composer install.

Getting Started

Basic Usage:

Here's a quick demo to crawl a website:

    <?php
    require 'vendor/autoload.php';

    $url = 'http://www.example.com';
    $linkDepth = 3;
    // Initiate crawl, by default it will use http client (GoutteClient), 
    $crawler = new \Arachnid\Crawler($url, $linkDepth);
    $crawler->traverse();

    // Get link data
    $links = $crawler->getLinksArray(); //to get links as objects use getLinks() method
    print_r($links);

Enabling Headless Browser mode:

Headless browser mode can be enabled, so it will use Chrome engine in the background which is useful to get contents of JavaScript-based sites.

enableHeadlessBrowserMode method set the scraping adapter used to be PantherChromeAdapter which is based on Symfony Panther library:

    $crawler = new \Arachnid\Crawler($url, $linkDepth);
    $crawler->enableHeadlessBrowserMode()
            ->traverse()
            ->getLinksArray();

In order to use this, you need to have chrome-driver installed on your machine, you can use dbrekelmans/browser-driver-installer to install chromedriver locally:

composer require --dev dbrekelmans/bdi
./vendor/bin/bdi driver:chromedriver drivers

Advanced Usage:

Set additional options to underlying http client, by specifying array of options in constructor or creating Http client scrapper with desired options:

    <?php
        use \Arachnid\Adapters\CrawlingFactory;
        //third parameter is the options used to configure http client
        $clientOptions = ['auth_basic' => array('username', 'password')];
        $crawler = new \Arachnid\Crawler('http://github.com', 2, $clientOptions);
           
        //or by creating and setting scrap client
        $options = array(
            'verify_host' => false,
            'verify_peer' => false,
            'timeout' => 30,
        );
                        
        $scrapperClient = CrawlingFactory::create(CrawlingFactory::TYPE_HTTP_CLIENT, $options);
        $crawler->setScrapClient($scrapperClient);

You can inject a PSR-3 compliant logger object to monitor crawler activity (like Monolog):

    <?php    
    $crawler = new \Arachnid\Crawler($url, $linkDepth); // ... initialize crawler   

    //set logger for crawler activity (compatible with PSR-3)
    $logger = new \Monolog\Logger('crawler logger');
    $logger->pushHandler(new \Monolog\Handler\StreamHandler(sys_get_temp_dir().'/crawler.log'));
    $crawler->setLogger($logger);
    ?>

You can set crawler to visit only pages with specific criteria by specifying callback closure using filterLinks method:

    <?php
    //filter links according to specific callback as closure
    $links = $crawler->filterLinks(function($link) {
                        //crawling only links with /blog/ prefix
                        return (bool)preg_match('/.*\/blog.*$/u', $link); 
                    })
                    ->traverse()
                    ->getLinks();

You can use LinksCollection class to get simple statistics about the links, as following:

    <?php
    $links = $crawler->traverse()
                     ->getLinks();
    $collection = new LinksCollection($links);

    //getting broken links
    $brokenLinks = $collection->getBrokenLinks();
   
    //getting links for specific depth
    $depth2Links = $collection->getByDepth(2);

    //getting external links inside site
    $externalLinks = $collection->getExternalLinks();

How to Contribute

Fork this repository
Create a new branch for each feature or improvement
Apply your code changes along with corresponding unit test
Send a pull request from each feature branch

It is very important to separate new features or improvements into separate feature branches, and to send a pull request for each branch. This allows me to review and pull in new features or improvements individually.

All pull requests must adhere to the PSR-2 standard.

System Requirements

PHP 7.2.0+

Authors

Josh Lockhart https://github.com/codeguy
Zeid Rashwani http://zrashwani.com

License

MIT Public License

arachnid's People

Contributors

Stargazers

Watchers

arachnid's Issues

Parent > Children > Grand Children

how to get the URL/Page Title of Parent page of crawl link.

suppose, I have a url "www.example.com" and it is the parent but it has a child "www.example.com/pageone.html" and grand child is "www.example.com/pageone/pagetwo.html".

after traversing pagetwo.html how to get the Page Title/URL of other two urls?

Crawler the whole site, page inside another page.

Hi.

Thank for the script.

I can find how to the scan site deeper. I mean there is a front page like https://example.com and on the page there are links to other pages where exist other pages with links. In the code below, crawler visit pages only by the links on the front, but not inside the pages.

Eg on the front exists link to the page https://example.com/links and on this page, there are a few links, the script doesn't visit the link on the page.

<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;

set_time_limit(6000);

$linkDepth = 500;
// Initiate crawl    

$crawler = new \Arachnid\Crawler("https://example,com", $linkDepth);
$crawler->traverse();

// Get link data
$links = $crawler->getLinks();

it's possible to modify the code above but if exists solution from the box, it's better.

Thx

Response 401 - Authentification

Hi, i need authentification against LDAP vía HTTP Auth, and it gives me an 401 status code.

How i can do this using 'CookieJar'? Like in the comment:
http://zrashwani.com/simple-web-spider-php-goutte/#comment-92

It gives me:

Array
(
    [http://somehost] => Array
        (
            [links_text] => Array
                (
                    [0] => BASE_URL
                )
            [absolute_url] => http://somehost
            [frequency] => 1
            [visited] => 
            [external_link] => 
            [original_urls] => Array
                (
                    [0] => http://somehost
                )
            [status_code] => 401
        )
)

Adding a release

Hi, can you please add a release for commit 244180c?

We are not able to use HEAD anymore as we have Laravel 5.3 and your package now requires Illuminate/Support 5.4.

Thank you,
Anthony

Does not support js-rendered sites

Sites like https://taxibambino.com or other sites that html is being rendered from js is not supported by this crawler and I understand that by design(being a back-end based crawler) this is not fixable. I am afraid that with the lacking support of js-only sites this crawler becomes obsolete.

Catchable fatal error

I'm getting this error while running the code:

Catchable fatal error: Argument 1 passed to Front\simpleCrawler::extractTitleInfo() must be an instance of Front\DomCrawler, instance of Symfony\Component\DomCrawler\Crawler given, called in C:\xampp\htdocs\webAN\src\Front\FrontController.php on line 336 and defined in C:\xampp\htdocs\webAN\src\Front\FrontController.php on line 453

404 error is hardcoded

Hello, so
the error_code is hardcoded to always return a 404, but in real life we are often dealing with 403, or a 500 etc. Would be nice to see a bit more info - I know this is not difficult to check. :)

For e.g. the method could look something like this:

function check_http_code($a)
{
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $a);
    curl_setopt($ch, CURLOPT_HEADER, 1);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $data = curl_exec($ch);
    $headers = curl_getinfo($ch);
    curl_close($ch);
    return $headers['http_code'];
}

Update Goutte dependency version

Hi,

I'm trying to use Goutte and Arachnid together to crawl and then scrape content from a website. I've installed Goutte which currently sits at version 3.1. I'm unable to install Arachnid alongside this version of Goutte because it requires Gouttee version ~1.

Is there any chance we can get the composer.json requirements either updated, or loosened to just accept any version of Goutte? Or is there a reason for this library to require that version of Goutte?

Arachnid's composer.json requirements:

"require": {
    "php": ">=5.4.0",
    "fabpot/goutte": "~1"
}

My composer.json requirements, using latest stable version of Goutte:

"require": {
    "fabpot/goutte": "^3.1"
}

Thanks

Absolute links and the actual urls in some cases is being rendered wrongly.

For e.g. page http://toastytech.com/evil/ with $linkDepth = 2; gives a lot of incorrect urls. You may say that this webpage is very old and no one writes relative urls like "../yourUrlPath", but I think this still should be fixed :)

"/evil/../links/index.html" => array:14 [▼
    "original_urls" => array:1 [ …1]
    "links_text" => array:1 [ …1]
    "absolute_url" => "http://toastytech.com/evil/../links/index.html"
    "external_link" => false
    "visited" => true
    "frequency" => 1
    "source_link" => "http://toastytech.com/evil/"
    "depth" => 1
    "status_code" => 200
    "title" => "Nathan's Links"
    "meta_keywords" => ""
    "meta_description" => ""
    "h1_count" => 1
    "h1_contents" => array:1 [ …1]

problems with some of my websites

This is the error that gives me:

Array
(
    [http://www.*********.com] => Array
        (
            [links_text] => Array
                (
                    [0] => BASE_URL
                )

            [absolute_url] => http://www.*********.com
            [frequency] => 1
            [visited] => 
            [external_link] => 
            [original_urls] => Array
                (
                    [0] => http://www.***********.com
                )

            [status_code] => 404
            [error_code] => 0
            [error_message] =>The current node list is empty.
        )

)

Another errors:

Warning: array_replace(): Argument #2 is not an array in /var/www/test/includes/vendor/symfony/browser-kit/CookieJar.php on line 200

Thanks!!

Named array keys

What is the rationale for using named $nodeUrl and $nodeText array keys?

$childLinks[$hash]['original_urls'][$nodeUrl] = $nodeUrl;
$childLinks[$hash]['links_text'][$nodeText] = $nodeText;

Crawler.php Line 363 & 364

Would it not be more consistent and easier to parse if we changed to numerical keys?

$childLinks[$hash]['original_urls'][] = $nodeUrl;
$childLinks[$hash]['links_text'][] = $nodeText;

The use of `$this` inside closures

Your package supports php version 5.3, but there are several lines with in closures that use $this yet $this can not be used in anonymous functions before php version 5.4.

https://github.com/codeguy/arachnid/blob/master/src/Arachnid/Crawler.php#L181
https://github.com/codeguy/arachnid/blob/master/src/Arachnid/Crawler.php#L193
https://github.com/codeguy/arachnid/blob/master/src/Arachnid/Crawler.php#L200
https://github.com/codeguy/arachnid/blob/master/src/Arachnid/Crawler.php#L235

A simple solution would be to required version 5.4. This would also have a nice side effect as it would use the latest version of goute.

Sites like linkedin should be probably excluded

Some social giants should be excluded from the results if they require login to be accessed - in order to only see the really broken links. With the current situation we get some false positives, with sites like linkedin.

filterLinks not work.

Hi, filterLinks not work in this example..

$url = "http://uk.louisvuitton.com/eng-gb/men/men-s-bags/fashion-shows/_/N-54s1t";
$crawler = new Crawler($url, 2); 
$links = $crawler
				->filterLinks(function($link){
                    
                    return (bool) preg_match('/\/eng-gb\/products\/(.*)/',$link); 
                })
                ->traverse()
                ->getLinks();

What Is wrong?

filterLinks Issue

filterLinks(function($link){ return (bool) preg_match('/\/google\/(.*)/',$link) }) ->traverse() ->getLinksArray(); print_r($links); I have written this code to traverse only which have google as domain name, but it returns an empty array. What am I missing??

Get anchor of link

There are a method to get anchor ?

Improvement suggestions

We from sulu-cmf wants to use your crawler to create a http cache warmer and website information extractor. I will start today to use your class in a new SymfonyBundle.

Because of this reason i would like to ask you if you had time to contribute to your class?

I will create a PR to include some improvements we need:

Extract metadata
Get status_code of external links to check broken links
Perhaps the possibility to add a "progress bar"

I hope you will be able to merge this PR and thanks to your good work until now (= it saves me a lot of time.

With best regards
sulu-cmf

How to find out from which url the url was crawled?

So let's say I am crawling a website http://website.com and it has a broken link http://website.com/dir/subdir/red located in http://website.com/dir/subdir . Is there a way that with all the data there would also be a key "source" => " http://website.com/dir/subdir"

Also,
is there a way to force all these keys on all of the crawled urls, not just a fraction of them as it is currently?

"original_urls" => 
    "links_text" =>
    "absolute_url" => 
    "external_link" => 
    "visited" => 
    "frequency" => 
    "depth" => 
    "status_code" => 
    "error_code" => 
    "error_message" =>

Timeout configuration for Goutte client

It's not currently possible to configure a timeout for the Guzzle client which is used to make HTTP requests when spidering a site. Without a default, Guzzle defaults to 0 timeout – i.e. it'll wait indefinitely until a response is received. (Which arguably isn't a sensible default anyway.)

I'm trying to spider a site which contains a link to a dead server. Requests to the URL never timeout, meaning the spider process gets stuck on this URL and never proceeds.

The timeout is configured when constructing a new Guzzle client, which is currently done in Arachnid\Crawler::getScrapClient():

protected function getScrapClient()
{
    $client = new GoutteClient();
    $client->followRedirects();

    $guzzleClient = new \GuzzleHttp\Client(array(
        'curl' => array(
            CURLOPT_SSL_VERIFYHOST => false,
            CURLOPT_SSL_VERIFYPEER => false,
        ),
    ));
    $client->setClient($guzzleClient);

    return $client;
}

It would be really helpful if a timeout was configured here. To do that, all we need to do is change the configuration array which is passed to the Guzzle client constructor method:

$guzzleClient = new \GuzzleHttp\Client(array(
    'curl' => array(
        CURLOPT_SSL_VERIFYHOST => false,
        CURLOPT_SSL_VERIFYPEER => false,
    ),
    'timeout' => 30,
    'connect_timeout' => 30,
));

I think a sensible default would be a 30 second timeout, but it would be great to have that configurable. That could either be an additional parameter in the constructor method, or alternatively an object property which can be changed.

In fact – it might make sense to allow us to add anything to the Guzzle constructor configuration. Perhaps again by means of a class property or constructor parameter whereby we can pass in an array of configuration options. This could be useful when configuring other client options, for example HTTP authentication:

$crawler = new Crawler($url, 3, array(
    'timeout' => 5,
    'connect_timeout' => 5,
    'auth' => array('username', 'password'),
));

Thoughts? I'd be happy to put together a PR for this, provided we can get some agreement on how this should be configured (class constructor, public property, static property, getter/setter, etc.)

tel: links get crawled

Despite beeing blacklisted in checkIfCrawlable tel links get crawled.

tested via

$crawler = new \Arachnid\Crawler('https://www.handyflash.de/', 3);
$crawler->traverse();

in the apache accesslog a hit like

www.handyflash.de:443 213.XXX.YYY.ZZZ - - [10/Jun/2016:12:24:42 +0200] "GET /tel:+4923199778877 HTTP/1.1" 404 37312 "-" "Symfony2 BrowserKit" 0

is recorded

Abandoned?

I tried to pull this through composer, but it was flagged abandoned. Just checking if this was intentional :)

'Don't crawl external links' option

It would be great to have the option to disable the crawling of external links.

I'd like to crawl an entire website, but am not interested in external links. By setting the 'depth' option to something suitably high in order to capture the entire website, I also end up doing a deep crawl of external websites.

Undefined index: external_link

I'm getting an Undefined index: external_link error in LinksCollection.php (line 51). I'm suspecting it may be because the site I'm indexing has some "javascript:void(0)" links on buttons and so forth that are tied to jQuery events, etc. Wondering if you might have any insight or ideas. Any help would be greatly appreciated. Thanks.

Images treated as 404 - false positive

Hello, in this case images are found as 404, while in reality they have good urls. This should be fixed.

array:8 [▼
  "/images/2017-putsschema-1.png" => array:9 [▼
    "original_urls" => array:1 [▼
      "/images/2017-putsschema-1.png" => "/images/2017-putsschema-1.png"
    ]
    "links_text" => array:1 [▼
      "PUTSSCHEMA 1" => "PUTSSCHEMA 1"
    ]
    "absolute_url" => "https://ssfonsterputs.se/images/2017-putsschema-1.png"
    "external_link" => false
    "visited" => false
    "frequency" => 1
    "source_link" => "https://ssfonsterputs.se/putsschema/"
    "depth" => 2
    "status_code" => 404
  ]
  "/images/2017-putsschema-2.png" => array:9 [▼
    "original_urls" => array:1 [▶]
    "links_text" => array:1 [▶]
    "absolute_url" => "https://ssfonsterputs.se/images/2017-putsschema-2.png"
    "external_link" => false
    "visited" => false
    "frequency" => 1
    "source_link" => "https://ssfonsterputs.se/putsschema/"
    "depth" => 2
    "status_code" => 404
  ]
  "/images/2017-putsschema-3.png" => array:9 [▼
    "original_urls" => array:1 [▶]
    "links_text" => array:1 [▶]
    "absolute_url" => "https://ssfonsterputs.se/images/2017-putsschema-3.png"
    "external_link" => false
    "visited" => false
    "frequency" => 1
    "source_link" => "https://ssfonsterputs.se/putsschema/"
    "depth" => 2
    "status_code" => 404
  ]
  "/images/2017-putsschema-4.png" => array:9 [▶]
  "/images/2017-putsschema-5.png" => array:9 [▶]
  "/images/2017-putsschema-6.png" => array:9 [▶]
  "/images/2017-putsschema-7.png" => array:9 [▶]
  "/images/2017-putsschema-8.png" => array:9 [▶]
]