GithubHelp home page GithubHelp logo

hhy5277 / arachnid Goto Github PK

View Code? Open in Web Editor NEW

This project forked from zrashwani/arachnid

0.0 1.0 0.0 179 KB

Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites

License: MIT License

PHP 88.91% HTML 11.09%

arachnid's Introduction

Arachnid Web Crawler

This library will crawl all unique internal links found on a given website up to a specified maximum page depth.

This library is based on the original blog post by Zeid Rashwani here:

http://zrashwani.com/simple-web-spider-php-goutte

Josh Lockhart adapted the original blog post's code (with permission) for Composer and Packagist and updated the syntax to conform with the PSR-2 coding standard.

SymfonyInsight Build Status codecov

How to Install

You can install this library with Composer. Drop this into your composer.json manifest file:

{
    "require": {
        "zrashwani/arachnid": "dev-master"
    }
}

Then run composer install.

Getting Started

Here's a quick demo to crawl a website:

    <?php
    require 'vendor/autoload.php';

    $url = 'http://www.example.com';
    $linkDepth = 3;
    // Initiate crawl  
    // By default it will use GoutteClient, 
    // If headless browser mode enabled, it will use Chrome engine in background useful to get javacript-based content
    $crawler = new \Arachnid\Crawler($url, $linkDepth);
    $crawler->enableHeadlessBrowserMode()
            ->traverse();

    // Get link data
    $links = $crawler->getLinksArray(); //to get links as objects use getLinks() method
    print_r($links);

Advanced Usage:

There are other options you can set to the crawler:

Set additional options to underlying guzzle client, by specifying array of options in constructor or creating Goutte scrapper with desired options:

    <?php
        //third parameter is the options used to configure guzzle client
        $crawler = new \Arachnid\Crawler('http://github.com',2, 
                                 ['auth'=>array('username', 'password')]);
           
        //or using separate method `setCrawlerOptions`
        $options = array(
            'curl' => array(
                CURLOPT_SSL_VERIFYHOST => false,
                CURLOPT_SSL_VERIFYPEER => false,
            ),
            'timeout' => 30,
            'connect_timeout' => 30,
        );
                        
        $scrapperClient = \Arachnid\Adapters\CrawlingFactory::create(\Arachnid\Adapters\CrawlingFactory::TYPE_GOUTTE,$options);
        $crawler->setScrapClient($scrapperClient);

You can inject a PSR-3 compliant logger object to monitor crawler activity (like Monolog):

    <?php    
    $crawler = new \Arachnid\Crawler($url, $linkDepth); // ... initialize crawler   

    //set logger for crawler activity (compatible with PSR-3)
    $logger = new \Monolog\Logger('crawler logger');
    $logger->pushHandler(new \Monolog\Handler\StreamHandler(sys_get_temp_dir().'/crawler.log'));
    $crawler->setLogger($logger);
    ?>

You can set crawler to visit only pages with specific criteria by specifying callback closure using filterLinks method:

    <?php
    //filter links according to specific callback as closure
    $links = $crawler->filterLinks(function($link){
                        //crawling only blog links
                        return (bool)preg_match('/.*\/blog.*$/u',$link); 
                    })
                    ->traverse()
                    ->getLinks();

You can use LinksCollection class to get simple statistics about the links, as following:

    <?php
    $links = $crawler->traverse()
                     ->getLinks();
    $collection = new LinksCollection($links);

    //getting broken links
    $brokenLinks = $collection->getBrokenLinks();
   
    //getting links for specific depth
    $depth2Links = $collection->getByDepth(2);

How to Contribute

  1. Fork this repository
  2. Create a new branch for each feature or improvement
  3. Apply your code changes along with corresponding unit test
  4. Send a pull request from each feature branch

It is very important to separate new features or improvements into separate feature branches, and to send a pull request for each branch. This allows me to review and pull in new features or improvements individually.

All pull requests must adhere to the PSR-2 standard.

System Requirements

  • PHP 7.1.0+

Authors

License

MIT Public License

arachnid's People

Contributors

howtomakeaturn avatar msjyoo avatar noplanman avatar onema avatar zrashwani avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.