GithubHelp home page GithubHelp logo

gauthiernpn / crawler-2 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from spatie/crawler

0.0 0.0 0.0 83 KB

Crawl all links found on a website

Home Page: https://murze.be/2015/11/building-a-crawler-in-php/

License: MIT License

PHP 96.32% JavaScript 2.78% Shell 0.90%

crawler-2's Introduction

Crawl links on a website

Latest Version on Packagist Software License Build Status SensioLabsInsight Quality Score StyleCI Total Downloads

This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to crawl multiple urls concurrently.

Spatie is a webdesign agency in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.

Postcardware

You're free to use this package (it's MIT-licensed), but if it makes it to your production environment you are required to send us a postcard from your hometown, mentioning which of our package(s) you are using.

Our address is: Spatie, Samberstraat 69D, 2060 Antwerp, Belgium.

The best postcards will get published on the open source page on our website.

Installation

This package can be installed via Composer:

composer require spatie/crawler

Usage

The crawler can be instantiated like this

Crawler::create()
    ->setCrawlObserver(<implementation of \Spatie\Crawler\CrawlObserver>)
    ->startCrawling($url);

The argument passed to setCrawlObserver must be an object that implements the \Spatie\Crawler\CrawlObserver interface:

/**
 * Called when the crawler will crawl the given url.
 *
 * @param \Spatie\Crawler\Url $url
 */
public function willCrawl(Url $url);

/**
 * Called when the crawler has crawled the given url.
 *
 * @param \Spatie\Crawler\Url $url
 * @param \Psr\Http\Message\ResponseInterface $response
 * @param \Spatie\Crawler\Url $foundOn
 */
public function hasBeenCrawled(Url $url, ResponseInterface $response, Url $foundOn);

/**
 * Called when the crawl has ended.
 */
public function finishedCrawling();

Filtering certain urls

You can tell the crawler not to visit certain urls by passing using the setCrawlProfile-function. That function expects an objects that implements the Spatie\Crawler\CrawlProfile-interface:

/*
 * Determine if the given url should be crawled.
 */
public function shouldCrawl(Url $url): bool;

This package comes with two CrawlProfiles out of the box:

  • CrawlAllUrls: this profile will crawl all urls on all pages including urls to an external site.
  • CrawlInternalUrls: this profile will only crawl the internal urls on the pages of a host.

Setting the number of concurrent requests

To improve the speed of the crawl the package concurrently crawls 10 urls by default. If you want to change that number you can use the setConcurrency method.

Crawler::create()
    ->setCrawlObserver(<implementation of \Spatie\Crawler\CrawlObserver>)
    ->setConcurrency(1) //now all urls will be crawled one by one
    ->startCrawling($url);

Changelog

Please see CHANGELOG for more information what has changed recently.

Contributing

Please see CONTRIBUTING for details.

Testing

To run the tests you'll have to start the included node based server first in a separate terminal window.

cd tests/server
./start_server.sh

With the server running, you can start testing.

vendor/bin/phpunit

Security

If you discover any security related issues, please email [email protected] instead of using the issue tracker.

Credits

About Spatie

Spatie is a webdesign agency in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.

License

The MIT License (MIT). Please see License File for more information.

crawler-2's People

Contributors

blueclock avatar freekmurze avatar ilazaridis avatar juukie avatar mallardduck avatar sebastiandedeyne avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.