GithubHelp home page GithubHelp logo

oxylabs / web-scraping-php Goto Github PK

View Code? Open in Web Editor NEW
8.0 1.0 3.0 24 KB

A tutorial and code samples of web scraping with PHP

PHP 100.00%
php web-scraping email-scraper email-scraper-with-proxy screen-scraping url-scraper website-crawler wikipedia-scraper

web-scraping-php's Introduction

Web Scraping With PHP

Oxylabs promo code

PHP is a general-purpose scripting language and one of the most popular options for web development. For example, WordPress, the most common content management system to create websites, is built using PHP.

PHP offers various building blocks required to build a web scraper, although it can quickly become an increasingly complicated task. Conveniently, there are many open-source libraries that can make web scraping with PHP more accessible.

This article will guide you through the step-by-step process of writing various PHP web scraping routines that can extract public data from static and dynamic web pages

For a detailed explanation, see our blog post.

Installing Prerequisites

# Windows
choco install php
choco install composer

or

# macOS
brew install php
brew install composer

Making an HTTP GET request

<?php
$html = file_get_contents('https://books.toscrape.com/');
echo $html;

Web scraping in PHP with Goutte

composer init --no-interaction --require="php >=7.1"
composer require fabpot/goutte
composer update
<?php
require 'vendor/autoload.php';
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'https://books.toscrape.com');
echo $crawler->html();

Locating HTML elements via CSS Selectors

echo $crawler->filter('title')->text(); //CSS
echo $crawler->filterXPath('//title')->text(); //XPath

Extracting the elements

function scrapePage($url, $client){
    $crawler = $client->request('GET', $url);
    $crawler->filter('.product_pod')->each(function ($node) {
            $title = $node->filter('.image_container img')->attr('alt');
            $price = $node->filter('.price_color')->text();
            echo $title . "-" . $price . PHP_EOL;
        });
    }

Handling pagination

function scrapePage($url, $client, $file)
{
   //...
  // Handling Pagination
    try {
        $next_page = $crawler->filter('.next > a')->attr('href');
    } catch (InvalidArgumentException) { //Next page not found
        return null;
    }
    return "https://books.toscrape.com/catalogue/" . $next_page;
}

Writing Data to CSV

function scrapePage($url, $client, $file)
{
    $crawler = $client->request('GET', $url);
    $crawler->filter('.product_pod')->each(function ($node) use ($file) {
        $title = $node->filter('.image_container img')->attr('alt');
        $price = $node->filter('.price_color')->text();
        fputcsv($file, [$title, $price]);
    });
    try {
        $next_page = $crawler->filter('.next > a')->attr('href');
    } catch (InvalidArgumentException) { //Next page not found
        return null;
    }
    return "https://books.toscrape.com/catalogue/" . $next_page;
}
$client = new Client();
$file = fopen("books.csv", "a");
$nextUrl = "https://books.toscrape.com/catalogue/page-1.html";
while ($nextUrl) {
    echo "<h2>" . $nextUrl . "</h2>" . PHP_EOL;
    $nextUrl = scrapePage($nextUrl, $client, $file);
}
fclose($file);

Web scraping with Symfony Panther

composer init --no-interaction --require="php >=7.1" 
composer require symfony/panther
composer update
brew install chromedriver

Sending HTTP requests with Panther

<?php
require 'vendor/autoload.php';
use \Symfony\Component\Panther\Client;
$client = Client::createChromeClient();
$client->get('https://quotes.toscrape.com/js/');

Locating HTML elements via CSS Selectors

    $crawler = $client->waitFor('.quote');
    $crawler->filter('.quote')->each(function ($node) {
        $author = $node->filter('.author')->text();
        $quote = $node->filter('.text')->text();
       echo $autor." - ".$quote
    });

Handling pagination

while (true) {
    $crawler = $client->waitFor('.quote');
…
    try {
        $client->clickLink('Next');
    } catch (Exception) {
        break;
    }
}

Writing data to a CSV file

$file = fopen("quotes.csv", "a");
while (true) {
    $crawler = $client->waitFor('.quote');
    $crawler->filter('.quote')->each(function ($node) use ($file) {
        $author = $node->filter('.author')->text();
        $quote = $node->filter('.text')->text();
        fputcsv($file, [$author, $quote]);
    });
    try {
        $client->clickLink('Next');
    } catch (Exception) {
        break;
    }
}
fclose($file);

If you wish to find out more about web scraping with PHP, see our blog post.

web-scraping-php's People

Contributors

augustoxy avatar oxyjohan avatar oxyjowyd avatar oxylabsorg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.