Web Scraping With PHP

Installing Prerequisites
Making an HTTP GET request
Web scraping in PHP with Goutte
Web scraping with Symfony Panther

PHP is a general-purpose scripting language and one of the most popular options for web development. For example, WordPress, the most common content management system to create websites, is built using PHP.

PHP offers various building blocks required to build a web scraper, although it can quickly become an increasingly complicated task. Conveniently, there are many open-source libraries that can make web scraping with PHP more accessible.

This article will guide you through the step-by-step process of writing various PHP web scraping routines that can extract public data from static and dynamic web pages

For a detailed explanation, see our blog post.

Installing Prerequisites

# Windows
choco install php
choco install composer

# macOS
brew install php
brew install composer

Making an HTTP GET request

<?php
$html = file_get_contents('https://books.toscrape.com/');
echo $html;

Web scraping in PHP with Goutte

composer init --no-interaction --require="php >=7.1"
composer require fabpot/goutte
composer update

<?php
require 'vendor/autoload.php';
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'https://books.toscrape.com');
echo $crawler->html();

Locating HTML elements via CSS Selectors

echo $crawler->filter('title')->text(); //CSS
echo $crawler->filterXPath('//title')->text(); //XPath

Extracting the elements

function scrapePage($url, $client){
    $crawler = $client->request('GET', $url);
    $crawler->filter('.product_pod')->each(function ($node) {
            $title = $node->filter('.image_container img')->attr('alt');
            $price = $node->filter('.price_color')->text();
            echo $title . "-" . $price . PHP_EOL;
        });
    }

Handling pagination

function scrapePage($url, $client, $file)
{
   //...
  // Handling Pagination
    try {
        $next_page = $crawler->filter('.next > a')->attr('href');
    } catch (InvalidArgumentException) { //Next page not found
        return null;
    }
    return "https://books.toscrape.com/catalogue/" . $next_page;
}

Writing Data to CSV

function scrapePage($url, $client, $file)
{
    $crawler = $client->request('GET', $url);
    $crawler->filter('.product_pod')->each(function ($node) use ($file) {
        $title = $node->filter('.image_container img')->attr('alt');
        $price = $node->filter('.price_color')->text();
        fputcsv($file, [$title, $price]);
    });
    try {
        $next_page = $crawler->filter('.next > a')->attr('href');
    } catch (InvalidArgumentException) { //Next page not found
        return null;
    }
    return "https://books.toscrape.com/catalogue/" . $next_page;
}

$client = new Client();
$file = fopen("books.csv", "a");
$nextUrl = "https://books.toscrape.com/catalogue/page-1.html";

while ($nextUrl) {
    echo "<h2>" . $nextUrl . "</h2>" . PHP_EOL;
    $nextUrl = scrapePage($nextUrl, $client, $file);
}
fclose($file);

Web scraping with Symfony Panther

composer init --no-interaction --require="php >=7.1" 
composer require symfony/panther
composer update

brew install chromedriver

Sending HTTP requests with Panther

<?php
require 'vendor/autoload.php';
use \Symfony\Component\Panther\Client;
$client = Client::createChromeClient();
$client->get('https://quotes.toscrape.com/js/');

Locating HTML elements via CSS Selectors

    $crawler = $client->waitFor('.quote');
    $crawler->filter('.quote')->each(function ($node) {
        $author = $node->filter('.author')->text();
        $quote = $node->filter('.text')->text();
       echo $autor." - ".$quote
    });

Handling pagination

while (true) {
    $crawler = $client->waitFor('.quote');
…
    try {
        $client->clickLink('Next');
    } catch (Exception) {
        break;
    }
}

Writing data to a CSV file

$file = fopen("quotes.csv", "a");
while (true) {
    $crawler = $client->waitFor('.quote');
    $crawler->filter('.quote')->each(function ($node) use ($file) {
        $author = $node->filter('.author')->text();
        $quote = $node->filter('.text')->text();
        fputcsv($file, [$author, $quote]);
    });
    try {
        $client->clickLink('Next');
    } catch (Exception) {
        break;
    }
}
fclose($file);

If you wish to find out more about web scraping with PHP, see our blog post.

ajmeese7 / web-scraping-php Goto Github PK

web-scraping-php's Introduction

Web Scraping With PHP

Installing Prerequisites

Making an HTTP GET request

Web scraping in PHP with Goutte

Locating HTML elements via CSS Selectors

Extracting the elements

Handling pagination

Writing Data to CSV

Web scraping with Symfony Panther

Sending HTTP requests with Panther

Locating HTML elements via CSS Selectors

Handling pagination

Writing data to a CSV file

web-scraping-php's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs