GithubHelp home page GithubHelp logo

api-client-php's Introduction

WebScraper.io PHP API client

API client for cloud.webscraper.io. The cloud based scraper is a managed scraper for the free Web Scraper Chrome extension. Visit https://cloud.webscraper.io/api to acquire API key.

Installation

Install the API client with composer.

composer require webscraperio/api-client-php

You might also need a CSV parser library. Visit http://csv.thephpleague.com/ for more information.

composer require league/csv

Usage

Initialize client

$client = new Client([
    'token' => 'paste api token here',
]);

Create Sitemap

$sitemapJSON = '
{
  "_id": "webscraper-io-landing",
  "startUrl": [
    "http://webscraper.io/"
  ],
  "selectors": [
    {
      "parentSelectors": [
        "_root"
      ],
      "type": "SelectorText",
      "multiple": false,
      "id": "title",
      "selector": "h1",
      "regex": "",
      "delay": ""
    }
  ]
}
';

$sitemap = json_decode($sitemapJSON, true);
$response = $client->createSitemap($sitemap);

Output:

['id' => 123]

Get Sitemap

$sitemap = $client->getSitemap($sitemapId);

Output:

[
    'id' => 123,
    'name' => 'webscraper-io-landing',
    'sitemap' => '{
        "_id": "webscraper-io-landing",
        "startUrl": [
          "http://webscraper.io/"
        ],
        "selectors": [
          {
            "parentSelectors": [
              "_root"
            ],
            "type": "SelectorText",
            "multiple": false,
            "id": "title",
            "selector": "h1",
            "regex": "",
            "delay": ""
          }
        ]
    }', // note sitemap won't be pretty printed
]

Get Sitemaps

$sitemaps = $client->getSitemaps();

Output (Iterator):

[
    [
        'id' => 123,
        'name' => 'webscraper-io-landing',
    ],
    [
        'id' => 124,
        'name' => 'webscraper-io-landing2',
    ],
]
// iterate through all sitemaps
$sitemaps = $client->getSitemaps();
foreach($sitemaps as $sitemap) {
    var_dump($sitemap);
}

// iterate throuh all sitemaps while manually handling pagination
$iterator = $client->getSitemaps();
$page = 1;
do {
    $sitemaps = $iterator->getPageData($page);
    foreach($sitemaps as $sitemap) {
        var_dump($sitemap);
    }
    $page++;
} while($page <= $iterator->getLastPage());

Delete Sitemap

$client->deleteSitemap(123);

Output:

"ok"

Create Scraping Job

$client->createScrapingJob([
    'sitemap_id' => 123,
    'driver' => 'fast', // 'fast' or 'fulljs'
    'page_load_delay' => 2000,
    'request_interval' => 2000,
]);

Output:

['id' => 500]

Get Scraping Job

Note. You can also receive a push notification that a scraping job has finished. Pinging the API to await when the scraping job has finished isn't the correct way to do it.

$client->getScrapingJob(500);

Output:

[
    'id' => 500,
    'sitemap_name' => 'webscraper-io-landing',
    'status' => 'scheduling',
    'sitemap_id' => 123,
    'test_run' => 0,
    'jobs_scheduled' => 0,
    'jobs_executed' => 0,
    'jobs_failed' => 0,
    'jobs_empty' => 0,
    'stored_record_count' => 0,
    'request_interval' => 2000,
    'page_load_delay' => 2000,
    'driver' => 'fast',
    'scheduled' => 0, // scraping job was started by scheduler
    'time_created' => '1493370624', // unix timestamp
]

Get Scraping Jobs

$client->getScrapingJobs($sitemapId = null);

Output (Iterator):

[
    [
        'id' => 500,
        'sitemap_name' => 'webscraper-io-landing',
        ...
    ],
    [
        'id' => 501,
        'sitemap_name' => 'webscraper-io-landing',
        ...
    ],
]
// iterate through all scraping jobs
$scrapingJobs = $client->getScrapingJobs();
foreach($scrapingJobs as $scrapingJob) {
    var_dump($scrapingJob);
}

// iterate through all scraping jobs while manually handling pagination
$iterator = $client->getScrapingJobs();
$page = 1;
do {
    $scrapingJobs = $iterator->getPageData($page);
    foreach($scrapingJobs as $scrapingJob) {
        var_dump($scrapingJob);
    }
    $page++;
} while($page <= $iterator->getLastPage());

Download Scraping Job JSON

Note! A good practice would be to move the download/import task to a queue job. Here is a good example of a queue system - https://laravel.com/docs/5.8/queues

require "../vendor/autoload.php";

use WebScraper\ApiClient\Client;
use WebScraper\ApiClient\Reader\JsonReader;

$apiToken = "API token here";
$scrapingJobId = 500; // scraping job id here

// initialize API client
$client = new Client([
	'token' => $apiToken,
]);

// download file locally
$outputFile = "/tmp/scrapingjob{$scrapingJobId}.json";
$client->downloadScrapingJobJSON($scrapingJobId, $outputFile);

// read data from file with built in JSON reader
$reader = new JsonReader($outputFile);
$rows = $reader->fetchRows();
foreach($rows as $row) {
	echo "ROW: ".json_encode($row)."\n";
}

// remove temporary file
unlink($outputFile);

// delete scraping job because you probably don't need it
$client->deleteScrapingJob($scrapingJobId);

Delete Scraping Job

$client->deleteScrapingJob(500);

Output:

"ok"

Get Account information

$client->getAccountInfo();

Output:

[
	'email' => '[email protected]',
	'firstname' => 'John',
	'lastname' => 'Deere',
	'page_credits' => 500,
]

Changelog

v0.2.0

  • getScrapingJobs() and getSitemaps() now return iterators
  • getScrapingJobs($sitemapId) can filter by sitemap

api-client-php's People

Contributors

martinsbalodis avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.