GithubHelp home page GithubHelp logo

optionalg / py-parsehub Goto Github PK

View Code? Open in Web Editor NEW

This project forked from hronecviktor/py-parsehub

0.0 2.0 0.0 28 KB

Python API for parsehub.com web scraping service

License: BSD 2-Clause "Simplified" License

Python 100.00%

py-parsehub's Introduction

py-parsehub

Python3 ParseHub module


What is ParseHub?

ParseHub is a service providing automated webscraping. You design a template using their Mozilla plugin and then access it through a REST API. Results can be retrieved in both JSON and CSV (coming soon) format. The templates are pretty flexible and use machine-learning to grasp complex hierarchies.

You can extract data from anywhere. ParseHub works with single-page apps, multi-page apps and just about any other modern web technology. ParseHub can handle Javascript, AJAX, cookies, sessions and redirects. You can easily fill in forms, loop through dropdowns, login to websites, click on interactive maps and even deal with infinite scrolling.

see ParseHub homepage


###Prerequisities

  • python3 (3.2.3 tested)
  • urllib3 ($ pip3 install urllib3)

Usage

Initialize

>>> from ph2 import ParseHub  
>>> ph = ParseHub('<redacted API-key>')  

Get all projects

>>> print(ph.projects)
[<PhProject 'Project1' token 'tCWOS3cB-ZM8qtShXw6j8tyOHZ84hLik'>, <PhProject 'Project2' token 'tfs9Gv10cixnCtrk0iz0-u62r7lSdNt8'>]

Run a given project

>>> p1 = ph.projects[0]
>>> r1 = p1.run()

Is data available for download?

>>> r1.check_available()
1

Get data if available

>>> r1.get_data()
[{'link': 'http://www.123greetings.com/', 'title': '123Greetings'}, {'link': 'https://webmail.123-reg.co.uk/', 'title': 'Welcome to 123-reg Webmail | Webmail log in | 123-reg'}.....]

A blocking request for data

>>> r2 = p1.run()
>>> r2.get_data_sync()
[{'link': 'http://www.123greetings.com/', 'title': '123Greetings'}, {'link': 'https://webmail.123-reg.co.uk/', 'title': 'Welcome to 123-reg Webmail | Webmail log in | 123-reg'}.....]

Cancel a running job

>>> r3 = p1.run()
>>> r3.cancel()

Or delete it alltogether

>>> r3.delete()

Get array of runs of a project

>>> p1.get_runs()
[<PhRun object token:tbcBSs9i7WHWtx3nqXW7vwp9>, ....]

You can specify offset to leave out the last x runs

>>> p1.get_runs(5)
[<PhRun object token:tbcBSs9i7WHWtx3nqXW7vwp9>, ....]

Projects hold reference to their last completed run...

>>> p1.last_ready_run
<PhRun object token:tCNPbuLm7wd-Aqmb9WHHZMV0>

...and the last run no matter what its status is

>>> p1.last_run.status
'running'

Runs can be compared based on their md5sum to detect changes between runs

>>> p1.get_runs()[0] == p1.get_runs()[1]
True

Both runs and projects can have their attributes easily printed printed for debugging

>>> p1.last_run.pprint()
data : [...]
data_ready : 1
end_time : 2015-04-13T15:30:10
md5sum : 51b246040a0ee389dd5eb6bb46e1b06b
pages : 1
ph : <ParseHub object API key:'<redacted>'>
project_token : tCWOS3cB-ZM8qtShXw6j8tyOHZ84hLik
run_token : tCNPbuLm7wd-Aqmb9WHHZMV0
start_time : 2015-04-13T15:30:03
start_url : https://www.google.com/search?q=...
start_value : {}
status : complete

###Todo

  • package
  • refactor
  • SSL
  • CSV

py-parsehub's People

Contributors

hronecviktor avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.