GithubHelp home page GithubHelp logo

pombredanne / scraperwiki-python Goto Github PK

View Code? Open in Web Editor NEW

This project forked from sensiblecodeio/scraperwiki-python

0.0 1.0 0.0 325 KB

ScraperWiki Python library for scraping and saving data

Home Page: https://scraperwiki.com

License: BSD 2-Clause "Simplified" License

Python 99.75% Shell 0.25%

scraperwiki-python's Introduction

ScraperWiki Python library

Build Status

This is a Python library for scraping web pages and saving data. It is the easiest way to save data on the ScraperWiki platform, and it can also be used locally or on your own servers.

Installing

pip install scraperwiki

Scraping

scraperwiki.scrape(url[, params][,user_agent])

Returns the downloaded string from the given url.

params are sent as a POST if set.

user_agent sets the user-agent string if provided.

Saving data

Helper functions for saving and querying an SQL database. Updates the schema automatically according to the data you save.

Currently only supports SQLite. It will make a local SQLite database. It is based on [SQLAlchemy](https://pypi.python.org/pypi/SQLAlchemy). You should expect it to support other SQL databases at a later date.

scraperwiki.sql.save(unique_keys, data[, table_name="swdata"])

Saves a data record into the datastore into the table given by table_name.

data is a dict object with field names as keys; unique_keys is a subset of data.keys() which determines when a record is overwritten. For large numbers of records data can be a list of dicts.

scraperwiki.sql.save is entitled to buffer an arbitrary number of rows until the next read via the ScraperWiki API, an exception is hit, or until process exit. An effort is made to do a timely periodic flush. Records can be lost if the process experiences a hard-crash, power outage or SIGKILL due to high memory usage during an out-of-memory condition. The buffer can be manually flushed with scraperwiki.sql.flush().

scraperwiki.sql.execute(sql[, vars])

Executes any arbitrary SQL command. For example CREATE, DELETE, INSERT or DROP.

vars is an optional list of parameters, inserted when the SQL command contains ‘?’s. For example:

scraperwiki.sql.execute("INSERT INTO swdata VALUES (?,?,?)", [a,b,c])

The ‘?’ convention is like "paramstyle qmark" from Python's DB API 2.0 (but note that the API to the datastore is nothing like Python's DB API). In particular the ‘?’ does not itself need quoting, and can in general only be used where a literal would appear.

scraperwiki.sql.select(sqlfrag[, vars])

Executes a select command on the datastore. For example:

scraperwiki.sql.select("* FROM swdata LIMIT 10")

Returns a list of dicts that have been selected.

vars is an optional list of parameters, inserted when the select command contains ‘?’s. This is like the feature in the .execute command, above.

scraperwiki.sql.commit()
Commits to the file after a series of execute commands. (sql.save auto-commits after every action).
scraperwiki.sql.show_tables([dbname])
Returns an array of tables and their schemas in the current database.
scraperwiki.sql.table_info(name)
Returns an array of attributes for each element of the table.
scraperwiki.sql.save_var(key, value)
Saves an arbitrary single-value into a table called swvariables. Intended to store scraper state so that a scraper can continue after an interruption.
scraperwiki.sql.get_var(key[, default])
Retrieves a single value that was saved by save_var. Only works for string, float, or int types. For anything else, use the pickle library to turn it into a string.

Miscellaneous

scraperwiki.status(type, message=None)
If run on the ScraperWiki platform (the new one, not Classic), updates the visible status of the dataset. If not on the platform, does nothing. params can be 'ok' or 'error'. If no message is given, it will show the time since the update. See dataset status API in the documentation for details.
scraperwiki.pdftoxml(pdfdata)
Convert a byte string containing a PDF file into an XML file containing the coordinates and font of each text string (see the pdftohtml documentation for details).

Environment Variables

SCRAPERWIKI_DATABASE_NAME
default: scraperwiki.sqlite - name of database
SCRAPERWIKI_DATABASE_TIMEOUT
default: 300 - number of seconds database will wait for a lock

scraperwiki-python's People

Contributors

aidanhs avatar andylolz avatar drj11 avatar fawkesley avatar frabcus avatar mlandauer avatar morty avatar petterreinholdtsen avatar pombredanne avatar pwaller avatar sean-duffy avatar stevenmaude avatar teajaymars avatar tlevine avatar zarino avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.