pip install arche
Arche (pronounced as Arkey) helps to verify data using set of defined rules, for example:
- Validation with JSON schema
- Coverage
- Duplicates
- Garbage symbols
- Comparison of two jobs
We use it in Scrapinghub to ensure quality of scraped data
Arche requires Jupyter environment, supporting both JupyterLab and Notebook UI
For JupyterLab, you will need to properly install plotly extensions
Then just pip install arche
-
You need to check the quality of data from Scrapy Cloud jobs continuously.
Say, you scraped some website and have the data ready in the cloud. A typical approach would be:
- Create a JSON schema and validate the data with it
- Use the created schema in Spidermon Validation
-
You want to use it in your application to verify Scrapy Cloud data
pipenv install --dev
pipenv shell
tox
Any contributions are welcome!
- Fork or create a new branch
- Make desired changes
- Open a pull request