scrapinghub / arche Goto Github PK

View Code? Open in Web Editor NEW

47.0 16.0 19.0 28.54 MB

Analyze scraped data

Home Page: https://arche.readthedocs.io/

License: MIT License

Python 96.75% HTML 3.25%

data data-visualization data-analysis python3 pandas scrapy jupyter

arche's Introduction

Arche

pip install arche

Arche (pronounced Arkey) helps to verify scraped data using set of defined rules, for example:

Validation with JSON schema
Coverage (items, fields, categorical data, including booleans and enums)
Duplicates
Garbage symbols
Comparison of two jobs

We use it in Scrapinghub, among the other tools, to ensure quality of scraped data

Installation

Arche requires Jupyter environment, supporting both JupyterLab and Notebook UI

For JupyterLab, you will need to properly install plotly extensions

Then just pip install arche

Why

To check the quality of scraped data continuously. For example, if you scraped a website, a typical approach would be to validate the data with Arche. You can also create a schema and then set up Spidermon

Developer Setup

pipenv install --dev
pipenv shell
tox

Contribution

Any contributions are welcome! See https://github.com/scrapinghub/arche/issues if you want to take on something or suggest an improvement/report a bug.

arche's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger vipulgupta2048 codacy-badger victor-torres quocnguyenh gallaecio eglet27 amironoff ejulio realslimshanky imbillu wintercomes manycoding mirceachira wall-eeeeeee zanachka anujsngh burakozturkdot

arche's Issues

Find duplicates of items by chosen fields

Currently there's no convenient way to find duplicates by chosen fields, only by one field or by certain tags.
Make or update the rule which consumes fields and outputs items with equal fields, e.g I want to find duplicates by key and url, so having this data:

[{"key":` 0, "name": "bob", "url": "example.com"}, {"key": 0, "name": "john", "url": "example.com"}]

I want to see 2 duplicates.

Add category counts to category coverage plot's ylabel

Currently, the concrete values count shown when the row is being pointed at.

@alexandr1988 states that putting values count to ylabel on a graph will be more convenient.

Use display() to plot graphs instead of iplotly once JupyterLab 1 is released

plotly/plotly.py#1516 (comment)
plotly/plotly.py#1522 (comment)
In the next version, ipywidgets have the functionality to save the widget state. Maybe it's convenient enough to switch on widgets completely.

The bug is that widgets are duplicated by plotly.

Add environment packages

Similar to https://github.com/fastai/fastai/blob/master/setup.py
The goal is to have an easy-to-set environment, since environment are not dependencies by nature.
E.g. The library should run in Jupyter, but Jupyter is not a dependency, it's an environment, so it's makes sense to put it in different category.

Include field names in jsonschema Additional Properties error

Currently one has to guess which ones:

1626 items affected - Additional properties are not allowed: 1032 609 895 1064 13

Should be like

1626 items affected - Additional properties are not allowed - "SOMETHING", "SHOULDN'T", "BE HERE": 1032 609 895 1064 13

Publish to pypi

Add mini pictures as https://pipenv.readthedocs.io/en/latest/

Reading schemas from private repos

Some schemas live inside repos and maybe they belong there along with the code.
Assuming these, it would be most convenient to fetch those schemas directly.

Both github and bitbucket provide tokens, so then it's just a matter of specifying raw url.

Fix readthedocs

It doesn't show 0.3 versions https://arche.readthedocs.io/en/latest/nbs/schema.html

JupyterLab performance degrades with the number of plotly graphs

It seems that in a notebook graphs are printed in a smaller size first, and then autosized to the page. This autosizing freezes the page for a second and looks like an issue.
Set the proper size by default?

Publish docs

Refactor rules which output dataframe or series to plot them

dataframe/series should be in message.stats
Report plots message.stats as series barh, so this should be updated accordingly to support dataframe (and see if barh plot actually makes sense in all cases for series)
Don't forget to actually ask people, perhaps they prefer text in some cases

Replace ' with "

basic_json_schema generates schemas with single quotes and then tox complains with an error: "E Expecting property name enclosed in double quotes: line 1 column 2 (char 1)"

Store all error items keys in result

It will help to easily extract information, e.g. what % of items failed

There won't be need in err_items_count property.

Analyze categorical data

https://pandas.pydata.org/pandas-docs/stable/categorical.html

This could be helpful to analyze data. At the minimum level, we can detect categorical fields and print the stats, so the discrepancies will be easily noted.
At maximum, we could compare the difference between jobs toward a certain threshold.

The idea is the same as enums #132, but doesn't rely on the json schema.

Read data from dataframe

Support at least dataframe - which will allow to read the data locally from whatever source (csv, json, be it remote or local)

Currently the library relies on having _key to report items by it. So the implementation could look like:

Figure out a simple api (fastai - datablock? like

items = Items.from_csv (items.from_job)
schema = Schema.get_schema(schema)
items.report_all(schema)

# And to keep it granular enough so it can be used in Spidermon
arche.rules.duplicates.find_by(items.df, ["name", "title"])

Add _key column. Maybe it's easier to make _key as index if it's present and report index
_type. So far _type nobody really needed it since we can use filters.

Use `pool` to download items with `filter`

The current logic of dividing on batches by start_index and count doesn't account for filter.
When using filter, returned items _key don't correspond with actual index so the data repeats.

Plotly and cufflinks seem to increase a notebook size on 12mb after import arche

Show boolean distribution graph for one job

There is a rule which compares booleans between two jobs, why not show the distribution for one?

Figure out trusted notebooks

JS from plotly/cufflinks is blocked by jupyter as not trusted.
A workaround is to make it trusted, which requires an additional action.

See what can be done (why plotly not trusted)

Ugly progress bar if using Pool while downloading items

There is a bug, maybe this one - tqdm/tqdm#485 which prevents from using tqdm_notebook in JupyterHub, Lab or Notebook. At the moment the output is blank.

The easiest way seemed to simply use tqdm, I don't want to implement another progress bar.

Schema methods should fail better if schema is not provided

Even for me it takes some seconds to figure what it just doesn't work.
I see it's either a minus in a design - e.g. it should feel like you have to add schema and see it clearly
Or a better error is needed.

Save graphs between sessions

At the moment ipywidgets are not saved between sessions, meaning the output is lost once a notebook is closed. jupyterlab/jupyterlab#5235

And using plotly.offline.iplot() duplicates any widgets, making the output polluted with copies of progress bars (which are widgets). jupyter-widgets/ipywidgets#2359

Consuming items data to df creates inconsistencies with jsonschema

Caused by #75

Pandas makes it's own casts which is incompatible with jsonschema dict validation by default.
For example if items data:
[{"availability": 1, "_key": "0"}, {"_key": "1"}}]
or
[{"availability": 1, "_key": "0"}, {"availability": None, "_key": "1"}}]
DataFrame in both cases would be

  _key  availability
0    0           1.0
1    1           NaN

Yet JSON Schema null type means None, and missing field is validated with not putting it in required. So we have:
Missing field (on purpose)

{
    "required": ["_key"],
    "properties": {
        "_key": {"type": "string"},
        "availability": {"type": "integer"},
    },
    "additionalProperties": False,
}

None field (on purpose)

{
    "required": ["_key", availability],
    "properties": {
        "_key": {"type": "string"},
        "availability": {"type": ["integer", null]},
    },
    "additionalProperties": False,
}

Last but not least, the inconsistencies between JSON schema and data persist when we feed a dataframe directly (unless a user manages it himself).

compare_prices_for_same_urls fails if url is nan

Clickable urls in dataframe views

It will be really convenient to have clickable urls, as opposing to manually copying them and opening a page each time. It will save time.
As far as I remember pandas has https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.formats.style.Styler.html#pandas.io.formats.style.Styler, so maybe it can be used to achieve it.

Find the common root for URL fields

Say, we scrape one website from different categories. In this case all items will have the same root, e.g.
https://pandas.pydata.org/pandas-docs/stable/categorical.html
https://pandas.pydata.org/pandas-docs/stable/merging.html
have https://pandas.pydata.org/pandas-docs/stable/ in common.

By returning this information, we can analyze urls without json schema.

Bitbucket auth to fetch files from private repos

The idea is to set id\pass (or other credentials) in jupyterhub cluster so everyone can access raw schemas from there without any additional actions.

#65 is not the best solution since 1. it requires real user credentials, meaning each user has to set their id\pass and 2. most, if not all of bitbucket ids use google SSO which doesn't work in that case.

After talk with @tcurvelo , we found there are 2 options:
Oauth2, manual steps:

Create a 'client' app on some account (some account?) that has access to all repos
Copy its client_id and secret and set them up on the jhub configuration
In the code:
Before requesting a resource, we need to authenticate and receive a temporary token
Using the token, we can request their api.bitbucket.org endpoint

API
The idea is that we can auth with API using app password. The app password belongs to a user group, which has all the access.

Create an app password
Use the password in API requests to bitbucket

Sort by coverage in scraped fields

Separate output by a newline - missing fields and new fields
Sort by field coverage


spec_Joints_Qty - coverage - 0% - 2 items
New Fields
spec_Dimensions_with_stand_H_x_W_x_D - coverage - 0% - 2 items

g = Gatf(source="307140/6/75",
         schema='schemas/Dell/99595_dell_us.json',
         target="307141/21/36",
         expand=False)

Documentation as ipynb

Github supports ipynb, so why not.

It is as simple as:

Execute everything from documentation (quickstart, etc)
Add ipynb with links to the repo

Graphs are invisible

display(HTML(plot(f, include_plotlyjs=False, output_type="div")))

Graphs are not shown in Notebook UI

For some reason they are blank. Maybe some library is missing.

Sort by difference in Fields Counts

The resulting df is sorted by a field name, it makes more sense to sort it by difference instead.

                                                    Difference, %
_validation                                                   100
_validation/price_now                                         100
_validation/price_was                                         100
configurations/description                                     65
configurations/name                                            65
configurations/options                                         65
delivery_options/date_range                                    13
delivery_options/name                                          13
delivery_options/price                                         13
spec_10th_Hard_Drive                                          100

Field counts doesn't trigger error for jobs with lots of NaN

ghat = Gatf(source='364692/1/17', 
            schema='schemas/Global Strategies/amazon_product.json',
            target='364692/1/14',
#             schema=schema
           )

The price fields here differ on 15% and yet no error in log. Perhaps we don't really want to see an improved coverage?

flatten_df is too slow

Can it be rewritted to not use recursion?
If not, profile and see how to improve. To test, use jobs with nested fields and a considerable amount of items.
Add tests to check the speed

Could not generate basic schema - SchemaGenerationError: Could not find matching type for object

Any source returns an error:
SchemaGenerationError: Could not find matching type for object: 1

Make 0 items outcome more visible

Currently if a filtered job returns 0 items, the first test simply fails. While there're some hints which point on the number of returned errors - 0it, it's not visible enough.

g = Gatf(source='2235/1276/18', 
         schema='schemas/Netflix-FTE/netflix-show.json',
         target='2235/1276/18',
         filters=[("META_TEMPLATE_NAME", "=", ["Title"])],
         expand=False)

Replace colorama styling with Jupyter API

Colorama is for console, Jupyter has its own tools.

Publish plots in documentation notebooks

Sphinx and github do not render any javascript, hiding some output.
Maybe there is another way to show it (pictures?).
#3

Do not limit schema errors by default

E.g.

2028 items affected - Additional properties are not allowed: 1122, 151, 1365, 1257, 1799
43 items affected - part_number is not of type 'string': 560, 910, 820, 1809, 369
43 items affected - name is not of type 'string': 1376, 1684, 1691, 343, 135
43 items affected - availability is not of type 'integer': 1999, 265, 1852, 820, 1849
43 items affected - availability is not one of [0, 1, 2, 3, 4, 5, 6]: 1321, 1691, 1153, 1774, 963```
Should print all 9 messages.

Switch to another plot theme

Red is error colour, current ggplot theme has too much of it.
Red, normally, is something bad, an error, so I want to reserver the colour for errors.

Is modin worth it

https://github.com/modin-project/modin
They claim a lot, let's see what we get with the actual data.

I feel like the only thing which really makes the difference (100x times) is numpy and CPython. That's covered in fastai ML 2018, perhaps in fastai Computational Algebra too.

Configurable thresholds

Right now it's hardcoded in booleans, category, coverage comparison.
It makes sense to make it configurable so it can be changed globally.

    err_diffs = difs[difs > 10]
    if not err_diffs.empty:

Compare tagged fields between two dataframes

Compares any tagged fields in schema between two dataframes.
Return the difference (new, missing, same)

Move from private Drone to Travis CI

Rule - enum statistics

Get any enums and return df which shows the corresponding values count.

E.g. if schema has such field "enum": ["black" , "white"]
it should return a similar df (perhaps values_count() suits this purpose)

value name, percentage, count
total values 80% 40/50
black 50% 25/50
white 30% 15/50

Use fastjsonschema for huge jobs by default

https://github.com/scrapinghub/gatf/pull/237#pullrequestreview-206687741

Since it's so fast, we can simplify people's life.

JSON schema summary error

errors_count of errors for items_cout of items
For example

JSON Schema Validation:
        551 errors for 277 of items

Currently, it's

JSON Schema Validation:
	500 items were checked, 9 error(s)

Extend schema with additionalProperties, additionalItems, uniqueItems

For every properties, add additionalProperties false
For every items, add additionalItems false, uniqueItems true (and check if it works with objects)

Needless to say, properties and items can be on any level.

Output keys for compare_fields_counts

It's inspired by:

Do you have any tool to find missing fields compare to previous crawls?
I would like to know which items have lost this fields to see what happened here:

At the moment this rule outputs just the field name and difference, perhaps it would be more helpful to include particular keys also.