GithubHelp home page GithubHelp logo

scrapinghub / arche Goto Github PK

View Code? Open in Web Editor NEW
47.0 16.0 19.0 28.54 MB

Analyze scraped data

Home Page: https://arche.readthedocs.io/

License: MIT License

Python 96.75% HTML 3.25%
data data-visualization data-analysis python3 pandas scrapy jupyter

arche's Introduction

Arche

PyPI PyPI - Python Version GitHub Build Status Codecov Code style: black GitHub commit activity

pip install arche

Arche (pronounced Arkey) helps to verify scraped data using set of defined rules, for example:

  • Validation with JSON schema
  • Coverage (items, fields, categorical data, including booleans and enums)
  • Duplicates
  • Garbage symbols
  • Comparison of two jobs

We use it in Scrapinghub, among the other tools, to ensure quality of scraped data

Installation

Arche requires Jupyter environment, supporting both JupyterLab and Notebook UI

For JupyterLab, you will need to properly install plotly extensions

Then just pip install arche

Why

To check the quality of scraped data continuously. For example, if you scraped a website, a typical approach would be to validate the data with Arche. You can also create a schema and then set up Spidermon

Developer Setup

pipenv install --dev
pipenv shell
tox

Contribution

Any contributions are welcome! See https://github.com/scrapinghub/arche/issues if you want to take on something or suggest an improvement/report a bug.

arche's People

Contributors

andersonberg avatar gitter-badger avatar manycoding avatar oik741 avatar simoess avatar tcurvelo avatar victor-torres avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arche's Issues

Find duplicates of items by chosen fields

Currently there's no convenient way to find duplicates by chosen fields, only by one field or by certain tags.
Make or update the rule which consumes fields and outputs items with equal fields, e.g I want to find duplicates by key and url, so having this data:

[{"key":` 0, "name": "bob", "url": "example.com"}, {"key": 0, "name": "john", "url": "example.com"}]

I want to see 2 duplicates.

Tags

Check Tags - specify possible tags along with specified tags:

Tags:
	used: category, product_price_field, unique, product_url_field, name_field
        not used tags: unique
	'text' field(s) was not found in source, but specified in schema
	'text' field(s) was not found in target, but specified in schema
	Skipping tag rules

Tags Rules - skip them

Include field names in jsonschema Additional Properties error

Currently one has to guess which ones:

1626 items affected - Additional properties are not allowed: 1032 609 895 1064 13

Should be like

1626 items affected - Additional properties are not allowed - "SOMETHING", "SHOULDN'T", "BE HERE": 1032 609 895 1064 13

Reading schemas from private repos

Some schemas live inside repos and maybe they belong there along with the code.
Assuming these, it would be most convenient to fetch those schemas directly.

Both github and bitbucket provide tokens, so then it's just a matter of specifying raw url.

Refactor rules which output dataframe or series to plot them

  1. dataframe/series should be in message.stats
  2. Report plots message.stats as series barh, so this should be updated accordingly to support dataframe (and see if barh plot actually makes sense in all cases for series)
  3. Don't forget to actually ask people, perhaps they prefer text in some cases

Replace ' with "

basic_json_schema generates schemas with single quotes and then tox complains with an error: "E Expecting property name enclosed in double quotes: line 1 column 2 (char 1)"

Read data from dataframe

Support at least dataframe - which will allow to read the data locally from whatever source (csv, json, be it remote or local)

Currently the library relies on having _key to report items by it. So the implementation could look like:

  1. Figure out a simple api (fastai - datablock? like
items = Items.from_csv (items.from_job)
schema = Schema.get_schema(schema)
items.report_all(schema)

# And to keep it granular enough so it can be used in Spidermon
arche.rules.duplicates.find_by(items.df, ["name", "title"])
  1. Add _key column. Maybe it's easier to make _key as index if it's present and report index
  2. _type. So far _type nobody really needed it since we can use filters.

Use `pool` to download items with `filter`

The current logic of dividing on batches by start_index and count doesn't account for filter.
When using filter, returned items _key don't correspond with actual index so the data repeats.

Figure out trusted notebooks

JS from plotly/cufflinks is blocked by jupyter as not trusted.
A workaround is to make it trusted, which requires an additional action.

See what can be done (why plotly not trusted)

Consuming items data to df creates inconsistencies with jsonschema

Caused by #75

Pandas makes it's own casts which is incompatible with jsonschema dict validation by default.
For example if items data:
[{"availability": 1, "_key": "0"}, {"_key": "1"}}]
or
[{"availability": 1, "_key": "0"}, {"availability": None, "_key": "1"}}]
DataFrame in both cases would be

  _key  availability
0    0           1.0
1    1           NaN

Yet JSON Schema null type means None, and missing field is validated with not putting it in required. So we have:
Missing field (on purpose)

{
    "required": ["_key"],
    "properties": {
        "_key": {"type": "string"},
        "availability": {"type": "integer"},
    },
    "additionalProperties": False,
}

None field (on purpose)

{
    "required": ["_key", availability],
    "properties": {
        "_key": {"type": "string"},
        "availability": {"type": ["integer", null]},
    },
    "additionalProperties": False,
}

Last but not least, the inconsistencies between JSON schema and data persist when we feed a dataframe directly (unless a user manages it himself).

Bitbucket auth to fetch files from private repos

The idea is to set id\pass (or other credentials) in jupyterhub cluster so everyone can access raw schemas from there without any additional actions.

#65 is not the best solution since 1. it requires real user credentials, meaning each user has to set their id\pass and 2. most, if not all of bitbucket ids use google SSO which doesn't work in that case.

After talk with @tcurvelo , we found there are 2 options:
Oauth2, manual steps:

  1. Create a 'client' app on some account (some account?) that has access to all repos
  2. Copy its client_id and secret and set them up on the jhub configuration
    In the code:
  3. Before requesting a resource, we need to authenticate and receive a temporary token
  4. Using the token, we can request their api.bitbucket.org endpoint

API
The idea is that we can auth with API using app password. The app password belongs to a user group, which has all the access.

  1. Create an app password
  2. Use the password in API requests to bitbucket

Sort by coverage in scraped fields

  1. Separate output by a newline - missing fields and new fields
  2. Sort by field coverage

spec_Joints_Qty - coverage - 0% - 2 items
New Fields
spec_Dimensions_with_stand_H_x_W_x_D - coverage - 0% - 2 items
g = Gatf(source="307140/6/75",
         schema='schemas/Dell/99595_dell_us.json',
         target="307141/21/36",
         expand=False)

Documentation as ipynb

Github supports ipynb, so why not.

It is as simple as:

  1. Execute everything from documentation (quickstart, etc)
  2. Add ipynb with links to the repo

Sort by difference in Fields Counts

The resulting df is sorted by a field name, it makes more sense to sort it by difference instead.

                                                    Difference, %
_validation                                                   100
_validation/price_now                                         100
_validation/price_was                                         100
configurations/description                                     65
configurations/name                                            65
configurations/options                                         65
delivery_options/date_range                                    13
delivery_options/name                                          13
delivery_options/price                                         13
spec_10th_Hard_Drive                                          100

Field counts doesn't trigger error for jobs with lots of NaN

ghat = Gatf(source='364692/1/17', 
            schema='schemas/Global Strategies/amazon_product.json',
            target='364692/1/14',
#             schema=schema
           )

The price fields here differ on 15% and yet no error in log. Perhaps we don't really want to see an improved coverage?

flatten_df is too slow

  1. Can it be rewritted to not use recursion?

  2. If not, profile and see how to improve. To test, use jobs with nested fields and a considerable amount of items.

  3. Add tests to check the speed

Make 0 items outcome more visible

Currently if a filtered job returns 0 items, the first test simply fails. While there're some hints which point on the number of returned errors - 0it, it's not visible enough.

g = Gatf(source='2235/1276/18', 
         schema='schemas/Netflix-FTE/netflix-show.json',
         target='2235/1276/18',
         filters=[("META_TEMPLATE_NAME", "=", ["Title"])],
         expand=False)

Do not limit schema errors by default

E.g.

2028 items affected - Additional properties are not allowed: 1122, 151, 1365, 1257, 1799
43 items affected - part_number is not of type 'string': 560, 910, 820, 1809, 369
43 items affected - name is not of type 'string': 1376, 1684, 1691, 343, 135
43 items affected - availability is not of type 'integer': 1999, 265, 1852, 820, 1849
43 items affected - availability is not one of [0, 1, 2, 3, 4, 5, 6]: 1321, 1691, 1153, 1774, 963```
Should print all 9 messages.

Switch to another plot theme

Red is error colour, current ggplot theme has too much of it.
Red, normally, is something bad, an error, so I want to reserver the colour for errors.

Is modin worth it

https://github.com/modin-project/modin
They claim a lot, let's see what we get with the actual data.

I feel like the only thing which really makes the difference (100x times) is numpy and CPython. That's covered in fastai ML 2018, perhaps in fastai Computational Algebra too.

Configurable thresholds

Right now it's hardcoded in booleans, category, coverage comparison.
It makes sense to make it configurable so it can be changed globally.

    err_diffs = difs[difs > 10]
    if not err_diffs.empty:

Screenshot 2019-04-10 at 14 33 03

Rule - enum statistics

Get any enums and return df which shows the corresponding values count.

E.g. if schema has such field "enum": ["black" , "white"]
it should return a similar df (perhaps values_count() suits this purpose)

value name, percentage, count
total values 80% 40/50
black 50% 25/50
white 30% 15/50

JSON schema summary error

errors_count of errors for items_cout of items
For example

JSON Schema Validation:
        551 errors for 277 of items

Currently, it's

JSON Schema Validation:
	500 items were checked, 9 error(s)

Output keys for compare_fields_counts

It's inspired by:

Do you have any tool to find missing fields compare to previous crawls?
I would like to know which items have lost this fields to see what happened here:

At the moment this rule outputs just the field name and difference, perhaps it would be more helpful to include particular keys also.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.