GithubHelp home page GithubHelp logo

frictionlessdata / frictionless-py Goto Github PK

View Code? Open in Web Editor NEW
682.0 28.0 140.0 138.99 MB

Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data

Home Page: https://framework.frictionlessdata.io

License: MIT License

Python 78.47% HTML 21.50% Dockerfile 0.03%

frictionless-py's Introduction

frictionless-py

Build Coverage Release Citation Codebase Support

Migrating from an older version? Please read **[v5](blog/2022/08-22-frictionless-framework-v5.html)** announcement and migration guide.

Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data (DEVT Framework). It supports a great deal of data sources and formats, as well as provides popular platforms integrations. The framework is powered by the lightweight yet comprehensive Frictionless Standards.

Purpose

  • Describe your data: You can infer, edit and save metadata of your data tables. It's a first step for ensuring data quality and usability. Frictionless metadata includes general information about your data like textual description, as well as, field types and other tabular data details.
  • Extract your data: You can read your data using a unified tabular interface. Data quality and consistency are guaranteed by a schema. Frictionless supports various file schemes like HTTP, FTP, and S3 and data formats like CSV, XLS, JSON, SQL, and others.
  • Validate your data: You can validate data tables, resources, and datasets. Frictionless generates a unified validation report, as well as supports a lot of options to customize the validation process.
  • Transform your data: You can clean, reshape, and transfer your data tables and datasets. Frictionless provides a pipeline capability and a lower-level interface to work with the data.

Features

  • Open Source (MIT)
  • Powerful Python framework
  • Convenient command-line interface
  • Low memory consumption for data of any size
  • Reasonable performance on big data
  • Support for compressed files
  • Custom checks and formats
  • Fully pluggable architecture
  • The included API server
  • More than 1000+ tests

Installation

$ pip install frictionless

Example

$ frictionless validate data/invalid.csv
[invalid] data/invalid.csv

  row    field  code              message
-----  -------  ----------------  --------------------------------------------
             3  blank-header      Header in field at position "3" is blank
             4  duplicate-header  Header "name" in field "4" is duplicated
    2        3  missing-cell      Row "2" has a missing cell in field "field3"
    2        4  missing-cell      Row "2" has a missing cell in field "name2"
    3        3  missing-cell      Row "3" has a missing cell in field "field3"
    3        4  missing-cell      Row "3" has a missing cell in field "name2"
    4           blank-row         Row "4" is completely blank
    5        5  extra-cell        Row "5" has an extra value in field  "5"

Documentation

Please visit our documentation portal:

frictionless-py's People

Contributors

aborruso avatar aivuk avatar akariv avatar antoineaugusti avatar areleu avatar augusto-herrmann avatar chris48s avatar dependabot[bot] avatar didiez avatar georgiana-b avatar hbruch avatar jen-thomas avatar jze avatar l-vincent-l avatar lwinfree avatar mirianbr avatar n0rdlicht avatar pedrovgp avatar peterdesmet avatar pierredittgen avatar pwalsh avatar rgieseke avatar roll avatar sapetti9 avatar serahkiburu avatar shashigharti avatar thill-odi avatar thomasrockhu-codecov avatar trickvi avatar vitorbaptista avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

frictionless-py's Issues

Handle obvious formatting errors/discrepancies

  • Dates - could have data that is parsable as a a date string, but is not ISO8601 - be lenient and try to get a date (messy tables has code for this) python-dateutil has a parser exactly for this.
  • Numbers - check about handling of "," and ";" when we cast strings to Decimal or float (i.e.: we want to allow certain common patterns, like "28,000.95")

Compat for Python 2.7, 3.3, 3.4

Code currently runs on 3.3 and 3.4, and will run on 2.7 with minimal changes.

Tasks

  • Setup CI server integration to run tests on multiple runtimes
  • Configure test runner to persist coverage reports
  • Add compat.py for handling py2/3 stuff
    • I'm avoiding six and the like if possible, and this should be fine as long as py2.7 is the oldest runtime we support
    • I'm using this as my primary reference

Raise an informative errors

Specifically, stating this fact.

Cases

  • Resource not found
  • Resource is HTML
  • Decoding error while iterating over stream

First two require user stories from @rgrp

Pass stream to validator for reporting

Allow calling code that creates a pipeline to pass in a writable stream for reporting (as an option).

If such a stream is passed in, the reporting code will use this to write report data into.

@rgrp (cc @tryggvib) question:

In such a case, what should be the format of the report?

The current default behaviour of the report module is to maintain a stream over a YAML file (YAML objects are append friendly). The reporter also supports an SQLite backend via https://github.com/pudo/dataset

In the case of this passed-in stream, I'm thinking line-delimited JSON objects (as YAML would require additional dependencies for the calling code) might be the most useful?

coveralls config problem

Something is up with the coveralls config (possibly the name change)? It is not collecting coverage data.

Validate Data Packages

Currently, we support validating a data source (i.e: a CSV file).

The pipeline constructor takes a data package spec, but does not use it/know what to do with it.

Need to spec out how to work with Data Packages in a validation pipeline.

Implementing a DataTable interface

@rgrp @tryggvib (and of course anyone else)

In my work so far on API design for the validation pipeline, an important issue has revolved around (a) how to parse a stream into some type of object-oriented interface, and (b) which backend to choose for this (and why). There are many ways to go about this of course - we're not going to go over them all here. Some discussion around it took place here

Anyway, the bottom line is that I decided to depend on Pandas to read data (CSV, and also JSON), and I currently work with that data in a Pandas DataFrame.

However, I do not want to expose all the DataFrame API directly, for the following reasons:

  • Pandas is really just a backend here: end users and clients mostly just need to know about headers, rows and columns
  • Future changes to the (this) backend should have minimal impact on public APIs
  • Pandas provides different interfaces for parsing CSV and JSON; I would like to provide a single interface and pass off to the appropriate parser as required

So, locally I'm now working with a DataTable class that wraps a Pandas DataFrame (here is a WIP of the code), and only exposes properties we need for the validation pipeline (all DataFrame properties can still be accessed via self._frame).

Each validator package will depend on this DataTable class. Hence, this will become another package in the suite.

Any thoughts/comments before I commit to this pattern?

One reason I'm even writing this up as an issue is because of the many "DataTable" like interfaces around in Python (csv.DictReader, Tablib, Dataset, Pandas' DataFrame), so I'm a bit wary of my own need to package up another one.

Update documentation

Ensure the docs are informative and up to date, in preparation for an alpha public release.

Accept CSV Dialect spec

Whenever we read a CSV file, we should also accept a CSV dialect spec.

Currently, the pipeline constructor accepts a CSV dialect spec, and checks it is validly formed, but it is not used when reading a CSV file.

Handling incorrect encoding detection

For detecting the encoding of a stream we are using chardet on a small sample of the stream data.

This mostly works fine, but there are cases of incorrect detection. This is a hard problem.

In some examples, such as when detecting encoding of this file, ascii is detected from the sample, but, if we pass charted the whole file, it does correctly guess ISO-8859-2. However, passing the entire stream contents to the detector by default is very expensive in time.

We need to do something here: investigate different ways to detect encoding etc.

I've also added the ability to pass the encoding string to the pipeline/validator constructors, which is potentially useful for large-scale programmatic use; yet not so much for the average user.

Spend publishing results aggregator

The spend publishing results aggregator is a (Python) script that writes results to a results.csv file used in a spend publishing dashboard data set (example dataset implementation).

It needs to be invoked, per data pipeline, as part of a batch processing job on a list of data sources.

How it works

  • A developer runs the validation over a set of data sources at regular intervals
  • Said data is managed in a git repository (eg), which the developer has locally, along with Tabular Validator code
  • The developer runs a pipeline.Batch. This gets all data/schema URLs out of the data/sources.csv, and runs each though a validation pipeline (pipeline.Pipeline)
  • At the end of each pipeline run, a post-processing hook loads the aggregator callable, which is responsible for writing the data/results.csv (each pipeline run appends a new record to the results.csv)
  • Once all data in the data/sources.csv has been run through a validation pipeline, the batcher instance also calls its own post processor. this post-precessing function commits the new changes in the data git repo, and pushes the code up to the central repo.
  • At this point, the updated results are live, and the associated Spend Publishing Dashboard instance is working in front of the new data

Requirements

  • pipeline.Batch class, which takes a CSV of sources, and knows how to get the data and schema urls out of it, and then line them up for running through a pipeline.Pipeline
    • pipeline.Batch also has a post-processing hook to call a callable with an instance of the batcher as its argument
  • pipeline.Pipeline needs to support a post-processing hook to call a callable with an instance of the pipeline as its argument
  • result_aggregator that will be called as the post-processing function of pipeline.Pipeline
  • data_deployer that will be called as the post-processing function of pipeline.Batch
  • Tests for it all

Structure validator

The StructureValidator checks data for correct structure.

  • Implement shared validator API
  • Check for blank rows
  • Check for headless columns
  • Check for defective rows
  • Check for duplicate rows
  • Check for duplicate column headings
  • Implement run method
  • Write tests as stand alone (via self.run)
  • Write tests as part of pipeline (via PipelineValidator.run)
  • Check for blank columns
    • Own issue #12
  • Transform stream for subsequent validators (More detail below)
  • Limit sample (eg: 1000 rows)

Register/upload to PyPI

As discussed.

I can upload in my account, and of course with all license rights to OKFN, but what we want is to upload from some common OKFN account for centralised management.

The improbability of being able to valid a date/time/datetime format if the source is excel

When the source is Excel, we use xlrd's stuff to turn Excel's number formatting for dates into Python dates. At the time of converting the excel source into a text stream for use in goodtables, we don't know if there is a schema, or what the formatting of a date should be according to the schema. So, the date is forced to isoformat.

That means that later on, the only way that a date/time/datetime field from excel will pass its schema validation is if the format is set to any.

Just putting this here for future.

License?

@pwalsh same question as for validator: what would be the license of this repo? thanks!

Provide additional test coverage

Add and/or ensure complete test coverage for the following. If it is an option, ensure test for all states.

  • Col name order
  • client report stream
  • other validator options
  • Formatting of dates and numbers (";" and "," in numbers, various formats in dates)
  • Run some tests on a sample of GLA data
  • Run some tests on a sample of messytables data

Travis false positives

@tryggvib @rgrp

A question:

Seeing as OK is using Travis regularly, are false positives a common occurrence?

eg: https://travis-ci.org/okfn/tabular-validator/builds/52039531

I'm seeing a false positive test failure every 5-10 pushes, always due to network issues (connection to http resources like CSVs in tests, and dependencies via pip).

As you know, I'm also running the same tests on Shippable, and I just don't have this happen, ever.

It is not a big problem, but I'd like to know if it is something you see as Travis users, and/or, if there is some way I can get travis to retry if it has network errors.

Check data is valid according to foreignKeys descriptor

When validating data against a JSON Table Schema, there may be foreign keys to other files.

This will need to be implemented in a way that supports:

  • Processing of a raw data source + a table schema
  • Processing of a DataPackage

Change the name?

Tabular Validator is not a great - it is not even that description seeing as we are actually working towards validation (or, linting) and optional transformation (or, cleaning) of data.

I don' have any suggestions right now, just putting it out there.

General pipeline refactoring

There are a few things I want to change, but as long as tests are passing I'm not focussing on it now. Rather, this issue is a placeholder for notes on changes I want make while working on other things.

  • Get rid of all the run_* stuff in the pipeline calling code per validator - it is enough to call run on each validator
  • Tidy up all the stuff around dry_run and transform
  • Refactor all distinct validation checks to generators
  • Fix workspace and let it support both an s3 directory and a local directory (use https://github.com/pudo/barn for that - it does everything I need already)
  • Don't need the utf-8 textstream until we start reading the data, so only need to do the conversion as we iterate over data
  • Remove all the help and help_edit keys on the result dicts

Refactor report output structure

Based on notes in user story 3A.

So, what we have is a summary object with:

  • Total row count
  • Bad row count
  • Total column count
  • Bad column count
  • Columns:
    • name
    • index
    • incorrect_type_percent

The summary object is calculated from the report results. Each report result object has:

  • result category (e.g.: row, column, header)
  • result level (e.g.: info, warning, error)
  • result message (the message describing the result)
  • result type (e.g: "Invalid Header")
  • row index (of this result)
  • row name (of this result, if row has id or _id field, else None)
  • column index (of this result, None if not column error)
  • column name (of this result, '' if not column error)

Run spec validations before data pipeline

Spec validation is exposed as a pipeline validator like anything else. However, in the main use case, we really want to validation the spec files (schema, CSV dialect, data package) before we even open the data.

Create system for help docs

Users should be able to contribute to creating help docs that correspond with errors we throw in the pipeline.

these docs can be shown on the web via the contextual help in goodtables-web. This has been somewhat stubbed out there.

Actions:

  • Create a new repo for content only (goodtables-handbook, goodtables-help, etc.)
  • Have a markdown docs that corresponds with the ID of each error in goodtables
  • pull the markdown content in as the help content for the errors
  • take this chance to tidy up the report result types a bit - e.g.: use sequential names "schema_001", "structure_014", etc.

"Expected values" validator example

Flow

Receive data

  • Client provides one of either:
    • A file object
    • A stream (known to be CSV)
  • Client provides a configuration object

Prepare for analysis

  • If received configuration object, configure accordingly
  • If received probes that do not match the CSV, throw and say why

Make analysis

  • Probe values based on config
    • This could be extended considerably. I'm suggesting a very simple set of rules for first pass (see config)

Return report

  • Respond with a report, which is a JSON object, like:
    • name (of file, or stream, or whatever)
    • config: the passed on config for this run
    • results:
      • per row: array of rows that have any deviations, with the deviation shown
      • per dataset: array of probe rules, and results of each probe
      • description: (eg: the CSV is valid according to probe because...)

Implementation

Code

Configuration

  • Exit early (first value outside of probe conditions)
  • A rules object with probing conditions. Ideas from discussion on labs list with Friedrich:
    • Each rule applies to a column
      • Check if values in column are not NULL
      • Check that values in column are within X range (e.g.: 10,000 <-> 25,000)
      • For each probe, set a detection strategy:
        • true/false (each row should comply)
        • weighted: e.g.: warn of more than 5% of data outside condition

Complete CLI for validation pipeline

There is a CLI, but it doesn't yet support all pipeline options (only takes data_source).

Ensure the the options it takes are consistent with the Web API options.

Schema validator

The SchemaValidator checks that data conforms to a JSON Table Schema.

  • Implement shared validator API
  • Create a better reference spec for JTS itself (see: https://github.com/dataprotocols/schemas)
  • Implement standalone run method
  • Check headers are valid according to schema
  • Check data is valid according to schema
    • This can be very deep, so discussed with @rgrp to minimally start with date and number validation, and build out from there in iterations. Will create separate issues when this issue closes.
  • Write tests as stand alone (via self.run)
  • Write tests as part of pipeline (via PipelineValidator.run)

Rabbit hole

Stuff that is beyond scope of this first pass, but that defines the larger scope of where we'd like to get.

  • Generate schema from the data, if we do not have a schema #15
  • foreignKeys: #17
  • constraints.minLength, constraints.maxLength, constraints.minimum, constraints.maximum needs discussion frictionlessdata/specs#161
  • Some issues around type and format. Would like to see this resolved before implementing deeper support of spec frictionlessdata/specs#159

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.