GithubHelp home page GithubHelp logo

di / vladiate Goto Github PK

View Code? Open in Web Editor NEW
90.0 7.0 35.0 108 KB

A strict validation tool for CSV files

Home Page: https://pypi.org/project/vladiate

License: MIT License

Python 98.95% Makefile 1.05%

vladiate's Introduction

Vladiate

https://github.com/di/vladiate/actions/workflows/test.yml/badge.svg?query=branch%3Amaster+event%3Apush https://coveralls.io/repos/di/vladiate/badge.svg?branch=master

Description

Vladiate helps you write explicit assertions for every field of your CSV file.

Features

Write validation schemas in plain-old Python
No UI, no XML, no JSON, just code.
Write your own validators
Vladiate comes with a few by default, but there's no reason you can't write your own.
Validate multiple files at once
Either with the same schema, or different ones.

Documentation

Installation

Installing:

$ pip install vladiate

Quickstart

Below is an example of a vladfile.py

from vladiate import Vlad
from vladiate.validators import UniqueValidator, SetValidator
from vladiate.inputs import LocalFile

class YourFirstValidator(Vlad):
    source = LocalFile('vampires.csv')
    validators = {
        'Column A': [
            UniqueValidator()
        ],
        'Column B': [
            SetValidator(['Vampire', 'Not A Vampire'])
        ]
    }

Here we define a number of validators for a local file vampires.csv, which would look like this:

Column A,Column B
Vlad the Impaler,Not A Vampire
Dracula,Vampire
Count Chocula,Vampire

We then run vladiate in the same directory as your .csv file:

$ vladiate

And get the following output:

Validating YourFirstValidator(source=LocalFile('vampires.csv'))
Passed! :)

Handling Changes

Let's imagine that you've gotten a new CSV file, potential_vampires.csv, that looks like this:

Column A,Column B
Vlad the Impaler,Not A Vampire
Dracula,Vampire
Count Chocula,Vampire
Ronald Reagan,Maybe A Vampire

If we were to update our first validator to use this file as follows:

- class YourFirstValidator(Vlad):
-     source = LocalFile('vampires.csv')
+ class YourFirstFailingValidator(Vlad):
+     source = LocalFile('potential_vampires.csv')

we would get the following error:

Validating YourFirstFailingValidator(source=LocalFile('potential_vampires.csv'))
Failed :(
  SetValidator failed 1 time(s) (25.0%) on field: 'Column B'
    Invalid fields: ['Maybe A Vampire']

And we would know that we'd either need to sanitize this field, or add it to the SetValidator.

Starting from scratch

To make writing a new vladfile.py easy, Vladiate will give meaningful error messages.

Given the following as real_vampires.csv:

Column A,Column B,Column C
Vlad the Impaler,Not A Vampire
Dracula,Vampire
Count Chocula,Vampire
Ronald Reagan,Maybe A Vampire

We could write a bare-bones validator as follows:

class YourFirstEmptyValidator(Vlad):
    source = LocalFile('real_vampires.csv')
    validators = {}

Running this with vladiate would give the following error:

Validating YourFirstEmptyValidator(source=LocalFile('real_vampires.csv'))
Missing...
  Missing validators for:
    'Column A': [],
    'Column B': [],
    'Column C': [],

Vladiate expects something to be specified for every column, even if it is an empty list (more on this later). We can easily copy and paste from the error into our vladfile.py to make it:

class YourFirstEmptyValidator(Vlad):
    source = LocalFile('real_vampires.csv')
    validators = {
        'Column A': [],
        'Column B': [],
        'Column C': [],
    }

When we run this with vladiate, we get:

Validating YourSecondEmptyValidator(source=LocalFile('real_vampires.csv'))
Failed :(
  EmptyValidator failed 4 time(s) (100.0%) on field: 'Column A'
    Invalid fields: ['Dracula', 'Vlad the Impaler', 'Count Chocula', 'Ronald Reagan']
  EmptyValidator failed 4 time(s) (100.0%) on field: 'Column B'
    Invalid fields: ['Maybe A Vampire', 'Not A Vampire', 'Vampire']
  EmptyValidator failed 4 time(s) (100.0%) on field: 'Column C'
    Invalid fields: ['Real', 'Not Real']

This is because Vladiate interprets an empty list of validators for a field as an EmptyValidator, which expects an empty string in every field. This helps us make meaningful decisions when adding validators to our vladfile.py. It also ensures that we are not forgetting about a column or field which is not empty.

Built-in Validators

Vladiate comes with a few common validators built-in:

class Validator

Generic validator. Should be subclassed by any custom validators. Not to be used directly.

class CastValidator

Generic "can-be-cast-to-x" validator. Should be subclassed by any cast-test validator. Not to be used directly.

class IntValidator

Validates whether a field can be cast to an int type or not.

empty_ok=False:Specify whether a field which is an empty string should be ignored.

class FloatValidator

Validates whether a field can be cast to an float type or not.

empty_ok=False:Specify whether a field which is an empty string should be ignored.

class SetValidator

Validates whether a field is in the specified set of possible fields.

valid_set=[]:List of valid possible fields
empty_ok=False:Implicity adds the empty string to the specified set.
ignore_case=False:Ignore the case between values in the column and valid set

class UniqueValidator

Ensures that a given field is not repeated in any other column. Can optionally determine "uniqueness" with other fields in the row as well via unique_with.

unique_with=[]:List of field names to make the primary field unique with.
empty_ok=False:Specify whether a field which is an empty string should be ignored.

class RegexValidator

Validates whether a field matches the given regex using re.match().

pattern=r'di^':The regex pattern. Fails for all fields by default.
full=False:Specify whether we should use a fullmatch() or match().
empty_ok=False:Specify whether a field which is an empty string should be ignored.

class RangeValidator

Validates whether a field falls within a given range (inclusive). Can handle integers or floats.

low:The low value of the range.
high:The high value of the range.
empty_ok=False:Specify whether a field which is an empty string should be ignored.

class EmptyValidator

Ensure that a field is always empty. Essentially the same as an empty SetValidator. This is used by default when a field has no validators.

class NotEmptyValidator

The opposite of an EmptyValidator. Ensure that a field is never empty.

class Ignore

Always passes validation. Used to explicity ignore a given column.

class RowValidator

Generic row validator. Should be subclassed by any custom validators. Not to be used directly.

class RowLengthValidator

Validates that each row has the expected number of fields. The expected number of fields is inferred from the CSV header row read by csv.DictReader.

Built-in Input Types

Vladiate comes with the following input types:

class VladInput

Generic input. Should be subclassed by any custom inputs. Not to be used directly.

class LocalFile

Read from a file local to the filesystem.

filename:Path to a local CSV file.

class S3File

Read from a file in S3. Optionally can specify either a full path, or a bucket/key pair.

Requires the boto library, which should be installed via pip install vladiate[s3].

path=None:A full S3 filepath (e.g., s3://foo.bar/path/to/file.csv)
bucket=None:S3 bucket. Must be specified with a key.
key=None:S3 key. Must be specified with a bucket.

class String

Read CSV from a string. Can take either an str or a StringIO.

:string_input=None
Regular Python string input.
:string_io=None
StringIO input.

Running Vlads Programatically

class Vlad

Initialize a Vlad programatically

source:Required. Any VladInput.
validators={}:List of validators. Optional, defaults to the class variable validators if set, otherwise uses EmptyValidator for all fields.
row_validators=[]:List of row-level validators. Validators provided here operate on entire rows and can be used to define constraints that involve more than one field. Optional, defaults to the class variable row_validators if set, otherwise [], which does not perform any row-level validation.
delimiter=',':The delimiter used within your csv source. Optional, defaults to ,.
ignore_missing_validators=False:Whether to fail validation if there are fields in the file for which the Vlad does not have validators. Optional, defaults to False.
quiet=False:Whether to disable log output generated by validations. Optional, defaults to False.
file_validation_failure_threshold=None:Stops validating the file after this failure threshold is reached. Input a value between 0.0 and 1.0. 1.0`(100%) validates the entire file. Optional, defaults to `None.

For example:

from vladiate import Vlad
from vladiate.inputs import LocalFile
Vlad(source=LocalFile('path/to/local/file.csv')).validate()

Testing

To run the tests:

make test

To run the linter:

make lint

Command Line Arguments

Usage: vladiate [options] [VladClass [VladClass2 ... ]]

Options:
  -h, --help            show this help message and exit
  -f VLADFILE, --vladfile=VLADFILE
                        Python module file to import, e.g. '../other.py'.
                        Default: vladfile
  -l, --list            Show list of possible vladiate classes and exit
  -V, --version         show version number and exit
  -p PROCESSES, --processes=PROCESSES
                        attempt to use this number of processes, Default: 1
  -q, --quiet           disable console log output generated by validations

Contributors

License

Open source MIT license.

vladiate's People

Contributors

boblannon avatar csojinb avatar di avatar dp247 avatar haritha-ravi avatar jonafato avatar maleix avatar mwang87 avatar qugu avatar santilytics avatar sgpeter1 avatar simo97 avatar syxolk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

vladiate's Issues

Validators defined as class attributes keep previous state around

When extending the Vlad class, validators defined as class attributes will keep state from a previous validation around:

class YourFirstValidator(Vlad):
    validators = {
        'Column A': [
            UniqueValidator()
        ],
        'Column B': [
            SetValidator(['Vampire', 'Not A Vampire'])
        ]
    }
>>> vlad = YourFirstValidator(source=LocalFile('vladiate/examples/vampires.csv'))
>>> vlad.validate()
Validating YourFirstValidator(source=LocalFile('vladiate/examples/vampires.csv'))
Passed! :)
>>> vlad2 = YourFirstValidator(source=LocalFile('vladiate/examples/vampires.csv'))
>>> vlad2.validate()
Validating YourFirstValidator(source=LocalFile('vladiate/examples/vampires.csv'))
Failed :(
  UniqueValidator failed 3 time(s) (100.0%) on field: 'Column A'
    Invalid fields: ['('Count Chocula',)', '('Dracula',)', '('Vlad the Impaler',)']

For now, this can be worked around by making validators an instance variable:

class YourSecondValidator(Vlad):
    def __init__(self, *args, **kwargs):
        self.validators = {
            'Column A': [
                UniqueValidator()
            ],
            'Column B': [
                SetValidator(['Vampire', 'Not A Vampire'])
            ]
        }
        super(YourSecondValidator, self).__init__(*args, **kwargs)
>>> vlad = YourSecondValidator(source=LocalFile('vladiate/examples/vampires.csv'))
>>> vlad.validate()
Validating YourSecondValidator(source=LocalFile('vladiate/examples/vampires.csv'))
Passed! :)
>>> vlad2 = YourSecondValidator(source=LocalFile('vladiate/examples/vampires.csv'))
>>> vlad2.validate()
Validating YourSecondValidator(source=LocalFile('vladiate/examples/vampires.csv'))
Passed! :)

Row-level validation

Is there a recommended way to do row-level validation? I have a work in progress / proof of concept branch to add the kind of logic I'm looking to use at https://github.com/jonafato/vladiate/tree/row-level-validators. My primary use cases here are the ability to validate the length of a row (csv.DictReader is very permissive here and reads rows of variable length) and to validate pairs of fields together. This could be conceptually similar to other validation libraries like Django's form validation.

If this feature seems useful to you, I'm happy to open a PR and fill in the rest of the implementation.

setup.py fails due to boto dependency

This package appears to depend on boto but the documentation does not mention it is required. Don't want or need to reference S3.

Processing dependencies for vladiate==0.0.18
Searching for boto
Reading https://pypi.python.org/simple/boto/
^Cinterrupted

Please mark boto as a required dependency in documentation or make class S3File optional when boto is available.

NotEmptyValidator minimal logging

NotEmptyValidator always returns an empty set for it's bad property regardless of failure causing the Vlad class to return False and log the general "Failed :(" message without an indication of the field that failed or the times it failed. Solution proposed below avoids repetition of empty strings in logger output while providing logging on par with the other Validators.

class NotEmptyValidator(Validator):
    ''' Validates that a field is not empty '''

    def __init__(self):
        self.fail_count = 0
        self.empty = set([])

    def validate(self, field, row={}):
        if field == '':
            self.empty.add(field)
            raise ValidationException("Row has empty field in column")

    @property
    def bad(self):
        return self.empty

CSV without header

Is it possible to validate files without header?

I expected something like this but it seems it's not possible.

validators = {
        0: [
            UniqueValidator(),
            IntValidator(),
        ],
        1: [
            FloatValidator(),
        ],
        2: [
            SetValidator(["USD", "EUR", "GBP"]),
        ],
    }

Stop validating a file when the failures reach a certain threshold

I intend to add an enhancement to vladiate which would optionally stop validating a file when the failures reach a certain threshold.

To add more context here, this is an idea to overcome an issue we are facing. We do not want to further process a file if vladiate returns a fail_count more than 10%. But currently there is no feature in vladiate to stop validating a file earlier. So we have to validate rest of the even though we could stop early. This proves especially troublesome when validating really large CSV files.

I have created a PR to introduce this: #89

Incompatibility with windows: module 'os' has no attribute 'EX_NOINPUT'

Hi there,
I'm trying to run this on windows and get this error on the commands:
vladiate
vladiate vlads
vladiate mymain.py

C:path\to\project> vladiate
Could not find any vladfile! Ensure file ends in '.py' and see --help for available options.
Traceback (most recent call last):
  File "c:\users\scollier\appdata\local\programs\python\python36-32\Lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\scollier\appdata\local\programs\python\python36-32\Lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\scollier\Envs\virtualenv\Scripts\vladiate.exe\__main__.py", line 9, in <module>
  File "c:\users\scollier\envs\virtualenv\lib\site-packages\vladiate\main.py", line 173, in main
    return os.EX_NOINPUT
AttributeError: module 'os' has no attribute 'EX_NOINPUT'

This seems to all stem from: https://github.com/di/vladiate/blob/master/vladiate/main.py:main()
All the os.EX_* values are UNIX only.

def main():
    arguments = parse_args()
    logger = logs.logger

    if arguments.show_version:
        print("Vladiate %s" % (get_distribution('vladiate').version, ))
        return os.EX_OK

    vladfile = find_vladfile(arguments.vladfile)
    if not vladfile:
        logger.error(
            "Could not find any vladfile! Ensure file ends in '.py' and see "
            "--help for available options."
        )
        return os.EX_NOINPUT

    docstring, vlads = load_vladfile(vladfile)

    if arguments.list_commands:
        logger.info("Available vlads:")
        for name in vlads:
            logger.info("    " + name)
        return os.EX_OK

    if not vlads:
        logger.error("No vlad class found!")
        return os.EX_NOINPUT

    # make sure specified vlad exists
    if arguments.vlads:
        missing = set(arguments.vlads) - set(vlads.keys())
        if missing:
            logger.error("Unknown vlad(s): %s\n" % (", ".join(missing)))
            return os.EX_UNAVAILABLE
        else:
            names = set(arguments.vlads) & set(vlads.keys())
            vlad_classes = [vlads[n] for n in names]
    else:
        vlad_classes = vlads.values()

    # validate all the vlads, and collect the validations for a good exit
    # return code
    if arguments.processes == 1:
        for vlad in vlad_classes:
            vlad(source=vlad.source).validate()

    else:
        proc_pool = Pool(
            arguments.processes
            if arguments.processes <= len(vlad_classes)
            else len(vlad_classes)
        )
        proc_pool.map(_vladiate, vlad_classes)
        try:
            if not result_queue.get_nowait():
                return os.EX_DATAERR
        except Empty:
            pass
        return os.EX_OK

Would you be able to change this to a more windows-friendly return value?

shell exitcode should return 1 if vladiate fails (instead of 0)

I am calling vladiate to validate a valid data.csv file, the exitcode returned is 0 for success:

$ vladiate
Validating YourFirstValidator(source=LocalFile('data.csv'))
Passed! :)
$ echo $?
0

When I mod data.csv to make vladiate fail, the exit code is still 0 (success):

$ vladiate
Validating YourFirstValidator(source=LocalFile('data.csv'))
Failed :(
  SetValidator failed 1 time(s) (1.5%) on field: 'provisionstate'
    Invalid fields: ['notprovisioned2']
$ echo $?
0

Would it be possible to modify vladiate to return an exitcode of failure (1)?

best,

Suppress output if result set is large

For large lists of violations, the output can be overwhelming. This is very noticeable when you're starting to work on a new vladfile for a large dataset. All of the empty validators will print every single value for a column, and if the dataset has 500,000 rows, oof.

It would be nice if the output was truncated to the first 100 or so.

add a quiet option (-q)

Hi,

After the vladiate 0.0.20, I could get the shell exitcode for success or not.

Now I would like to add a quiet option (-q) so that when I call vladiate, it silently validates, if not it will throw an error.

Best

Add option to check header order

Some use cases require a specified field order (exe. MSSQL BULK INSERT statement). It would be great if the Vlad class had an option to check if the source file's fields exactly matched those specified in the Validator class including their order. Proposed changes below add this functionality via an ignore_field_order parameter.

class Vlad:
    def __init__(self, source, validators={}, default_validator=EmptyValidator,
                 delimiter=None, ignore_missing_validators=False, ignore_field_order=True):
        self.logger = logs.logger
        self.failures = defaultdict(lambda: defaultdict(list))
        self.missing_validators = None
        self.missing_fields = None
        self.source = source
        self.validators = validators or getattr(self, 'validators', {})
        self.delimiter = delimiter or getattr(self, 'delimiter', ',')
        self.line_count = 0
        self.ignore_missing_validators = ignore_missing_validators
        self.ignore_field_order = ignore_field_order
        self.validators.update({
            field: [default_validator()]
            for field, value in self.validators.items() if not value
        })

    def _log_debug_failures(self):
        for field_name, field_failure in self.failures.items():
            self.logger.debug("\nFailure on field: \"{}\":".format(field_name))
            for i, (row, errors) in enumerate(field_failure.items()):
                self.logger.debug("  {}:{}".format(self.source, row))
                for error in errors:
                    self.logger.debug("    {}".format(error))

    def _log_validator_failures(self):
        for field_name, validators_list in self.validators.items():
            for validator in validators_list:
                if validator.bad:
                    self.logger.error(
                        "  {} failed {} time(s) ({:.1%}) on field: '{}'".format(
                            validator.__class__.__name__, validator.fail_count,
                            validator.fail_count / self.line_count, field_name))

                    try:
                        # If self.bad is iterable, it contains the fields which
                        # caused it to fail
                        invalid = list(validator.bad)
                        shown = [
                            "'{}'".format(field) for field in invalid[:99]
                        ]
                        hidden = [
                            "'{}'".format(field)
                            for field in invalid[99:]
                        ]
                        self.logger.error(
                            "    Invalid fields: [{}]".format(", ".join(shown)))
                        if hidden:
                            self.logger.error(
                                "    ({} more suppressed)".format(len(hidden)))
                    except TypeError:
                        pass

    def _log_missing_validators(self):
        self.logger.error("  Missing validators for:")
        self._log_missing(self.missing_validators)

    def _log_missing_fields(self):
        self.logger.error("  Missing expected fields:")
        self._log_missing(self.missing_fields)

    def _log_missing(self, missing_items):
        self.logger.error(
            "{}".format("\n".join([
                "    '{}': [],".format(field)
                for field in sorted(missing_items)])))

    def validate(self):
        self.logger.info("\nValidating {}(source={})".format(
            self.__class__.__name__, self.source))
        reader = csv.DictReader(self.source.open(), delimiter=self.delimiter)
        if not reader.fieldnames:
            self.logger.info(
                "\033[1;33m" + "Source file has no field names" + "\033[0m"
            )
            return False

        self.missing_validators = set(reader.fieldnames) - set(self.validators)
        if self.missing_validators:
            self.logger.info("\033[1;33m" + "Missing..." + "\033[0m")
            self._log_missing_validators()
            if not self.ignore_missing_validators:
                return False

        self.missing_fields = set(self.validators) - set(reader.fieldnames)
        if self.missing_fields:
            self.logger.info("\033[1;33m" + "Missing..." + "\033[0m")
            self._log_missing_fields()
            return False

        if not self.ignore_field_order:
            if not reader.fieldnames == self.validators:
                self.logger.info("Source file field names do not exactly "
                                 "match supplied validator fields. Order matters.")
                return False

        for line, row in enumerate(reader):
            self.line_count += 1
            for field_name, field in row.items():
                if field_name in self.validators:
                    for validator in self.validators[field_name]:
                        try:
                            validator.validate(field, row=row)
                        except ValidationException as e:
                            self.failures[field_name][line].append(e)
                            validator.fail_count += 1

        if self.failures:
            self.logger.info("\033[0;31m" + "Failed :(" + "\033[0m")
            self._log_debug_failures()
            self._log_validator_failures()
            return False
        else:
            self.logger.info("\033[0;32m" + "Passed! :)" + "\033[0m")
            return True

A new release

I found useful functionality in this library but the is no any released package containing it.
When is a new release planned?

No great way to override logger

There's currently no way to turn off logging when using Vlads programmatically (i.e. not from the commandline) and by default it logs a lot. Workaround for now:

class MyVlad(Vlad):
    def __init__(self, *args, **kwargs):

        super().__init__(*args, **kwargs)

        class DummyLogger():
            def __init__(self):
                pass

            def __getattr__(self, name):
                return (lambda *x: None)

        self.logger = DummyLogger()

RangeValidator has no empty_ok parameter

RangeValidator has no option to allow empty records in a field. RangeValidator fails if the field contains an empty record "". Propose to add an empty_ok parameter to the RangeValidator class similar to other validators as shown below:

class RangeValidator(Validator):
    def __init__(self, low, high, empty_ok=False):
        self.fail_count = 0
        self.low = low
        self.high = high
        self.empty_ok = empty_ok
        self.outside = set()

    def validate(self, field, row={}):
        if field == '' and self.empty_ok:
            pass
        else:
            try:
                value = float(field)
                if not self.low <= value <= self.high:
                    raise ValueError
            except ValueError:
                self.outside.add(field)
                raise ValidationException(
                    "'{}' is not in range {} to {}".format(
                        field, self.low, self.high
                    )
                )

    @property
    def bad(self):
        return self.outside

No great way to inherit from another Vlad

For example, if we have:

class YourFirstValidator(Vlad):
    source = LocalFile('vampires.csv')
    validators = {
        'Column A': [
            UniqueValidator()
        ],
        'Column B': [
            SetValidator(['Vampire', 'Not A Vampire'])
        ]
    }

And we wanted to create a new class that slightly modifies the existing Vlad, we'd have to do:

class YourSecondValidator(YourFirstValidator):
    validators = YourFirstValidator.validators
    validators['Column C'] = SetValidator(['Real', 'Not Real'])

(this one's for you, @dmcclory)

pass file name as an argument to Vlad during initiation

We were running SubClassOfVlad.validate() in our own code, instead of using thevladiate` binary. Since that file wasn't in the working directory, we had to do somthing like this:

vlad = SubclassedValidator()
vlad.source = LocalFile(csv_file)
vlad.validate()

Not too bad, but it would be nice to be have an api that's more like this:

vlad = SubclassedValidator(csv_file=csv_file)
vlad.validate()

avro schema validator

Wondering if anyone has considering a validator that does a check against a data schema.
I need to define a schema (maybe avro schema) and would like to validate that a csv passes the schema.
I could hand code the schema translation to a vlad file but thinking automatic would be better.

Load CSV through S3File() Doesn't Work in Python 3

Thanks who contributed to this package. It helps.

When I test it using LoadFile(filename), it works.
But when I test it using S3FILE like below, it doesn't work. I'm using Python 3.
test = YourFirstValidator(S3File(path = None, bucket='buckets', key='key/file.csv'))

I installed package via "pip install vladiate"
But I found the code I downloaded is different with the code in github.
I downloaded this version "vladiate-0.0.20.dist-info".

I investigate the failures I faced. Found out the inputs.py in this version need an update.
I believe the issue was, the LoadFile returns list, but S3File returns str. So I added one line in bold below to make code works.

under class S3File(VladInput):
...
ret=obj.get()['Body'].read().decode('utf-8')
**Convert str to list
ret = ret.splitlines()
...

I'm not a expert programmer. Not 100% sure this can fix all potential issues. But it do resolve my issue. I hope this can help to other people who faced same issue.

Improve coverage

This might include testing main.py as well, which is currently at 0% coverage.

Support Python 3

Add any (preferably all) to .travis.yml:

  - "3.2"
  - "3.3"
  - "3.4"
  - "3.5"
  - "3.5-dev" # 3.5 development branch
  - "nightly" # currently points to 3.6-dev

Vladiate enhancements

This PR, I extend the current Vladiate to support our CSV file validation requirements.

The following enhancements have been added

  1. Ability to overrides or accept custom field names instead of inferring them from CSV header.
  2. Able to pass S3 credentials to S3File inputs.
  3. Gzip compress CSV file support for LocalFile and S3File inputs.
  4. Add mostly common used CSV data types validators.
  5. Ability to turn off console log.

High memory usage with too many invalid fields.

I have a CSV with roughly 140k lines that should be validated with vladiate. The validation code looks like this:

Vlad(source=LocalFile("largefile.csv"), validators={
        "id": [UniqueValidator()],
        "parent_id": [SetValidator(lots_of_ids)],
    }).validate()

The number of items in lots_of_ids is also around 140k.

For some reasons I had a case where the ids in lots_of_ids had no intersection with the values from the parent_id column in largefile.csv. Vladiate will in this case (correctly) collect all invalid rows. However, this resulted in my PC going to a RAM usage of at least 8GB.

Is there any way I can stop the validation early? Can Vladiate detect if there were too many wrong fields and stop validating?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.