di / vladiate Goto Github PK

View Code? Open in Web Editor NEW

90.0 7.0 35.0 108 KB

A strict validation tool for CSV files

Home Page: https://pypi.org/project/vladiate

License: MIT License

Python 98.95% Makefile 1.05%

vladiate's Introduction

Vladiate

Description

Vladiate helps you write explicit assertions for every field of your CSV file.

Features

Write validation schemas in plain-old Python: No UI, no XML, no JSON, just code.
Write your own validators: Vladiate comes with a few by default, but there's no reason you can't write your own.
Validate multiple files at once: Either with the same schema, or different ones.

Documentation

Installation

Installing:

$ pip install vladiate

Quickstart

Below is an example of a vladfile.py

from vladiate import Vlad
from vladiate.validators import UniqueValidator, SetValidator
from vladiate.inputs import LocalFile

class YourFirstValidator(Vlad):
    source = LocalFile('vampires.csv')
    validators = {
        'Column A': [
            UniqueValidator()
        ],
        'Column B': [
            SetValidator(['Vampire', 'Not A Vampire'])
        ]
    }

Here we define a number of validators for a local file vampires.csv, which would look like this:

Column A,Column B
Vlad the Impaler,Not A Vampire
Dracula,Vampire
Count Chocula,Vampire

We then run vladiate in the same directory as your .csv file:

$ vladiate

And get the following output:

Validating YourFirstValidator(source=LocalFile('vampires.csv'))
Passed! :)

Handling Changes

Let's imagine that you've gotten a new CSV file, potential_vampires.csv, that looks like this:

Column A,Column B
Vlad the Impaler,Not A Vampire
Dracula,Vampire
Count Chocula,Vampire
Ronald Reagan,Maybe A Vampire

If we were to update our first validator to use this file as follows:

- class YourFirstValidator(Vlad):
-     source = LocalFile('vampires.csv')
+ class YourFirstFailingValidator(Vlad):
+     source = LocalFile('potential_vampires.csv')

we would get the following error:

Validating YourFirstFailingValidator(source=LocalFile('potential_vampires.csv'))
Failed :(
  SetValidator failed 1 time(s) (25.0%) on field: 'Column B'
    Invalid fields: ['Maybe A Vampire']

And we would know that we'd either need to sanitize this field, or add it to the SetValidator.

Starting from scratch

To make writing a new vladfile.py easy, Vladiate will give meaningful error messages.

Given the following as real_vampires.csv:

Column A,Column B,Column C
Vlad the Impaler,Not A Vampire
Dracula,Vampire
Count Chocula,Vampire
Ronald Reagan,Maybe A Vampire

We could write a bare-bones validator as follows:

class YourFirstEmptyValidator(Vlad):
    source = LocalFile('real_vampires.csv')
    validators = {}

Running this with vladiate would give the following error:

Validating YourFirstEmptyValidator(source=LocalFile('real_vampires.csv'))
Missing...
  Missing validators for:
    'Column A': [],
    'Column B': [],
    'Column C': [],

Vladiate expects something to be specified for every column, even if it is an empty list (more on this later). We can easily copy and paste from the error into our vladfile.py to make it:

class YourFirstEmptyValidator(Vlad):
    source = LocalFile('real_vampires.csv')
    validators = {
        'Column A': [],
        'Column B': [],
        'Column C': [],
    }

When we run this with vladiate, we get:

Validating YourSecondEmptyValidator(source=LocalFile('real_vampires.csv'))
Failed :(
  EmptyValidator failed 4 time(s) (100.0%) on field: 'Column A'
    Invalid fields: ['Dracula', 'Vlad the Impaler', 'Count Chocula', 'Ronald Reagan']
  EmptyValidator failed 4 time(s) (100.0%) on field: 'Column B'
    Invalid fields: ['Maybe A Vampire', 'Not A Vampire', 'Vampire']
  EmptyValidator failed 4 time(s) (100.0%) on field: 'Column C'
    Invalid fields: ['Real', 'Not Real']

This is because Vladiate interprets an empty list of validators for a field as an EmptyValidator, which expects an empty string in every field. This helps us make meaningful decisions when adding validators to our vladfile.py. It also ensures that we are not forgetting about a column or field which is not empty.

Built-in Validators

Vladiate comes with a few common validators built-in:

class Validator

Generic validator. Should be subclassed by any custom validators. Not to be used directly.

class CastValidator

Generic "can-be-cast-to-x" validator. Should be subclassed by any cast-test validator. Not to be used directly.

class IntValidator

Validates whether a field can be cast to an int type or not.

empty_ok=False: Specify whether a field which is an empty string should be ignored.

`empty_ok=False`:	Specify whether a field which is an empty string should be ignored.

class FloatValidator

Validates whether a field can be cast to an float type or not.

empty_ok=False: Specify whether a field which is an empty string should be ignored.

`empty_ok=False`:	Specify whether a field which is an empty string should be ignored.

class SetValidator

Validates whether a field is in the specified set of possible fields.

valid_set=[]: List of valid possible fields

empty_ok=False: Implicity adds the empty string to the specified set.

ignore_case=False: Ignore the case between values in the column and valid set

`valid_set=[]`:	List of valid possible fields
`empty_ok=False`:	Implicity adds the empty string to the specified set.
`ignore_case=False`:	Ignore the case between values in the column and valid set

class UniqueValidator

Ensures that a given field is not repeated in any other column. Can optionally determine "uniqueness" with other fields in the row as well via unique_with.

unique_with=[]: List of field names to make the primary field unique with.

empty_ok=False: Specify whether a field which is an empty string should be ignored.

`unique_with=[]`:	List of field names to make the primary field unique with.
`empty_ok=False`:	Specify whether a field which is an empty string should be ignored.

class RegexValidator

Validates whether a field matches the given regex using re.match().

pattern=r'di^': The regex pattern. Fails for all fields by default.

full=False: Specify whether we should use a fullmatch() or match().

empty_ok=False: Specify whether a field which is an empty string should be ignored.

`pattern=r'di^'`:	The regex pattern. Fails for all fields by default.
`full=False`:	Specify whether we should use a fullmatch() or match().
`empty_ok=False`:	Specify whether a field which is an empty string should be ignored.

class RangeValidator

Validates whether a field falls within a given range (inclusive). Can handle integers or floats.

low: The low value of the range.

high: The high value of the range.

empty_ok=False: Specify whether a field which is an empty string should be ignored.

`low`:	The low value of the range.
`high`:	The high value of the range.
`empty_ok=False`:	Specify whether a field which is an empty string should be ignored.

class EmptyValidator

Ensure that a field is always empty. Essentially the same as an empty SetValidator. This is used by default when a field has no validators.

class NotEmptyValidator

The opposite of an EmptyValidator. Ensure that a field is never empty.

class Ignore

Always passes validation. Used to explicity ignore a given column.

class RowValidator

Generic row validator. Should be subclassed by any custom validators. Not to be used directly.

class RowLengthValidator

Validates that each row has the expected number of fields. The expected number of fields is inferred from the CSV header row read by csv.DictReader.

Built-in Input Types

Vladiate comes with the following input types:

class VladInput

Generic input. Should be subclassed by any custom inputs. Not to be used directly.

class LocalFile

Read from a file local to the filesystem.

filename: Path to a local CSV file.

`filename`:	Path to a local CSV file.

class S3File

Read from a file in S3. Optionally can specify either a full path, or a bucket/key pair.

Requires the boto library, which should be installed via pip install vladiate[s3].

path=None: A full S3 filepath (e.g., s3://foo.bar/path/to/file.csv)

bucket=None: S3 bucket. Must be specified with a key.

key=None: S3 key. Must be specified with a bucket.

`path=None`:	A full S3 filepath (e.g., `s3://foo.bar/path/to/file.csv`)
`bucket=None`:	S3 bucket. Must be specified with a `key`.
`key=None`:	S3 key. Must be specified with a `bucket`.

class String

Read CSV from a string. Can take either an str or a StringIO.

:string_input=None

Regular Python string input.

:string_io=None

StringIO input.

Running Vlads Programatically

class Vlad

Initialize a Vlad programatically

source: Required. Any VladInput.

validators={}: List of validators. Optional, defaults to the class variable validators if set, otherwise uses EmptyValidator for all fields.

row_validators=[]: List of row-level validators. Validators provided here operate on entire rows and can be used to define constraints that involve more than one field. Optional, defaults to the class variable row_validators if set, otherwise [], which does not perform any row-level validation.

delimiter=',': The delimiter used within your csv source. Optional, defaults to ,.

ignore_missing_validators=False: Whether to fail validation if there are fields in the file for which the Vlad does not have validators. Optional, defaults to False.

quiet=False: Whether to disable log output generated by validations. Optional, defaults to False.

file_validation_failure_threshold=None: Stops validating the file after this failure threshold is reached. Input a value between 0.0 and 1.0. 1.0`(100%) validates the entire file. Optional, defaults to `None.

For example:

`source`:	Required. Any VladInput.
`validators={}`:	List of validators. Optional, defaults to the class variable validators if set, otherwise uses EmptyValidator for all fields.
`row_validators=[]`:	List of row-level validators. Validators provided here operate on entire rows and can be used to define constraints that involve more than one field. Optional, defaults to the class variable row_validators if set, otherwise [], which does not perform any row-level validation.
`delimiter=','`:	The delimiter used within your csv source. Optional, defaults to ,.
`ignore_missing_validators=False`:	Whether to fail validation if there are fields in the file for which the Vlad does not have validators. Optional, defaults to False.
`quiet=False`:	Whether to disable log output generated by validations. Optional, defaults to False.
`file_validation_failure_threshold=None`:	Stops validating the file after this failure threshold is reached. Input a value between 0.0 and 1.0. 1.0`(100%) validates the entire file. Optional, defaults to `None.

from vladiate import Vlad
from vladiate.inputs import LocalFile
Vlad(source=LocalFile('path/to/local/file.csv')).validate()

Testing

To run the tests:

make test

To run the linter:

make lint

Command Line Arguments

Usage: vladiate [options] [VladClass [VladClass2 ... ]]

Options:
  -h, --help            show this help message and exit
  -f VLADFILE, --vladfile=VLADFILE
                        Python module file to import, e.g. '../other.py'.
                        Default: vladfile
  -l, --list            Show list of possible vladiate classes and exit
  -V, --version         show version number and exit
  -p PROCESSES, --processes=PROCESSES
                        attempt to use this number of processes, Default: 1
  -q, --quiet           disable console log output generated by validations

Contributors

License

Open source MIT license.

vladiate's People

Contributors

Stargazers

Watchers

vladiate's Issues

Validators defined as class attributes keep previous state around

When extending the Vlad class, validators defined as class attributes will keep state from a previous validation around:

class YourFirstValidator(Vlad):
    validators = {
        'Column A': [
            UniqueValidator()
        ],
        'Column B': [
            SetValidator(['Vampire', 'Not A Vampire'])
        ]
    }

>>> vlad = YourFirstValidator(source=LocalFile('vladiate/examples/vampires.csv'))
>>> vlad.validate()
Validating YourFirstValidator(source=LocalFile('vladiate/examples/vampires.csv'))
Passed! :)
>>> vlad2 = YourFirstValidator(source=LocalFile('vladiate/examples/vampires.csv'))
>>> vlad2.validate()
Validating YourFirstValidator(source=LocalFile('vladiate/examples/vampires.csv'))
Failed :(
  UniqueValidator failed 3 time(s) (100.0%) on field: 'Column A'
    Invalid fields: ['('Count Chocula',)', '('Dracula',)', '('Vlad the Impaler',)']

For now, this can be worked around by making validators an instance variable:

class YourSecondValidator(Vlad):
    def __init__(self, *args, **kwargs):
        self.validators = {
            'Column A': [
                UniqueValidator()
            ],
            'Column B': [
                SetValidator(['Vampire', 'Not A Vampire'])
            ]
        }
        super(YourSecondValidator, self).__init__(*args, **kwargs)

>>> vlad = YourSecondValidator(source=LocalFile('vladiate/examples/vampires.csv'))
>>> vlad.validate()
Validating YourSecondValidator(source=LocalFile('vladiate/examples/vampires.csv'))
Passed! :)
>>> vlad2 = YourSecondValidator(source=LocalFile('vladiate/examples/vampires.csv'))
>>> vlad2.validate()
Validating YourSecondValidator(source=LocalFile('vladiate/examples/vampires.csv'))
Passed! :)

Add release workflow

This should make a release to PyPI when a GitHub release is made, via https://github.com/marketplace/actions/pypi-publish

Raise warning when not all validators are used

This would imply that a column which was expected to be validated was missing.

Migrate CI to GitHub Actions

The Travis integration no longer works.

Row-level validation

Is there a recommended way to do row-level validation? I have a work in progress / proof of concept branch to add the kind of logic I'm looking to use at https://github.com/jonafato/vladiate/tree/row-level-validators. My primary use cases here are the ability to validate the length of a row (csv.DictReader is very permissive here and reads rows of variable length) and to validate pairs of fields together. This could be conceptually similar to other validation libraries like Django's form validation.

If this feature seems useful to you, I'm happy to open a PR and fill in the rest of the implementation.

setup.py fails due to boto dependency

This package appears to depend on boto but the documentation does not mention it is required. Don't want or need to reference S3.

Processing dependencies for vladiate==0.0.18
Searching for boto
Reading https://pypi.python.org/simple/boto/
^Cinterrupted

Please mark boto as a required dependency in documentation or make class S3File optional when boto is available.

Local files are not guaranteed to be closed

NotEmptyValidator minimal logging

NotEmptyValidator always returns an empty set for it's bad property regardless of failure causing the Vlad class to return False and log the general "Failed :(" message without an indication of the field that failed or the times it failed. Solution proposed below avoids repetition of empty strings in logger output while providing logging on par with the other Validators.

class NotEmptyValidator(Validator):
    ''' Validates that a field is not empty '''

    def __init__(self):
        self.fail_count = 0
        self.empty = set([])

    def validate(self, field, row={}):
        if field == '':
            self.empty.add(field)
            raise ValidationException("Row has empty field in column")

    @property
    def bad(self):
        return self.empty

CSV without header

Is it possible to validate files without header?

I expected something like this but it seems it's not possible.

validators = {
        0: [
            UniqueValidator(),
            IntValidator(),
        ],
        1: [
            FloatValidator(),
        ],
        2: [
            SetValidator(["USD", "EUR", "GBP"]),
        ],
    }

Vladiate throws exception if source file is missing headers

Or if it's empty. It should give a debug message instead and fail the validation.

Check if `unique_with` keys are in `validators.keys()`

Currently raises a KeyError if it is not.

Stop validating a file when the failures reach a certain threshold

I intend to add an enhancement to vladiate which would optionally stop validating a file when the failures reach a certain threshold.

To add more context here, this is an idea to overcome an issue we are facing. We do not want to further process a file if vladiate returns a fail_count more than 10%. But currently there is no feature in vladiate to stop validating a file earlier. So we have to validate rest of the even though we could stop early. This proves especially troublesome when validating really large CSV files.

I have created a PR to introduce this: #89

Incompatibility with windows: module 'os' has no attribute 'EX_NOINPUT'

Hi there,
I'm trying to run this on windows and get this error on the commands:
vladiate
vladiate vlads
vladiate mymain.py

C:path\to\project> vladiate
Could not find any vladfile! Ensure file ends in '.py' and see --help for available options.
Traceback (most recent call last):
  File "c:\users\scollier\appdata\local\programs\python\python36-32\Lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\scollier\appdata\local\programs\python\python36-32\Lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\scollier\Envs\virtualenv\Scripts\vladiate.exe\__main__.py", line 9, in <module>
  File "c:\users\scollier\envs\virtualenv\lib\site-packages\vladiate\main.py", line 173, in main
    return os.EX_NOINPUT
AttributeError: module 'os' has no attribute 'EX_NOINPUT'

This seems to all stem from: https://github.com/di/vladiate/blob/master/vladiate/main.py:main()
All the os.EX_* values are UNIX only.

def main():
    arguments = parse_args()
    logger = logs.logger

    if arguments.show_version:
        print("Vladiate %s" % (get_distribution('vladiate').version, ))
        return os.EX_OK

    vladfile = find_vladfile(arguments.vladfile)
    if not vladfile:
        logger.error(
            "Could not find any vladfile! Ensure file ends in '.py' and see "
            "--help for available options."
        )
        return os.EX_NOINPUT

    docstring, vlads = load_vladfile(vladfile)

    if arguments.list_commands:
        logger.info("Available vlads:")
        for name in vlads:
            logger.info("    " + name)
        return os.EX_OK

    if not vlads:
        logger.error("No vlad class found!")
        return os.EX_NOINPUT

    # make sure specified vlad exists
    if arguments.vlads:
        missing = set(arguments.vlads) - set(vlads.keys())
        if missing:
            logger.error("Unknown vlad(s): %s\n" % (", ".join(missing)))
            return os.EX_UNAVAILABLE
        else:
            names = set(arguments.vlads) & set(vlads.keys())
            vlad_classes = [vlads[n] for n in names]
    else:
        vlad_classes = vlads.values()

    # validate all the vlads, and collect the validations for a good exit
    # return code
    if arguments.processes == 1:
        for vlad in vlad_classes:
            vlad(source=vlad.source).validate()

    else:
        proc_pool = Pool(
            arguments.processes
            if arguments.processes <= len(vlad_classes)
            else len(vlad_classes)
        )
        proc_pool.map(_vladiate, vlad_classes)
        try:
            if not result_queue.get_nowait():
                return os.EX_DATAERR
        except Empty:
            pass
        return os.EX_OK

Would you be able to change this to a more windows-friendly return value?

Extract the row number(s) where validation failed?

How to extract the row number(s) where validation failed

Document commandline args

Add linting

Code coverage with coveralls

shell exitcode should return 1 if vladiate fails (instead of 0)

I am calling vladiate to validate a valid data.csv file, the exitcode returned is 0 for success:

$ vladiate
Validating YourFirstValidator(source=LocalFile('data.csv'))
Passed! :)
$ echo $?
0

When I mod data.csv to make vladiate fail, the exit code is still 0 (success):

$ vladiate
Validating YourFirstValidator(source=LocalFile('data.csv'))
Failed :(
  SetValidator failed 1 time(s) (1.5%) on field: 'provisionstate'
    Invalid fields: ['notprovisioned2']
$ echo $?
0

Would it be possible to modify vladiate to return an exitcode of failure (1)?

best,

StringIO as VladInput

It'd be helpful to have a VladInput that will accept a StringIO

Problem with logging

When importing vladiate from vladiate import Vlad it sets logging level for root logger to INFO here https://github.com/di/vladiate/blob/master/vladiate/logs.py#L3. This is causing problems on Spark as it prints all INFO logs to stdout. Or could it be Spark problem?

Validators hold state if you vlad multiple times in the same context

It would be nice to be able to run validation from within python scripts or a shell repeatedly, without holding over state from the previous validation.

Running vladiate programatically doesn't create a handler for the `vlad_logger`

This will need to be done when a Vlad instance is created instead.

Tab auto-complete Vlads on command line

Helpful if there are several to validate.

Suppress output if result set is large

For large lists of violations, the output can be overwhelming. This is very noticeable when you're starting to work on a new vladfile for a large dataset. All of the empty validators will print every single value for a column, and if the dataset has 500,000 rows, oof.

It would be nice if the output was truncated to the first 100 or so.

add a quiet option (-q)

Hi,

After the vladiate 0.0.20, I could get the shell exitcode for success or not.

Now I would like to add a quiet option (-q) so that when I call vladiate, it silently validates, if not it will throw an error.

Best

Add option to check header order

Some use cases require a specified field order (exe. MSSQL BULK INSERT statement). It would be great if the Vlad class had an option to check if the source file's fields exactly matched those specified in the Validator class including their order. Proposed changes below add this functionality via an ignore_field_order parameter.

class Vlad:
    def __init__(self, source, validators={}, default_validator=EmptyValidator,
                 delimiter=None, ignore_missing_validators=False, ignore_field_order=True):
        self.logger = logs.logger
        self.failures = defaultdict(lambda: defaultdict(list))
        self.missing_validators = None
        self.missing_fields = None
        self.source = source
        self.validators = validators or getattr(self, 'validators', {})
        self.delimiter = delimiter or getattr(self, 'delimiter', ',')
        self.line_count = 0
        self.ignore_missing_validators = ignore_missing_validators
        self.ignore_field_order = ignore_field_order
        self.validators.update({
            field: [default_validator()]
            for field, value in self.validators.items() if not value
        })

    def _log_debug_failures(self):
        for field_name, field_failure in self.failures.items():
            self.logger.debug("\nFailure on field: \"{}\":".format(field_name))
            for i, (row, errors) in enumerate(field_failure.items()):
                self.logger.debug("  {}:{}".format(self.source, row))
                for error in errors:
                    self.logger.debug("    {}".format(error))

    def _log_validator_failures(self):
        for field_name, validators_list in self.validators.items():
            for validator in validators_list:
                if validator.bad:
                    self.logger.error(
                        "  {} failed {} time(s) ({:.1%}) on field: '{}'".format(
                            validator.__class__.__name__, validator.fail_count,
                            validator.fail_count / self.line_count, field_name))

                    try:
                        # If self.bad is iterable, it contains the fields which
                        # caused it to fail
                        invalid = list(validator.bad)
                        shown = [
                            "'{}'".format(field) for field in invalid[:99]
                        ]
                        hidden = [
                            "'{}'".format(field)
                            for field in invalid[99:]
                        ]
                        self.logger.error(
                            "    Invalid fields: [{}]".format(", ".join(shown)))
                        if hidden:
                            self.logger.error(
                                "    ({} more suppressed)".format(len(hidden)))
                    except TypeError:
                        pass

    def _log_missing_validators(self):
        self.logger.error("  Missing validators for:")
        self._log_missing(self.missing_validators)

    def _log_missing_fields(self):
        self.logger.error("  Missing expected fields:")
        self._log_missing(self.missing_fields)

    def _log_missing(self, missing_items):
        self.logger.error(
            "{}".format("\n".join([
                "    '{}': [],".format(field)
                for field in sorted(missing_items)])))

    def validate(self):
        self.logger.info("\nValidating {}(source={})".format(
            self.__class__.__name__, self.source))
        reader = csv.DictReader(self.source.open(), delimiter=self.delimiter)
        if not reader.fieldnames:
            self.logger.info(
                "\033[1;33m" + "Source file has no field names" + "\033[0m"
            )
            return False

        self.missing_validators = set(reader.fieldnames) - set(self.validators)
        if self.missing_validators:
            self.logger.info("\033[1;33m" + "Missing..." + "\033[0m")
            self._log_missing_validators()
            if not self.ignore_missing_validators:
                return False

        self.missing_fields = set(self.validators) - set(reader.fieldnames)
        if self.missing_fields:
            self.logger.info("\033[1;33m" + "Missing..." + "\033[0m")
            self._log_missing_fields()
            return False

        if not self.ignore_field_order:
            if not reader.fieldnames == self.validators:
                self.logger.info("Source file field names do not exactly "
                                 "match supplied validator fields. Order matters.")
                return False

        for line, row in enumerate(reader):
            self.line_count += 1
            for field_name, field in row.items():
                if field_name in self.validators:
                    for validator in self.validators[field_name]:
                        try:
                            validator.validate(field, row=row)
                        except ValidationException as e:
                            self.failures[field_name][line].append(e)
                            validator.fail_count += 1

        if self.failures:
            self.logger.info("\033[0;31m" + "Failed :(" + "\033[0m")
            self._log_debug_failures()
            self._log_validator_failures()
            return False
        else:
            self.logger.info("\033[0;32m" + "Passed! :)" + "\033[0m")
            return True

`RegexValidator` doesn't allow for `empty_ok=True`

This can be avoided by adding |^$ to the regex, but it'd be nice if it was not necessary.

A new release

I found useful functionality in this library but the is no any released package containing it.
When is a new release planned?

No great way to override logger

There's currently no way to turn off logging when using Vlads programmatically (i.e. not from the commandline) and by default it logs a lot. Workaround for now:

class MyVlad(Vlad):
    def __init__(self, *args, **kwargs):

        super().__init__(*args, **kwargs)

        class DummyLogger():
            def __init__(self):
                pass

            def __getattr__(self, name):
                return (lambda *x: None)

        self.logger = DummyLogger()

RangeValidator has no empty_ok parameter

RangeValidator has no option to allow empty records in a field. RangeValidator fails if the field contains an empty record "". Propose to add an empty_ok parameter to the RangeValidator class similar to other validators as shown below:

class RangeValidator(Validator):
    def __init__(self, low, high, empty_ok=False):
        self.fail_count = 0
        self.low = low
        self.high = high
        self.empty_ok = empty_ok
        self.outside = set()

    def validate(self, field, row={}):
        if field == '' and self.empty_ok:
            pass
        else:
            try:
                value = float(field)
                if not self.low <= value <= self.high:
                    raise ValueError
            except ValueError:
                self.outside.add(field)
                raise ValidationException(
                    "'{}' is not in range {} to {}".format(
                        field, self.low, self.high
                    )
                )

    @property
    def bad(self):
        return self.outside

Percentage for failure counts

It might be useful to have a percentage for the failure counts, to add context.

No great way to inherit from another Vlad

For example, if we have:

class YourFirstValidator(Vlad):
    source = LocalFile('vampires.csv')
    validators = {
        'Column A': [
            UniqueValidator()
        ],
        'Column B': [
            SetValidator(['Vampire', 'Not A Vampire'])
        ]
    }

And we wanted to create a new class that slightly modifies the existing Vlad, we'd have to do:

class YourSecondValidator(YourFirstValidator):
    validators = YourFirstValidator.validators
    validators['Column C'] = SetValidator(['Real', 'Not Real'])

(this one's for you, @dmcclory)

pass file name as an argument to Vlad during initiation

We were running SubClassOfVlad.validate() in our own code, instead of using thevladiate` binary. Since that file wasn't in the working directory, we had to do somthing like this:

vlad = SubclassedValidator()
vlad.source = LocalFile(csv_file)
vlad.validate()

Not too bad, but it would be nice to be have an api that's more like this:

vlad = SubclassedValidator(csv_file=csv_file)
vlad.validate()

avro schema validator

Wondering if anyone has considering a validator that does a check against a data schema.
I need to define a schema (maybe avro schema) and would like to validate that a csv passes the schema.
I could hand code the schema translation to a vlad file but thinking automatic would be better.

Don't use multiprocessing for single process

If the process argument is not specified, the default number of processes is 1 and there's no need to use multiprocessing.

Load CSV through S3File() Doesn't Work in Python 3

Thanks who contributed to this package. It helps.

When I test it using LoadFile(filename), it works.
But when I test it using S3FILE like below, it doesn't work. I'm using Python 3.
test = YourFirstValidator(S3File(path = None, bucket='buckets', key='key/file.csv'))

I installed package via "pip install vladiate"
But I found the code I downloaded is different with the code in github.
I downloaded this version "vladiate-0.0.20.dist-info".

I investigate the failures I faced. Found out the inputs.py in this version need an update.
I believe the issue was, the LoadFile returns list, but S3File returns str. So I added one line in bold below to make code works.

under class S3File(VladInput):
...
ret=obj.get()['Body'].read().decode('utf-8')
**Convert str to list
ret = ret.splitlines()
...

I'm not a expert programmer. Not 100% sure this can fix all potential issues. But it do resolve my issue. I hope this can help to other people who faced same issue.

Improve coverage

This might include testing main.py as well, which is currently at 0% coverage.

Support Python 3

Add any (preferably all) to .travis.yml:

  - "3.2"
  - "3.3"
  - "3.4"
  - "3.5"
  - "3.5-dev" # 3.5 development branch
  - "nightly" # currently points to 3.6-dev

_log_debug_failures calls VladInput.filename

This does not necessarily exist for S3 file input types. Should use __repr__ instead.

`String` input type is not documented in README

Vladiate enhancements

This PR, I extend the current Vladiate to support our CSV file validation requirements.

The following enhancements have been added

Ability to overrides or accept custom field names instead of inferring them from CSV header.
Able to pass S3 credentials to S3File inputs.
Gzip compress CSV file support for LocalFile and S3File inputs.
Add mostly common used CSV data types validators.
Ability to turn off console log.

High memory usage with too many invalid fields.

I have a CSV with roughly 140k lines that should be validated with vladiate. The validation code looks like this:

Vlad(source=LocalFile("largefile.csv"), validators={
        "id": [UniqueValidator()],
        "parent_id": [SetValidator(lots_of_ids)],
    }).validate()

The number of items in lots_of_ids is also around 140k.

For some reasons I had a case where the ids in lots_of_ids had no intersection with the values from the parent_id column in largefile.csv. Vladiate will in this case (correctly) collect all invalid rows. However, this resulted in my PC going to a RAM usage of at least 8GB.

Is there any way I can stop the validation early? Can Vladiate detect if there were too many wrong fields and stop validating?

di / vladiate Goto Github PK

vladiate's Introduction

Vladiate

Description

Features

Documentation

Installation

Quickstart

Handling Changes

Starting from scratch

Built-in Validators

Built-in Input Types

Running Vlads Programatically

Testing

Command Line Arguments

Contributors

License

vladiate's People

Contributors

Stargazers

Watchers

Forkers

vladiate's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs