Comments (3)
Hi @syxolk! Thanks for the report, and sorry that vladiate was eating up your memory.
Is there any way I can stop the validation early? Can Vladiate detect if there were too many wrong fields and stop validating?
This is currently not possible, but it could be a valuable optional feature.
I'm a little more interested in figuring out why so much memory is getting consumed and if the footprint can be reduced (I'm guessing it can, as I haven't really tested the upper bounds of this tool).
from vladiate.
Hey, I found two issues that lead to high memory usage in my particular case:
First, validate()
collects all exceptions in the failures
dictionary. In my case, all rows were failures -> Every row's exception was collected in there.
Second, the SetValidator
raises a ValidationException
that stringifies all elements of valid_set
. In my case, valid_set
contained 140k UUIDs. That resulted in a pretty huge string.
If we can fix at least one of the two issues the memory problem should be gone:
- The exceptions are printed as debug messages and the logger defaults to the info level, so no exceptions are printed by default. Can we disable collecting all the exception objects?
- Don't stringify the entire
valid_set
. Instead cut it at 100 elements, similarly to what is done in_log_validator_failures
PS: I'm happy to contribute!
from vladiate.
The exceptions are printed as debug messages and the logger defaults to the info level, so no exceptions are printed by default. Can we disable collecting all the exception objects?
I think there's two things we could do here:
-
We could only collect the failure exceptions if debugging is turned on. We'll need to add a
--verbose
flag or something, and I think we'll need to add another additional variable to determine if there were failures when not in debug mode (since we can't just check for a non-emptyfailures
. -
We could collect something other than the entire exception (which is probably huge) into
failures
, like maybe just the exception's message?
I'm leaning towards the first one, since it seems like less work, and still preserves the entire exception for debugging, but I could be convinced otherwise.
Don't stringify the entire
valid_set
. Instead cut it at 100 elements, similarly to what is done in_log_validator_failures
I think this is a great idea.
PS: I'm happy to contribute!
PRs are welcome!
from vladiate.
Related Issues (20)
- Problem with logging HOT 1
- Add linting
- No great way to inherit from another Vlad
- Percentage for failure counts HOT 1
- Validators defined as class attributes keep previous state around
- No great way to override logger
- NotEmptyValidator minimal logging HOT 2
- RangeValidator has no empty_ok parameter HOT 1
- Add option to check header order HOT 1
- setup.py fails due to boto dependency HOT 1
- avro schema validator
- Incompatibility with windows: module 'os' has no attribute 'EX_NOINPUT' HOT 6
- shell exitcode should return 1 if vladiate fails (instead of 0) HOT 2
- add a quiet option (-q) HOT 5
- Extract the row number(s) where validation failed? HOT 10
- Load CSV through S3File() Doesn't Work in Python 3
- Vladiate enhancements
- Migrate CI to GitHub Actions HOT 3
- Add release workflow
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vladiate.