Comments (14)
Hi, I'm the author of the library and found this issue on google. Take care when building a benchmark using small files. By default univocity-parsers executes with the concurrent input reading enabled. If you run the benchmark with a small input, multiple times, all you are testing is the time the parser waits for the input thread to get ready.
If this is the case, disable the extra thread by calling setReadInputOnSeparateThread(false);
on the parser settings object.
By default, it also allocates a 1mb buffer at startup which might be overkill if you are testing inputs with just a few dozens of records. You can tweak this by setting the buffer size using the setInputBufferSize();
method.
Lastly, keep in mind the parser supports a lot of different configurations and depending on the configuration it will perform some initialization processes such as automatic detection of line endings. If you run a benchmark with small inputs, multiple times, you will be effectively testing the performance of the initialization process and not the parsing itself.
from kantan.csv.
Thanks for taking the time to post this. You're right that univocity
's default setting are bad for what my benchmarking is testing: after disabling the extra thread and setting the buffer size to something a bit more conservative, performances have improved drastically - still not the best of the lot (that'd be jackson), but at least part of the competition now.
I don't intend to disable automatic line ending detection - that wouldn't be fair to the other parsers I'm using, as they also have that feature enabled. I'll look into the other settings though - if there's something like automated quote character or column separator detection on by default, that's certainly unfair to univocity
when all other parsers know to expect "
and ,
.
from kantan.csv.
A few other things that univocity does by default and other parsers dont:
- leading/trailing whitespace removal on each parsed value
- comment skipping
- blank line skipping
- unescaped quote handling
- the line ending detection works by analysing the input. Not by using the
operating system default line separator.
Also the parser is designed with the use of the RowProcessor in mind. This
is a callback interface used to delegate the processing of each parsed row
to all sorts of custom requirements. It is generally slower to use the
parser.readnext method as it triggers a few calls to the RowProcessor under
the hood.
On 3 Jan 2016 7:38 am, "Nicolas Rinaudo" [email protected] wrote:
Thanks for taking the time to post this. You're right that univocity's
default setting are bad for what my benchmarking is testing: after
disabling the extra thread and setting the buffer size to something a bit
more conservative, performances have improved drastically - still not the
best of the lot (that'd be jackson), but at least part of the competition
now.I don't intend to disable automatic line ending detection - that wouldn't
be fair to the other parsers I'm using, as they also have that feature
enabled. I'll look into the other settings though - if there's something
like automated quote character or column separator detection on by default,
that's certainly unfair to univocity when all other parsers know to
expect " and ,.—
Reply to this email directly or view it on GitHub
#11 (comment).
from kantan.csv.
The other libraries I'm currently testing are:
- my own, tabulate.
- opencsv
- jackson-csv
- apache commons-csv
- product-collections, which I believe uses opencsv as the underlying parser.
leading/trailing whitespace removal on each parsed value
Technically, that's not RFC compliant (Spaces are considered part of a field and should not be ignored.
)
Is it something that I can disable?
comment skipping
You're right that I'd want to disable that, as it's not something that I'm testing against. Come to think of it, this is also something I should disable for the other libraries - Jackson also does it by default, I think, as well as commons-csv. Not sure about opencsv, I'll need to check.
blank line skipping
I think all libraries I'm testing do that. I'll need to make sure, but I think I even have an explicit test against it.
unescaped quote handling
This I know for a fact is handled by default by all libraries I'm using except opencsv. I'm still pondering whether this is worth dropping opencsv for.
the line ending detection works by analysing the input. Not by using the operating system default line separator.
Is it anything more fancy than accepting both LF and CRLF as row separators when not escaped or quoted? If so, I'd be quite interested in reading about that if you have links / source code you can share. If not, that's also supported by default by all the libraries I'm benchmarking, and validated by an explicit test.
Also the parser is designed with the use of the RowProcessor
Interesting. I am indeed calling parser.parseNext
directly, as I need iterator-like access to the CSV rows. Looking at the code though, it looks like in my case the parser is using an instance of NoopRowProcessor
, and I'm assuming the cost is minimal - if not nil, I'd have thought this is the kind of dead code that a JIT in a primed JVM would remove altogether. Note that all benchmarks are executed 10 times and discarded before I actually start collecting metrics, so most JIT optimisations should have kicked in (and we can indeed see a large performance gain between the warmup iterations and the interesting ones).
from kantan.csv.
leading/trailing whitespace removal on each parsed value
Technically, that's not RFC compliant (Spaces are considered part of a field and should not be ignored.)
Note the the RFC is just a proposal and not a standard. It's rarely followed and there are many sorts of non-conforming CSV input out there.
Is it something that I can disable?
settings.setIgnoreLeadingWhitespaces(false);
settings.setIgnoreTrailingWhitespaces(false);
unescaped quote handling
This I know for a fact is handled by default by all libraries I'm using except opencsv. I'm still pondering whether this is worth dropping opencsv for.
No parser except univocity handles this. Try this:
something,"a quoted value "with unescaped quotes" can be parsed", something
univocity will parse 3 values instead of blowing up:
- something
- a quoted value "with unescaped quotes" can be parsed
- something
To disable this behavior and get an exception instead, use: settings.setParseUnescapedQuotes(false);
the line ending detection works by analysing the input. Not by using the operating system default line separator.
Is it anything more fancy than accepting both LF and CRLF as row separators when not escaped or quoted? If so, I'd be quite interested in reading about that if you have links / source code you can share. If not, that's also supported by default by all the libraries I'm benchmarking, and validated by an explicit test.
It just analyses the first loaded input buffer and tries to identify which line ending (CRLF, LF or CR) is present in the input. To make sure it doesn't run use settings.setLineSeparatorDetectionEnabled(false)
Also the parser is designed with the use of the RowProcessor
Interesting. I am indeed calling parser.parseNext directly, as I need iterator-like access to the CSV rows. Looking at the code though, it looks like in my case the parser is using an instance of NoopRowProcessor, and I'm assuming the cost is minimal - if not nil, I'd have thought this is the kind of dead code that a JIT in a primed JVM would remove altogether. Note that all benchmarks are executed 10 times and discarded before I actually start collecting metrics, so most JIT optimisations should have kicked in (and we can indeed see a large performance gain between the warmup iterations and the interesting ones).
Try and see for yourself. It is ~15% slower.
from kantan.csv.
Mmm, you're right. I thought that by unescaped quote handling, you meant something like something,a non quoted value with "quotes",something
. Tabulate can handle your specific example, but it does break down if the character following the unescaped quote is a line break or a column separator.
As for the RowProcessor
thing, you're the author, I'm sure you're right: it must be faster when using the callback-based API rather than the iterator-like one. I'm specifically benchmarking iterator-like access though - and I must say, if I'm using univocity for a use-case for which it wasn't specifically optimised, the results are quite impressive.
from kantan.csv.
All the best parsers and serializers, including univocity, are now in the same very small performance bracket. All the gross misconfigurations are now fixed.
from kantan.csv.
Thanks!
As a FYI, version 2.1.0 will be considerably faster than 2.0.2 and you can test univocity-parsers-2.1.0-SNAPSHOT for parsing already.
from kantan.csv.
I have set things up to be notified of new versions of the parsers I'm benchmarking and will be sure to update results as soon as 2.1.0 is released - although I'm not sure it can get much faster than it already is, there's not much room for improvement left :)
from kantan.csv.
Well, it got at least 30% faster. I ran a preliminary test against some other parsers using the worldcitiespop.txt file and it parsed everything in 880ms on my machine while Jackson took ~1.1 seconds and OpenCsv ~1.9, so it might be good to test again using your test scenario.
from kantan.csv.
I'm not surprised about opencsv - my results show that 2.0.2 is already almost twice as fast - but faster than jackson is quite an achievement.
from kantan.csv.
Version 2.1.0 released to include parsing and writing performance improvements. It should be way faster now.
from kantan.csv.
Quite, second only to jackson in my benchmarks now.
from kantan.csv.
Thank you!
On 2 May 2016 at 17:36, Nicolas Rinaudo [email protected] wrote:
Quite http://nrinaudo.github.io/kantan.csv/tut/benchmarks.html, second
only to jackson in my benchmarks now.—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#11 (comment)
from kantan.csv.
Related Issues (20)
- [feat request] Publishing for Scala.js 1.0.0 HOT 4
- Separator detection
- Help with a case
- could not find implicit value for evidence parameter of type kantan.csv.HeaderDecoder HOT 8
- RowEncoder for more than 22 fields
- Overriding codecs HOT 2
- Enabling `quoteAll` breaks double quote escaping
- Configurable line separator (write CSV with \n line separator)
- Consuming a pipe-delimited file becomes erratic when is set to CsvConfiguration('|', '"', QuotePolicy.Always, Header.None)
- Intellij red flag Ambiguous implicit complain still sbt compilation works fine HOT 2
- Can we use this to read csv from S3 bucket and how? HOT 1
- Scala 3 support HOT 3
- my readCsv fails on corrupted file without means to inspect it
- Help with identifying separator HOT 1
- Case insensitive headers HOT 2
- Parsing a cell as a NonEmptyList
- Single columns optional with CSV containing headers
- Why case class has implicits issue ? HOT 1
- How to read a wide CSV from S3 in streaming fashion? HOT 1
- StackOverflowError when parsing csv with different then expected cell separator and quoted cells
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kantan.csv.