GithubHelp home page GithubHelp logo

Comments (14)

jbax avatar jbax commented on June 24, 2024

Hi, I'm the author of the library and found this issue on google. Take care when building a benchmark using small files. By default univocity-parsers executes with the concurrent input reading enabled. If you run the benchmark with a small input, multiple times, all you are testing is the time the parser waits for the input thread to get ready.

If this is the case, disable the extra thread by calling setReadInputOnSeparateThread(false); on the parser settings object.

By default, it also allocates a 1mb buffer at startup which might be overkill if you are testing inputs with just a few dozens of records. You can tweak this by setting the buffer size using the setInputBufferSize(); method.

Lastly, keep in mind the parser supports a lot of different configurations and depending on the configuration it will perform some initialization processes such as automatic detection of line endings. If you run a benchmark with small inputs, multiple times, you will be effectively testing the performance of the initialization process and not the parsing itself.

from kantan.csv.

nrinaudo avatar nrinaudo commented on June 24, 2024

Thanks for taking the time to post this. You're right that univocity's default setting are bad for what my benchmarking is testing: after disabling the extra thread and setting the buffer size to something a bit more conservative, performances have improved drastically - still not the best of the lot (that'd be jackson), but at least part of the competition now.

I don't intend to disable automatic line ending detection - that wouldn't be fair to the other parsers I'm using, as they also have that feature enabled. I'll look into the other settings though - if there's something like automated quote character or column separator detection on by default, that's certainly unfair to univocity when all other parsers know to expect " and ,.

from kantan.csv.

jbax avatar jbax commented on June 24, 2024

A few other things that univocity does by default and other parsers dont:

  • leading/trailing whitespace removal on each parsed value
  • comment skipping
  • blank line skipping
  • unescaped quote handling
  • the line ending detection works by analysing the input. Not by using the
    operating system default line separator.

Also the parser is designed with the use of the RowProcessor in mind. This
is a callback interface used to delegate the processing of each parsed row
to all sorts of custom requirements. It is generally slower to use the
parser.readnext method as it triggers a few calls to the RowProcessor under
the hood.
On 3 Jan 2016 7:38 am, "Nicolas Rinaudo" [email protected] wrote:

Thanks for taking the time to post this. You're right that univocity's
default setting are bad for what my benchmarking is testing: after
disabling the extra thread and setting the buffer size to something a bit
more conservative, performances have improved drastically - still not the
best of the lot (that'd be jackson), but at least part of the competition
now.

I don't intend to disable automatic line ending detection - that wouldn't
be fair to the other parsers I'm using, as they also have that feature
enabled. I'll look into the other settings though - if there's something
like automated quote character or column separator detection on by default,
that's certainly unfair to univocity when all other parsers know to
expect " and ,.


Reply to this email directly or view it on GitHub
#11 (comment).

from kantan.csv.

nrinaudo avatar nrinaudo commented on June 24, 2024

The other libraries I'm currently testing are:

  • my own, tabulate.
  • opencsv
  • jackson-csv
  • apache commons-csv
  • product-collections, which I believe uses opencsv as the underlying parser.

leading/trailing whitespace removal on each parsed value

Technically, that's not RFC compliant (Spaces are considered part of a field and should not be ignored.)
Is it something that I can disable?

comment skipping

You're right that I'd want to disable that, as it's not something that I'm testing against. Come to think of it, this is also something I should disable for the other libraries - Jackson also does it by default, I think, as well as commons-csv. Not sure about opencsv, I'll need to check.

blank line skipping

I think all libraries I'm testing do that. I'll need to make sure, but I think I even have an explicit test against it.

unescaped quote handling

This I know for a fact is handled by default by all libraries I'm using except opencsv. I'm still pondering whether this is worth dropping opencsv for.

the line ending detection works by analysing the input. Not by using the operating system default line separator.

Is it anything more fancy than accepting both LF and CRLF as row separators when not escaped or quoted? If so, I'd be quite interested in reading about that if you have links / source code you can share. If not, that's also supported by default by all the libraries I'm benchmarking, and validated by an explicit test.

Also the parser is designed with the use of the RowProcessor

Interesting. I am indeed calling parser.parseNext directly, as I need iterator-like access to the CSV rows. Looking at the code though, it looks like in my case the parser is using an instance of NoopRowProcessor, and I'm assuming the cost is minimal - if not nil, I'd have thought this is the kind of dead code that a JIT in a primed JVM would remove altogether. Note that all benchmarks are executed 10 times and discarded before I actually start collecting metrics, so most JIT optimisations should have kicked in (and we can indeed see a large performance gain between the warmup iterations and the interesting ones).

from kantan.csv.

jbax avatar jbax commented on June 24, 2024

leading/trailing whitespace removal on each parsed value
Technically, that's not RFC compliant (Spaces are considered part of a field and should not be ignored.)

Note the the RFC is just a proposal and not a standard. It's rarely followed and there are many sorts of non-conforming CSV input out there.

Is it something that I can disable?

settings.setIgnoreLeadingWhitespaces(false);
settings.setIgnoreTrailingWhitespaces(false);

unescaped quote handling
This I know for a fact is handled by default by all libraries I'm using except opencsv. I'm still pondering whether this is worth dropping opencsv for.

No parser except univocity handles this. Try this:

something,"a quoted value "with unescaped quotes" can be parsed", something

univocity will parse 3 values instead of blowing up:

  1. something
  2. a quoted value "with unescaped quotes" can be parsed
  3. something

To disable this behavior and get an exception instead, use: settings.setParseUnescapedQuotes(false);

the line ending detection works by analysing the input. Not by using the operating system default line separator.
Is it anything more fancy than accepting both LF and CRLF as row separators when not escaped or quoted? If so, I'd be quite interested in reading about that if you have links / source code you can share. If not, that's also supported by default by all the libraries I'm benchmarking, and validated by an explicit test.

It just analyses the first loaded input buffer and tries to identify which line ending (CRLF, LF or CR) is present in the input. To make sure it doesn't run use settings.setLineSeparatorDetectionEnabled(false)

Also the parser is designed with the use of the RowProcessor
Interesting. I am indeed calling parser.parseNext directly, as I need iterator-like access to the CSV rows. Looking at the code though, it looks like in my case the parser is using an instance of NoopRowProcessor, and I'm assuming the cost is minimal - if not nil, I'd have thought this is the kind of dead code that a JIT in a primed JVM would remove altogether. Note that all benchmarks are executed 10 times and discarded before I actually start collecting metrics, so most JIT optimisations should have kicked in (and we can indeed see a large performance gain between the warmup iterations and the interesting ones).

Try and see for yourself. It is ~15% slower.

from kantan.csv.

nrinaudo avatar nrinaudo commented on June 24, 2024

Mmm, you're right. I thought that by unescaped quote handling, you meant something like something,a non quoted value with "quotes",something. Tabulate can handle your specific example, but it does break down if the character following the unescaped quote is a line break or a column separator.

As for the RowProcessor thing, you're the author, I'm sure you're right: it must be faster when using the callback-based API rather than the iterator-like one. I'm specifically benchmarking iterator-like access though - and I must say, if I'm using univocity for a use-case for which it wasn't specifically optimised, the results are quite impressive.

from kantan.csv.

nrinaudo avatar nrinaudo commented on June 24, 2024

All the best parsers and serializers, including univocity, are now in the same very small performance bracket. All the gross misconfigurations are now fixed.

from kantan.csv.

jbax avatar jbax commented on June 24, 2024

Thanks!

As a FYI, version 2.1.0 will be considerably faster than 2.0.2 and you can test univocity-parsers-2.1.0-SNAPSHOT for parsing already.

from kantan.csv.

nrinaudo avatar nrinaudo commented on June 24, 2024

I have set things up to be notified of new versions of the parsers I'm benchmarking and will be sure to update results as soon as 2.1.0 is released - although I'm not sure it can get much faster than it already is, there's not much room for improvement left :)

from kantan.csv.

jbax avatar jbax commented on June 24, 2024

Well, it got at least 30% faster. I ran a preliminary test against some other parsers using the worldcitiespop.txt file and it parsed everything in 880ms on my machine while Jackson took ~1.1 seconds and OpenCsv ~1.9, so it might be good to test again using your test scenario.

from kantan.csv.

nrinaudo avatar nrinaudo commented on June 24, 2024

I'm not surprised about opencsv - my results show that 2.0.2 is already almost twice as fast - but faster than jackson is quite an achievement.

from kantan.csv.

jbax avatar jbax commented on June 24, 2024

Version 2.1.0 released to include parsing and writing performance improvements. It should be way faster now.

from kantan.csv.

nrinaudo avatar nrinaudo commented on June 24, 2024

Quite, second only to jackson in my benchmarks now.

from kantan.csv.

jbax avatar jbax commented on June 24, 2024

Thank you!

On 2 May 2016 at 17:36, Nicolas Rinaudo [email protected] wrote:

Quite http://nrinaudo.github.io/kantan.csv/tut/benchmarks.html, second
only to jackson in my benchmarks now.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#11 (comment)

from kantan.csv.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.