GithubHelp home page GithubHelp logo

Comments (5)

der avatar der commented on August 16, 2024

Elsewhere we've tended to use commons-csv rather than opencsv, partly because it's included in jena.
It may make sense to switch to that for dclib, though I've no idea how it handles non-standard line breaks in files.

Certainly support CR CR LF as a legal sequence for a single line break makes no sense to me, that's just mad.

from dclib.

der avatar der commented on August 16, 2024

Legal line endings are normally CR (old style Macs), CR LF (window) and of course LF so interpreting CR CR LF as two line breaks (albit in different styles) seems like the best interpretation. Preprocessing such broken files is the obvious solution and strikes me as better than bending the interpretation of CSVs.

Though, if you can provide a small test case then I guess we could check commons CSV and updated open CSV in case they have different behaviour.

from dclib.

skwlilac avatar skwlilac commented on August 16, 2024

Legal line endings are normally CR (old style Macs), CR LF (window) and of course LF so interpreting CR CR LF as two line breaks (albit in different styles) seems like the best interpretation.

Totally agree.

Interestingly (maybe) the text/csv media type registration registration states:

Encoding considerations:  CSV MIME entities consist of binary data
   [RFC6838].  As per section 4.1.1. of RFC 2046 [RFC2046], this
   media type uses CRLF to denote line breaks.  However, implementers
   should be aware that some implementations may use other values.

which is common at least with text/plain. I was blissfully unaware of this until @ajtucker pointed it out.

I guess a forensic reading would note that the absence of the word 'only' and the note to implementers do not prohibit the use of other line breaks (specifically a single LF or CR).

from dclib.

ajtucker avatar ajtucker commented on August 16, 2024

While platform line endings are indeed an issue, the crux of this particular issue is that we were surprised that the parser essentially strips any CRs from quoted cell values. We were expecting to be able to deal with these odd character sequences in a template, but they didn't make it that far.

The grammar in RFC4180 allows a quoted cell value to have CR and LF chars and I'd expect them to make it through unmolested.

When digging down, we saw that BufferedReader, to this day, reads lines and gobbles up any CR, LF or CR LF sequences, and the opencsv parser will then replace them with a single LF. More recent versions of opencsv appear to have a keepCarriageReturn(bool) method.

I've done some quick checking and it looks as though commons-csv:1.0 at least allows CRs to come through.

import au.com.bytecode.opencsv.{CSVParser, CSVReader}
import org.apache.commons.csv.CSVFormat

import java.io.StringReader
import scala.jdk.CollectionConverters.IterableHasAsScala

object CSVParserTest extends Examples with App {
  val acc_records = CSVFormat.RFC4180.parse(new StringReader(eg1)).asScala.toArray
  assert(acc_records(1).get(0) == "line1\r\nline2")
  
  val oc_reader = new CSVReader(
    new StringReader(eg1),
    CSVParser.DEFAULT_SEPARATOR, CSVParser.DEFAULT_QUOTE_CHARACTER, '\u0000')
  val oc_records = oc_reader.readAll().asScala.toArray
  assert(oc_records(1)(0) == "line1\nline2")
}

trait Examples {
  val eg1: String = "a,b\r\n\"line1\r\nline2\",1"
}

from dclib.

der avatar der commented on August 16, 2024

Well very few sources of CSV conform to other aspects of RFC4180 (hence all parsers having lots of configuration options to try to cope with the vagaries of CSVs), why should this be different? :)

In particular, all CSV tools will normally cope with, and typically generate, platform-compliant non-quoted line endings not RFC4180 CR LF. So similar handling of quoted line endings seems reasonable and expected to me unless running in a strict RFC4180 mode which we would never normally do.

I still think this specific case is simply broken data that should be cleaned by preprocessing rather than cleaned by text processing in the dclib rules, feels like a better separation of concerns.

However, if this is causing problems then I'd be OK to switch dclib to commons-csv and test whether using RFC4180 mode would break too many other things or, more likely, make it an option.

from dclib.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.