Comments (5)
Elsewhere we've tended to use commons-csv rather than opencsv, partly because it's included in jena.
It may make sense to switch to that for dclib, though I've no idea how it handles non-standard line breaks in files.
Certainly support CR
CR
LF
as a legal sequence for a single line break makes no sense to me, that's just mad.
from dclib.
Legal line endings are normally CR
(old style Macs), CR LF
(window) and of course LF
so interpreting CR CR LF
as two line breaks (albit in different styles) seems like the best interpretation. Preprocessing such broken files is the obvious solution and strikes me as better than bending the interpretation of CSVs.
Though, if you can provide a small test case then I guess we could check commons CSV and updated open CSV in case they have different behaviour.
from dclib.
Legal line endings are normally CR (old style Macs), CR LF (window) and of course LF so interpreting CR CR LF as two line breaks (albit in different styles) seems like the best interpretation.
Totally agree.
Interestingly (maybe) the text/csv
media type registration registration states:
Encoding considerations: CSV MIME entities consist of binary data
[RFC6838]. As per section 4.1.1. of RFC 2046 [RFC2046], this
media type uses CRLF to denote line breaks. However, implementers
should be aware that some implementations may use other values.
which is common at least with text/plain
. I was blissfully unaware of this until @ajtucker pointed it out.
I guess a forensic reading would note that the absence of the word 'only' and the note to implementers do not prohibit the use of other line breaks (specifically a single LF or CR).
from dclib.
While platform line endings are indeed an issue, the crux of this particular issue is that we were surprised that the parser essentially strips any CR
s from quoted cell values. We were expecting to be able to deal with these odd character sequences in a template, but they didn't make it that far.
The grammar in RFC4180 allows a quoted cell value to have CR
and LF
chars and I'd expect them to make it through unmolested.
When digging down, we saw that BufferedReader, to this day, reads lines and gobbles up any CR
, LF
or CR
LF
sequences, and the opencsv
parser will then replace them with a single LF
. More recent versions of opencsv
appear to have a keepCarriageReturn(bool)
method.
I've done some quick checking and it looks as though commons-csv:1.0
at least allows CR
s to come through.
import au.com.bytecode.opencsv.{CSVParser, CSVReader}
import org.apache.commons.csv.CSVFormat
import java.io.StringReader
import scala.jdk.CollectionConverters.IterableHasAsScala
object CSVParserTest extends Examples with App {
val acc_records = CSVFormat.RFC4180.parse(new StringReader(eg1)).asScala.toArray
assert(acc_records(1).get(0) == "line1\r\nline2")
val oc_reader = new CSVReader(
new StringReader(eg1),
CSVParser.DEFAULT_SEPARATOR, CSVParser.DEFAULT_QUOTE_CHARACTER, '\u0000')
val oc_records = oc_reader.readAll().asScala.toArray
assert(oc_records(1)(0) == "line1\nline2")
}
trait Examples {
val eg1: String = "a,b\r\n\"line1\r\nline2\",1"
}
from dclib.
Well very few sources of CSV conform to other aspects of RFC4180 (hence all parsers having lots of configuration options to try to cope with the vagaries of CSVs), why should this be different? :)
In particular, all CSV tools will normally cope with, and typically generate, platform-compliant non-quoted line endings not RFC4180 CR LF
. So similar handling of quoted line endings seems reasonable and expected to me unless running in a strict RFC4180 mode which we would never normally do.
I still think this specific case is simply broken data that should be cleaned by preprocessing rather than cleaned by text processing in the dclib rules, feels like a better separation of concerns.
However, if this is causing problems then I'd be OK to switch dclib to commons-csv and test whether using RFC4180 mode would break too many other things or, more likely, make it an option.
from dclib.
Related Issues (20)
- Support for fetching reference data from remote services HOT 1
- Template Language documentation missing or moved. HOT 1
- Replace asNode machinery HOT 1
- hash namespace prefix expansion with empty localname looses the trailing '#' HOT 2
- $row functions don't seem to return wrapped Values HOT 1
- Array valued 'bind'
- Improve progress indicator
- Conditional templates HOT 1
- Validation
- Trying to generate tel: URI generates an error HOT 2
- dclib x.fetch() seem to only ask for .ttl
- Support version flag on dclib executable HOT 1
- Add support to create ValueArrays with the global value function HOT 1
- Provide away to make URI/resource nodes from `ValueArray` strings with prefix expansion. HOT 1
- Extraneous prefixes emitted with no (obvious) way to remove them HOT 3
- ValueArray.map(...) behaviour when individual items fail to map.
- Is it possible to get segments, but camel-cased? HOT 3
- toCleanSegement() and full-stops HOT 2
- Syntax checking and IDE help
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dclib.