nrinaudo / kantan.csv Goto Github PK
View Code? Open in Web Editor NEWCSV handling library for Scala
Home Page: http://nrinaudo.github.io/kantan.csv/
License: Apache License 2.0
CSV handling library for Scala
Home Page: http://nrinaudo.github.io/kantan.csv/
License: Apache License 2.0
Benchmarks currently take it on faith that each library's output is correct. Some simple tests to validate that it's really the case would be good.
Both type classes have no mechanism for dealing with failure when opening the underlying resource, and must rely on throwing exceptions.
Libraries that should be included, if they pass tests, are:
There are probably more out there
Shapeless requires macroparadise to run on 2.10, this needs to be stated clearly in the corresponding documentation.
The benchmark module depends on product-collections
, which does not have 2.10 artifacts. This makes the publication process more painful than it needs to be
For the case where the case class is laid out exactly as the columns in the CSV file, it would be a nice convenience to be able to do RowDecoder.decoder8(Foo.apply)
instead of RowDecoder.decoder8(Foo.apply)(0, 1, 2, 3, 4, 5, 6, 7)
.
Univocity performances are horrible, at least ten times slower than the slowest engine in our benchmarks.
This is highly suspicious, as they published benchmarks in which they were performing better than all other parsers.
While this is not strictly a tabulate
issue, it might be one with our benchmarks. It might be worth it to investigate, possibly discuss with the univocity team, and fix the benchmarks if needed.
If the performances are really that bad, we should probably drop univocity from our benchmarks.
In other projects, these would be called CsvSource
and CsvSink
. Better rename them (and maybe create a deprecated type alias for CsvInput
and CsvOutput
?).
In the current implementation, Iterator
s cannot be turned into a CSV string.
Some type classes, such as CsvInput
and CsvOutput
, would probably benefit from being made contravariant - there's no reason a CsvInput[InputStream]
shouldn't be usable where a CsvInput[ByteArrayInputStream]
is required, for example.
This is currently problematic, as the current version of simulacrum doesn't seem to behave well with variance annotations. A fix has been submitted and accepted, and should be released with version 0.6.0.
In the current version, pattern matching on Result
is cumbersome: it requires importing it from kantan.codecs, where we'd rather users not even have to know about it.
See imp.
Maybe writeCsv
should expect a TraversableOnce
instead?
This will take some work to fix, but is probably not too difficult: all write operations simply need to return a WriteResult[CsvWriter]
rather than a plain CsvWriter
.
The current implementation of the opencsv
engine is underwhelming: it has some of the worst performances, and fails most parsing tests. It fails so hard, in fact, that something might be up with my implementation rather than with opencsv
itself.
This should be investigated and possibly discussed with the opencsv
team. Should the issue be with opencsv
itself, we might want to drop support for it, at least until it's fixed.
In its current implementation, CsvIterator
relies on scala.io.Source
for IO.
Ideally, CsvIterator
would only require some sort of closeable iterator on chars. Among other things, this would probably make scala.js integration easier.
See this conversation.
The problem comes from the fact that we rely on the StringDecoder
instances. Manually defined CellDecoder
instances are not picked up.
The same problem most probably affects:
Either
instancesCellEncoder
instanceskantan.csv doesn't appear to behave well when the column separator is "any non-0 number of spaces".
Is this something we want to support?
In theory, it looks like it should be possible for CellEncoder
and RowEncoder
to both be specific instances of Encoder[E, D]
, where E
is the encoded type and D
the decoded type. CellDecoder[A]
would be an Encoder[String, A]
and RowDecoder[A]
an Encoder[Seq[String], A]
.
I've tried that and am currently stuck with simulacrum
not creating the operators I'd like it to.
Encoder
cannot be annotated with @typeclass
- it has two type parameters. And methods declared in Encoder
do not get operators in subclasses that are annotated with @typeclass
.
For example:
trait Encoder[E, D] {
def encode(d: D): E
}
@simulacrum.typeclass
trait CellEncoder[A] extends Encoder[String, A]
object Test {
import CellEncoder.ops._
def test[A: CellEncoder](a: A): String = a.encode
}
This fails with:
[error] /Users/nicolasrinaudo/dev/nrinaudo/tabulate/core/src/main/scala/tabulate/Encoder.scala:14: value encode is not a member of type parameter A
[error] def test[A: ColumnEncoder](a: A): String = a.encode
[error] ^
[error] one error found
It might be worth bringing this issue up on simulacrum
's Gitter channel
We have some odd dependencies coming up in the generated pom files. In particular, the scalaz-stream
module depends on laws
at compile time, which is just wrong.
A common use case of safe parsing is to simply skip over errors. This is currently a bit of a pain to write, but a collect(f: PartialFunction[A, B]): CsvReader[B]
would make it both easier and more idiomatic.
CsvIterator
is not exposed to the rest of the word, and its close
method cannot be called explicitly. This means that, not unlike scala.io.Source
, callers don't really control when the underlying stream is closed, which is likely to cause all sorts of issues in the real world.
Ideally, CsvIterator
would become public, have a close
method and all the bells and whistles of normal Iterator
.
The apply
name clashes with the instance retrieval method, which can sometimes cause the compiler to be confused and require explicit type tags where they really should be inferred.
apply
needs to be marked as deprecated (and deleted in the next version).
See machinist.
PrintWriter
is too tied to java, which makes integration to scala.js impossible in the current state.
It's a bit of a mystery as they all look very straightforward, but they sometimes fail to generate test cases and cause the entire build to fail.
Specifically, the generators fail in either imap identity (encoding)
of imap identity (decoding)
.
Engines are currently configured the way I feel they should be by default - RFC compliant. The underlying implementation is not exposed however, which seriously limits the usefulness of external engines - what's the point of bringing in jackson csv if one cannot configure it's more esoteric features?
Does something like:
val l: Long = 10
CellEncoder.encode(l)
Make sense?
While java.util.Date
is terrible, it doesn't cost much to have a simple ISO-8601 default implementation. People that need it will appreciate having it, people that want better tools (joda, say) can easily implement a better codec.
RowEncoder
, RowDecoder
and RowCodec
have a horrible amount of duplicated code - case classes from arity 1 to 22, tuples from arity 1 to 22, ...
There has got to be a better way to generate that code. sbt-boilerplate might be one, or I think I remember argonaut uses a ad-hoc solution that might be worth investigating.
Attempting to create a decoder twice (or any other decoder for that matter) generates an error after one decoder is in scope.
Welcome to Scala version 2.10.6 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66).
Type in expressions to have them evaluated.
Type :help for more information.
scala> case class Car(year: Int, make: String, model: String, desc: Option[String], price: Float)
defined class Car
scala> kantan.csv.RowDecoder.decoder5(Car.apply)(0, 1, 2, 3, 4)
res0: kantan.csv.RowDecoder[Car] = kantan.codecs.Decoder$$anon$1@3f39cd95
scala> kantan.csv.RowDecoder.decoder5(Car.apply)(0, 1, 2, 3, 4)
<console>:10: error: value decoder5 is not a member of object kantan.csv.RowDecoder
kantan.csv.RowDecoder.decoder5(Car.apply)(0, 1, 2, 3, 4)
^
Note that this means dropping support for scala 2.10, at least for the fs2 module.
Probably a way to handle this, but getting implicit errors when attempting to use kantan asCsvReader
and asCsvWriter
against generics. I suspect this is either a limitation or there is some niceness I can try with implicits to move forward, but right now my solution is to simply leave the trait methods unimplemented and do so in the implementing case classes.
If anyone has a better way of doing this, I'm all ears.
Code follows errors.
Information:6/25/16, 5:24 PM - Compilation completed with 6 errors and 0 warnings in 1s 749ms
...
Error:(24, 40) could not find implicit value for evidence parameter of type kantan.csv.RowEncoder[T]
val writer = out.asCsvWriter[T](',', line.headers)
^
Error:(24, 40) not enough arguments for method asCsvWriter: (implicit evidence$1: kantan.csv.RowEncoder[T], implicit oa: kantan.csv.CsvOutput[java.io.ByteArrayOutputStream], implicit e: kantan.csv.engine.WriterEngine)kantan.csv.CsvWriter[T].
Unspecified value parameters evidence$1, oa, e.
val writer = out.asCsvWriter[T](',', line.headers)
^
Error:(45, 66) could not find implicit value for evidence parameter of type kantan.csv.RowDecoder[T]
val results = getClass.getResource(source).asCsvReader[T](',', false)
^
Error:(45, 66) not enough arguments for method asCsvReader: (implicit evidence$1: kantan.csv.RowDecoder[T], implicit ia: kantan.csv.CsvInput[java.net.URL], implicit e: kantan.csv.engine.ReaderEngine)kantan.csv.CsvReader[kantan.csv.ReadResult[T]].
Unspecified value parameters evidence$1, ia, e.
val results = getClass.getResource(source).asCsvReader[T](',', false)
^
Error:(54, 65) could not find implicit value for evidence parameter of type kantan.csv.RowDecoder[T]
val results : List[ReadResult[T]] = line.readCsv[List,T](sep, header)
^
Error:(54, 65) not enough arguments for method readCsv: (implicit evidence$3: kantan.csv.RowDecoder[T], implicit ia: kantan.csv.CsvInput[String], implicit e: kantan.csv.engine.ReaderEngine, implicit cbf: scala.collection.generic.CanBuildFrom[Nothing,kantan.csv.ReadResult[T],List[kantan.csv.ReadResult[T]]])List[kantan.csv.ReadResult[T]].
Unspecified value parameters evidence$3, ia, e...
val results : List[ReadResult[T]] = line.readCsv[List,T](sep, header)
^ ^
import java.io.ByteArrayOutputStream
import kantan.csv._
import kantan.csv.ops._
import kantan.csv.generic._
/**
* Created by revprez on 6/25/16.
*/
trait CsvLine {
def headers : List[String]
}
trait CsvParsable[T <: CsvLine] {
def line : T
def toCsvString(header: Boolean = true) : String = {
val out : ByteArrayOutputStream = new ByteArrayOutputStream()
val writer = out.asCsvWriter[T](',', line.headers)
writer.write(line).close
val string = new String(out.toByteArray)
out.close
header match {
case true => return string.stripLineEnd
case false => {
return string.split("\\r?\\n")(1).stripLineEnd
}
}
}
}
trait CsvParser[T] {
implicit val codec = scala.io.Codec.ISO8859
def parseFile(source: String, sep : Char = ',', header : Boolean = true) : Stream[T] = {
val results = getClass.getResource(source).asCsvReader[T](',', false)
.toStream
.filter( _.isSuccess )
.map(_.get)
return results
}
def parse(line : String, sep : Char = ',', header : Boolean = false) : Option[T] = {
val results : List[ReadResult[T]] = line.readCsv[List,T](sep, header)
results match {
case Nil => None
case or::ors if (or.isFailure) => None
case or::ors if (or.isSuccess) => Some(or.get)
}
}
}
This supercedes #16.
Tabulate doesn't support Scala.js. It should.
There doesn't seem to be an example in the docs using for comprehensions. Can Kantan be used with them and can we add some documentation about that?
There are at least two unofficial CSV features that I've yet to encounter:
#
and a line break, apparently).\
is sometimes used to prevent interpretation of the next character).I haven't encountered data files that exhibit these behaviours (or other, even rarer oddities), but if they do exist, it'd be good to support them / test against them.
In the current implementation, using, say, Jackson for parsing but opencsv for writing is hard work. While not necessarily a common use case, splitting all engines into an implicit ReaderEngine
and WriterEngine
solves this neatly without changing anything for more standard use cases.
An unforeseen consequence of giving them the name of the project they are connectors for is that it messes up imports, but good.
For example, in:
import com.nrinaudo.csv._
import scalaz.stream._
None of scalaz-stream
classes are imported - the second import is understood to be com.nrinaudo.csv.scalaz.stream._
I'd like to be able to find out more details on why a decode failed. In DecodeResult
exceptions are swallowed, inhibiting this. I'd consider returning a subclass of Failure
that wraps the exception, or just using the standard Either
type.
This works:
$ sbt console
scala> import kantan.csv.ops._
scala> List(List(1, 2), List(3, 4)).asCsv(',')
res0: String =
"1,2
3,4
"
This hangs (and hogs the processor):
$ sbt console
scala> import kantan.csv.ops._
scala> import kantan.csv.generic.codecs._
scala> List(List(1, 2), List(3, 4)).asCsv(',')
Later versions of scoverage mess up SBT's handling of scala versions.
All packages should be renamed to com.nrinaudo.tabulate
.
For some reason, the generic
module doesn't get any code coverage information. In fact, scoverage seems to believe there is nothing to instrument:
[info] [info] Instrumentation completed [0 statements]
When transforming a CSV file into another, we currently need to turn the CsvReader
into an Iterator
before passing it to CsvOutput.writeCsv
. This seems unnecessary.
The question is, do we want a specialised implementation of writeCsv
, or should CsvReader
implement TraversableOnce
(if at all possible)?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.