nrinaudo / kantan.csv Goto Github PK

View Code? Open in Web Editor NEW

346.0 10.0 36.0 9 MB

CSV handling library for Scala

Home Page: http://nrinaudo.github.io/kantan.csv/

License: Apache License 2.0

Scala 100.00%

csv

kantan.csv's Introduction

kantan.csv

CSV is an unfortunate part of life. This attempts to alleviate the pain somewhat by letting developers treat CSV data as a simple iterator.

As much as possible, kantan.csv attempts to present a purely functional and safe interface to users. I've not hesitated to violate these principles internally however, when it afforded better performances. This approach appears to be somewhat successful.

Documentation and tutorials are available on the companion site, but for those looking for a few quick examples:

import java.io.File
import kantan.csv._         // All kantan.csv types.
import kantan.csv.ops._     // Enriches types with useful methods.
import kantan.csv.generic._ // Automatic derivation of codecs.

// Reading from a file: returns an iterator-like structure on (Int, Int)
new File("points.csv").asCsvReader[(Int, Int)](rfc)

// "Complex" types derivation: the second column is either an int, or a string that might be empty.
new File("dodgy.csv").asCsvReader[(Int, Either[Int, Option[String]])](rfc)

case class Point2D(x: Int, y: Int)

// Parsing the content of a remote URL as a List[Point2D].
new java.net.URL("http://someserver.com/points.csv").readCsv[List, Point2D](rfc.withHeader)

// Writing to a CSV file.
new File("output.csv").asCsvWriter[Point2D](rfc)
  .write(Point2D(0, 1))
  .write(Point2D(2, 3))
  .close()

// Writing a collection to a CSV file
new File("output.csv").writeCsv[Point2D](List(Point2D(0, 1), Point2D(2, 3)), rfc)

kantan.csv is distributed under the Apache 2.0 License.

kantan.csv's People

Contributors

Stargazers

Watchers

kantan.csv's Issues

Investigate the opencsv engine

The current implementation of the opencsv engine is underwhelming: it has some of the worst performances, and fails most parsing tests. It fails so hard, in fact, that something might be up with my implementation rather than with opencsv itself.

This should be investigated and possibly discussed with the opencsv team. Should the issue be with opencsv itself, we might want to drop support for it, at least until it's fixed.

Provide a default Date codec?

While java.util.Date is terrible, it doesn't cost much to have a simple ISO-8601 default implementation. People that need it will appreciate having it, people that want better tools (joda, say) can easily implement a better codec.

Use imp for TC instance summoning methods

See imp.

Option CellDecoder instances are not derived automatically anymore

See this conversation.

The problem comes from the fact that we rely on the StringDecoder instances. Manually defined CellDecoder instances are not picked up.

The same problem most probably affects:

Either instances
CellEncoder instances

Add more pure Scala libraries to the benchmarks

Libraries that should be included, if they pass tests, are:

~~scala-csv~~
PureCSV (currently fails tests, but it should be possible to get it to work)
Delimited
mighty-csv

There are probably more out there

Support Scala.js

Tabulate doesn't support Scala.js. It should.

Replace encoders and decoders by kantan.codecs

This supercedes #16.

Test the benchmarks

Benchmarks currently take it on faith that each library's output is correct. Some simple tests to validate that it's really the case would be good.

Implicit value for evidence parameter errors

Probably a way to handle this, but getting implicit errors when attempting to use kantan asCsvReader and asCsvWriter against generics. I suspect this is either a limitation or there is some niceness I can try with implicits to move forward, but right now my solution is to simply leave the trait methods unimplemented and do so in the implementing case classes.

If anyone has a better way of doing this, I'm all ears.

Code follows errors.

Errors

Information:6/25/16, 5:24 PM - Compilation completed with 6 errors and 0 warnings in 1s 749ms

...

Error:(24, 40) could not find implicit value for evidence parameter of type kantan.csv.RowEncoder[T]
        val writer = out.asCsvWriter[T](',', line.headers)
                                       ^
Error:(24, 40) not enough arguments for method asCsvWriter: (implicit evidence$1: kantan.csv.RowEncoder[T], implicit oa: kantan.csv.CsvOutput[java.io.ByteArrayOutputStream], implicit e: kantan.csv.engine.WriterEngine)kantan.csv.CsvWriter[T].
Unspecified value parameters evidence$1, oa, e.
        val writer = out.asCsvWriter[T](',', line.headers)
                                       ^
Error:(45, 66) could not find implicit value for evidence parameter of type kantan.csv.RowDecoder[T]
        val results = getClass.getResource(source).asCsvReader[T](',', false)
                                                                 ^
Error:(45, 66) not enough arguments for method asCsvReader: (implicit evidence$1: kantan.csv.RowDecoder[T], implicit ia: kantan.csv.CsvInput[java.net.URL], implicit e: kantan.csv.engine.ReaderEngine)kantan.csv.CsvReader[kantan.csv.ReadResult[T]].
Unspecified value parameters evidence$1, ia, e.
        val results = getClass.getResource(source).asCsvReader[T](',', false)
                                                                 ^
Error:(54, 65) could not find implicit value for evidence parameter of type kantan.csv.RowDecoder[T]
        val results : List[ReadResult[T]] = line.readCsv[List,T](sep, header)
                                                                ^
Error:(54, 65) not enough arguments for method readCsv: (implicit evidence$3: kantan.csv.RowDecoder[T], implicit ia: kantan.csv.CsvInput[String], implicit e: kantan.csv.engine.ReaderEngine, implicit cbf: scala.collection.generic.CanBuildFrom[Nothing,kantan.csv.ReadResult[T],List[kantan.csv.ReadResult[T]]])List[kantan.csv.ReadResult[T]].
Unspecified value parameters evidence$3, ia, e...
        val results : List[ReadResult[T]] = line.readCsv[List,T](sep, header)
                                                                ^                                       ^

Code

import java.io.ByteArrayOutputStream

import kantan.csv._
import kantan.csv.ops._
import kantan.csv.generic._
/**
  * Created by revprez on 6/25/16.
  */

trait CsvLine {

    def headers : List[String]
}

trait CsvParsable[T <: CsvLine] {

    def line : T

    def toCsvString(header: Boolean = true) : String = {
        val out : ByteArrayOutputStream = new ByteArrayOutputStream()

        val writer = out.asCsvWriter[T](',', line.headers)
        writer.write(line).close

        val string = new String(out.toByteArray)
        out.close

        header match {
            case true => return string.stripLineEnd
            case false => {
                return string.split("\\r?\\n")(1).stripLineEnd
            }
        }

    }
}

trait CsvParser[T] {

    implicit val codec = scala.io.Codec.ISO8859

    def parseFile(source: String, sep : Char = ',', header : Boolean = true) : Stream[T] = {
        val results = getClass.getResource(source).asCsvReader[T](',', false)
            .toStream
            .filter( _.isSuccess )
            .map(_.get)

        return results
    }

    def parse(line : String, sep : Char = ',', header : Boolean = false) : Option[T] = {
        val results : List[ReadResult[T]] = line.readCsv[List,T](sep, header)

        results match {
            case Nil => None
            case or::ors if (or.isFailure) => None
            case or::ors if (or.isSuccess) => Some(or.get)

        }
    }
}

RowDecoder methods hidden from resolution once an instance is created

Attempting to create a decoder twice (or any other decoder for that matter) generates an error after one decoder is in scope.

Welcome to Scala version 2.10.6 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66).
Type in expressions to have them evaluated.
Type :help for more information.

scala> case class Car(year: Int, make: String, model: String, desc: Option[String], price: Float)
defined class Car

scala> kantan.csv.RowDecoder.decoder5(Car.apply)(0, 1, 2, 3, 4)
res0: kantan.csv.RowDecoder[Car] = kantan.codecs.Decoder$$anon$1@3f39cd95

scala> kantan.csv.RowDecoder.decoder5(Car.apply)(0, 1, 2, 3, 4)
<console>:10: error: value decoder5 is not a member of object kantan.csv.RowDecoder
              kantan.csv.RowDecoder.decoder5(Car.apply)(0, 1, 2, 3, 4)
                                    ^

Rename the scalaz, cats and scalaz-stream packages

An unforeseen consequence of giving them the name of the project they are connectors for is that it messes up imports, but good.

For example, in:

import com.nrinaudo.csv._
import scalaz.stream._

None of scalaz-stream classes are imported - the second import is understood to be com.nrinaudo.csv.scalaz.stream._

Move all tests to the test module.

Find a way to automate generation of the horrible case class boilerplate

RowEncoder, RowDecoder and RowCodec have a horrible amount of duplicated code - case classes from arity 1 to 22, tuples from arity 1 to 22, ...

There has got to be a better way to generate that code. sbt-boilerplate might be one, or I think I remember argonaut uses a ad-hoc solution that might be worth investigating.

Find more "odd" CSV data to test against

There are at least two unofficial CSV features that I've yet to encounter:

Comments (anything between a # and a line break, apparently).
Escaped characters (\ is sometimes used to prevent interpretation of the next character).

I haven't encountered data files that exhibit these behaviours (or other, even rarer oddities), but if they do exist, it'd be good to support them / test against them.

Use machinist for TC syntax

See machinist.

Give CellEncoder and RowEncoder a common parent?

In theory, it looks like it should be possible for CellEncoder and RowEncoder to both be specific instances of Encoder[E, D], where E is the encoded type and D the decoded type. CellDecoder[A] would be an Encoder[String, A] and RowDecoder[A] an Encoder[Seq[String], A].

I've tried that and am currently stuck with simulacrum not creating the operators I'd like it to.
Encoder cannot be annotated with @typeclass - it has two type parameters. And methods declared in Encoder do not get operators in subclasses that are annotated with @typeclass.

For example:

trait Encoder[E, D] {
  def encode(d: D): E
}

@simulacrum.typeclass
trait CellEncoder[A] extends Encoder[String, A]

object Test {
  import CellEncoder.ops._
  def test[A: CellEncoder](a: A): String = a.encode
}

This fails with:

[error] /Users/nicolasrinaudo/dev/nrinaudo/tabulate/core/src/main/scala/tabulate/Encoder.scala:14: value encode is not a member of type parameter A
[error]   def test[A: ColumnEncoder](a: A): String = a.encode
[error]                                                ^
[error] one error found

It might be worth bringing this issue up on simulacrum's Gitter channel

Replace scalaz-streams by fs2

Note that this means dropping support for scala 2.10, at least for the fs2 module.

Turn TraversableOps into TraversableOnceOps

In the current implementation, Iterators cannot be turned into a CSV string.

Iterators cannot be passed to writeCsv

Maybe writeCsv should expect a TraversableOnce instead?

Update generic module documentation for 2.10

Shapeless requires macroparadise to run on 2.10, this needs to be stated clearly in the corresponding documentation.

Split additional engines into explicit reader and writer instances

In the current implementation, using, say, Jackson for parsing but opencsv for writing is hard work. While not necessarily a common use case, splitting all engines into an implicit ReaderEngine and WriterEngine solves this neatly without changing anything for more standard use cases.

CsvIterator cannot be closed explicitly

CsvIterator is not exposed to the rest of the word, and its close method cannot be called explicitly. This means that, not unlike scala.io.Source, callers don't really control when the underlying stream is closed, which is likely to cause all sorts of issues in the real world.

Ideally, CsvIterator would become public, have a close method and all the bells and whistles of normal Iterator.

Clean dependencies up

We have some odd dependencies coming up in the generated pom files. In particular, the scalaz-stream module depends on laws at compile time, which is just wrong.

Work on type class variance

Some type classes, such as CsvInput and CsvOutput, would probably benefit from being made contravariant - there's no reason a CsvInput[InputStream] shouldn't be usable where a CsvInput[ByteArrayInputStream] is required, for example.

This is currently problematic, as the current version of simulacrum doesn't seem to behave well with variance annotations. A fix has been submitted and accepted, and should be released with version 0.6.0.

It would be really useful for CsvReader to have a collect method

A common use case of safe parsing is to simply skip over errors. This is currently a bit of a pain to write, but a collect(f: PartialFunction[A, B]): CsvReader[B] would make it both easier and more idiomatic.

Investigate univocity performances

Univocity performances are horrible, at least ten times slower than the slowest engine in our benchmarks.

This is highly suspicious, as they published benchmarks in which they were performing better than all other parsers.

While this is not strictly a tabulate issue, it might be one with our benchmarks. It might be worth it to investigate, possibly discuss with the univocity team, and fix the benchmarks if needed.

If the performances are really that bad, we should probably drop univocity from our benchmarks.

Whitespace as column separator

kantan.csv doesn't appear to behave well when the column separator is "any non-0 number of spaces".

Is this something we want to support?

CsvWriter needs to not rely on java.io.PrintWriter

PrintWriter is too tied to java, which makes integration to scala.js impossible in the current state.

CsvInput and CsvOutput are unsafe

Both type classes have no mechanism for dealing with failure when opening the underlying resource, and must rely on throwing exceptions.

Generators in DerivedRowDecoderTests sometimes fail

It's a bit of a mystery as they all look very straightforward, but they sometimes fail to generate test cases and cause the entire build to fail.

Specifically, the generators fail in either imap identity (encoding) of imap identity (decoding).

Consider pre-numbered aliases for `decoderXY` and `encoderXY`.

For the case where the case class is laid out exactly as the columns in the CSV file, it would be a nice convenience to be able to do RowDecoder.decoder8(Foo.apply) instead of RowDecoder.decoder8(Foo.apply)(0, 1, 2, 3, 4, 5, 6, 7).

asCsv goes into infinite recursion when certain imports are present

This works:

  $ sbt console
  scala> import kantan.csv.ops._
  scala> List(List(1, 2), List(3, 4)).asCsv(',')
  res0: String =
  "1,2
  3,4
  "

This hangs (and hogs the processor):

  $ sbt console
  scala> import kantan.csv.ops._
  scala> import kantan.csv.generic.codecs._
  scala> List(List(1, 2), List(3, 4)).asCsv(',')

Provide codec instances for Path

CsvIterator needs to not rely on scala.io.Source

In its current implementation, CsvIterator relies on scala.io.Source for IO.

Ideally, CsvIterator would only require some sort of closeable iterator on chars. Among other things, this would probably make scala.js integration easier.

Package name does not match tabulate

All packages should be renamed to com.nrinaudo.tabulate.

Generic module code coverage

For some reason, the generic module doesn't get any code coverage information. In fact, scoverage seems to believe there is nothing to instrument:

[info] [info] Instrumentation completed [0 statements]

`CsvSink.sink` needs its header argument to be optional

Downgrade to scoverage 1.2.0

Later versions of scoverage mess up SBT's handling of scala versions.

Normalise CsvInput and CsvOutput names

In other projects, these would be called CsvSource and CsvSink. Better rename them (and maybe create a deprecated type alias for CsvInput and CsvOutput?).

Make an alias for the Result companion object

In the current version, pattern matching on Result is cumbersome: it requires importing it from kantan.codecs, where we'd rather users not even have to know about it.

Fix error hashcode tests

See nrinaudo/kantan.xpath#8

Consider way of keeping exceptions thrown during decoding

I'd like to be able to find out more details on why a decode failed. In DecodeResult exceptions are swallowed, inhibiting this. I'd consider returning a subclass of Failure that wraps the exception, or just using the standard Either type.

CsvOutput / CsvWriter can throw exceptions in case of IO errors

This will take some work to fix, but is probably not too difficult: all write operations simply need to return a WriteResult[CsvWriter] rather than a plain CsvWriter.

Publishing is broken due to benchmark dependencies

The benchmark module depends on product-collections, which does not have 2.10 artifacts. This makes the publication process more painful than it needs to be

Write encoding helpers?

Does something like:

val l: Long = 10
CellEncoder.encode(l)

Make sense?

Document usage with for comprehensions

There doesn't seem to be an example in the docs using for comprehensions. Can Kantan be used with them and can we add some documentation about that?

CsvOutput.writeCsv should accept CsvReader as a parameter

When transforming a CSV file into another, we currently need to turn the CsvReader into an Iterator before passing it to CsvOutput.writeCsv. This seems unnecessary.

The question is, do we want a specialised implementation of writeCsv, or should CsvReader implement TraversableOnce (if at all possible)?

Better configurability of reader / writer engines

Engines are currently configured the way I feel they should be by default - RFC compliant. The underlying implementation is not exposed however, which seriously limits the usefulness of external engines - what's the point of bringing in jackson csv if one cannot configure it's more esoteric features?

Add instances for java.nio.file classes

Rename the apply TC instance creation method to from

The apply name clashes with the instance retrieval method, which can sometimes cause the compiler to be confused and require explicit type tags where they really should be inferred.

apply needs to be marked as deprecated (and deleted in the next version).

nrinaudo / kantan.csv Goto Github PK

kantan.csv's Introduction

kantan.csv

kantan.csv's People

Contributors

Stargazers

Watchers

Forkers

kantan.csv's Issues

Errors

Code

Recommend Projects

Recommend Topics

Recommend Org

Jobs