nrinaudo / kantan.csv Goto Github PK

The benchmark module depends on product-collections, which does not have 2.10 artifacts. This makes the publication process more painful than it needs to be

Consider pre-numbered aliases for `decoderXY` and `encoderXY`.

For the case where the case class is laid out exactly as the columns in the CSV file, it would be a nice convenience to be able to do RowDecoder.decoder8(Foo.apply) instead of RowDecoder.decoder8(Foo.apply)(0, 1, 2, 3, 4, 5, 6, 7).

Investigate univocity performances

Univocity performances are horrible, at least ten times slower than the slowest engine in our benchmarks.

This is highly suspicious, as they published benchmarks in which they were performing better than all other parsers.

While this is not strictly a tabulate issue, it might be one with our benchmarks. It might be worth it to investigate, possibly discuss with the univocity team, and fix the benchmarks if needed.

If the performances are really that bad, we should probably drop univocity from our benchmarks.

Normalise CsvInput and CsvOutput names

In other projects, these would be called CsvSource and CsvSink. Better rename them (and maybe create a deprecated type alias for CsvInput and CsvOutput?).

Turn TraversableOps into TraversableOnceOps

In the current implementation, Iterators cannot be turned into a CSV string.

Work on type class variance

Some type classes, such as CsvInput and CsvOutput, would probably benefit from being made contravariant - there's no reason a CsvInput[InputStream] shouldn't be usable where a CsvInput[ByteArrayInputStream] is required, for example.

This is currently problematic, as the current version of simulacrum doesn't seem to behave well with variance annotations. A fix has been submitted and accepted, and should be released with version 0.6.0.

Make an alias for the Result companion object

In the current version, pattern matching on Result is cumbersome: it requires importing it from kantan.codecs, where we'd rather users not even have to know about it.

Use imp for TC instance summoning methods

See imp.

Iterators cannot be passed to writeCsv

Maybe writeCsv should expect a TraversableOnce instead?

CsvOutput / CsvWriter can throw exceptions in case of IO errors

This will take some work to fix, but is probably not too difficult: all write operations simply need to return a WriteResult[CsvWriter] rather than a plain CsvWriter.

Investigate the opencsv engine

The current implementation of the opencsv engine is underwhelming: it has some of the worst performances, and fails most parsing tests. It fails so hard, in fact, that something might be up with my implementation rather than with opencsv itself.

This should be investigated and possibly discussed with the opencsv team. Should the issue be with opencsv itself, we might want to drop support for it, at least until it's fixed.

CsvIterator needs to not rely on scala.io.Source

In its current implementation, CsvIterator relies on scala.io.Source for IO.

Ideally, CsvIterator would only require some sort of closeable iterator on chars. Among other things, this would probably make scala.js integration easier.

Option CellDecoder instances are not derived automatically anymore

See this conversation.

The problem comes from the fact that we rely on the StringDecoder instances. Manually defined CellDecoder instances are not picked up.

The same problem most probably affects:

Either instances
CellEncoder instances

Whitespace as column separator

kantan.csv doesn't appear to behave well when the column separator is "any non-0 number of spaces".

Is this something we want to support?

Give CellEncoder and RowEncoder a common parent?

In theory, it looks like it should be possible for CellEncoder and RowEncoder to both be specific instances of Encoder[E, D], where E is the encoded type and D the decoded type. CellDecoder[A] would be an Encoder[String, A] and RowDecoder[A] an Encoder[Seq[String], A].

I've tried that and am currently stuck with simulacrum not creating the operators I'd like it to.
Encoder cannot be annotated with @typeclass - it has two type parameters. And methods declared in Encoder do not get operators in subclasses that are annotated with @typeclass.

For example:

trait Encoder[E, D] {
  def encode(d: D): E
}

@simulacrum.typeclass
trait CellEncoder[A] extends Encoder[String, A]

object Test {
  import CellEncoder.ops._
  def test[A: CellEncoder](a: A): String = a.encode
}

This fails with:

[error] /Users/nicolasrinaudo/dev/nrinaudo/tabulate/core/src/main/scala/tabulate/Encoder.scala:14: value encode is not a member of type parameter A
[error]   def test[A: ColumnEncoder](a: A): String = a.encode
[error]                                                ^
[error] one error found

It might be worth bringing this issue up on simulacrum's Gitter channel

Clean dependencies up

We have some odd dependencies coming up in the generated pom files. In particular, the scalaz-stream module depends on laws at compile time, which is just wrong.

Add instances for java.nio.file classes

It would be really useful for CsvReader to have a collect method

A common use case of safe parsing is to simply skip over errors. This is currently a bit of a pain to write, but a collect(f: PartialFunction[A, B]): CsvReader[B] would make it both easier and more idiomatic.

CsvIterator cannot be closed explicitly

CsvIterator is not exposed to the rest of the word, and its close method cannot be called explicitly. This means that, not unlike scala.io.Source, callers don't really control when the underlying stream is closed, which is likely to cause all sorts of issues in the real world.

Ideally, CsvIterator would become public, have a close method and all the bells and whistles of normal Iterator.

Rename the apply TC instance creation method to from

The apply name clashes with the instance retrieval method, which can sometimes cause the compiler to be confused and require explicit type tags where they really should be inferred.

apply needs to be marked as deprecated (and deleted in the next version).

Use machinist for TC syntax

See machinist.

CsvWriter needs to not rely on java.io.PrintWriter

PrintWriter is too tied to java, which makes integration to scala.js impossible in the current state.

Generators in DerivedRowDecoderTests sometimes fail

It's a bit of a mystery as they all look very straightforward, but they sometimes fail to generate test cases and cause the entire build to fail.

Specifically, the generators fail in either imap identity (encoding) of imap identity (decoding).

Fix error hashcode tests

See nrinaudo/kantan.xpath#8

Better configurability of reader / writer engines

Engines are currently configured the way I feel they should be by default - RFC compliant. The underlying implementation is not exposed however, which seriously limits the usefulness of external engines - what's the point of bringing in jackson csv if one cannot configure it's more esoteric features?

Write encoding helpers?

Does something like:

val l: Long = 10
CellEncoder.encode(l)

Make sense?

Provide a default Date codec?

While java.util.Date is terrible, it doesn't cost much to have a simple ISO-8601 default implementation. People that need it will appreciate having it, people that want better tools (joda, say) can easily implement a better codec.

Find a way to automate generation of the horrible case class boilerplate

RowEncoder, RowDecoder and RowCodec have a horrible amount of duplicated code - case classes from arity 1 to 22, tuples from arity 1 to 22, ...

There has got to be a better way to generate that code. sbt-boilerplate might be one, or I think I remember argonaut uses a ad-hoc solution that might be worth investigating.

Move all tests to the test module.

RowDecoder methods hidden from resolution once an instance is created

Attempting to create a decoder twice (or any other decoder for that matter) generates an error after one decoder is in scope.

Welcome to Scala version 2.10.6 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66).
Type in expressions to have them evaluated.
Type :help for more information.

scala> case class Car(year: Int, make: String, model: String, desc: Option[String], price: Float)
defined class Car

scala> kantan.csv.RowDecoder.decoder5(Car.apply)(0, 1, 2, 3, 4)
res0: kantan.csv.RowDecoder[Car] = kantan.codecs.Decoder$$anon$1@3f39cd95

scala> kantan.csv.RowDecoder.decoder5(Car.apply)(0, 1, 2, 3, 4)
<console>:10: error: value decoder5 is not a member of object kantan.csv.RowDecoder
              kantan.csv.RowDecoder.decoder5(Car.apply)(0, 1, 2, 3, 4)
                                    ^

Replace scalaz-streams by fs2

Note that this means dropping support for scala 2.10, at least for the fs2 module.

Implicit value for evidence parameter errors

Probably a way to handle this, but getting implicit errors when attempting to use kantan asCsvReader and asCsvWriter against generics. I suspect this is either a limitation or there is some niceness I can try with implicits to move forward, but right now my solution is to simply leave the trait methods unimplemented and do so in the implementing case classes.

If anyone has a better way of doing this, I'm all ears.

Code follows errors.

Errors

Information:6/25/16, 5:24 PM - Compilation completed with 6 errors and 0 warnings in 1s 749ms

...

Error:(24, 40) could not find implicit value for evidence parameter of type kantan.csv.RowEncoder[T]
        val writer = out.asCsvWriter[T](',', line.headers)
                                       ^
Error:(24, 40) not enough arguments for method asCsvWriter: (implicit evidence$1: kantan.csv.RowEncoder[T], implicit oa: kantan.csv.CsvOutput[java.io.ByteArrayOutputStream], implicit e: kantan.csv.engine.WriterEngine)kantan.csv.CsvWriter[T].
Unspecified value parameters evidence$1, oa, e.
        val writer = out.asCsvWriter[T](',', line.headers)
                                       ^
Error:(45, 66) could not find implicit value for evidence parameter of type kantan.csv.RowDecoder[T]
        val results = getClass.getResource(source).asCsvReader[T](',', false)
                                                                 ^
Error:(45, 66) not enough arguments for method asCsvReader: (implicit evidence$1: kantan.csv.RowDecoder[T], implicit ia: kantan.csv.CsvInput[java.net.URL], implicit e: kantan.csv.engine.ReaderEngine)kantan.csv.CsvReader[kantan.csv.ReadResult[T]].
Unspecified value parameters evidence$1, ia, e.
        val results = getClass.getResource(source).asCsvReader[T](',', false)
                                                                 ^
Error:(54, 65) could not find implicit value for evidence parameter of type kantan.csv.RowDecoder[T]
        val results : List[ReadResult[T]] = line.readCsv[List,T](sep, header)
                                                                ^
Error:(54, 65) not enough arguments for method readCsv: (implicit evidence$3: kantan.csv.RowDecoder[T], implicit ia: kantan.csv.CsvInput[String], implicit e: kantan.csv.engine.ReaderEngine, implicit cbf: scala.collection.generic.CanBuildFrom[Nothing,kantan.csv.ReadResult[T],List[kantan.csv.ReadResult[T]]])List[kantan.csv.ReadResult[T]].
Unspecified value parameters evidence$3, ia, e...
        val results : List[ReadResult[T]] = line.readCsv[List,T](sep, header)
                                                                ^                                       ^

Code

import java.io.ByteArrayOutputStream

import kantan.csv._
import kantan.csv.ops._
import kantan.csv.generic._
/**
  * Created by revprez on 6/25/16.
  */

trait CsvLine {

    def headers : List[String]
}

trait CsvParsable[T <: CsvLine] {

    def line : T

    def toCsvString(header: Boolean = true) : String = {
        val out : ByteArrayOutputStream = new ByteArrayOutputStream()

        val writer = out.asCsvWriter[T](',', line.headers)
        writer.write(line).close

        val string = new String(out.toByteArray)
        out.close

        header match {
            case true => return string.stripLineEnd
            case false => {
                return string.split("\\r?\\n")(1).stripLineEnd
            }
        }

    }
}

trait CsvParser[T] {

    implicit val codec = scala.io.Codec.ISO8859

    def parseFile(source: String, sep : Char = ',', header : Boolean = true) : Stream[T] = {
        val results = getClass.getResource(source).asCsvReader[T](',', false)
            .toStream
            .filter( _.isSuccess )
            .map(_.get)

        return results
    }

    def parse(line : String, sep : Char = ',', header : Boolean = false) : Option[T] = {
        val results : List[ReadResult[T]] = line.readCsv[List,T](sep, header)

        results match {
            case Nil => None
            case or::ors if (or.isFailure) => None
            case or::ors if (or.isSuccess) => Some(or.get)

        }
    }
}

Replace encoders and decoders by kantan.codecs

This supercedes #16.

Support Scala.js

Tabulate doesn't support Scala.js. It should.

Document usage with for comprehensions

There doesn't seem to be an example in the docs using for comprehensions. Can Kantan be used with them and can we add some documentation about that?

Find more "odd" CSV data to test against

There are at least two unofficial CSV features that I've yet to encounter:

Comments (anything between a # and a line break, apparently).
Escaped characters (\ is sometimes used to prevent interpretation of the next character).

I haven't encountered data files that exhibit these behaviours (or other, even rarer oddities), but if they do exist, it'd be good to support them / test against them.

Split additional engines into explicit reader and writer instances

In the current implementation, using, say, Jackson for parsing but opencsv for writing is hard work. While not necessarily a common use case, splitting all engines into an implicit ReaderEngine and WriterEngine solves this neatly without changing anything for more standard use cases.

`CsvSink.sink` needs its header argument to be optional

Rename the scalaz, cats and scalaz-stream packages

An unforeseen consequence of giving them the name of the project they are connectors for is that it messes up imports, but good.

For example, in:

import com.nrinaudo.csv._
import scalaz.stream._

None of scalaz-stream classes are imported - the second import is understood to be com.nrinaudo.csv.scalaz.stream._

Consider way of keeping exceptions thrown during decoding

I'd like to be able to find out more details on why a decode failed. In DecodeResult exceptions are swallowed, inhibiting this. I'd consider returning a subclass of Failure that wraps the exception, or just using the standard Either type.

asCsv goes into infinite recursion when certain imports are present

This works:

  $ sbt console
  scala> import kantan.csv.ops._
  scala> List(List(1, 2), List(3, 4)).asCsv(',')
  res0: String =
  "1,2
  3,4
  "

This hangs (and hogs the processor):

  $ sbt console
  scala> import kantan.csv.ops._
  scala> import kantan.csv.generic.codecs._
  scala> List(List(1, 2), List(3, 4)).asCsv(',')

Downgrade to scoverage 1.2.0

Later versions of scoverage mess up SBT's handling of scala versions.

Package name does not match tabulate

All packages should be renamed to com.nrinaudo.tabulate.

Generic module code coverage

For some reason, the generic module doesn't get any code coverage information. In fact, scoverage seems to believe there is nothing to instrument:

[info] [info] Instrumentation completed [0 statements]

Provide codec instances for Path

CsvOutput.writeCsv should accept CsvReader as a parameter

When transforming a CSV file into another, we currently need to turn the CsvReader into an Iterator before passing it to CsvOutput.writeCsv. This seems unnecessary.

The question is, do we want a specialised implementation of writeCsv, or should CsvReader implement TraversableOnce (if at all possible)?

nrinaudo / kantan.csv Goto Github PK

kantan.csv's Issues

Errors

Code

Recommend Projects

Recommend Topics

Recommend Org

Jobs