GithubHelp home page GithubHelp logo

nrinaudo / kantan.csv Goto Github PK

View Code? Open in Web Editor NEW
346.0 10.0 36.0 9 MB

CSV handling library for Scala

Home Page: http://nrinaudo.github.io/kantan.csv/

License: Apache License 2.0

Scala 100.00%
csv

kantan.csv's Introduction

kantan.csv

Build Status Maven Central Join the chat at https://gitter.im/nrinaudo/kantan.csv

CSV is an unfortunate part of life. This attempts to alleviate the pain somewhat by letting developers treat CSV data as a simple iterator.

As much as possible, kantan.csv attempts to present a purely functional and safe interface to users. I've not hesitated to violate these principles internally however, when it afforded better performances. This approach appears to be somewhat successful.

Documentation and tutorials are available on the companion site, but for those looking for a few quick examples:

import java.io.File
import kantan.csv._         // All kantan.csv types.
import kantan.csv.ops._     // Enriches types with useful methods.
import kantan.csv.generic._ // Automatic derivation of codecs.

// Reading from a file: returns an iterator-like structure on (Int, Int)
new File("points.csv").asCsvReader[(Int, Int)](rfc)

// "Complex" types derivation: the second column is either an int, or a string that might be empty.
new File("dodgy.csv").asCsvReader[(Int, Either[Int, Option[String]])](rfc)

case class Point2D(x: Int, y: Int)

// Parsing the content of a remote URL as a List[Point2D].
new java.net.URL("http://someserver.com/points.csv").readCsv[List, Point2D](rfc.withHeader)

// Writing to a CSV file.
new File("output.csv").asCsvWriter[Point2D](rfc)
  .write(Point2D(0, 1))
  .write(Point2D(2, 3))
  .close()

// Writing a collection to a CSV file
new File("output.csv").writeCsv[Point2D](List(Point2D(0, 1), Point2D(2, 3)), rfc)

kantan.csv is distributed under the Apache 2.0 License.

kantan.csv's People

Contributors

akiomik avatar cquiroz avatar dhleemarchex avatar hshn avatar jan0sch avatar nevillelyh avatar nrinaudo avatar paulpdaniels avatar scala-steward avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kantan.csv's Issues

Investigate the opencsv engine

The current implementation of the opencsv engine is underwhelming: it has some of the worst performances, and fails most parsing tests. It fails so hard, in fact, that something might be up with my implementation rather than with opencsv itself.

This should be investigated and possibly discussed with the opencsv team. Should the issue be with opencsv itself, we might want to drop support for it, at least until it's fixed.

Provide a default Date codec?

While java.util.Date is terrible, it doesn't cost much to have a simple ISO-8601 default implementation. People that need it will appreciate having it, people that want better tools (joda, say) can easily implement a better codec.

Test the benchmarks

Benchmarks currently take it on faith that each library's output is correct. Some simple tests to validate that it's really the case would be good.

Implicit value for evidence parameter errors

Probably a way to handle this, but getting implicit errors when attempting to use kantan asCsvReader and asCsvWriter against generics. I suspect this is either a limitation or there is some niceness I can try with implicits to move forward, but right now my solution is to simply leave the trait methods unimplemented and do so in the implementing case classes.

If anyone has a better way of doing this, I'm all ears.

Code follows errors.

Errors

Information:6/25/16, 5:24 PM - Compilation completed with 6 errors and 0 warnings in 1s 749ms

...

Error:(24, 40) could not find implicit value for evidence parameter of type kantan.csv.RowEncoder[T]
        val writer = out.asCsvWriter[T](',', line.headers)
                                       ^
Error:(24, 40) not enough arguments for method asCsvWriter: (implicit evidence$1: kantan.csv.RowEncoder[T], implicit oa: kantan.csv.CsvOutput[java.io.ByteArrayOutputStream], implicit e: kantan.csv.engine.WriterEngine)kantan.csv.CsvWriter[T].
Unspecified value parameters evidence$1, oa, e.
        val writer = out.asCsvWriter[T](',', line.headers)
                                       ^
Error:(45, 66) could not find implicit value for evidence parameter of type kantan.csv.RowDecoder[T]
        val results = getClass.getResource(source).asCsvReader[T](',', false)
                                                                 ^
Error:(45, 66) not enough arguments for method asCsvReader: (implicit evidence$1: kantan.csv.RowDecoder[T], implicit ia: kantan.csv.CsvInput[java.net.URL], implicit e: kantan.csv.engine.ReaderEngine)kantan.csv.CsvReader[kantan.csv.ReadResult[T]].
Unspecified value parameters evidence$1, ia, e.
        val results = getClass.getResource(source).asCsvReader[T](',', false)
                                                                 ^
Error:(54, 65) could not find implicit value for evidence parameter of type kantan.csv.RowDecoder[T]
        val results : List[ReadResult[T]] = line.readCsv[List,T](sep, header)
                                                                ^
Error:(54, 65) not enough arguments for method readCsv: (implicit evidence$3: kantan.csv.RowDecoder[T], implicit ia: kantan.csv.CsvInput[String], implicit e: kantan.csv.engine.ReaderEngine, implicit cbf: scala.collection.generic.CanBuildFrom[Nothing,kantan.csv.ReadResult[T],List[kantan.csv.ReadResult[T]]])List[kantan.csv.ReadResult[T]].
Unspecified value parameters evidence$3, ia, e...
        val results : List[ReadResult[T]] = line.readCsv[List,T](sep, header)
                                                                ^                                       ^

Code

import java.io.ByteArrayOutputStream

import kantan.csv._
import kantan.csv.ops._
import kantan.csv.generic._
/**
  * Created by revprez on 6/25/16.
  */

trait CsvLine {

    def headers : List[String]
}

trait CsvParsable[T <: CsvLine] {

    def line : T

    def toCsvString(header: Boolean = true) : String = {
        val out : ByteArrayOutputStream = new ByteArrayOutputStream()

        val writer = out.asCsvWriter[T](',', line.headers)
        writer.write(line).close

        val string = new String(out.toByteArray)
        out.close

        header match {
            case true => return string.stripLineEnd
            case false => {
                return string.split("\\r?\\n")(1).stripLineEnd
            }
        }

    }
}

trait CsvParser[T] {

    implicit val codec = scala.io.Codec.ISO8859

    def parseFile(source: String, sep : Char = ',', header : Boolean = true) : Stream[T] = {
        val results = getClass.getResource(source).asCsvReader[T](',', false)
            .toStream
            .filter( _.isSuccess )
            .map(_.get)

        return results
    }

    def parse(line : String, sep : Char = ',', header : Boolean = false) : Option[T] = {
        val results : List[ReadResult[T]] = line.readCsv[List,T](sep, header)

        results match {
            case Nil => None
            case or::ors if (or.isFailure) => None
            case or::ors if (or.isSuccess) => Some(or.get)

        }
    }
}

RowDecoder methods hidden from resolution once an instance is created

Attempting to create a decoder twice (or any other decoder for that matter) generates an error after one decoder is in scope.

Welcome to Scala version 2.10.6 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66).
Type in expressions to have them evaluated.
Type :help for more information.

scala> case class Car(year: Int, make: String, model: String, desc: Option[String], price: Float)
defined class Car

scala> kantan.csv.RowDecoder.decoder5(Car.apply)(0, 1, 2, 3, 4)
res0: kantan.csv.RowDecoder[Car] = kantan.codecs.Decoder$$anon$1@3f39cd95

scala> kantan.csv.RowDecoder.decoder5(Car.apply)(0, 1, 2, 3, 4)
<console>:10: error: value decoder5 is not a member of object kantan.csv.RowDecoder
              kantan.csv.RowDecoder.decoder5(Car.apply)(0, 1, 2, 3, 4)
                                    ^

Rename the scalaz, cats and scalaz-stream packages

An unforeseen consequence of giving them the name of the project they are connectors for is that it messes up imports, but good.

For example, in:

import com.nrinaudo.csv._
import scalaz.stream._

None of scalaz-stream classes are imported - the second import is understood to be com.nrinaudo.csv.scalaz.stream._

Find more "odd" CSV data to test against

There are at least two unofficial CSV features that I've yet to encounter:

  • Comments (anything between a # and a line break, apparently).
  • Escaped characters (\ is sometimes used to prevent interpretation of the next character).

I haven't encountered data files that exhibit these behaviours (or other, even rarer oddities), but if they do exist, it'd be good to support them / test against them.

Give CellEncoder and RowEncoder a common parent?

In theory, it looks like it should be possible for CellEncoder and RowEncoder to both be specific instances of Encoder[E, D], where E is the encoded type and D the decoded type. CellDecoder[A] would be an Encoder[String, A] and RowDecoder[A] an Encoder[Seq[String], A].

I've tried that and am currently stuck with simulacrum not creating the operators I'd like it to.
Encoder cannot be annotated with @typeclass - it has two type parameters. And methods declared in Encoder do not get operators in subclasses that are annotated with @typeclass.

For example:

trait Encoder[E, D] {
  def encode(d: D): E
}

@simulacrum.typeclass
trait CellEncoder[A] extends Encoder[String, A]

object Test {
  import CellEncoder.ops._
  def test[A: CellEncoder](a: A): String = a.encode
}

This fails with:

[error] /Users/nicolasrinaudo/dev/nrinaudo/tabulate/core/src/main/scala/tabulate/Encoder.scala:14: value encode is not a member of type parameter A
[error]   def test[A: ColumnEncoder](a: A): String = a.encode
[error]                                                ^
[error] one error found

It might be worth bringing this issue up on simulacrum's Gitter channel

Split additional engines into explicit reader and writer instances

In the current implementation, using, say, Jackson for parsing but opencsv for writing is hard work. While not necessarily a common use case, splitting all engines into an implicit ReaderEngine and WriterEngine solves this neatly without changing anything for more standard use cases.

CsvIterator cannot be closed explicitly

CsvIterator is not exposed to the rest of the word, and its close method cannot be called explicitly. This means that, not unlike scala.io.Source, callers don't really control when the underlying stream is closed, which is likely to cause all sorts of issues in the real world.

Ideally, CsvIterator would become public, have a close method and all the bells and whistles of normal Iterator.

Clean dependencies up

We have some odd dependencies coming up in the generated pom files. In particular, the scalaz-stream module depends on laws at compile time, which is just wrong.

Work on type class variance

Some type classes, such as CsvInput and CsvOutput, would probably benefit from being made contravariant - there's no reason a CsvInput[InputStream] shouldn't be usable where a CsvInput[ByteArrayInputStream] is required, for example.

This is currently problematic, as the current version of simulacrum doesn't seem to behave well with variance annotations. A fix has been submitted and accepted, and should be released with version 0.6.0.

Investigate univocity performances

Univocity performances are horrible, at least ten times slower than the slowest engine in our benchmarks.

This is highly suspicious, as they published benchmarks in which they were performing better than all other parsers.

While this is not strictly a tabulate issue, it might be one with our benchmarks. It might be worth it to investigate, possibly discuss with the univocity team, and fix the benchmarks if needed.

If the performances are really that bad, we should probably drop univocity from our benchmarks.

Whitespace as column separator

kantan.csv doesn't appear to behave well when the column separator is "any non-0 number of spaces".

Is this something we want to support?

CsvInput and CsvOutput are unsafe

Both type classes have no mechanism for dealing with failure when opening the underlying resource, and must rely on throwing exceptions.

Generators in DerivedRowDecoderTests sometimes fail

It's a bit of a mystery as they all look very straightforward, but they sometimes fail to generate test cases and cause the entire build to fail.

Specifically, the generators fail in either imap identity (encoding) of imap identity (decoding).

asCsv goes into infinite recursion when certain imports are present

This works:

  $ sbt console
  scala> import kantan.csv.ops._
  scala> List(List(1, 2), List(3, 4)).asCsv(',')
  res0: String =
  "1,2
  3,4
  "

This hangs (and hogs the processor):

  $ sbt console
  scala> import kantan.csv.ops._
  scala> import kantan.csv.generic.codecs._
  scala> List(List(1, 2), List(3, 4)).asCsv(',')

CsvIterator needs to not rely on scala.io.Source

In its current implementation, CsvIterator relies on scala.io.Source for IO.

Ideally, CsvIterator would only require some sort of closeable iterator on chars. Among other things, this would probably make scala.js integration easier.

Generic module code coverage

For some reason, the generic module doesn't get any code coverage information. In fact, scoverage seems to believe there is nothing to instrument:

[info] [info] Instrumentation completed [0 statements]

Normalise CsvInput and CsvOutput names

In other projects, these would be called CsvSource and CsvSink. Better rename them (and maybe create a deprecated type alias for CsvInput and CsvOutput?).

CsvOutput.writeCsv should accept CsvReader as a parameter

When transforming a CSV file into another, we currently need to turn the CsvReader into an Iterator before passing it to CsvOutput.writeCsv. This seems unnecessary.

The question is, do we want a specialised implementation of writeCsv, or should CsvReader implement TraversableOnce (if at all possible)?

Better configurability of reader / writer engines

Engines are currently configured the way I feel they should be by default - RFC compliant. The underlying implementation is not exposed however, which seriously limits the usefulness of external engines - what's the point of bringing in jackson csv if one cannot configure it's more esoteric features?

Rename the apply TC instance creation method to from

The apply name clashes with the instance retrieval method, which can sometimes cause the compiler to be confused and require explicit type tags where they really should be inferred.

apply needs to be marked as deprecated (and deleted in the next version).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.