brianhicks / elm-csv Goto Github PK

View Code? Open in Web Editor NEW

31.0 7.0 4.0 237 KB

Decode CSV in the most boring way possible.

Home Page: https://package.elm-lang.org/packages/BrianHicks/elm-csv/latest/

License: BSD 3-Clause "New" or "Revised" License

Shell 0.27% Nix 6.81% Elm 92.92%

csv elm

elm-csv's Introduction

elm-csv

Decode CSV in the most boring way possible. Other CSV libraries have exciting, innovative APIs... not this one! Pretend you're writing a JSON decoder, gimme your data, get on with your life.

import Csv.Decode as Decode exposing (Decoder)


decoder : Decoder ( Int, Int, Int )
decoder =
    Decode.map3 (\r g b -> ( r, g, b ))
        (Decode.column 0 Decode.int)
        (Decode.column 1 Decode.int)
        (Decode.column 2 Decode.int)


csv : String
csv =
    "0,128,128\r\n112,128,144"


Decode.decodeCsv Decode.NoFieldNames decoder csv
--> Ok
-->     [ ( 0, 128, 128 )
-->     , ( 112, 128, 144 )
-->     ]

However, in an effort to avoid a common problem with elm/json ("How do I decode a record with more than 8 fields?") this library also exposes a pipeline-style decoder (inspired by NoRedInk/elm-json-decode-pipeline) for records:

import Csv.Decode as Decode exposing (Decoder)


type alias Pet =
    { id : Int
    , name : String
    , species : String
    , weight : Maybe Float
    }


decoder : Decoder Pet
decoder =
    Decode.into Pet
        |> Decode.pipeline (Decode.field "id" Decode.int)
        |> Decode.pipeline (Decode.field "name" Decode.string)
        |> Decode.pipeline (Decode.field "species" Decode.string)
        |> Decode.pipeline (Decode.field "weight" (Decode.blank Decode.float))


csv : String
csv =
    "id,name,species,weight\r\n1,Atlas,cat,14.5\r\n2,Pippi,dog,"


Decode.decodeCsv Decode.FieldNamesFromFirstRow decoder csv
--> Ok
-->     [ { id = 1, name = "Atlas", species = "cat", weight = Just 14.5 }
-->     , { id = 2, name = "Pippi", species = "dog", weight = Nothing }
-->     ]

FAQ

Can this do TSVs too? What about European-style CSVs that use semicolon instead of comma?

Yes to both! Use decodeCustom. It takes a field and row separator string, which can be whatever you need.

Aren't there like (checks) 8 other CSV libraries already?

Yes, there are! While I appreciate the hard work that other people have put into those, there are a couple problems:

First, you need to put together multiple libraries to successfully parse CSV. Before this package was published, you had to pick one package for parsing to List (List String) and another to decode from that into something you actually cared about. Props to those authors for making their hard work available, of course, but this situation bugs me!

I don't want to have to pick different libraries for parsing and converting. I just want it to work like elm/json where I write a decoder, give the package a string, and handle a Result. This should not require so much thought!

The second thing, and the one that prompted me to publish this package, is that none of the libraries available at the time implemented andThen. Sure, you can use a Result to do whatever you like, but there's not a good way to make a decoding decision for one field dependent on another.

Contributing

Hello! I'm so glad that you're interested in contributing to elm-csv! Just so you know, I consider this library "done". Unless something major changes in either the CSV standard or Elm, major changes are unlikely. If you want to make a case for new decoder functions (or whatever) being added to the package feel free to do so (in an issue, not a PR!), but be aware the bar is fairly high for new inclusions.

That said, I'll be publishing upgrades to track with new versions of Elm, and bug fixes as needed. I always welcome help with those, and with documentation improvements!

Still here? Ok, let's get set up. This project uses Nix to manage versions (but just need a nix installation, not NixOS, so this will work on macOS.) Install that, then run nix-shell to get into a development environment.

Things I'd appreciate help with:

Testing the parser on many kinds of CSV and TSV data. If you find that the parser has incorrectly interpreted some data you have, please open an issue. It would be very helpful if you could include a sample of the input that's giving you problems, the versions of the software used to produce the sample, and the locale settings on your computer.
Feedback on speed. Please let me know if you find out that parsing/decoding has become a bottleneck in your application. Our parser is fairly quick (see benchmarking in the source) but we can always go faster.
Docs. Always docs. Forever docs.

Climate Action

I want my open-source work to support projects addressing the climate crisis (for example, projects in clean energy, public transit, reforestation, or sustainable agriculture.) If you are working on such a project, and find a bug or missing feature in any of my libraries, please let me know and I will treat your issue as high priority. I'd also be happy to support such projects in other ways. In particular, I've worked with Elm for a long time and would be happy to advise on your implementation.

License

elm-csv is licensed under the BSD 3-Clause license, located at LICENSE.

elm-csv's People

Contributors

Stargazers

Watchers

Forkers

jfmengels lydell jpagex gampleman

elm-csv's Issues

Should it support files with empty last line?

I do not know if this would be desirable, but at the moment, if the file has the last line empty, there is an error like:

There was a problem on row -1: I looked for a column named `Population`, but couldn't find one.

SSCCE (Short Self-Contained Correct Example):

decodeCustom
    { fieldSeparator = '\t' }
    FieldNamesFromFirstRow
    (map2 Tuple.pair
        (field "Country" string)
        (field "Population" int)
    )
    "Country\tPopulation\nArgentina\t44361150\nBrazil\t212652000\n"
    |> Result.mapError errorToString

If I remove the last \n it works fine.

Is this feature desirable or should we make sure the files do not have an empty line at the end?

Should a field containing only spaces be considered "blank"?

Using the same example/use case as for the issue #11, I wonder if using blank should work for cells with only spaces (using trim before isEmpty)?

Here is an example:

[ "   Name, Age"
, "  Alice,  12"
, "    Bob,    "
, "Charlie,  24"
]
|> String.join "\r\n"
-- |> String.replace " " ""
|> decodeCsv FieldNamesFromFirstRow (field "Age" (blank int))

I would expect to get:

Ok [Just 12, Nothing, Just 24]

But instead, I get:

Err (DecodingErrors [{ column = Field "Age" (Just 1), problems = [ExpectedInt ("    ")], row = 2 }])

If I uncomment the line String.replace " " "", it works as expected.

As we discussed previously, this could lead to problems if we do blank string. Maybe someone would expect to get Just " " instead of Nothing?

However, should it work for blank int and blank float?

I can surely add the following to make it work:

customBlank : Decoder a -> Decode (Maybe a)
customBlank decoder =
    andThen
        (\maybeBlank ->
            if String.isEmpty (String.trim maybeBlank) then
                succeed Nothing

            else
                map Just decoder
        )
        string


[ "   Name, Age"
, "  Alice,  12"
, "    Bob,    "
, "Charlie,  24"
]
|> String.join "\r\n"
|> decodeCsv FieldNamesFromFirstRow (field "Age" (customBlank int))
                                                  ^^^^^^^^^^^

Sorry for this long message. I have seen that you consider the package "done", but I was wondering what you would expect in this situation. Thanks a lot for your work and the nice 3.0.1 upgrade!

Optional columns

In our application there are some required columns the user submit and some are optional.

At the moment I encode this like this:

optional :Decoder a -> Decoder (Maybe a -> b) -> Decoder b
optional default childDecoder =
    pipeline (Csv.oneOf (Csv.map Just childDecoder) [ Csv.succeed Nothing ])

decoder =
    Csv.into Tuple.pair 
         |> pipeline (Csv.field "requiredColumn" Csv.float)
         |> optional (Csv.field "optionalColumn" Csv.float)

This sort of works, but has some subtle behavioural gotchas:

requiredColumn	optionalColumn
1	2
2	four
3	6

would successfully decode into:

[ ( 1, Just 2 ), ( 2, Nothing ), (3, Just 6) ]

which is not what I want. In this case I'd like to flag the error to the user, but I'd still like to mark some columns as optional. I can't really see a way to achieve this with the current API. So I'd propose something like

optionalField : String -> Decoder a -> Decoder (Maybe a)

which would only go to the Nothing case iff the column was not present at all in the CSV.

Feature Request: Function to parse only at runtime specified fields

I'm looking for a easy way to parse just the at runtime provided fields. So something like:

parseFields :
    { fieldSeparator : Char }
    -> List String
    -> String
    -> Result Problem (List (List String))
    
parseFields ',' ["id", "name", "color"] "id,name,color\n1,bike shed,green\n"

If the specified fields are not available it should it result in an error (maybe there should also be a flag wether to allow additional fields in the CSV).

Maybe this is already possible with decodeCsv, but if so I'm not sure how, since into needs a function with a pre defined arity.

Decoder API to get available fields

headers : Decoder (List String)

or some such.

The main use case I have is that the format of CSV my application accepts is dynamic based on the state of the application.

(i.e there are columns called 1_fx, 1_fy, 2_fx, 2_fy, 3_fx, 3_fy, ...)

Hence I would like to have an easy way to validate that there are no extra columns matching the template, since that would indicate that the user provided a CSV that didn't match the state of the application.

Run elm-review in CI

Hi!

I debugged this issue in ci.sh a little:

# elm-review tries to download elm-json, and it fails in CI. We'll try again
# in the 20.05 release of Nix, where it's packaged natively.
# elm-review

It turns out elm-json downloads just fine! It’s executing elm-json that fails. This is the command that elm-review executes:

elm-json solve --extra elm/json stil4m/elm-syntax elm/project-metadata-utils MartinSStewart/elm-serialize -- review/elm.json

Running that on its own in your CI setup results in:

phase: retrieve
 Jan 22 18:15:41.748 WARN Failed to fetch versions from package.elm-lang.org

-- NO VALID PACKAGE VERSION ----------------------------------------------------

Because elm/json does not appear to exist and this project depends on elm/json
at any version, no valid set of package versions could be found.

Trying to curl package.elm-lang.org results in:

curl https://package.elm-lang.org/
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (77) error setting certificate verify locations:  CAfile: /no-cert-file.crt CApath: none

I found this GitHub issue while googling, but I don’t understand if/what the solution is: NixOS/nixpkgs#13744

Hope this helps!

a quoted field which ends a line combines the current and next line

The following CSV…

1,"two"
3,"four"

Will decode to this:

[ [ 1, "two", 3, "four" ] ]

Instead of the correct value:

[ [ 1, "two" ]
, [ 3, "four" ]
]

This is an error in Csv.Parser in 2.0.0 and 2.0.1, and will be fixed in 3.0.0!

Wrong `AdditionalCharactersAfterClosingQuote` with quoted string inside cell

The following test would fail:

test "quoted single values 2" <|
    \_ ->
        "x"
            ++ config.rowSeparator
            ++ "1 \"a\" 2"
            |> parse { fieldSeparator = config.fieldSeparator }
            |> Expect.equal (Ok [ [ "x", "1 \"a\" 2" ] ])

It yields

    Err (AdditionalCharactersAfterClosingQuote 2)
    ╷
    │ Expect.equal
    ╵
    Ok [["x","1 \"a\" 2"]]

Unfortunately I'm not very into the parser code in order to be able to fix this.

`map2`, `map3` and friends don't preserve all error messages

For instance:

decoder =
    Decode.map3 (\a b c -> ( a, b, c ))
        (Decode.field "bar" Decode.int)
        (Decode.field "foo" Decode.int)
        (Decode.column 3 Decode.int)


contents =
    """bar
1
2
d
d
5
"""

Decode.decodeCsv Decode.FieldNamesFromFirstRow decoder contents

You would expect that for rows 1,2, and 5 you would get 2 errors (foo not provided, column 3 not found) and 3 errors for rows 3 and 4, but instead you just get 1 error for each row.

This is because they are implemented as Result.map2, Result.map3 etc, which only take the first encountered error, rather than aggregating them as a list.

Better error messages for missing columns

When a column is missing, the function errorToString produces extremely repetitive output:

I saw 5 problems while decoding this CSV:

There was a problem on row 1, in the `foo` field: The `foo` field wasn't provided in the field names.

There was a problem on row 2, in the `foo` field: The `foo` field wasn't provided in the field names.

There was a problem on row 3, in the `foo` field: The `foo` field wasn't provided in the field names.

There was a problem on row 4, in the `foo` field: The `foo` field wasn't provided in the field names.

There was a problem on row 5, in the `foo` field: The `foo` field wasn't provided in the field names.

(This can easily lead to thousands of lines of error output), even though the error is only on the first line.

See this Ellie for a minimal example.

Trim cell content

I would expect to be able to add "padding" to cells in order to align the CSV columns. Example:

decodeCsv
    FieldNamesFromFirstRow
    (map2 Tuple.pair
        (field "Name" string)
        (field "Age" int)
    )
<|
    String.join "\n"
        [ "   Name, Age"
        , "  Alice,  12"
        , "    Bob,  14"
        , " Victor,  18"
        ]

I would expect it to decode it into:

[ ( "Alice", 12 )
, ( "Bob", 14 )
, ( "Victor", 18 )
]

Would it be desirable? For those who do not want to trim some parts, it would still be possible to add " around the cell content.

Encode.withoutFieldNames to Dict Int String

I'm looking instead of
withoutFieldNames : (a -> List String) -> Encoder a
to
withoutFieldNames : (a -> Dict Int String) -> Encoder a
because I'm having following structure:

toDict a =
    [ ( 8, a.firstname )
    , ( 9, a.lastname )
    , ( 11, a.organization )

    --
    , ( 12, a.address )
    , ( 15, a.city )
    , ( 57, a.stateCode )
    , ( 18, a.countryCode )
    , ( 17, a.postcode )

    --
    , ( 51, a.email )
    , ( 52, a.phone )

    --
    , ( 47, a.type_ )
    , ( 22, a.services )
    , ( 21, a.weight )

    --
    , ( 37, a.note )
    , ( 38, a.labelNote )
    ]
        |> Dict.fromList

We can revert to List String simply by (List.indexedMap Tuple.pair) >> Dict.fromString

Great library!

Decode CSV as `Dict String (List String)` and/or `List (Dict String String)`

Use case: I'dl ike providing my users a way to "explore" a CSV with "headers". So from a CSV which looks like:

name,first name,age
sheep,seb,18
hicks,brian,21

I'd like having either data by column:

Dict.fromList  [("name", ["sheep", "hicks"]), ("first name", ["seb", "brian"]), ("age", ["18", "21"])]

or by row:

[ Dict.fromList  [("name", "sheep"), ("first name", "seb"), ("age", "18")]
, Dict.fromList  [("name", "hicks"), ("first name", "brian"), ("age", "21")]
]

Bug: Incorrect handling of quoted fields

> import Csv.Parser exposing (parse)

> parse { fieldSeparator = ',' } "Name,Species\nAtlas,cat\nAxel,dog\n"
Ok [["Name","Species"],["Atlas","cat"],["Axel","dog"]]
    : Result Csv.Parser.Problem (List (List String))

> parse { fieldSeparator = ',' } "\"Name\",\"Species\"\n\"Atlas\",\"cat\"\n\"Axel\",\"dog\"\n"
Ok [["Name","Species"],["Atlas","cat"],["Axel","dog",""]]
    : Result Csv.Parser.Problem (List (List String))

When the fields are quoted, it wrongly adds an additional empty field at the end.

Column oriented decoding

At work we represent our data in a column oriented format, i.e. something like:

type alias Data =
  { time : List Float
  , speed : List Float 
  , altitude: Maybe (List Float)
  }

rather than the more conventional row oriented structure:

type alias Data =
  List 
     { time : Float
     , speed : Float 
     , altitude: Maybe Float
     }

(note that the data formats have some subtly different invariants encoded into them; also the column oriented structure is more space efficient in the case of missing altitude)

Now obviously I can write a decoder for the row oriented data, then write some function that would transpose it to the column oriented data and call it a day.

But I was wondering if there wasn't a way to directly decode column oriented data straight from the CSV?