GithubHelp home page GithubHelp logo

beatrichartz / csv Goto Github PK

View Code? Open in Web Editor NEW
492.0 492.0 91.0 402 KB

CSV Decoding and Encoding for Elixir

License: MIT License

Elixir 91.30% HTML 8.37% Shell 0.33%
csv decoder decoding elixir encoder encoding hex parser parsing rfc-4180 stream

csv's People

Contributors

23skidoo avatar aerlingssonwh avatar al2o3cr avatar barruumrex avatar beatrichartz avatar dnnx avatar garethm avatar jamesvl avatar joe-noh avatar joseph-lozano avatar kianmeng avatar kiliancs avatar lasseebert avatar lowks avatar maennchen avatar malcolmstill avatar matthewlehner avatar milmazz avatar mralexlau avatar princemaple avatar raksonibs avatar rossjones avatar ryvasquez avatar seejohnplay avatar stuart avatar taylor-redden-papa avatar tdcain89 avatar tomjoro avatar ybur-yug avatar zacharydenton avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

csv's Issues

Unexpected token error

I am getting this error:

when parsing this:

 Message
"prof·li·gate\ˈprä-fli-gət, -ˌgāt\
adjective
: carelessly and foolishly wasting money, materials, etc. : very wasteful
Full Definition
1 : wildly extravagant <profligate spending>
2 : completely given up to dissipation and licentiousness <leading a profligate life>
prof·li·gate·ly adverb
Origin: Latin profligatus, from past participle of profligare to strike down, from pro- forward, down + -fligare (akin to fligere to strike); akin to Greek phlibein to squeeze.
First use: 1617
Synonyms: extravagant, high-rolling, prodigal, spendthrift, squandering, thriftless, unthrifty, wasteful
Antonyms: conserving, economical, economizing, frugal, penny-pinching, scrimping, skimping, thrifty
Synonyms: fritterer, high roller, prodigal, spender, spendthrift, squanderer, waster, wastrel
Antonyms: economizer, penny-pincher
2
prof·li·gate\ˈprä-fli-gət, -ˌgāt\
noun
: a person given to wildly extravagant and usually grossly self-indulgent expenditure
Origin: (see 1profligate ).
First use: 1709
Synonyms: extravagant, high-rolling, prodigal, spendthrift, squandering, thriftless, unthrifty, wasteful
Antonyms: conserving, economical, economizing, frugal, penny-pinching, scrimping, skimping, thrifty
Synonyms: fritterer, high roller, prodigal, spender, spendthrift, squanderer, waster, wastrel
Antonyms: economizer, penny-pincher"
"46984364136: Beamish
adjective

and I am getting this error:

%SyntaxError{description: "unexpected token: \":\" (column 61, codepoint U+003A)",
 file: "nofile", line: 4}

and this is my code:

defmodule SpellingListParser do
  File.stream!("SpellingList.csv") |> CSV.decode :headers: true |> Enum.map fn row ->
    Enum.each(row, IO.puts)
  end
end

Running in production/release slow

Hi I'm having problems that

full_rows |> CSV.encode |> Enum.each(&IO.write(file, &1))
is realy slow, and won't write the whole data to the file.

It works very well when I'm running mix phenix.server, but as release it's not working well.

I have consolidate_protocols: true in mix.exs

Any idea what it might be?

Decoder uses number of schedulers at compile time, rather than runtime

https://github.com/beatrichartz/csv/blob/master/lib/csv/decoder.ex#L13

This tells the decoder to use the number of schedulers that were present at compile time. This is irrelevant to runtime. If I were to roll a release from a small machine and deploy on a large machine, it wouldn't use the resources as well as you're trying to.

It's not hugely important, but it struck me as incorrect. Hope this isn't a bothersome thing to bring up! :)

Ignore lines with errors

Is there a way I could ignore lines that have invalid encoding? Or at the very least, know what this invalid encoding is? All I currently get is:

** (CSV.Lexer.EncodingError) Invalid encoding on line 3642
             lib/csv/decoder.ex:161: CSV.Decoder.handle_error_for_result!/1
    (elixir) lib/stream.ex:454: anonymous fn/4 in Stream.map/2
    (elixir) lib/enum.ex:2744: Enumerable.List.reduce/3
    (elixir) lib/stream.ex:732: Stream.do_list_transform/9
    (elixir) lib/stream.ex:1247: Enumerable.Stream.do_each/4
    (elixir) lib/enum.ex:1477: Enum.reduce/3
    (elixir) lib/enum.ex:609: Enum.each/2

I'm working with a very large file, so it's pretty hard to pick out a line that might be encoded incorrectly.

Ability to specify header values and ignore headers in the csv file

Thanks for the CSV library!

In order to use the Map parsed from each row with Ecto, I need to define the headers as atoms that exactly match my model:

File.stream!("/Users/wsmoak/Downloads/transactions.csv")
  |> CSV.decode(headers: [:date, :description, :original_description, :amount, :transaction_type, :category, :account_name, :labels, :notes])

When I do this, the first line of the file that contains the headers is picked up as a row of values, which I don't want.

Is there a way to skip that first row the way headers: false would, and define the values for the headers for the map?

Would having two different attributes work better? Maybe :headers could be true/false only, to say whether the csv file contains headers, and then a separate attribute could be used to optionally specify the header values to use?

(My workaround is to define a function that matches on one of the things I know will be in the Map for that first unwanted record, and ignore it: def store_it(%{:date => "Date"}) do ... end )

Performance Issues

I have a tab delimited file that's ~2.6GB. I'm attempting to do the following in iex but it never completes:

file = File.open!("output.csv", [:write])
File.stream!("input.csv")
  |> CSV.Decoder.decode(separator: ?\t)
  |> CSV.encode 
  |> Enum.each(&(IO.write(file, &1)))

I thought this was because of the number of rows but even if I do Enum.take(700) it doesn't complete. If I only take 500 however it completes almost immediately. Any idea on what's going on or know what I could do to debug this? I'm using Elixir 1.3.0

Question (ErlangError) erlang error: :no_translation

Hi, thanks for this project 👍 . Has been a big help for beginners like me.I'm getting a strange error and have no clue how to fix it.

I only get this error if I try do encode this field "Hotel 10 Aparecida de Goiânia" without it works just fine. Do you have any tip or direction to give me?

table_data |> CSV.encode |> Enum.each(&IO.write(file, &1))
** (ErlangError) erlang error: :no_translation
(stdlib) :io.put_chars(#PID<0.248.0>, :unicode, "27,SinnPDV Atualização Goiânia,6125,Hotel 10 Aparecida de Goiânia,Felipe Borges Ferreira,Bruno\r\n")
(elixir) lib/enum.ex:657: anonymous fn/3 in Enum.each/2
(elixir) lib/enum.ex:1637: anonymous fn/3 in Enum.reduce/3
(elixir) lib/enum.ex:2843: Enumerable.List.reduce/3
(elixir) lib/stream.ex:769: Stream.do_list_transform/9
(elixir) lib/enum.ex:1636: Enum.reduce/3
(elixir) lib/enum.ex:656: Enum.each/2

Q: Does `num_workers` process the file in parallel?

Just as the title says, if I want to process a file with the rows processing independently of each other (non-sequential), does setting the num_workers to, let's say 6, process the file in 6 independent processes?


iex> "../test/fixtures/docs/valid.csv"
iex> |> Path.expand(__DIR__)
iex> |> File.stream!
iex> |> CSV.decode(num_workers: 6)

Docs

Hi, I downloaded installed and started using your library, but I had to give up when I could work out how to pattern match on a row in order to do a database insert. I've solved it with another library, but thought it worth recommending you beef up the example for beginners like me.

UTF-8 with BOM

Hi,

I have a CSV saved as UTF-8 with BOM. The decoder stores the BOM together with the first header. Would this library try to strip out the BOM, or would it be the user's responsibility to decide whether to have the BOM in the CSV?

Allow stray quotes in unescaped fields

Quotes escaping is a recommendation in RFC 4180, however CSV should allow stray quotes in fields not starting with a quote.

That means this should be valid:

A,B"C,D

Whereas this should still not be valid:

A,"B"C",D

Make row length check optional

Hello,

thank you for the wonderful library, it's a pleasure to use it except one pain point: a hard check on a row length.
If there are no any strong objections it can be a wonderful addition.

CSV.decode(headers: true) breaks when there is more than one column with a matching header

This is a subtle but important implementation problem when using a straight-up map as your target implementation, because it’s not quite right since keys cannot be reused.

Assume you have a CSV like this:

id,name,address,address,city,province
1,joe,123 city st,,moosejaw,SK

This will result in a map %{"id" => 1, "name" => "joe", "address" => "", "city" => "moosejaw", "province" => "SK"}. While this seems reasonable, this is completely incorrect and results in data loss. The Ruby CSV library handles this correctly, in that it returns a data structure that acts like a hash but is not quite a hash so that you can ask for row['address'] # => "123 city st" and row['address', 1] # => nil (the first can be row['address', 0], too).

I suspect that to fix this, you will need to return a specialized form that can be handled properly. The interface would be something like CSV.Row.get(row, header, index \\ 0) where is_binary(header) and CSV.Row.get(row, index) (because row[3] always returns the 4th column.

`CSV.encode` seems extremely slow

defmodule CSVTest do
  use ExUnit.Case, async: true

  for rows <- [1, 10, 100, 500] do
    @rows rows
    test "generating #{rows} rows" do
      data = List.duplicate([1, 2, 3, 4, 5], @rows)
      data |> CSV.encode |> Stream.run
    end
  end
end
➜ mix test test/csv_test.exs --trace

CSVTest
  * generating 100 rows (1858.3ms)
  * generating 1 rows (16.4ms)
  * generating 10 rows (173.0ms)
  * generating 500 rows (8899.1ms)


Finished in 10.9 seconds (0.05s on load, 10.9s on tests)
4 tests, 0 failures

It appears to be linear in the number of rows (as you'd expect) but with a huge constant factor -- about 17ms per row.

Any idea why the encoder is so slow?

Add an option to disable the row_length check

I'm trying to process an invalid csv, with 10 headers in the first row, follow by rows with 10 or less values.

decode (rightfully) fails with Row has length 8 - expected length 10 on line X

I could probably try to pre-process the rows and add the missing columns (append the right number of commas to the line) but that would require that I handle all the escaping cases that you already handle in your code, to be able to actually count the columns.

Probably easier (for me) would be to have an option to disable this check, maybe something like:

row_length: false and maybe even row_length: 10 to enforce a certain row_lenght

would you accept such a merge request?

Headers as atoms

Hello,

Is there any way to read the headers from the first line, but to transform them into atoms before converting rows to maps.

What I want is to have atoms as keys in maps, not strings.

Thank you !

Crashes on unrecognized letters?

Here are two examples of individual lines that led the parser to barf and stop:

frações,1,M,12-Sep,,,,,,,,,,,,,,,,,,,
matemática financeira,1,M,12-Sep,,,,,,,,,,,,,,,,,,,

This is in a file generated by Excel (export as XLS). Not sure what format it was in - saving it in Vim as UTF-8 resolved the issue. However, it seems like it could either accept parsing a file, even if some characters are messed up, or at least give a useful error message ("Non UTF-8 valid character found on line x etc").

Error message:

17:41:09.397 [error] Error in process <0.112.0> with exit value: {function_clause,[{'Elixir.CSV.Lexer',lex,[411,<<37 bytes>>,<0.120.0>,{content,<<5 bytes>>},44],[{file,"lib/csv/lexer.ex"},{line,35}]},{'Elixir.CSV.Lexer',lex_into,2,[{file,"lib/csv/lexer.ex"},{line,24}]}]}

Add context and field number to syntax errors

CSV should add context to syntax errors in order to make the error messages more useful. That means an error like this:

Invalid escape sequence on line 418

should become:

Invalid escape sequence on line 418, field 4 near "B, \"C"

When there are headers, the field could be named:

Invalid escape sequence on line 418, field "ABC" near "B, "C"

First row skipped when parsing StringIO

I'm trying to parse a CSV string, using StringIO.open/1 and IO.binstream/1 to convert the string to a stream first. When the :headers option is false, the decoder skips the first row in the CSV. When :headers is true, the CSV is decoded as expected.

Add this test to test/csv_test.exs:

  test "decodes from StringIO stream" do
    {:ok, out} =
      "a,b,c\nd,e,f"
      |> StringIO.open

    stream = out |> IO.binstream(:line)

    assert stream |> CSV.decode! |> Enum.map(&(&1)) == [~w(a b c), ~w(d e f)]
  end

Fails with:

  1) test decodes from StringIO stream (CSVTest)
     test/csv_test.exs:30
     Assertion with == failed
     code: stream |> CSV.decode!() |> Enum.map(&(&1)) == [~w"a b c", ~w"d e f"]
     lhs:  [["d", "e", "f"]]
     rhs:  [["a", "b", "c"], ["d", "e", "f"]]
     stacktrace:
       test/csv_test.exs:37

I spent quite a bit of time digging into the code and debugging, but I'm out of time. The thing that looks suspicious is the Enum.take(1) call in CSV.Decoder.get_first_row/2. It would seem to me that you can't pluck an item out of a stream like that without disturbing the stream.

But then again, if the stream comes from a file, it's not a problem. For files, maybe the stream is opened (?) twice? For example with a file stream, you can inspect the stream, then send it to the decoder without creating a new stream:

IO.puts "stream: #{stream |> Enum.map(&(&1)) |> inspect}"
assert stream |> CSV.decode! |> Enum.map(&(&1)) == [~w(a b c), ~w(d e f)]

But with a stream created from StringIO.open and IO.bitstream, you can't inspect the stream before sending it to the decoder. The decoder outputs an empty stream.

I don't know if this is a bug in CSV, or a bug in StringIO or IO.binstream, or if I'm just doing it wrong. How does one reliably decode a CSV string? I like to do this in tests a lot. Fortunately, I've been using headers: true, but sometimes I'd like to have headers: false.

just fyi - hexdocs.pm/csv is broke

Using hexdocs to search for csv leads us to an overview page which may be missing or something.

It seems small. I can get to your docs through the direct link to functions like decode.

CSV encode double-escapes escape codes like CR

When encoding a line which contains a field with an embedded escape code, CSV escapes the backslash, so the output contains \r\n instead of a CRLF. This is an issue with double-quoted fields which are allowed to have CRLF's embedded. See below:

iex(9)> IO.puts "foo\r\nbar"
foo
bar
:ok
iex(10)> [["foo\r\nbar", "foo", "bar"]] |> CSV.encode |> Enum.to_list
["\"foo\\r\\nbar\",foo,bar\r\n"]

I can work around it by regex removing the escaped backslash, but unless I'm missing something this is incorrect behavior.

Takes way too long for large files

Maybe I'm using the library wrong but I noticed something strange. I have a large csv file (~ 500MB). With streams it is very easy to output this file line by line with a very low memory footprint:

File.stream!("large.csv") |> Stream.each(&IO.inspect/1) |> Stream.run    

So I assumed it would be very easy to plug the CSV decoder in between:

File.stream!("large.csv") |> CSV.decode |> Stream.each(&IO.inspect/1) |> Stream.run    

I assumed this would print out the elixir data structures that csv emits but it turns out, there is some kind of eager functionality in between because when I run this, there is no output at first and only after a while the output is shown (and much slower). Also the memory usage goes through the roof, running that code.

Is there something I'm missing?

Thinking about that, wouldn't it be best to have the possibility to provide some kind of fallback to the decode method to run side effects?

File.stream!("large.csv") |> CSV.decode(&IO.inspect/1)

Adding an option to set up the quote_char

Hi ! I'm currently porting ETL ruby code in elixir and I have a problem in some files.

I'm on the master branch and I use the | as separator. I get an Exception about CSV.EscapeSequenceError when I have only one " in the line which can happen because this character is not intended to be used for escaping in my case. Here a line that trigger the exception :
TEST|RES "LES PRES LE ROY||ZZ

I was expecting on the master branch to receive a tuple for this error : {:error, EscapeSequenceError...}, can I do something for getting that instead the exception ?
The error message in the exception was wrong about the line number of the problem, I'll look why but there is maybe a bug here.

In the csv ruby library I can set a different character (the quote_char option) for escaping and it solved my problem if I put a character that I'm sure is not in the file : https://ruby-doc.org/stdlib-2.0.0/libdoc/csv/rdoc/CSV.html#method-c-new

Do you think that can of option could be introduced in the library ? I can try to submit a PR for that if it's ok

if multiple options are specified, then strip_cells is ignored

When specifying headers and strip_cells, then strip_cells was ignored. I changed it to be consistent with the way Lex works, etc., by checking if the option is set.

File.stream!("data/user_1/data.csv") |> CSV.decode(strip_cells: true, headers: true) |> Enum.each(fn(row) -> IO.inspect(row) end)

Encoder becomes corrupted when it encounters an error

Playing around with your library in iex yielded some pretty odd results:

iex(1)> ["a"] |> CSV.encode |> CSV.decode |> Enum.to_list
** (Protocol.UndefinedError) protocol Enumerable not implemented for "a"
    (elixir) lib/enum.ex:1: Enumerable.impl_for!/1
    (elixir) lib/enum.ex:112: Enumerable.reduce/3
    (elixir) lib/enum.ex:981: Enum.map/2
             lib/csv/encoder.ex:47: CSV.Encoder.encode_row/3
             lib/csv/encoder.ex:42: anonymous fn/4 in CSV.Encoder.encode/2
    (elixir) lib/stream.ex:650: Stream.do_transform/7
    (elixir) lib/enum.ex:1740: Enum.take/2
             lib/csv/decoder.ex:120: CSV.Decoder.produce/2
iex(1)> [["b"]] |> CSV.encode |> CSV.decode |> Enum.to_list
[]
iex(2)> [["b"]] |> CSV.encode |> CSV.decode |> Enum.to_list
[["b"], ["b"]]
iex(3)> [["b"]] |> CSV.encode |> CSV.decode |> Enum.to_list
[]
iex(4)> [["b"]] |> CSV.encode |> CSV.decode |> Enum.to_list
[["b"]]

Notice that [["b"]] |> CSV.encode |> CSV.decode |> Enum.to_list resulted in [] the first time, then [["b"], ["b"]] and [], and then finally, on the 4th try, [["b"]], as expected. The mistake made initially causes the encoder to become corrupted apparently. If exit the iex session and re-do it without the bad line the problem does not occur.

Accept string input

Does this library accept a simple string input? Sometimes the entire csv string already exists in memory and it would be nice to simply pass it into this library.

(CSV.Parser.SyntaxError) Unterminated escape sequence

Hi,

I've recently exported data to CSV from Postgres, therefore I assume that there is nothing wrong with the data from RFC perspective.

I am using latest csv release 1.4.2 and facing error mentioned in subject during parsing.

Sample, which gives me error with CSV header attached

sample.csv.zip

id,question_id,user_id,year,locked,text,locale,inserted_at,updated_at,image_file_name,image_content_type,image_file_size,image_updated_at,comment_count
170,144,8,2015,f,"ООО...Неее... это не ко мне((
Но мое любимое вот это:
""Тихо-тихо ползи,
Улитка, по склону Фудзи,
Вверх, до самых высот.""",ru,2015-03-05 17:12:55,2015-03-05 17:12:55,,,,,0

P.s. I don't think that Russian text makes any difference, right? :)

Allow line breaks in fields

According to RFC 4180 page 2, fields that are enclosed in double quotes can have newlines in them. When trying to parse such row:

first,row,here
one,two,"three
and newline"

The parser throws a syntax error:

(CSV.Parser.SyntaxError) Unterminated escape sequence. on line 1

Notice that the reported line number is incorrect as well – confusing.

I'll take a look if I can fix it, but I may need some direction :)

does not work with mac os legacy line feed CR

with a csv file with line separator CR (legacy mac os),
the module does not work, it cannot decode the file, it puts everything as one line and gives back only a blank list if headers true
as soon as the csv file is rewritten with line feed LF everything works as expected
this is with last stable elixir on a mac

you get legacy max os CR line separator as soon as you create the csv from excel on mac for instance

thanks

New release?

Hi there,

thank you for this nice library!

It hasn't seen a release in quite some time and some of the new features on master are pretty nice and we'd like to use them, so a new release would be highly appreciated :)

Thanks!
Tobi

missing entry if parsing IO.stream

it is found that different result obtained if IO.stream is used, instead of File.stream!, e.g.

iex(1)> "sample.csv" |> File.stream! |> CSV.Decoder.decode(headers: true)|>Enum.to_list
[%{"login" => "A", "name" => "Mr A", "shell" => "/bin/bash"},
 %{"login" => "admin", "name" => " Administrator", "shell" => " /bin/bash"},
 %{"login" => "root", "name" => " Root", "shell" => " /bin/sh"}]
iex(2)> f = File.open!("sample.csv")
#PID<0.176.0>
iex(3)> IO.stream(f, :line) |> CSV.Decoder.decode(headers: true) |> Enum.to_list 
[%{"login" => "admin", "name" => " Administrator", "shell" => " /bin/bash"}, 
 %{"login" => "root", "name" => " Root", "shell" => " /bin/sh"}]

content of "sample.csv":

login,name,shell
A,Mr A,/bin/bash
admin, Administrator, /bin/bash
root, Root, /bin/sh

May you let me know if you have any idea of the problem? thanks

module version: 1.4.0

Calling `decode` with `num_pipes: 1` multiple times on the same stream yields different orderings each time

I'm noticing some weird stuff happening with the decoder when I call it multiple times in an IEx session.

Given a file my_file.csv with ~200 lines, doing

File.stream!("my_file.csv") |> CSV.Decoder.decode(num_pipes: 1) |> Enum.take(2)

yields the first 2 lines of the file the first time I run it, but subsequent invocations yield seemingly random / out-of-order lines.

I also get strange behavior when working with a simple stream (although I'm not able to reproduce the ordering issue):

iex(1)> stream = ~w(1,2,3,4 5,6,7,8 9,10,11,12 13,14,15,16) |> Stream.map(&(&1))
#Stream<[enum: ["1,2,3,4", "5,6,7,8"],
 funs: [#Function<45.113986093/1 in Stream.map/2>]]>
iex(2)> stream |> Enum.take(2)
["1,2,3,4", "5,6,7,8"]
iex(3)> stream |> Enum.take(2)
["1,2,3,4", "5,6,7,8"]
iex(4)> CSV.Decoder.decode(stream, num_pipes: 1) |> Enum.take(2)
#Function<25.113986093/2 in Stream.resource/3>
iex(5)> CSV.Decoder.decode(stream, num_pipes: 1) |> Enum.take(2)
[["1", "2", "3", "4"], ["5", "6", "7", "8"]]
iex(6)> CSV.Decoder.decode(stream, num_pipes: 1) |> Enum.take(2)
[]
iex(7)> CSV.Decoder.decode(stream, num_pipes: 1) |> Enum.take(2)
[["1", "2", "3", "4"], ["5", "6", "7", "8"]]
iex(8)> CSV.Decoder.decode(stream, num_pipes: 1) |> Enum.take(2)
[]

Decode is slow when using (headers: true)

When asking the CSV parser to add headers it seems to take a while to start. Once going it is generally fast. See attached gif for demo. The only difference in the code between the two panes is (headers: true)
slowcsv

ParallelStream is not available

Hi!

I recently created release with edeliver and I'm seeing this?

** (exit) an exception was raised:^M
    ** (UndefinedFunctionError) function ParallelStream.map/3 is undefined (module ParallelStream is not available)^M
        ParallelStream.map(#Function<57.89908360/2 in Stream.transform/3>, #Function<1.67996581/1 in CSV.Decoder.decode/2>, [num_workers: 3])^M

Naive question, but would requiring ParallelStream in the Decoder module fix the runtime error I am seeing?

Consider \r as a valid delimiter by default?

Files coming from Windows users will just contain \r as a newline delimiter, and I think it's good if the default decode method supports this out of the box. Otherwise there's no good/fast way to parse a CSV like that: we'd need to read each line, trim it, and then decode it. Using \r as a delimiter doesn't work because if \r\n appears in the file then \n will be part of the next line. \n could work, but then we have to trim each column's value.

What do you think?

Support for csv in strings?

I'm getting a error when trying to use csv on strings rather than reading from file

iex(23)> string = "a,1
...(23)> b,4"
"a,1\nb,4"
iex(24)> string |> CSV.decode  |> Enum.map fn r -> Enum.map(r, &String.upcase/1) end
** (Protocol.UndefinedError) protocol Enumerable not implemented for "a,1\nb,4"
    (elixir) lib/enum.ex:1: Enumerable.impl_for!/1
    (elixir) lib/enum.ex:112: Enumerable.reduce/3
    (elixir) lib/stream.ex:1240: Enumerable.Stream.do_each/4
    (elixir) lib/stream.ex:700: Stream.do_transform/8
    (elixir) lib/enum.ex:1400: Enum.reduce/3
    (elixir) lib/enum.ex:1047: Enum.map/2
iex(24)> File.stream!("/Users/username/test_data.csv") |> CSV.decode  |> 
iex(25)>Enum.map fn r -> Enum.map(r, &String.upcase/1) end
[["A", "1"], ["B", "4"]]
iex(26)>

the contents of test_data.csv are same as string used
Is there anything wrong i'm doing or are strings not supported?

PS: i'm fairly new to functional programing/elixir world.

Make decode return errors in a tuple and decode! raise errors

The decoder interface could be separated into a method raising errors and into another returning them in a tuple with the result:

myfile
|> File.stream!
|> CSV.decode! # potentially raises errors

myfile
|> File.stream!
|> CSV.decode # returns rows as { :ok, row } and errors as { :error, "Error message" }

Please tag v1.0.1

I'm adding a FreeBSD port, and it would be great to use the official tag :)

Too many processes

I'm processing a large number os CSV files, around 105GB. 30 minutes after started, it reach "Too many process", even if I configure ELIXIR_ERL_OPTIONS="+P 134217727".

The only way I can fix it is changing num_workers to 1. I guess it not ending finished processes.

Add configurable maximum number of lines aggregated by line aggregator

Currently, when there is an escaped field that does not have a proper ending, the line aggregator will continue to aggregate lines indefinitely. This is problematic when working with large files that have syntactic errors.

CSV should aggregate up to n lines and then throw an error, but allow configuration of this number to allow users to decode files with large fields.

Explicit Ordering of Headers

Provide the ability for the client to decide the ordering of the headers.

Current Behavior: When headers are provided, the final output orders the headers in alphabetical order.

Desired Behavior: When headers are provided as a list in the options to CSV.Encoding.Encoder.encode/2, the headers in the final output should be ordered as in the list.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.