GithubHelp home page GithubHelp logo

tux / text-csv_xs Goto Github PK

View Code? Open in Web Editor NEW
15.0 15.0 21.0 12.29 MB

perl5 module for composition and decomposition of comma-separated values

Perl 80.53% XS 19.24% Shell 0.23% Vim Script 0.01%

text-csv_xs's People

Contributors

1nickt avatar bulk88 avatar charsbar avatar choroba avatar dhgutteridge avatar manwar avatar soonix avatar sschoeling avatar tonycoz avatar tux avatar x12340 avatar xsawyerx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

text-csv_xs's Issues

Wishlist: ability to throw exception if formula seen in CSV

You may have seen this blog post, about potential security problems with CSV files that have formulae in cells:

http://georgemauer.net/2017/10/07/csv-injection.html

In a perfect world, all of my existing code that uses Text::CSV_XS would start throwing exceptions if they got CSVs with formulae in them. If users have to turn on some "don't allow formulae in cells" feature, then it's not going to help most people, because most people won't (a) know about the potential problem, or (b) the module's support for protecting you.

I'm guessing that for backwards compatibility reasons you might not want to add this as a feature that's enabled by default, but I think you should at least consider it.

That said, following on from email, I might want to write:

use Text::CSV_XS qw/ csv /;
csv(in => $fh, headers => 'auto', formula => 'croak');

The formula parameter could be croak, allow, diag, empty (blank out all such cells), or undef (return cell as undef).

Personally i'd make the default be croak, but I realise you don't want code to suddenly start breaking, so maybe diag could be the default?

Add an option to the csv function to neither return nor print anything

Let's say I just want to populate a structure in after_parse or on_in, but I don't want to get the whole structure back, and I don't want to print it. E.g. I might want to accumulate information by id:

csv (in      => shift,
     headers => "skip",
     on_in   => sub {
         my ($id, $email) = @{ $_[1] };
         push @{ $by_id{$id}{email} }, $email;
         },
    );

I'm only interested in %by_id. Something like out => undef or similar.

skip empty or blank

If a field is empty or blank, it would be nice if if could be skipped completely not made undef. So how about skip_empty and skip_blank?

(I was going to try to see if I could write this based on blank_is_undef and empty_is_undef, but I could not figure out how those interact with the method or function.)

csv2xls does not like (nearly) empty files

When run on an empty file or one consisting only of whitespace, csv2xls throws some errors:

Use of uninitialized value $sep in string eq at csv2xls line 132.
Use of uninitialized value $sep in string ne at .../perl-5.16.2/lib/perl5/site_perl/5.16.2/x86_64-linux/Text/CSV_XS.pm line 135.
# CSV_XS ERROR: 1008 - INI - SEP undefined @ rec 0 pos 0
Can't call method "getline" on an undefined value at /grasp_mob_c/local/bin/csv2xls line 163.

It would be nice if it would just output an empty file, or one with a cell containing the whitespace.

csv in => $aoa, out => โ€ฆ produces a CSV file with two empty records when $aoa is empty

fany@homer:/tmp> perl -MText::CSV_XS=csv -E 'csv in => [[1,2],[3,4]], out => "test.csv"'; xxd test.csv
00000000: 312c 320d 0a33 2c34 0d0a                 1,2..3,4..
fany@homer:/tmp> perl -MText::CSV_XS=csv -E 'csv in => [[1,2]], out => "test.csv"'; xxd test.csv
00000000: 312c 320d 0a                             1,2..
fany@homer:/tmp> perl -MText::CSV_XS=csv -E 'csv in => [], out => "test.csv"'; xxd test.csv
-rw------- 1 fany users 4 11. Mai 04:07 test.csv
00000000: 0d0a 0d0a                                ....

So two records in $aoa give two lines of CSV, one gives one line, but none give two empty lines.
I think it should just output an empty file in this case.

getter quote_char() and escape_char() returning 0 instead of undef

I have created a csv object with a no-arg constructor call. I have set the quote character and the escape character to the undef value

my $csv = Text::CSV_XS->new;
$csv->quote_char (undef);
$csv->escape_char(undef);

Instead of undef the quote_char and escape_char getters return 0. That means I am not able to use the getters to pass this information to a DBI:CSV connect() call?!

$dbh = DBI->connect ("DBI:CSV:", undef, undef, {
  ...
  csv_quote_char  => $csv->quote_char,
  csv_escape_char => $csv->escape_char,
  ...

Add support for configurable NULL encoding

Current NULL encoding options are limited. It works for some cases - where upstream can handle what we produce. Other cases - eg MySQL 'load data infile' - is unable to correctly identify NULLs using our encoding method ( eg ,, ). The docs here:
http://search.cpan.org/~hmbrand/Text-CSV_XS-1.35/CSV_XS.pm#csv
... suggest you can produce output that databases can parse by doing:

while (my $row = $sth->fetch) {
  $csv->print ($fh, [ map { $_ // "\\N" } @$row ]);
  }

... but this is absolutely not the case. Given the data:

[ "blah", undef, 3 ]

... the required output for importing into MySQL or other DBs would be:

"blah",\N,3

... but the above hack instead gives us:

"blah","\\N",3

There are 2 problems with this:

  1. Text::CSV_XS is escaping the \N, giving us \\N. DBs won't parse this correctly.
  2. Text::CSV_XS is quoting the \\N. DBs won't parse this correctly either.

What we really need is a way to pass in any string sequence that can be used to encode a NULL value. Additionally, this string sequence should not be quoted.

Can't parse fields with nested quotes and commas

I have the following line in my CSV file:
nj23h32n,"By using "pseudoenergies", we were able to design multiple peptide sequences that showed low micromolar viral entry inhibitory activity.",2010-06-22,

The 2nd field is the "abstract" field which could contain double quotes and commas as in the example above. I use the following one-liner to parse the file:
perl -MText::CSV_XS=csv -wE'csv(in=>csv(in=>"test.csv",sep=>",",allow_loose_quotes=>1,allow_loose_escapes=>1))'

But I get:
nj23h32n,"By using ""pseudoenergies"," we were able to design multiple peptide sequences that showed low micromolar viral entry inhibitory activity.""",2010-06-22,

Any idea how to correctly parse this example?

Thanks in advance,
Andrej

Processing fails on loose quotes with alternate sep_char

v1.23, perl 5.22.1

Test case as follows, note the third line of the __DATA__ section with badly balanced quotes.

#!perl

use strict;
use warnings;

use Test::More;
use Text::CSV_XS;

my $csv = Text::CSV_XS->new(
    {
        binary             => 1,
        sep_char           => '|',
        allow_loose_quotes => 1,
    }
);

my $fh = \*DATA;
$csv->column_names( @{ $csv->getline( $fh ) } );

my $nbr_lines = 0;
$nbr_lines++ while ( my $row = $csv->getline( $fh ) );

is( $nbr_lines,3,'processed expected number of lines' );

done_testing();

__DATA__
first|second|third|fourth|fifth|sixth|seventh|eigth|ninth|tenth|eleventh|twelth|thriteenth|fourteenth|fifteenth|sixteenth|seventeenth|eighteenth|nineteenth
1|||||||156999||12 Valley||D||N|3610|||68 V D|EA MATCH
2|||||||195658|"""The Cottage"" 54"|"""The "|K|||||307652|R, M|"""The ", K, |EA MATCH
3|||||||216058|117 The K|||||||||117 The K, |EA MATCH

If you use Text::CSV_PP to process the above file then the test passes, so there is a difference in behaviour. Possibly unexpected for those using Text::CSV as they will get different behaviour on different machines if they haven't watched their installation/dependency tree.

I don't know if the behaviour of Text::CSV_XS is supposed to map to Text::CSV_PP exactly, hence this issue.

FR: Reading multiple CSVs from a single file

re: makamaka/Text-CSV#62

Occasionally I have to process files that contain multiple CSVs. Each of these CSVs is stored in the file as a heading, followed by the data lines, followed by an empty line (or eof).

It would be nice if Text::CSV had an option that basically says: stop reading after an empty line. This would make it possible to write something similar to:

csv (in => $fh, out => @aoh1, stop_at_empty => 1);
csv (in => $fh, out => @Aoh2, stop_at_empty => 1);
csv (in => $fh, out => @aoh3, stop_at_empty => 1);

Multiple spaces as a separator?

Is this possible? (e.g. to parse output of ps shell command.) sep_char=>' ', allow_whitespace=>1 parse to multiple empty columns.

Typo in sub csv() causes sep chars set to not work

A minor typo causes header() not to use the set of sep chars defined.

Line 1363 sets the key 'set_set'

defined $c->{'hd_s'} and $harg{'set_set'} = $c->{'hd_s'};

But it should be 'sep_set' :

defined $c->{'hd_s'} and $harg{'sep_set'} = $c->{'hd_s'};

Calling $csv->header breaks auto_diag in 1.40

In this short test script, I expect a "CSV_XS ERROR: 2023 - EIQ - QUO character not allowed @ rec 2 pos 6 field 2" exception in the loop, and that happens if the line calling $csv->header is commented out. However, after calling ->header it seems Text::CSV_XS::error_diag is getting called with no arguments and so $self is falsy and so the croak is never triggered. Instead we fall through to the "why no auto_diag" exception.

#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV_XS;

my $csv = Text::CSV_XS->new( { auto_diag => 2 } );
$csv->header(*DATA); # comment out this line to hide the bug
while ( my $row = $csv->getline(*DATA) ) { print "$row->[0]\n" }
die "why no auto_diag for: " . $csv->error_diag
    if $csv->error_diag and $csv->error_diag != 2012;

__DATA__
Foo,Bar,Baz
a,xxx,1
b,"xx"xx", 2
c, foo , 3

POD: Decide if it is $fh or $io

In the POD, there are many usage examples. However, they flip between $fh and $io. $fh is used more often, so I would suggest turning all $io uses into $fh. Also, $fh is more widely used by the community in general when making a common variable for a file handle.

I could copy the POD and make the changes, but I don't know how to do a pull request. I could paste the edit here or elsewhere if you want me to do it.

Lady Aleena

_is_arrayref, _is_hashref and _is_coderef is incorrectly defined

See implementation: https://github.com/Tux/Text-CSV_XS/blob/master/CSV_XS.xs#L60:

#define _is_arrayref(f) ( f && \
     (SvROK (f) || (SvRMAGICAL (f) && (mg_get (f), 1) && SvROK (f))) && \
      SvOK (f) && SvTYPE (SvRV (f)) == SVt_PVAV )
#define _is_hashref(f) ( f && \
     (SvROK (f) || (SvRMAGICAL (f) && (mg_get (f), 1) && SvROK (f))) && \
      SvOK (f) && SvTYPE (SvRV (f)) == SVt_PVHV )
#define _is_coderef(f) ( f && \
     (SvROK (f) || (SvRMAGICAL (f) && (mg_get (f), 1) && SvROK (f))) && \
      SvOK (f) && SvTYPE (SvRV (f)) == SVt_PVCV )

Code first checks for SvRMAGICAL() and then calls mg_get(). But SvRMAGICAL checks for SVs_RMG -- magic different from get/set, in most cases uses for clear function. So it does not make sense to call mg_get() method based on SVs_RMG result.

Instead SvRMAGICAL() there should be used SvGMAGICAL(), check for SVs_GMG that scalar has get magic which means that mg_get needs to be called.

To simplify code I would propose to use SvGETMAGIC() macro which calls mg_get() when it is needed. E.g. _is_arrayref(f) could looks like this:

static inline bool _is_arrayref(SV *sv) {
    if (!sv) return false;
    SvGETMAGICAL(sv);
    if (!SvROK(sv)) return false;
    if (SvTYPE(SvRV(sv)) != SVt_PVAV) return false;
    return true;
}

csv() cannot have both in and out be filenames

Currently, if both of these are file names, you get an error like:

Can't use string ("infile.csv") as an ARRAY ref while "strict refs" in use at /home/greg/perl5/perlbrew/perls/perl-5.26.0/lib/site_perl/5.26.0/x86_64-linux/Text/CSV_XS.pm line 1169.

I was trying to do something like:

 csv (
    in      => 'infile.csv',
    headers => "auto",
    on_in   => sub { $_{'domain'} =~ s/^(.+)$/${1}.com/ },
    out => 'outfile.csv',
    );

csv munge_column_names function fails with sep=; header

Hi,

I'm attempting to use the 'csv' function of Text::CSV_XS with munge_column_names set to "db". This fails when the .csv file passed to the script starts with the sep=; line to indicate the field separator, with the following error message:

CSV_XS ERROR: 1012 - INI - the header contains an empty field @ rec 1 pos 0

Setting sep_set to 1 doesn't seem to make a difference but it all works correctly when I manually remove the sep=; line from the .csv file. Reading the documentation, I was under the impression that the 'sep=' feature was implemented in version 1.17. I am missing a detail to get this to work?

I am running Text::CSV_XS version 1.41 built from MacPorts on Perl v5.26.3 (Mac OS X 10.13.6)

Turn off column naming munging

I wanted to turn off column name munging, and ended up with:

$csv->header( $fh, { munge_column_names => sub { $_[0] } } );

Aside from noting that lc is the default, I'd like another special value, perhaps undef or any false, to turn off any processing:

$csv->header( $fh, { munge_column_names => undef } );

Not a big deal and there's no urgency.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.