tux / text-csv_xs Goto Github PK
View Code? Open in Web Editor NEWperl5 module for composition and decomposition of comma-separated values
perl5 module for composition and decomposition of comma-separated values
You may have seen this blog post, about potential security problems with CSV files that have formulae in cells:
In a perfect world, all of my existing code that uses Text::CSV_XS
would start throwing exceptions if they got CSVs with formulae in them. If users have to turn on some "don't allow formulae in cells" feature, then it's not going to help most people, because most people won't (a) know about the potential problem, or (b) the module's support for protecting you.
I'm guessing that for backwards compatibility reasons you might not want to add this as a feature that's enabled by default, but I think you should at least consider it.
That said, following on from email, I might want to write:
use Text::CSV_XS qw/ csv /;
csv(in => $fh, headers => 'auto', formula => 'croak');
The formula
parameter could be croak, allow, diag, empty (blank out all such cells), or undef (return cell as undef
).
Personally i'd make the default be croak, but I realise you don't want code to suddenly start breaking, so maybe diag could be the default?
Maybe this could be part of the planned attribute strict
Let's say I just want to populate a structure in after_parse
or on_in
, but I don't want to get the whole structure back, and I don't want to print it. E.g. I might want to accumulate information by id:
csv (in => shift,
headers => "skip",
on_in => sub {
my ($id, $email) = @{ $_[1] };
push @{ $by_id{$id}{email} }, $email;
},
);
I'm only interested in %by_id. Something like out => undef
or similar.
If a field is empty or blank, it would be nice if if could be skipped completely not made undef
. So how about skip_empty
and skip_blank
?
(I was going to try to see if I could write this based on blank_is_undef
and empty_is_undef
, but I could not figure out how those interact with the method or function.)
When run on an empty file or one consisting only of whitespace, csv2xls throws some errors:
Use of uninitialized value $sep in string eq at csv2xls line 132.
Use of uninitialized value $sep in string ne at .../perl-5.16.2/lib/perl5/site_perl/5.16.2/x86_64-linux/Text/CSV_XS.pm line 135.
# CSV_XS ERROR: 1008 - INI - SEP undefined @ rec 0 pos 0
Can't call method "getline" on an undefined value at /grasp_mob_c/local/bin/csv2xls line 163.
It would be nice if it would just output an empty file, or one with a cell containing the whitespace.
fany@homer:/tmp> perl -MText::CSV_XS=csv -E 'csv in => [[1,2],[3,4]], out => "test.csv"'; xxd test.csv
00000000: 312c 320d 0a33 2c34 0d0a 1,2..3,4..
fany@homer:/tmp> perl -MText::CSV_XS=csv -E 'csv in => [[1,2]], out => "test.csv"'; xxd test.csv
00000000: 312c 320d 0a 1,2..
fany@homer:/tmp> perl -MText::CSV_XS=csv -E 'csv in => [], out => "test.csv"'; xxd test.csv
-rw------- 1 fany users 4 11. Mai 04:07 test.csv
00000000: 0d0a 0d0a ....
So two records in $aoa
give two lines of CSV, one gives one line, but none give two empty lines.
I think it should just output an empty file in this case.
I have created a csv object with a no-arg constructor call. I have set the quote character and the escape character to the undef value
my $csv = Text::CSV_XS->new;
$csv->quote_char (undef);
$csv->escape_char(undef);
Instead of undef the quote_char and escape_char getters return 0. That means I am not able to use the getters to pass this information to a DBI:CSV connect() call?!
$dbh = DBI->connect ("DBI:CSV:", undef, undef, {
...
csv_quote_char => $csv->quote_char,
csv_escape_char => $csv->escape_char,
...
Current NULL encoding options are limited. It works for some cases - where upstream can handle what we produce. Other cases - eg MySQL 'load data infile' - is unable to correctly identify NULL
s using our encoding method ( eg ,, ). The docs here:
http://search.cpan.org/~hmbrand/Text-CSV_XS-1.35/CSV_XS.pm#csv
... suggest you can produce output that databases can parse by doing:
while (my $row = $sth->fetch) {
$csv->print ($fh, [ map { $_ // "\\N" } @$row ]);
}
... but this is absolutely not the case. Given the data:
[ "blah", undef, 3 ]
... the required output for importing into MySQL or other DBs would be:
"blah",\N,3
... but the above hack instead gives us:
"blah","\\N",3
There are 2 problems with this:
\N
, giving us \\N
. DBs won't parse this correctly.\\N
. DBs won't parse this correctly either.What we really need is a way to pass in any string sequence that can be used to encode a NULL
value. Additionally, this string sequence should not be quoted.
yewtc commented 19 hours ago
The option detect_bom is only available in the header method, but not all CSVs have header lines. The detect_bom functionality should be used on the first line of the file, not just the header line.
Please see makamaka/Text-CSV#48 .
I don't know how to reassign this bug to here.
I have the following line in my CSV file:
nj23h32n,"By using "pseudoenergies", we were able to design multiple peptide sequences that showed low micromolar viral entry inhibitory activity.",2010-06-22,
The 2nd field is the "abstract" field which could contain double quotes and commas as in the example above. I use the following one-liner to parse the file:
perl -MText::CSV_XS=csv -wE'csv(in=>csv(in=>"test.csv",sep=>",",allow_loose_quotes=>1,allow_loose_escapes=>1))'
But I get:
nj23h32n,"By using ""pseudoenergies"," we were able to design multiple peptide sequences that showed low micromolar viral entry inhibitory activity.""",2010-06-22,
Any idea how to correctly parse this example?
Thanks in advance,
Andrej
v1.23, perl 5.22.1
Test case as follows, note the third line of the __DATA__
section with badly balanced quotes.
#!perl
use strict;
use warnings;
use Test::More;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new(
{
binary => 1,
sep_char => '|',
allow_loose_quotes => 1,
}
);
my $fh = \*DATA;
$csv->column_names( @{ $csv->getline( $fh ) } );
my $nbr_lines = 0;
$nbr_lines++ while ( my $row = $csv->getline( $fh ) );
is( $nbr_lines,3,'processed expected number of lines' );
done_testing();
__DATA__
first|second|third|fourth|fifth|sixth|seventh|eigth|ninth|tenth|eleventh|twelth|thriteenth|fourteenth|fifteenth|sixteenth|seventeenth|eighteenth|nineteenth
1|||||||156999||12 Valley||D||N|3610|||68 V D|EA MATCH
2|||||||195658|"""The Cottage"" 54"|"""The "|K|||||307652|R, M|"""The ", K, |EA MATCH
3|||||||216058|117 The K|||||||||117 The K, |EA MATCH
If you use Text::CSV_PP to process the above file then the test passes, so there is a difference in behaviour. Possibly unexpected for those using Text::CSV as they will get different behaviour on different machines if they haven't watched their installation/dependency tree.
I don't know if the behaviour of Text::CSV_XS is supposed to map to Text::CSV_PP exactly, hence this issue.
Occasionally I have to process files that contain multiple CSVs. Each of these CSVs is stored in the file as a heading, followed by the data lines, followed by an empty line (or eof).
It would be nice if Text::CSV had an option that basically says: stop reading after an empty line. This would make it possible to write something similar to:
csv (in => $fh, out => @aoh1, stop_at_empty => 1);
csv (in => $fh, out => @Aoh2, stop_at_empty => 1);
csv (in => $fh, out => @aoh3, stop_at_empty => 1);
Is this possible? (e.g. to parse output of ps
shell command.) sep_char=>' ', allow_whitespace=>1
parse to multiple empty columns.
A minor typo causes header() not to use the set of sep chars defined.
Line 1363 sets the key 'set_set'
defined $c->{'hd_s'} and $harg{'set_set'} = $c->{'hd_s'};
But it should be 'sep_set' :
defined $c->{'hd_s'} and $harg{'sep_set'} = $c->{'hd_s'};
As the error codes are not very end-user friendly.
Either as a method (to be used instead of $csv->error_diag
) or as a procedure wrapper (e.g. textual_error($csv->error_diag)
.
In this short test script, I expect a "CSV_XS ERROR: 2023 - EIQ - QUO character not allowed @ rec 2 pos 6 field 2" exception in the loop, and that happens if the line calling $csv->header
is commented out. However, after calling ->header
it seems Text::CSV_XS::error_diag
is getting called with no arguments and so $self
is falsy and so the croak
is never triggered. Instead we fall through to the "why no auto_diag" exception.
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new( { auto_diag => 2 } );
$csv->header(*DATA); # comment out this line to hide the bug
while ( my $row = $csv->getline(*DATA) ) { print "$row->[0]\n" }
die "why no auto_diag for: " . $csv->error_diag
if $csv->error_diag and $csv->error_diag != 2012;
__DATA__
Foo,Bar,Baz
a,xxx,1
b,"xx"xx", 2
c, foo , 3
Hello
Consider input file with 1 billion records:
my $batch = $csv->getline_all ($fh, 0, 1e6)
will return all data
my $batch = $csv->getline_all ($fh, 0, 1000000)
will return 1000000 recs
In the POD, there are many usage examples. However, they flip between $fh
and $io
. $fh
is used more often, so I would suggest turning all $io
uses into $fh
. Also, $fh
is more widely used by the community in general when making a common variable for a file handle.
I could copy the POD and make the changes, but I don't know how to do a pull request. I could paste the edit here or elsewhere if you want me to do it.
Lady Aleena
See implementation: https://github.com/Tux/Text-CSV_XS/blob/master/CSV_XS.xs#L60:
#define _is_arrayref(f) ( f && \
(SvROK (f) || (SvRMAGICAL (f) && (mg_get (f), 1) && SvROK (f))) && \
SvOK (f) && SvTYPE (SvRV (f)) == SVt_PVAV )
#define _is_hashref(f) ( f && \
(SvROK (f) || (SvRMAGICAL (f) && (mg_get (f), 1) && SvROK (f))) && \
SvOK (f) && SvTYPE (SvRV (f)) == SVt_PVHV )
#define _is_coderef(f) ( f && \
(SvROK (f) || (SvRMAGICAL (f) && (mg_get (f), 1) && SvROK (f))) && \
SvOK (f) && SvTYPE (SvRV (f)) == SVt_PVCV )
Code first checks for SvRMAGICAL()
and then calls mg_get()
. But SvRMAGICAL
checks for SVs_RMG
-- magic different from get
/set
, in most cases uses for clear function. So it does not make sense to call mg_get() method based on SVs_RMG
result.
Instead SvRMAGICAL()
there should be used SvGMAGICAL()
, check for SVs_GMG
that scalar has get
magic which means that mg_get
needs to be called.
To simplify code I would propose to use SvGETMAGIC()
macro which calls mg_get()
when it is needed. E.g. _is_arrayref(f)
could looks like this:
static inline bool _is_arrayref(SV *sv) {
if (!sv) return false;
SvGETMAGICAL(sv);
if (!SvROK(sv)) return false;
if (SvTYPE(SvRV(sv)) != SVt_PVAV) return false;
return true;
}
Currently, if both of these are file names, you get an error like:
Can't use string ("infile.csv") as an ARRAY ref while "strict refs" in use at /home/greg/perl5/perlbrew/perls/perl-5.26.0/lib/site_perl/5.26.0/x86_64-linux/Text/CSV_XS.pm line 1169.
I was trying to do something like:
csv (
in => 'infile.csv',
headers => "auto",
on_in => sub { $_{'domain'} =~ s/^(.+)$/${1}.com/ },
out => 'outfile.csv',
);
Hi,
I'm attempting to use the 'csv' function of Text::CSV_XS with munge_column_names set to "db". This fails when the .csv file passed to the script starts with the sep=; line to indicate the field separator, with the following error message:
Setting sep_set to 1 doesn't seem to make a difference but it all works correctly when I manually remove the sep=; line from the .csv file. Reading the documentation, I was under the impression that the 'sep=' feature was implemented in version 1.17. I am missing a detail to get this to work?
I am running Text::CSV_XS version 1.41 built from MacPorts on Perl v5.26.3 (Mac OS X 10.13.6)
I wanted to turn off column name munging, and ended up with:
$csv->header( $fh, { munge_column_names => sub { $_[0] } } );
Aside from noting that lc
is the default, I'd like another special value, perhaps undef or any false, to turn off any processing:
$csv->header( $fh, { munge_column_names => undef } );
Not a big deal and there's no urgency.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.