Comments (4)
So I tried fixing the code by using this check:
if ($binary =~ m/[\x{100}-\x{10FFFF}]/) {
return;
}
unless ($binary =~ m/[\x{80}-\x{FF}]/) {
return;
}
However, this produces unwanted output:
use YAML::PP;
use Encode;
my $input = decode_utf8("ä");
my $dump = YAML::PP->new( schema => ['Binary'] )->dump_string($input);
say $dump;
__END__
--- !!binary |
5A==
Maybe I didn't understand your suggestions correctly?
from yaml-pp-p5.
Maybe it's simply not possible to detect if data is binary.
My tests in t/45.binary.t
have been working like I expected, but maybe my assumptions in there are wrong.
I will document that only loading binary data this way is recommended.
For dumping, I might provide a special class that users can bless/tie their data with to mark them as binary.
from yaml-pp-p5.
Maybe it's simply not possible to detect if data is binary.
Yes, this is not possible. You must declare in API if input should be treated as binary (and therefore only qr/^[\x00-\xff]*$/
is accepted, otherwise function should die) or input should be treated as (Unicode) string and then any non-undef input is accepted.
However, this produces unwanted output:
use YAML::PP; use Encode; my $input = decode_utf8("ä");
Lets stop at above line as it does not have to be obvious and simple what is doing here.
Unless you specify use utf8
pragma, Perl parses source code/file in Latin1 encoding. As Latin1 is defined at whole 8-bit domain, you can represent with it also byte sequence of characters in UTF-8. It is ugly but I see it (mis)used in Perl lot of times. So when use utf8
is not specified (which is your case) above line is equivalent to (I'm assuming that you have wrote that file in UTF-8 encoding):
my $input = decode_utf8("\xC3\xA4");
Encode::decode_utf8
expects on its input sequence of bytes (which should represent UTF-8 sequence) and returns Unicode string, decoded from UTF-8.
So in $input
would be stored Unicode string "ä"
(now written in UTF-8, not in Latin1) which is equivalent to string "\xE4"
and to string "\N{U+E4}"
.
my $dump = YAML::PP->new( schema => ['Binary'] )->dump_string($input); say $dump; __END__ --- !!binary | 5A==
Maybe I didn't understand your suggestions correctly?
And what should Binary schema/encoding do? I would say that it should expects on its input binary buffer and produce ouput in YAML marked as binary.
And because on input was one byte 0xE4
its representation in base64 is 5A==
. Which seems to be correct.
So, main problem is there how is (or rather how should be) defined API of Binary schema.
If you define API in way that input is expected and must be in binary 8bit, then code should look like:
if ($binary =~ m/[\x{100}-\x{10FFFF}]/) {
die "Input is not 8-bit";
}
unless ($binary =~ m/[\x{80}-\x{FF}]/) {
# Only 7-bit ascii, base64 encoding is not needed
return;
}
# now apply base64 encoding
...
Plus there is a still small problem, 64bit perl supports also characters above 0x10FFFF up to the 2^64-1. They are marked as Extended Perl Unicode and are not portable. But in Perl you can create them and use them. So to be fully precise, you should check for non-8-bit as:
unless ($binary =~ m/^[\x00-\xFF]*$/) {
die "Input is not 8-bit";
}
For dumping, I might provide a special class that users can bless/tie their data with to mark them as binary.
This should work fine.
Alternative solution how to handle types when dumping perl structure to typed-formats (YAML/JSON/...) is to provide types explicitly via additional argument / interface. This is for example implemented in Cpanel::JSON::XS, you can look at examples in documentation: https://metacpan.org/pod/Cpanel::JSON::XS::Type
from yaml-pp-p5.
Based on above description, here is my proposed change with API that binary schema expects 8-bit data:
diff --git a/lib/YAML/PP/Schema/Binary.pm b/lib/YAML/PP/Schema/Binary.pm
index 30b4491..a63fa64 100644
--- a/lib/YAML/PP/Schema/Binary.pm
+++ b/lib/YAML/PP/Schema/Binary.pm
@@ -27,15 +27,15 @@ sub register {
code => sub {
my ($rep, $node) = @_;
my $binary = $node->{value};
- unless ($binary =~ m/[\x{7F}-\x{10FFFF}]/) {
- # ASCII
- return;
+ if ($binary =~ m/[^\x{00}-\x{FF}]/) {
+ # non 8-bit
+ die "Input is not 8-bit binary\n";
}
- if (utf8::is_utf8($binary)) {
- # utf8
+ if ($binary =~ m/^[\x{00}-\x{7F}]*$/) {
+ # 7-bit ASCII
return;
}
- # everything else must be base64 encoded
+ # 8-bit must be base64 encoded
my $base64 = encode_base64($binary);
$node->{style} = YAML_ANY_SCALAR_STYLE;
$node->{data} = $base64;
@@ -84,10 +84,10 @@ See <https://yaml.org/type/binary.html>
By prepending a base64 encoded binary string with the C<!!binary> tag, it can
be automatically decoded when loading.
-Note that the logic for dumping is probably broken, see
-L<https://github.com/perlpunk/YAML-PP-p5/issues/28>.
+If you are using this schema, any string containing C<[\x{80}-\x{FF}]>
+(non-7-bit) will be dumped as binary.
-Suggestions welcome.
+This schema cannot be used for non-8-bit (non-binary) data.
=head1 METHODS
diff --git a/t/45.binary.t b/t/45.binary.t
index caed569..24e4b58 100644
--- a/t/45.binary.t
+++ b/t/45.binary.t
@@ -1,6 +1,7 @@
#!/usr/bin/env perl
use strict;
use warnings;
+use utf8;
use Test::More tests => 3;
use YAML::PP;
@@ -41,16 +42,15 @@ EOM
};
-my $latin1_a_umlaut = encode(latin1 => (decode_utf8 "ä"));
+my $latin1_a_umlaut = encode(latin1 => "ä");
my @tests = (
- [utf8 => "a"],
+ [ascii => "a"],
+ [ascii => "test"],
+ [ascii => "euro"],
[binary => $latin1_a_umlaut],
- [binary => "\304\244",],
- [utf8 => decode_utf8("\304\244"),],
- [binary => "a umlaut ä",],
- [utf8 => decode_utf8("a umlaut ä"),],
- [binary => "euro €",],
- [utf8 => decode_utf8("euro €"),],
+ [binary => "\304\244",],
+ [binary => encode_utf8("a umlaut ä"),],
+ [binary => encode_utf8("euro €"),],
[binary => "\303\274 \374",],
[binary => "\xC0\x80"],
[binary => "\xC0\xAF"],
@@ -59,7 +59,7 @@ my @tests = (
[binary => "\xE0\x83\xBF"],
[binary => "\xF0\x80\x83\xBF"],
[binary => "\xF0\x80\xA3\x80"],
- [binary => [$gif, decode_utf8("ä")],],
+ [binary => [$gif, encode_utf8("ä")],],
[binary => [$gif, 'foo'],],
);
from yaml-pp-p5.
Related Issues (20)
- t/31.schema.t fails tests 238 and 3838 when nvtype is IBM DoubleDouble HOT 4
- Suggestion: shorter alias for YAML::PP::Highlight HOT 2
- Update yamlpp-* tools to support the Merge feature. HOT 3
- Merge breaks when merging a node with a sequence. HOT 7
- boolean.pm values cannot be emitted HOT 16
- quote special YAML keywords when dumping HOT 3
- t/54.glob.t fails on perl 5.8.8 or lower HOT 2
- Question: is it possible to force all one-line string scalars to be single-quoted? HOT 1
- Getting a "Bad indendation in FLOWMAP" error HOT 3
- Option to indent lists relative to mapping keys HOT 1
- Schema to support TO_JSON methods HOT 2
- order is not preserved in new subhashes HOT 10
- Doc issue HOT 2
- Bug: Literal scalars with explicit indent seem to have a problem HOT 2
- ypp fails to parse !~ HOT 3
- Parse error on plain key ending with colon HOT 2
- YAML::PP::Load loops infinitely when given tainted string on perl < 5.14 HOT 1
- Recent released versions are prefixed with `v` HOT 1
- anchors don't survive when files are included using the Include Schema HOT 2
- Parser events have offset, but not line number HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from yaml-pp-p5.