GithubHelp home page GithubHelp logo

Comments (4)

perlpunk avatar perlpunk commented on June 14, 2024

So I tried fixing the code by using this check:

if ($binary =~ m/[\x{100}-\x{10FFFF}]/) {
    return;
}
unless ($binary =~ m/[\x{80}-\x{FF}]/) {
    return;
}

However, this produces unwanted output:

use YAML::PP;
use Encode;
my $input = decode_utf8("ä");
my $dump = YAML::PP->new( schema => ['Binary'] )->dump_string($input);
say $dump;
__END__
--- !!binary |
  5A==

Maybe I didn't understand your suggestions correctly?

from yaml-pp-p5.

perlpunk avatar perlpunk commented on June 14, 2024

Maybe it's simply not possible to detect if data is binary.
My tests in t/45.binary.t have been working like I expected, but maybe my assumptions in there are wrong.

I will document that only loading binary data this way is recommended.
For dumping, I might provide a special class that users can bless/tie their data with to mark them as binary.

from yaml-pp-p5.

pali avatar pali commented on June 14, 2024

Maybe it's simply not possible to detect if data is binary.

Yes, this is not possible. You must declare in API if input should be treated as binary (and therefore only qr/^[\x00-\xff]*$/ is accepted, otherwise function should die) or input should be treated as (Unicode) string and then any non-undef input is accepted.

However, this produces unwanted output:

use YAML::PP;
use Encode;
my $input = decode_utf8("ä");

Lets stop at above line as it does not have to be obvious and simple what is doing here.

Unless you specify use utf8 pragma, Perl parses source code/file in Latin1 encoding. As Latin1 is defined at whole 8-bit domain, you can represent with it also byte sequence of characters in UTF-8. It is ugly but I see it (mis)used in Perl lot of times. So when use utf8 is not specified (which is your case) above line is equivalent to (I'm assuming that you have wrote that file in UTF-8 encoding):

my $input = decode_utf8("\xC3\xA4");

Encode::decode_utf8 expects on its input sequence of bytes (which should represent UTF-8 sequence) and returns Unicode string, decoded from UTF-8.

So in $input would be stored Unicode string "ä" (now written in UTF-8, not in Latin1) which is equivalent to string "\xE4" and to string "\N{U+E4}".

my $dump = YAML::PP->new( schema => ['Binary'] )->dump_string($input);
say $dump;
__END__
--- !!binary |
  5A==

Maybe I didn't understand your suggestions correctly?

And what should Binary schema/encoding do? I would say that it should expects on its input binary buffer and produce ouput in YAML marked as binary.

And because on input was one byte 0xE4 its representation in base64 is 5A==. Which seems to be correct.

So, main problem is there how is (or rather how should be) defined API of Binary schema.

If you define API in way that input is expected and must be in binary 8bit, then code should look like:

if ($binary =~ m/[\x{100}-\x{10FFFF}]/) {
    die "Input is not 8-bit";
}
unless ($binary =~ m/[\x{80}-\x{FF}]/) {
    # Only 7-bit ascii, base64 encoding is not needed
    return;
}
# now apply base64 encoding
...

Plus there is a still small problem, 64bit perl supports also characters above 0x10FFFF up to the 2^64-1. They are marked as Extended Perl Unicode and are not portable. But in Perl you can create them and use them. So to be fully precise, you should check for non-8-bit as:

unless ($binary =~ m/^[\x00-\xFF]*$/) {
    die "Input is not 8-bit";
}

For dumping, I might provide a special class that users can bless/tie their data with to mark them as binary.

This should work fine.

Alternative solution how to handle types when dumping perl structure to typed-formats (YAML/JSON/...) is to provide types explicitly via additional argument / interface. This is for example implemented in Cpanel::JSON::XS, you can look at examples in documentation: https://metacpan.org/pod/Cpanel::JSON::XS::Type

from yaml-pp-p5.

pali avatar pali commented on June 14, 2024

Based on above description, here is my proposed change with API that binary schema expects 8-bit data:

diff --git a/lib/YAML/PP/Schema/Binary.pm b/lib/YAML/PP/Schema/Binary.pm
index 30b4491..a63fa64 100644
--- a/lib/YAML/PP/Schema/Binary.pm
+++ b/lib/YAML/PP/Schema/Binary.pm
@@ -27,15 +27,15 @@ sub register {
         code => sub {
             my ($rep, $node) = @_;
             my $binary = $node->{value};
-            unless ($binary =~ m/[\x{7F}-\x{10FFFF}]/) {
-                # ASCII
-                return;
+            if ($binary =~ m/[^\x{00}-\x{FF}]/) {
+                # non 8-bit
+                die "Input is not 8-bit binary\n";
             }
-            if (utf8::is_utf8($binary)) {
-                # utf8
+            if ($binary =~ m/^[\x{00}-\x{7F}]*$/) {
+                # 7-bit ASCII
                 return;
             }
-            # everything else must be base64 encoded
+            # 8-bit must be base64 encoded
             my $base64 = encode_base64($binary);
             $node->{style} = YAML_ANY_SCALAR_STYLE;
             $node->{data} = $base64;
@@ -84,10 +84,10 @@ See <https://yaml.org/type/binary.html>
 By prepending a base64 encoded binary string with the C<!!binary> tag, it can
 be automatically decoded when loading.
 
-Note that the logic for dumping is probably broken, see
-L<https://github.com/perlpunk/YAML-PP-p5/issues/28>.
+If you are using this schema, any string containing C<[\x{80}-\x{FF}]>
+(non-7-bit) will be dumped as binary.
 
-Suggestions welcome.
+This schema cannot be used for non-8-bit (non-binary) data.
 
 =head1 METHODS
 
diff --git a/t/45.binary.t b/t/45.binary.t
index caed569..24e4b58 100644
--- a/t/45.binary.t
+++ b/t/45.binary.t
@@ -1,6 +1,7 @@
 #!/usr/bin/env perl
 use strict;
 use warnings;
+use utf8;
 use Test::More tests => 3;
 
 use YAML::PP;
@@ -41,16 +42,15 @@ EOM
 
 };
 
-my $latin1_a_umlaut = encode(latin1 => (decode_utf8 "ä"));
+my $latin1_a_umlaut = encode(latin1 => "ä");
 my @tests = (
-    [utf8 => "a"],
+    [ascii => "a"],
+    [ascii => "test"],
+    [ascii => "euro"],
     [binary => $latin1_a_umlaut],
-    [binary =>  "\304\244",],
-    [utf8 =>  decode_utf8("\304\244"),],
-    [binary => "a umlaut ä",],
-    [utf8 => decode_utf8("a umlaut ä"),],
-    [binary => "euro €",],
-    [utf8 => decode_utf8("euro €"),],
+    [binary => "\304\244",],
+    [binary => encode_utf8("a umlaut ä"),],
+    [binary => encode_utf8("euro €"),],
     [binary => "\303\274 \374",],
     [binary => "\xC0\x80"],
     [binary => "\xC0\xAF"],
@@ -59,7 +59,7 @@ my @tests = (
     [binary => "\xE0\x83\xBF"],
     [binary => "\xF0\x80\x83\xBF"],
     [binary => "\xF0\x80\xA3\x80"],
-    [binary => [$gif, decode_utf8("ä")],],
+    [binary => [$gif, encode_utf8("ä")],],
     [binary => [$gif, 'foo'],],
 );
 

from yaml-pp-p5.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.