GithubHelp home page GithubHelp logo

Comments (11)

lopopolo avatar lopopolo commented on June 2, 2024 1

I'm not sure if String#length is the right conditional here. Look at this for Binary and ASCII strings:

$ irb
[3.1.2] > s = "\u{1F600}"
=> "๐Ÿ˜€"
[3.1.2] > s.encoding
=> #<Encoding:UTF-8>
[3.1.2] > s.length
=> 1
[3.1.2] > a = s.force_encoding(Encoding::ASCII)
=> "\xF0\x9F\x98\x80"
[3.1.2] > b = s.b
=> "\xF0\x9F\x98\x80"
[3.1.2] > a.encoding
=> #<Encoding:US-ASCII>
[3.1.2] > b.encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > b == a
=> false
[3.1.2] > a == b
=> false

I think we'll have to dive though the MRI sources to find out what the extra conditional check is.

from artichoke.

lopopolo avatar lopopolo commented on June 2, 2024

That length comparison should be self.char_len(). len() is not encoding aware, it just looks at the underlying bytes.

from artichoke.

lopopolo avatar lopopolo commented on June 2, 2024

Let's add test cases for this in the string_test.rb functional tests.

from artichoke.

b-n avatar b-n commented on June 2, 2024

Suspect:

[3.1.2] > s = "\u{1F600}"
=> "๐Ÿ˜€"
[3.1.2] > s.force_encoding(Encoding::ASCII)
=> "\xF0\x9F\x98\x80"
[3.1.2] > t = s.b
=> "\xF0\x9F\x98\x80"
[3.1.2] > s == t
=> false
[3.1.2] > s.encoding
=> #<Encoding:US-ASCII>
[3.1.2] > t.encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > s.valid_encoding?
=> false
[3.1.2] > t.valid_encoding?
=> true
[3.1.2] > s = "A"
=> "A"
[3.1.2] > s.force_encoding(Encoding::ASCII)
=> "A"
[3.1.2] > t = s.b
=> "A"
[3.1.2] > s == t
=> true
[3.1.2] > s.valid_encoding?
=> true
[3.1.2] > t.valid_encoding?
=> true

e.g. \xF0 is not a valid US-ASCII char. Ruby is still doing it's best to show the underlying bytes of course. And since all bytes are valid in ASCII-8BIT

from artichoke.

AI-Mozi avatar AI-Mozi commented on June 2, 2024

Hey!
I will be happy to try to solve this issue. Is there anything else I should know about it?

from artichoke.

lopopolo avatar lopopolo commented on June 2, 2024

Go for it @AI-Mozi

from artichoke.

b-n avatar b-n commented on June 2, 2024

@AI-Mozi In case it helps, I think this line in the Ruby docs is the crux of the issue:

Returns false if the two strings' encodings are not compatible

โ˜๏ธ from here: https://ruby-doc.org/core-3.1.2/String.html#method-i-3D-3D

I spent some time with a colleague at work thinking about what this actually means, and I think I now have a grasp. Sorry in advance if my explanation is lacking and/or this becomes a thesis ๐Ÿ˜ฌ.

In short, it comes to how the characters are represented, instead of their binary values. This hopefully can be explained by these two code pages:

Here, you can see that \x30 is represented by 0 in both code pages. However later on, \xA1 is represented by a ยก in 8859-1, where as ฤ„ in 8859-2. So although their binary contents are the same, how they are displayed to the user would be different.

Note, this is confusing in ruby, because it's reliant on whether your shell is set up to view these character sets or not (I believe this is what the limitation is, but not 100% sure - echo $LANG and you'll likely see UTF-8 for example). How this manifests is as follows:

[3.1.2] > s = "\x30".force_encoding(Encoding::ISO_8859_1)
=> "0"
[3.1.2] > t = "\x30".force_encoding(Encoding::ISO_8859_2)
=> "0"
[3.1.2] > s.encoding
=> #<Encoding:ISO-8859-1>
[3.1.2] > t.encoding
=> #<Encoding:ISO-8859-2>
[3.1.2] > s == t
=> true
[3.1.2] > u = "\xA1".force_encoding(Encoding::ISO_8859_1)
=> "\xA1"
[3.1.2] > v = "\xA1".force_encoding(Encoding::ISO_8859_2)
=> "\xA1"
[3.1.2] > u == v
=> false

Some analysis:

  • \x30 outputs as a 0 since this symbol is encoded the same as what shell supports (UTF-8 \x30 is also represented by 0)
  • \xA1 is output as hex, since that character is not the same in UTF-8 as it is in the encoding specified. e.g. My shell knows this text is not UTF-8, but also it doesn't know how to output it
  • When ruby says: "Returns false if the two strings' encodings are not compatible" - I believe it's saying: "If all characters are represented in the same way in both encodings, then it's equal, otherwise it is not"

โ˜๏ธ In saying all the above, Artichoke currently only supports Binary, ASCII, and UTF-8 strings. The good news, is that the first 128 characters (0 indexed) are represented the same across all of these encodings. e.g. any two byte strings that only include the characters \x00 => \x7F should be equal, regardless of encoding. If the strings contain anything \x80 and above, I'd expect to give equality. Proof:

[3.1.2] > s = "\x80"
=> "\x80"
[3.1.2] > t = s.b
=> "\x80"
[3.1.2] > u = s.dup.force_encoding(Encoding::ASCII)
=> "\x80"
[3.1.2] > s.encoding
=> #<Encoding:UTF-8>
[3.1.2] > t.encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > u.encoding
=> #<Encoding:US-ASCII>
[3.1.2] > s == t
=> false
[3.1.2] > t == u
=> false
[3.1.2] > s == u
=> false

Hopefully the above makes some sense. I'm not actually sure how/where I'd implement this, but I thought the info might help

from artichoke.

b-n avatar b-n commented on June 2, 2024

In saying this:

[3.1.2] > u = "\xD6".force_encoding(Encoding::ISO_8859_1)
=> "\xD6"
[3.1.2] > v = "\xD6".force_encoding(Encoding::ISO_8859_2)
=> "\xD6"
[3.1.2] > u == v
=> false

โ˜๏ธ I'm not sure if this is "correct" from MRI. In both code pages, this would be ร– so "In theory" they should equal since the those two encodings are compatible for that character. I imagine this is because it would be massively hard to manage giant equality tables of "this code point looks like this here" etc. Although I do like the challenge of making a Rust library that can do this sort of equality ๐Ÿ˜ฌ.

Note, I suspect MRI (and we could also use this same logic in Artichoke) uses these conditions for equality:

  • internal bytes are the same
  • Encoding is the same OR both strings only contain ascii chars

from artichoke.

AI-Mozi avatar AI-Mozi commented on June 2, 2024

Hey! I've had a bit of a break but now I'd like to complete this task :)
Is spinoso-string the only place where I should add changes?
And could you please provide some examples of tests that would hep me check if my changes are correct?

from artichoke.

lopopolo avatar lopopolo commented on June 2, 2024

hi @AI-Mozi. you'll want to modify the PartialEq implementation on EncodedString to also check for the left and right sides having the same char_len:

impl PartialEq for EncodedString {
fn eq(&self, other: &Self) -> bool {
// Equality only depends on each `EncodedString`'s byte contents.
//
// ```
// [3.0.2] > s = "abc"
// => "abc"
// [3.0.2] > t = s.dup.force_encoding(Encoding::ASCII)
// => "abc"
// [3.0.2] > s == t
// => true
// ```
*self.as_slice() == *other.as_slice()
}
}

from artichoke.

AI-Mozi avatar AI-Mozi commented on June 2, 2024

And thats all? Just check if have same char_len?

from artichoke.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.