roc-lang / unicode Goto Github PK
View Code? Open in Web Editor NEWLicense: Universal Permissive License v1.0
License: Universal Permissive License v1.0
This repo has a GitHub Actions workflow to deploy the docs to GitHub Pages, but the pipeline is failing because GitHub Pages is not enabled, to fix this a repo owner needs to go to https://github.com/roc-lang/unicode/settings/pages and choose "GitHub Actions" as the source, then re-run the deploy-docs job.
Here's a little example to reproduce the issue:
app [main] {
pf: platform "https://github.com/roc-lang/basic-cli/releases/download/0.12.0/Lb8EgiejTUzbggO2HVVuPJFkwvvsfW6LojkLR20kTVE.tar.br",
unicode: "https://github.com/roc-lang/unicode/releases/download/0.1.1/-FQDoegpSfMS-a7B0noOnZQs3-A2aq9RSOR5VVLMePg.tar.br"
}
import pf.Task exposing [Task]
import pf.Stdout
import unicode.Grapheme
main =
when Grapheme.split "" is
Ok _ -> Stdout.line!("Ok")
Err _ -> Stdout.line!("Err")
The Unicode Character Database UCD assigns to each Unicode character as its default width property one of six values: Ambiguous, Fullwidth, Halfwidth, Narrow, Wide, or Neutral (= Not East Asian). For any given operation, these six default property values resolve into only two property values, narrow and wide, depending on context.
We already have a few examples that do this in our package, so this should be easy to implement as a good first issue.
Add the EastAsianWidth.txt data file to unicode/package/data
, then write a InternalEAWGen.roc
file that is almost a copy paste of InternalGBPGen.roc to parse the data file and generates a Roc file that maps CodePoints CP
to an East Asian Width property EAW : [Ambiguous, Fullwidth, Halfwidth, Narrow, Neutral, Wide]
, and then implement a corresponding helper that uses this to walk through a List U8
or a Str
and sum of the width.
The Grapheme.split
function crashes on some edge-cases, for example, running:
Grapheme.split (Str.fromUtf8 [225, 134, 168, 226, 128, 141, 225, 133, 129])
Crashes with the output:
The program crashed with:
This is definitely a bug in the roc-lang/unicode package, caused by an unhandled edge case in grapheme text segmentation.
It is difficult to track down and catch every possible combination, so it would be helpful if you could log this as an issue with a reproduction.
Grapheme.split state machine state at the time was:
((AfterZWJ <opaque>), [8205, 4417], [ZWJ, L])
Here is the call stack that led to the crash:
roc.panic
Grapheme.splitHelp
Grapheme.(anonymous function)
Result.try
Grapheme.split
app.(anonymous function)
Task.(anonymous function)
.(anonymous function)
rust.main
Optimizations can make this list inaccurate! If it looks wrong, try running without `--optimize` and with `--linker=legacy`
Here are a list of examples that crash this function:
Grapheme.split (Str.fromUtf8 [13, 204, 136, 225, 134, 168, 226, 128, 141, 234, 176, 129])
Grapheme.split (Str.fromUtf8 [224, 185, 131, 1, 225, 133, 160, 226, 128, 141, 224, 164, 128])
Grapheme.split (Str.fromUtf8 [225, 132, 128, 226, 128, 141, 204, 136, 224, 165, 141])
Grapheme.split (Str.fromUtf8 [225, 132, 128, 226, 128, 141, 204, 136, 31])
Grapheme.split (Str.fromUtf8 [225, 133, 160, 226, 128, 141, 204, 136, 205, 184])
Grapheme.split (Str.fromUtf8 [225, 133, 160, 226, 128, 141, 224, 164, 149])
Grapheme.split (Str.fromUtf8 [225, 134, 168, 226, 128, 141, 10])
Grapheme.split (Str.fromUtf8 [225, 134, 168, 226, 128, 141, 225, 133, 129])
Grapheme.split (Str.fromUtf8 [234, 176, 128, 226, 128, 141, 224, 165, 141])
Grapheme.split (Str.fromUtf8 [234, 176, 128, 226, 128, 141, 224, 181, 142])
Grapheme.split (Str.fromUtf8 [234, 176, 129, 226, 128, 141, 204, 136, 240, 159, 135, 166])
Grapheme.split (Str.fromUtf8 [234, 176, 129, 226, 128, 141, 225, 134, 168])
Grapheme.split (Str.fromUtf8 [234, 176, 129, 226, 128, 141, 36])
Grapheme.split (Str.fromUtf8 [243, 160, 129, 174, 234, 176, 128, 226, 128, 141, 224, 164, 188])
They all contain U+200D
the zero-width joiner character, so that's probably the source of the crash.
These examples were found by running the radamsa fuzzer using the examples in the GraphemeBreakTest data file. Hopefully this fuzz testing could be automated in the future as mentioned in #7.
Quoted from Luke:
coverage of the unicode data file test points is pretty average, like it might only have a test that covers an emoji at the start of a string, but not the middle or end or before a CLRF or after a Hangul sequence... etc.
So I'm reasonably confident there are a couple of edge cases we haven't caught, and could end up crashing someone's code. It would be nice to get that to a point where we are reasonably confident that is not going to happen.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.