roc-lang / unicode Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 6.0 348 KB

License: Universal Permissive License v1.0

Shell 0.24% Roc 99.60% Nix 0.17%

unicode's People

Contributors

Stargazers

Watchers

Forkers

lukewilliamboswell ricardo-valero dilsonhiga ageron bhansconnect

unicode's Issues

Enable GitHub Pages

This repo has a GitHub Actions workflow to deploy the docs to GitHub Pages, but the pipeline is failing because GitHub Pages is not enabled, to fix this a repo owner needs to go to https://github.com/roc-lang/unicode/settings/pages and choose "GitHub Actions" as the source, then re-run the deploy-docs job.

Grapheme.split fails on empty strings

Here's a little example to reproduce the issue:

app [main] {
    pf: platform "https://github.com/roc-lang/basic-cli/releases/download/0.12.0/Lb8EgiejTUzbggO2HVVuPJFkwvvsfW6LojkLR20kTVE.tar.br",
    unicode: "https://github.com/roc-lang/unicode/releases/download/0.1.1/-FQDoegpSfMS-a7B0noOnZQs3-A2aq9RSOR5VVLMePg.tar.br"
}

import pf.Task exposing [Task]
import pf.Stdout
import unicode.Grapheme

main =
    when Grapheme.split "" is
        Ok _ -> Stdout.line!("Ok")
        Err _ -> Stdout.line!("Err")

Implement Visual Width

The Unicode Character Database UCD assigns to each Unicode character as its default width property one of six values: Ambiguous, Fullwidth, Halfwidth, Narrow, Wide, or Neutral (= Not East Asian). For any given operation, these six default property values resolve into only two property values, narrow and wide, depending on context.

zulip discussion

We already have a few examples that do this in our package, so this should be easy to implement as a good first issue.

Add the EastAsianWidth.txt data file to unicode/package/data, then write a InternalEAWGen.roc file that is almost a copy paste of InternalGBPGen.roc to parse the data file and generates a Roc file that maps CodePoints CP to an East Asian Width property EAW : [Ambiguous, Fullwidth, Halfwidth, Narrow, Neutral, Wide], and then implement a corresponding helper that uses this to walk through a List U8 or a Str and sum of the width.

Grapheme.split function crashes

The Grapheme.split function crashes on some edge-cases, for example, running:

Grapheme.split (Str.fromUtf8 [225, 134, 168, 226, 128, 141, 225, 133, 129])

Crashes with the output:

The program crashed with:

        This is definitely a bug in the roc-lang/unicode package, caused by an unhandled edge case in grapheme text segmentation.

It is difficult to track down and catch every possible combination, so it would be helpful if you could log this as an issue with a reproduction.

Grapheme.split state machine state at the time was:
((AfterZWJ <opaque>), [8205, 4417], [ZWJ, L])

Here is the call stack that led to the crash:

        roc.panic
        Grapheme.splitHelp
        Grapheme.(anonymous function)
        Result.try
        Grapheme.split
        app.(anonymous function)
        Task.(anonymous function)
        .(anonymous function)
        rust.main

Optimizations can make this list inaccurate! If it looks wrong, try running without `--optimize` and with `--linker=legacy`

Here are a list of examples that crash this function:

Grapheme.split (Str.fromUtf8 [13, 204, 136, 225, 134, 168, 226, 128, 141, 234, 176, 129])
Grapheme.split (Str.fromUtf8 [224, 185, 131, 1, 225, 133, 160, 226, 128, 141, 224, 164, 128])
Grapheme.split (Str.fromUtf8 [225, 132, 128, 226, 128, 141, 204, 136, 224, 165, 141])
Grapheme.split (Str.fromUtf8 [225, 132, 128, 226, 128, 141, 204, 136, 31])
Grapheme.split (Str.fromUtf8 [225, 133, 160, 226, 128, 141, 204, 136, 205, 184])
Grapheme.split (Str.fromUtf8 [225, 133, 160, 226, 128, 141, 224, 164, 149])
Grapheme.split (Str.fromUtf8 [225, 134, 168, 226, 128, 141, 10])
Grapheme.split (Str.fromUtf8 [225, 134, 168, 226, 128, 141, 225, 133, 129])
Grapheme.split (Str.fromUtf8 [234, 176, 128, 226, 128, 141, 224, 165, 141])
Grapheme.split (Str.fromUtf8 [234, 176, 128, 226, 128, 141, 224, 181, 142])
Grapheme.split (Str.fromUtf8 [234, 176, 129, 226, 128, 141, 204, 136, 240, 159, 135, 166])
Grapheme.split (Str.fromUtf8 [234, 176, 129, 226, 128, 141, 225, 134, 168])
Grapheme.split (Str.fromUtf8 [234, 176, 129, 226, 128, 141, 36])
Grapheme.split (Str.fromUtf8 [243, 160, 129, 174, 234, 176, 128, 226, 128, 141, 224, 164, 188])

They all contain U+200D the zero-width joiner character, so that's probably the source of the crash.

These examples were found by running the radamsa fuzzer using the examples in the GraphemeBreakTest data file. Hopefully this fuzz testing could be automated in the future as mentioned in #7.

Imrpove grapheme.split testing

Quoted from Luke:

coverage of the unicode data file test points is pretty average, like it might only have a test that covers an emoji at the start of a string, but not the middle or end or before a CLRF or after a Hangul sequence... etc.
So I'm reasonably confident there are a couple of edge cases we haven't caught, and could end up crashing someone's code. It would be nice to get that to a point where we are reasonably confident that is not going to happen.

roc-lang / unicode Goto Github PK

unicode's People

Contributors

Stargazers

Watchers

Forkers

unicode's Issues

Enable GitHub Pages

Grapheme.split fails on empty strings

Implement Visual Width

Grapheme.split function crashes

Imrpove grapheme.split testing

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs