GithubHelp home page GithubHelp logo

Comments (6)

MarkReedZ avatar MarkReedZ commented on June 23, 2024 1

UTF-8 looks like this - you can count bits for the character size once you see the left most bit set. Languages like Chinese will be all unicode characters. I speak Chinese so optimized this in mrjson. I'll setup tests next.

    110xxxxx 10xxxxxx
    1110xxxx 10xxxxxx 10xxxxxx
    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

from stringzilla.

ashvardanian avatar ashvardanian commented on June 23, 2024

I've merged some intermediate patches by @ghazariann, but some parts have to be reimplemented. Like this:

    if args.max_line_length:
        max_line_length = max(len(line) for line in str(mapped_bytes).split("\n"))
        counts["max_line_length"] = max_line_length

It is expensive to convert to str and even more expensive to split it.

from stringzilla.

MarkReedZ avatar MarkReedZ commented on June 23, 2024

We're missing tests and don't handle locale.

Some thoughts on test

Stdin

Redirection - Note that --files0-from needs to pull a nul delimited list of filenames

find . -name '*.[ch]' -print0 |   wc -L --files0-from=-
cat xxx | wc -l

Word Count

We only count spaces so add tests for adjacent and other whitespace.

Line Count

If a file ends in a non-newline character, its trailing partial line is not counted.

Max Line Length

Tabs are set at every 8th column. Display widths of wide characters are considered. Non-printable characters are given 0 width.

Locale

-m --chars Print only the character counts, as per the current locale. ( utf-8, and utf-16 support needed ) Encoding errors are not counted. locale.getencoding / setencoding

  • We'd need to scan for non ascii codepoints in the input.

-w --words Uses locale specific whitespace.

  • wc likely doesn't really do this per locale. We'd need to test a few locales. To do this we'd have to scan for non ascii and have a list of unicode whitespace to compare.

References

https://www.gnu.org/software/coreutils/manual/html_node/wc-invocation.html#wc-invocation
https://www.mkssoftware.com/docs/man1/wc.1.asp#:~:text=wc%20counts%20the%20number%20of,16%2Dbit%20wide%20Unicode%20files.

from stringzilla.

ashvardanian avatar ashvardanian commented on June 23, 2024

Can we detect those locale-based settings in the Python implementation of wc, without changing the core C implementation and the Python binding?

from stringzilla.

MarkReedZ avatar MarkReedZ commented on June 23, 2024

For counting characters we can locale.getencoding() in python then a naive approach would be len(bytes.decode('utf-8')) which would not be performant. Ultimately we'd want to be able to scan for unicode characters ( & 0x80 ) and consume them as the character could be 2-4 bytes. If the library does not have a way to find bytes with the first bit set (& 0x80) we'd have to add it.

For counting words I believe we want a find_charset function that we can use with the whitespace character set.

from stringzilla.

ashvardanian avatar ashvardanian commented on June 23, 2024

For the first part we can temporarily compensate that by performing several runs over data - one for each multi-byte rune.

from stringzilla.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.