The <a class="commit-link" data-hovercard-type="commit" data-hovercard-url="https://gi

I've merged some intermediate patches by <a class="user-mention notranslate" data-hove

Standard-compliant `wc` implementation about stringzilla HOT 6 OPEN

ashvardanian commented on June 23, 2024 1

Standard-compliant `wc` implementation

from stringzilla.

Comments (6)

MarkReedZ commented on June 23, 2024 1

UTF-8 looks like this - you can count bits for the character size once you see the left most bit set. Languages like Chinese will be all unicode characters. I speak Chinese so optimized this in mrjson. I'll setup tests next.

    110xxxxx 10xxxxxx
    1110xxxx 10xxxxxx 10xxxxxx
    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

from stringzilla.

ashvardanian commented on June 23, 2024

I've merged some intermediate patches by @ghazariann, but some parts have to be reimplemented. Like this:

    if args.max_line_length:
        max_line_length = max(len(line) for line in str(mapped_bytes).split("\n"))
        counts["max_line_length"] = max_line_length

It is expensive to convert to str and even more expensive to split it.

from stringzilla.

MarkReedZ commented on June 23, 2024

We're missing tests and don't handle locale.

Some thoughts on test

Stdin

Redirection - Note that --files0-from needs to pull a nul delimited list of filenames

find . -name '*.[ch]' -print0 |   wc -L --files0-from=-
cat xxx | wc -l

Word Count

We only count spaces so add tests for adjacent and other whitespace.

Line Count

If a file ends in a non-newline character, its trailing partial line is not counted.

Max Line Length

Tabs are set at every 8th column. Display widths of wide characters are considered. Non-printable characters are given 0 width.

Locale

-m --chars Print only the character counts, as per the current locale. ( utf-8, and utf-16 support needed ) Encoding errors are not counted. locale.getencoding / setencoding

We'd need to scan for non ascii codepoints in the input.

-w --words Uses locale specific whitespace.

wc likely doesn't really do this per locale. We'd need to test a few locales. To do this we'd have to scan for non ascii and have a list of unicode whitespace to compare.

References

https://www.gnu.org/software/coreutils/manual/html_node/wc-invocation.html#wc-invocation
https://www.mkssoftware.com/docs/man1/wc.1.asp#:~:text=wc%20counts%20the%20number%20of,16%2Dbit%20wide%20Unicode%20files.

from stringzilla.

ashvardanian commented on June 23, 2024

Can we detect those locale-based settings in the Python implementation of wc, without changing the core C implementation and the Python binding?

from stringzilla.

MarkReedZ commented on June 23, 2024

For counting characters we can locale.getencoding() in python then a naive approach would be len(bytes.decode('utf-8')) which would not be performant. Ultimately we'd want to be able to scan for unicode characters ( & 0x80 ) and consume them as the character could be 2-4 bytes. If the library does not have a way to find bytes with the first bit set (& 0x80) we'd have to add it.

For counting words I believe we want a find_charset function that we can use with the whitespace character set.

from stringzilla.

ashvardanian commented on June 23, 2024

For the first part we can temporarily compensate that by performing several runs over data - one for each multi-byte rune.

from stringzilla.

Standard-compliant `wc` implementation about stringzilla HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs