Comments (6)
We have numerous functions to compute the size after transcoding such as simdutf::utf16_length_from_utf8
. We also have functions which count characters.
Typically, in C/C++, if we have a string (that can contain the zero character) then we have a pointer to the beginning of the string and either a length parameter or a pointer to the end of the string. Thus determining the size in bytes is direct: it is either given by the length parameter or by pointer arithmetic. If we expect that your string is null terminated (as in C), then we should invoke strlen
. However we cannot tell in general that the null character terminates the string: for example, this is a perfectly valid string in JavaScript: t= "fdsfd\0 fdsfds"
.
How does JavaScript knows the size of string t= "fdsfd\0 fdsfds"
? Well, it knows how long it is, probably because it has a length parameter.
from simdutf.
The internal v8 implementation looks something like this:
int Utf8LengthHelper(const char* s) {
unibrow::Utf8::Utf8IncrementalBuffer buffer(unibrow::Utf8::kBufferEmpty);
unibrow::Utf8::State state = unibrow::Utf8::State::kAccept;
int length = 0;
const uint8_t* c = reinterpret_cast<const uint8_t*>(s);
while (*c != '\0') {
unibrow::uchar tmp = unibrow::Utf8::ValueOfIncremental(&c, &state, &buffer);
length += Ucs2CharLength(tmp);
}
unibrow::uchar tmp = unibrow::Utf8::ValueOfIncrementalFinish(&state);
length += Ucs2CharLength(tmp);
return length;
}
Upon searching I found v8-primitive.h
- https://github.com/v8/v8/blob/70bdadce8f79e9ab12b9e8972803aea708fd36e7/include/v8-primitive.h#L143
from simdutf.
At a glance, this code (Utf8LengthHelper
) does something like simdutf::utf16_length_from_utf8(s, strlen(s))
, but much slower.
from simdutf.
I found the correct implementation in v8
int String::Utf8Length(Isolate* v8_isolate) const {
i::Handle<i::String> str = Utils::OpenHandle(this);
str = i::String::Flatten(reinterpret_cast<i::Isolate*>(v8_isolate), str);
int length = str->length();
if (length == 0) return 0;
i::DisallowGarbageCollection no_gc;
i::String::FlatContent flat = str->GetFlatContent(no_gc);
DCHECK(flat.IsFlat());
int utf8_length = 0;
if (flat.IsOneByte()) {
for (uint8_t c : flat.ToOneByteVector()) {
utf8_length += c >> 7;
}
utf8_length += length;
} else {
int last_character = unibrow::Utf16::kNoPreviousCharacter;
for (uint16_t c : flat.ToUC16Vector()) {
utf8_length += unibrow::Utf8::Length(c, last_character);
last_character = c;
}
}
return utf8_length;
}
from simdutf.
If I had to guess, what this code does is one of two things...
- If the input is a European iso-encoded character set (e.g., latin1), then this function computes the size in bytes of the output as an UTF-8 string. The simdutf library only supports Unicode at this time. We may add support for non-Unicode encodings later, but not in the immediate future.
- If the input is UTF-16, then it effectively does
simdutf::utf8_length_from_utf16
. I would expectsimdutf::utf8_length_from_utf16
to be much faster.
from simdutf.
@anonrig In our main branch, we have full support for Latin1. So simdutf::utf8_length_from_latin1
is available, along with optimized transcoding function. A release should follow soon.
I am closing this issue. I invited anyone interested in this feature to review our API and comment if needed.
from simdutf.
Related Issues (20)
- Support Latin 1 => UTF 8 (SSE) HOT 1
- Support Latin 1 => UTF 16 (SSE)
- Support Latin 1 <= UTF 16 (SSE)
- Support Latin 1 <= UTF 32 (SSE)
- Support Latin 1 <= UTF 8 (SSE)
- Optimize our UTF-8 to Latin 1 routines
- Write 'utf8_is_latin1', 'utf16_is_latin1' and 'utf32_is_latin1' routines [hypothetical] HOT 1
- validate_utf8_with_errors is insufficienty documented HOT 4
- Latin 1 <= UTF-16 (AVX-512)
- Latin 1 => UTF-16 (AVX-512)
- Latin 1 <= UTF-32 (AVX-512)
- count_code_points HOT 5
- Consider disabling the icelake kernel for Visual Studio 2019
- Building as a shared library but static is hardcoded HOT 2
- Build fails on armv7 FreeBSD 13.2
- simdutf 4 fails on node.js HOT 2
- Fix bele tests on armv7
- Streaming API for transcoding? HOT 5
- #define conflict (`ERROR_H`) HOT 1
- `convert_latin1_to_utf8` doesn't accept length field for `utf8_output` pointer HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from simdutf.