Node has Buffer.byteLength() explained in <a href="ht

The internal v8 implementation looks something like this: <div class="snippet-clip

I found the correct implementation in v8 <div class="snippet-clipboard-content not

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Feature request: UTF8 Byte Length about simdutf HOT 6 CLOSED

anonrig commented on June 3, 2024

Feature request: UTF8 Byte Length

from simdutf.

Comments (6)

lemire commented on June 3, 2024

We have numerous functions to compute the size after transcoding such as simdutf::utf16_length_from_utf8. We also have functions which count characters.

Typically, in C/C++, if we have a string (that can contain the zero character) then we have a pointer to the beginning of the string and either a length parameter or a pointer to the end of the string. Thus determining the size in bytes is direct: it is either given by the length parameter or by pointer arithmetic. If we expect that your string is null terminated (as in C), then we should invoke strlen. However we cannot tell in general that the null character terminates the string: for example, this is a perfectly valid string in JavaScript: t= "fdsfd\0 fdsfds".

How does JavaScript knows the size of string t= "fdsfd\0 fdsfds"? Well, it knows how long it is, probably because it has a length parameter.

from simdutf.

anonrig commented on June 3, 2024

The internal v8 implementation looks something like this:

int Utf8LengthHelper(const char* s) {
  unibrow::Utf8::Utf8IncrementalBuffer buffer(unibrow::Utf8::kBufferEmpty);
  unibrow::Utf8::State state = unibrow::Utf8::State::kAccept;

  int length = 0;
  const uint8_t* c = reinterpret_cast<const uint8_t*>(s);
  while (*c != '\0') {
    unibrow::uchar tmp = unibrow::Utf8::ValueOfIncremental(&c, &state, &buffer);
    length += Ucs2CharLength(tmp);
  }
  unibrow::uchar tmp = unibrow::Utf8::ValueOfIncrementalFinish(&state);
  length += Ucs2CharLength(tmp);
  return length;
}

Upon searching I found v8-primitive.h - https://github.com/v8/v8/blob/70bdadce8f79e9ab12b9e8972803aea708fd36e7/include/v8-primitive.h#L143

from simdutf.

lemire commented on June 3, 2024

At a glance, this code (Utf8LengthHelper) does something like simdutf::utf16_length_from_utf8(s, strlen(s)), but much slower.

from simdutf.

anonrig commented on June 3, 2024

I found the correct implementation in v8

int String::Utf8Length(Isolate* v8_isolate) const {
  i::Handle<i::String> str = Utils::OpenHandle(this);
  str = i::String::Flatten(reinterpret_cast<i::Isolate*>(v8_isolate), str);
  int length = str->length();
  if (length == 0) return 0;
  i::DisallowGarbageCollection no_gc;
  i::String::FlatContent flat = str->GetFlatContent(no_gc);
  DCHECK(flat.IsFlat());
  int utf8_length = 0;
  if (flat.IsOneByte()) {
    for (uint8_t c : flat.ToOneByteVector()) {
      utf8_length += c >> 7;
    }
    utf8_length += length;
  } else {
    int last_character = unibrow::Utf16::kNoPreviousCharacter;
    for (uint16_t c : flat.ToUC16Vector()) {
      utf8_length += unibrow::Utf8::Length(c, last_character);
      last_character = c;
    }
  }
  return utf8_length;
}

from simdutf.

lemire commented on June 3, 2024

If I had to guess, what this code does is one of two things...

If the input is a European iso-encoded character set (e.g., latin1), then this function computes the size in bytes of the output as an UTF-8 string. The simdutf library only supports Unicode at this time. We may add support for non-Unicode encodings later, but not in the immediate future.
If the input is UTF-16, then it effectively does simdutf::utf8_length_from_utf16. I would expect simdutf::utf8_length_from_utf16 to be much faster.

from simdutf.

lemire commented on June 3, 2024

@anonrig In our main branch, we have full support for Latin1. So simdutf::utf8_length_from_latin1 is available, along with optimized transcoding function. A release should follow soon.

I am closing this issue. I invited anyone interested in this feature to review our API and comment if needed.

from simdutf.

Feature request: UTF8 Byte Length about simdutf HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs