GithubHelp home page GithubHelp logo

Comments (6)

lemire avatar lemire commented on June 3, 2024

We have numerous functions to compute the size after transcoding such as simdutf::utf16_length_from_utf8. We also have functions which count characters.

Typically, in C/C++, if we have a string (that can contain the zero character) then we have a pointer to the beginning of the string and either a length parameter or a pointer to the end of the string. Thus determining the size in bytes is direct: it is either given by the length parameter or by pointer arithmetic. If we expect that your string is null terminated (as in C), then we should invoke strlen. However we cannot tell in general that the null character terminates the string: for example, this is a perfectly valid string in JavaScript: t= "fdsfd\0 fdsfds".

How does JavaScript knows the size of string t= "fdsfd\0 fdsfds"? Well, it knows how long it is, probably because it has a length parameter.

from simdutf.

anonrig avatar anonrig commented on June 3, 2024

The internal v8 implementation looks something like this:

int Utf8LengthHelper(const char* s) {
  unibrow::Utf8::Utf8IncrementalBuffer buffer(unibrow::Utf8::kBufferEmpty);
  unibrow::Utf8::State state = unibrow::Utf8::State::kAccept;

  int length = 0;
  const uint8_t* c = reinterpret_cast<const uint8_t*>(s);
  while (*c != '\0') {
    unibrow::uchar tmp = unibrow::Utf8::ValueOfIncremental(&c, &state, &buffer);
    length += Ucs2CharLength(tmp);
  }
  unibrow::uchar tmp = unibrow::Utf8::ValueOfIncrementalFinish(&state);
  length += Ucs2CharLength(tmp);
  return length;
}

Upon searching I found v8-primitive.h - https://github.com/v8/v8/blob/70bdadce8f79e9ab12b9e8972803aea708fd36e7/include/v8-primitive.h#L143

from simdutf.

lemire avatar lemire commented on June 3, 2024

At a glance, this code (Utf8LengthHelper) does something like simdutf::utf16_length_from_utf8(s, strlen(s)), but much slower.

from simdutf.

anonrig avatar anonrig commented on June 3, 2024

I found the correct implementation in v8

int String::Utf8Length(Isolate* v8_isolate) const {
  i::Handle<i::String> str = Utils::OpenHandle(this);
  str = i::String::Flatten(reinterpret_cast<i::Isolate*>(v8_isolate), str);
  int length = str->length();
  if (length == 0) return 0;
  i::DisallowGarbageCollection no_gc;
  i::String::FlatContent flat = str->GetFlatContent(no_gc);
  DCHECK(flat.IsFlat());
  int utf8_length = 0;
  if (flat.IsOneByte()) {
    for (uint8_t c : flat.ToOneByteVector()) {
      utf8_length += c >> 7;
    }
    utf8_length += length;
  } else {
    int last_character = unibrow::Utf16::kNoPreviousCharacter;
    for (uint16_t c : flat.ToUC16Vector()) {
      utf8_length += unibrow::Utf8::Length(c, last_character);
      last_character = c;
    }
  }
  return utf8_length;
}

from simdutf.

lemire avatar lemire commented on June 3, 2024

If I had to guess, what this code does is one of two things...

  1. If the input is a European iso-encoded character set (e.g., latin1), then this function computes the size in bytes of the output as an UTF-8 string. The simdutf library only supports Unicode at this time. We may add support for non-Unicode encodings later, but not in the immediate future.
  2. If the input is UTF-16, then it effectively does simdutf::utf8_length_from_utf16. I would expect simdutf::utf8_length_from_utf16 to be much faster.

from simdutf.

lemire avatar lemire commented on June 3, 2024

@anonrig In our main branch, we have full support for Latin1. So simdutf::utf8_length_from_latin1 is available, along with optimized transcoding function. A release should follow soon.

I am closing this issue. I invited anyone interested in this feature to review our API and comment if needed.

from simdutf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.