GithubHelp home page GithubHelp logo

Comments (6)

GoogleCodeExporter avatar GoogleCodeExporter commented on July 27, 2024
I've done a few more experiments, which made the situation a bit clearer to me.

I applied the attached patch to my local copy. This patch allows CLD2 to report 
any chunk size up to 1 GiB, instead of being artificially limited to 64 KiB. 
Then I ran it over ~56,000 web pages (~4 GB of text). The patch eliminated all 
the cases where one chunk ended before the next one started. I also tried 
running the stock CLD2, but ignoring the 'length' field and instead treating 
each chunk as ending at the point when the next chunk begins. It turned out 
that both of these approaches (patched CLD2, and stock CLD2 + ignoring length 
field) produced identical output on my test corpus, which I take as evidence 
that artificially limited length field is the *only* reason we see "gaps" in 
the middle of the chunk output.

So as a workaround, my plan (and recommendation to any other users who stumble 
across this) is: (1) If the first chunk begins after position 0, then pretend 
there's an extra chunk covering positions 0 through <first chunk.offset> with 
language tag "un". (2) Ignore the length fields in all cases; they can only 
mislead you. Instead look at the "offset" fields, and treat each chunk as 
running from its offset until the offset of the next chunk (or the end of the 
file for the last chunk).

It would be nice if these changes were made directly in CLD2, though, to avoid 
the need for such workarounds. 

Original comment by [email protected] on 4 Jul 2014 at 2:30

Attachments:

from cld2.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 27, 2024
This seems like a reasonable request to me in principle. My first guess was 
that the limit comes from using 'int' as the buffer length in 
DetectLanguageSummaryV2, but CLD2 defines int to be int32, so that can't be it. 
The motivation may simply have been saving space, I don't know.

Most likely this is just a use case that hasn't been prevalent enough to be a 
problem. I think the proposed patch is entirely reasonable. I'll ping Dick and 
see if he has any objections to putting this in. I don't.

Original comment by [email protected] on 25 Jul 2014 at 10:08

from cld2.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 27, 2024
PS - Thank you for taking the time to make and upload a patch.

Original comment by [email protected] on 25 Jul 2014 at 10:08

from cld2.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 27, 2024
You're welcome, and hope it helps :-)

Note that the patch only implements one half of my suggestion (stopping the 
spans from being truncated too early), not the other half (inserting an "un" 
span at the beginning of files that begin with punctuation/whitespace).

Original comment by [email protected] on 26 Jul 2014 at 9:52

from cld2.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 27, 2024
I have pinged Dick about this and I believe (thought can't speak for him 
directly) that he's also in support of this. Hopefully we'll get this fixed 
shortly, thanks again for the report.

Original comment by [email protected] on 27 Oct 2014 at 8:42

from cld2.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 27, 2024
Fixed in svn revisions 170-176. The ResultChunk output now covers all the bytes 
of the input buffer, with the byte length field now increased to 32 bits and 
the endpoints explicitly covered. Thank you for finding this.

Original comment by [email protected] on 28 Oct 2014 at 9:13

  • Changed state: Fixed

from cld2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.