<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

CLD2 result chunk vector omits portions of input file about cld2 HOT 6 CLOSED

denik commented on July 27, 2024

CLD2 result chunk vector omits portions of input file

from cld2.

Comments (6)

GoogleCodeExporter commented on July 27, 2024

I've done a few more experiments, which made the situation a bit clearer to me.

I applied the attached patch to my local copy. This patch allows CLD2 to report 
any chunk size up to 1 GiB, instead of being artificially limited to 64 KiB. 
Then I ran it over ~56,000 web pages (~4 GB of text). The patch eliminated all 
the cases where one chunk ended before the next one started. I also tried 
running the stock CLD2, but ignoring the 'length' field and instead treating 
each chunk as ending at the point when the next chunk begins. It turned out 
that both of these approaches (patched CLD2, and stock CLD2 + ignoring length 
field) produced identical output on my test corpus, which I take as evidence 
that artificially limited length field is the *only* reason we see "gaps" in 
the middle of the chunk output.

So as a workaround, my plan (and recommendation to any other users who stumble 
across this) is: (1) If the first chunk begins after position 0, then pretend 
there's an extra chunk covering positions 0 through <first chunk.offset> with 
language tag "un". (2) Ignore the length fields in all cases; they can only 
mislead you. Instead look at the "offset" fields, and treat each chunk as 
running from its offset until the offset of the next chunk (or the end of the 
file for the last chunk).

It would be nice if these changes were made directly in CLD2, though, to avoid 
the need for such workarounds.

Original comment by [email protected] on 4 Jul 2014 at 2:30

Attachments:

bigchunks.diff

from cld2.

GoogleCodeExporter commented on July 27, 2024

This seems like a reasonable request to me in principle. My first guess was 
that the limit comes from using 'int' as the buffer length in 
DetectLanguageSummaryV2, but CLD2 defines int to be int32, so that can't be it. 
The motivation may simply have been saving space, I don't know.

Most likely this is just a use case that hasn't been prevalent enough to be a 
problem. I think the proposed patch is entirely reasonable. I'll ping Dick and 
see if he has any objections to putting this in. I don't.

Original comment by [email protected] on 25 Jul 2014 at 10:08

from cld2.

GoogleCodeExporter commented on July 27, 2024

PS - Thank you for taking the time to make and upload a patch.

Original comment by [email protected] on 25 Jul 2014 at 10:08

from cld2.

GoogleCodeExporter commented on July 27, 2024

You're welcome, and hope it helps :-)

Note that the patch only implements one half of my suggestion (stopping the 
spans from being truncated too early), not the other half (inserting an "un" 
span at the beginning of files that begin with punctuation/whitespace).

Original comment by [email protected] on 26 Jul 2014 at 9:52

from cld2.

GoogleCodeExporter commented on July 27, 2024

I have pinged Dick about this and I believe (thought can't speak for him 
directly) that he's also in support of this. Hopefully we'll get this fixed 
shortly, thanks again for the report.

Original comment by [email protected] on 27 Oct 2014 at 8:42

from cld2.

GoogleCodeExporter commented on July 27, 2024

Fixed in svn revisions 170-176. The ResultChunk output now covers all the bytes 
of the input buffer, with the byte length field now increased to 32 bits and 
the endpoints explicitly covered. Thank you for finding this.

Original comment by [email protected] on 28 Oct 2014 at 9:13

Changed state: Fixed

from cld2.

CLD2 result chunk vector omits portions of input file about cld2 HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs