Comments (6)
I've done a few more experiments, which made the situation a bit clearer to me.
I applied the attached patch to my local copy. This patch allows CLD2 to report
any chunk size up to 1 GiB, instead of being artificially limited to 64 KiB.
Then I ran it over ~56,000 web pages (~4 GB of text). The patch eliminated all
the cases where one chunk ended before the next one started. I also tried
running the stock CLD2, but ignoring the 'length' field and instead treating
each chunk as ending at the point when the next chunk begins. It turned out
that both of these approaches (patched CLD2, and stock CLD2 + ignoring length
field) produced identical output on my test corpus, which I take as evidence
that artificially limited length field is the *only* reason we see "gaps" in
the middle of the chunk output.
So as a workaround, my plan (and recommendation to any other users who stumble
across this) is: (1) If the first chunk begins after position 0, then pretend
there's an extra chunk covering positions 0 through <first chunk.offset> with
language tag "un". (2) Ignore the length fields in all cases; they can only
mislead you. Instead look at the "offset" fields, and treat each chunk as
running from its offset until the offset of the next chunk (or the end of the
file for the last chunk).
It would be nice if these changes were made directly in CLD2, though, to avoid
the need for such workarounds.
Original comment by [email protected]
on 4 Jul 2014 at 2:30
Attachments:
from cld2.
This seems like a reasonable request to me in principle. My first guess was
that the limit comes from using 'int' as the buffer length in
DetectLanguageSummaryV2, but CLD2 defines int to be int32, so that can't be it.
The motivation may simply have been saving space, I don't know.
Most likely this is just a use case that hasn't been prevalent enough to be a
problem. I think the proposed patch is entirely reasonable. I'll ping Dick and
see if he has any objections to putting this in. I don't.
Original comment by [email protected]
on 25 Jul 2014 at 10:08
from cld2.
PS - Thank you for taking the time to make and upload a patch.
Original comment by [email protected]
on 25 Jul 2014 at 10:08
from cld2.
You're welcome, and hope it helps :-)
Note that the patch only implements one half of my suggestion (stopping the
spans from being truncated too early), not the other half (inserting an "un"
span at the beginning of files that begin with punctuation/whitespace).
Original comment by [email protected]
on 26 Jul 2014 at 9:52
from cld2.
I have pinged Dick about this and I believe (thought can't speak for him
directly) that he's also in support of this. Hopefully we'll get this fixed
shortly, thanks again for the report.
Original comment by [email protected]
on 27 Oct 2014 at 8:42
from cld2.
Fixed in svn revisions 170-176. The ResultChunk output now covers all the bytes
of the input buffer, with the byte length field now increased to 32 bits and
the endpoints explicitly covered. Thank you for finding this.
Original comment by [email protected]
on 28 Oct 2014 at 9:13
- Changed state: Fixed
from cld2.
Related Issues (20)
- Dynamic data loading should not use iostream HOT 5
- Windows build fails: undeclared identifier 'close' HOT 6
- Support mmap-ing dynamic data on win32 HOT 5
- Build warning on Windows with clang HOT 2
- Eliminate redundancy and/or simplify default case for compiling unittest_data.h HOT 4
- Missing include in cld2_dynamic_data_loader.cc HOT 1
- cld2_dynamic_data.cc and cld2_dynamic_data_loader.cc problems on Win32 HOT 10
- Enable dynamic data for 20141015 release HOT 1
- New GCC 5.0 hits problem with narrowing in list-initializers HOT 2
- CLD should check result of "new" in all use cases HOT 1
- please use CFLAGS CXXFLAGS CPPFLAGS and LDFLAGS HOT 3
- please provide a SONAME HOT 13
- cld2 testsuite failures HOT 3
- Compilation issues in Visual Studio
- Compilation failure on VS2015 on Windows
- Check in tools for generating generated_* files HOT 2
- Add armv8-a support HOT 5
- new code location? HOT 6
- Add possibility to set MinReliableKeepPercent HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cld2.