GithubHelp home page GithubHelp logo

dkpro / dkpro-c4corpus Goto Github PK

View Code? Open in Web Editor NEW
49.0 49.0 8.0 12.97 MB

DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.

Home Page: https://dkpro.github.io/dkpro-c4corpus

License: Apache License 2.0

Java 96.76% Python 2.97% HTML 0.27%

dkpro-c4corpus's Issues

inconsistent package hierarchy and groupId

The project has been released with groupId

org.dkpro.c4corpus

But is still using the old package hierarchy i.e.

de.tudarmstadt.ukp.dkpro.c4corpus

This should be fixed. Causes confusion when referencing to classes inside the artifacts.

Text normalization too aggressive?

The text normalization in Utils.normalize() seems pretty heavy handed for something which is irreversible and non-optional. Additionally, it's not computationally expensive, so it can be done easily by downstream consumers if they want that level of normalization.

On the flip side, if one were going to normalize that heavily, you'd probably also want to do Unicode normalization and output one of the canonical/compatibility forms such as NFKC.

Perhaps this could all be packaged up into a small set of utility methods which are made available, but not run on the base corpus.

HTML entities not decoded

Comparing these two files:

  • /dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt
  • /dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Python_Defaults_CleanEvalHTMLTestSubset/105.txt

It appears that the Python program is dropping   entities, but not decoding some other such as <. The gold standard doesn't include any HTML entities, naturally. I'd argue that the correct approach is to decode all HTML entities and convert them to their equivalent Unicode character, even though this is different from what the original Python program did.

Add example of reading processed data

Into dkpro-c4corpus-hadoop; word count example from processed C4Corpus; how to select only particular languages and licenses (maybe with an intro how to spin a Hadoop cluster on EMR)

Update path to CommonCrawl in documentation

As announced in CC mailing list, CC is moving within AWS:

For users of the data, this means that the path to access any data in the corpus, from https or S3, is modified because the data has been moved to a new bucket (location) on AWS S3. Going forward, all Common Crawl data is accessible below https://commoncrawl.s3.amazonaws.com/ or s3://commoncrawl/.
For the next few weeks, the entire corpus will be available at both the old and new locations. During this time, all links on the Common Crawl website that point to datasets in the corpus will be updated to point to the new location.
This group will receive a reminder of this change and notification when the paths to the previous location are no longer active.
The first new dataset shared at the new location is the April crawl (s3://commoncrawl/crawl-data/CC-MAIN-2016-18/). Detail on the crawl archive of April 2016 is posted here on the Common Crawl blog. (Please note that the April crawl is not available at the old location.​)

WARCFileWriter throws IOException if file already exists

Method createSegment() should create a new segment (file) and not override the existing one; however, this is not the case on S3.

This should be updated

FSDataOutputStream fsStream = (progress == null) ?
                fs.create(path, false) :
                fs.create(path, progress);

as fs.create(path, false) sometimes throws

Error: java.io.IOException: File already exists:s3://ukp-research-data/c4corpus/cc-phase1out-2016-07/part-r-00000.seg-00000.warc.gz at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:634) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:912) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:893) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:790) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:182) at de.tudarmstadt.ukp.dkpro.c4corpus.hadoop.io.WARCFileWriter.createSegment(WARCFileWriter.java:152) at
...

Phase4 Deduplication broken?

I'm only going by inspection here, but I don't think this actually works.

The mappers take two input streams and tag them with their source (0=WARC, 1=dupe list) then output to the reducer using a composite key consisting of the WARC ID plus the source.

The reducer looks at how many values a key has and only outputs the document if the number of values is 1. This would work if the key was just the WARC ID, but since it includes the source tag as well, it's actually going to add the dummy documents to the output stream rather than deleting the duplicate documents.

I've reorganized the data flow and collapsed phases 1, 2, 3.1, 3.2, 3.3, 3.4 into a single job which generates both text-only WARCs and a list of duplicate document IDs and was converted Phase 4 into my new phase 2 when I ran across this.

If you're open to adopting my new workflow (when it's ready to be reviewed), this can probably be deferred, but if you're going to use the existing pipeline for a while, someone should have a look and see if I'm confused or this is a real problem.

Phase 1 loses English documents with License=none

Mitigating the single reducer bottleneck in de.tudarmstadt.ukp.dkpro.c4corpus.hadoop.full.Phase1FullJob
caused that no output is written for "Lic_none_Lang_en"

See output in bucket s3://ukp-research-data/c4corpus/cc-phase1out-2015-11

Questions on statistics

I've been trying to wrap my head around the overall process and understand the numbers associated. The questions below are things that I can't figure out:

  • Why are the CleanEval results different for the Java & Python implementations if it's the same algorithm?
  • The Phase 1 stats are inconsistent. The text says 22 hours, but the pasted log says 10.5 hrs.
  • The Phase 1 log says there were 34901 map tasks, which is suspiciously close to the number of files in the CC-MAIN-2016-07 crawl, not the 2015-48 crawl. Are these stats for a different crawl than the others?
  • Phase 1 mapper output records is 1.1 billion which is significantly lower than the 1.73B (or 1.82B) URLs listed for the crawl. That seems like too big a difference to be accounted for by content type filters (or is my perception wrong?). Is it known what factors contribute to this delta?
  • The paper says that the there were only ~1% duplicates in the Common Crawl, but the Phase 2 reducer (exact duplicates filter) appears to have only output 39% of the input records (ie it filtered 60%+). Am I misunderstanding the stats or is this the actual number of exact duplicates.
  • The Phase 1 stats seem to indicate that a significant amount (40%) of time was spent in the shuffle phase, but it doesn't look like the reducer actually does anything. Could Phase 1 be implemented as a map only job? Conversely, could Phase 1 & Phase 2 be merged so that the reducer actually does useful work?
  • The Phase 3 Step 3 stats for Tuples Creation (36 hrs, 7104 normalized instance hours) seem to indicate that very few instances were used for this phase. Is that an accurate observation? Would more instances reduce the elapsed time?
  • Are there stats on how many near-duplicate documents were eliminated in Phase 3/4?

Thanks for any answers/insights you can offer!

Character encoding issues in boilerplate processing

The output from the boilerplate processeor, e.g. /dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt, appears to use a character encoding other than UTF-8. This causes strings such as Epogen® and “A-thal” to be corrupted.

Apache Commons projects are versioned separately...

... thus, controlling them via a single property in the POM doesn't seem to make too much sense:

        <!-- Apache Commons version should be consistent with the one used in hadoop -->
        <commons.version>2.4</commons.version>

Boilerplate removal header post processing incorrect

The conditional here is wrong:
https://github.com/dkpro/dkpro-c4corpus/blob/master/dkpro-c4corpus-boilerplate/src/main/java/de/tudarmstadt/ukp/dkpro/c4corpus/boilerplate/impl/JusTextBoilerplateRemoval.java#L350
causing the algorithm to attempt to reclassify non-headings, not just headings. The inverted conditionals just to save a little indentation whitespace make my head hurt and are error prone, so I'd recommend using normal logic which matches the algorithm descriptions. ie In this case, instead of:

        if (!(paragraph.isHeading() && paragraph.getClassType().equalsIgnoreCase("bad")
                && !paragraph.getContextFreeClass().equalsIgnoreCase("bad"))) {
            continue;
        }

use

        if (paragraph.isHeading() && paragraph.getClassType().equalsIgnoreCase("bad")
                && !paragraph.getContextFreeClass().equalsIgnoreCase("bad")) {

The current code goes pathologically wrong in the case of documents with a large number empty elements (45,000 "paragraphs", a large number of which were consecutive <br> elements in the example I looked at). In this case the 200 character distance limit never gets reached to trigger the loop exit, causing O(n!) processing of 45,000 elements.

This suggests a couple other possible improvements:

  • compress runs of more than 2 <br> elements
  • introduce a max number of elements distance limit in addition to the max number of character limit

O(n!) processing in tag name/path for Paragraph in dedupe code

Attempts to process this segment:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-27/segments/1435375093899.18/warc/CC-MAIN-20150627031813-00201-ip-10-179-60-89.ec2.internal.warc.gz

stalls between 7k-8k records when it encounters a deeply nested tag structure that triggers the O(n!) complexity in tree depth processing of Paragraph.getPath(Node).

The document is pathological in that its many thousands of levels deeply nested, but it causes the entire segment to fail when the mapper gets killed.

Make Java JusText implementation match Python and/or document differences

The differences between the Java and Python implementations were explained as largely an artifact of different XML parsers in a reply to #23, but I think there's more to it than that. I think the differences in the output of the two implementations should be explainable, and preferably should be improvements.

Some differences that I know of (some have already been reported as bugs) include:

  • header postprocessing implemented incorrectly (#36)
  • <select> elements not automatically tagged as boilerplate as described in the algorithm description at http://corpus.tools/wiki/Justext/Algorithm
  • HTML entity decoding not done (#30)
  • min/max lengths implemented as doubles instead of integers (probably doesn't affect the output, but it seems an unnecessary deviation)
  • <textarea> is marked as ignorable rather than a block level separator
  • block level tags are computed using JSoup's Element.isBlock() method, rather than the list of tags defined by the jusText algorithm resulting in a different tag set being used for paragraph splitting. The sets have substantial overlap and JSoup's may be better, but I'm not sure the difference is anything other than arbitrary.
  • <br><br> handling is different

Bottom line - what's implemented is not the JusText algorithm as documented.

Clarify license for Java JusTex implementation

The source file headers mention an original author, but make no mention of what license the "found code" was under. It would appear that the code was derived from https://github.com/duongphuhiep/justext/tree/master/JusText/src/main/java/dh/tool/justext but that repository doesn't include any license declaration, which effectively means that it's copyrighted and unusable unless a separate license or clearance was obtained.

Was a compatible license provided by the original author? If so, could a statement to that effect please be added to the relevant source files?

SimHash slicing algorithm incorrect & inefficient

The current implementation will never output the top 16-bit slice of the simhash. It also computes the remaining slices incorrectly, but that's less serious since the computations are consistent, so the comparisons aren't effected.

Given input 0X0800040002000100L the current algorithm will generate

[0_{8}, 1_{8}, 2_{8}]

when it should generate:

[0_{8}, 1_{9},  2_{10}, 3_{11}]

It would actually be much more efficient (and easier to understand) if it switched the Hadoop type to Long instead of Text and just generated:

[0X0000000000000100L,
 0X0000000002000000L,
 0X0000040000000000L,
 0X0800000000000000L]

This would also speed up sorting and comparisons, particularly for the more common cases where many bits are set and the text strings become very long and inefficient to compare.

Limit charset detection to first 8k bytes

I thought I had already reported this, but apparently not. Currently the character set detection uses all the bytes that it reads from the input stream. If called with a stream, ICU limits itself to the first 8K bytes, because that should be enough to determine what the character encoding is, but if it's handed a buffer instead, it uses the entire thing. For very large documents, this is inefficient without adding any accuracy.

SimHash returning 32-bit results, not 64-bits

Although the code and paper suggest that 64-bit hashes are being used, the Java Object.hashCode() function only returns 32 bits. The good news is that the bug in #19 has no effect since the upper 16-bits are always 0 (or perhaps all 1s, depending on sign extension effects).

The bad news is that because bits 32-47 are either all zero (or perhaps evenly divided between all zero & all one), I suspect all (or at least half) of the documents will end up being clustered together, making for a very expensive O(n^2) comparison.

You can probably ignore PR #20 for now. It'll get subsumed into the larger rework necessary.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.