dkpro / dkpro-c4corpus Goto Github PK

DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.

Home Page: https://dkpro.github.io/dkpro-c4corpus

License: Apache License 2.0

Java 96.76% Python 2.97% HTML 0.27%

dkpro-c4corpus's Issues

inconsistent package hierarchy and groupId

The project has been released with groupId

org.dkpro.c4corpus

But is still using the old package hierarchy i.e.

de.tudarmstadt.ukp.dkpro.c4corpus

This should be fixed. Causes confusion when referencing to classes inside the artifacts.

Improve documentation for the first release

add some typical use-case scenarios
describe reproducibility of the reported results
- Separate issue #4
how to access the corpus from AWS

Text normalization too aggressive?

The text normalization in Utils.normalize() seems pretty heavy handed for something which is irreversible and non-optional. Additionally, it's not computationally expensive, so it can be done easily by downstream consumers if they want that level of normalization.

On the flip side, if one were going to normalize that heavily, you'd probably also want to do Unicode normalization and output one of the canonical/compatibility forms such as NFKC.

Perhaps this could all be packaged up into a small set of utility methods which are made available, but not run on the base corpus.

Avoid deploying doc module to repository

Not intended to be accessible on maven repository.

HTML entities not decoded

Comparing these two files:

/dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt
/dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Python_Defaults_CleanEvalHTMLTestSubset/105.txt

It appears that the Python program is dropping   entities, but not decoding some other such as <. The gold standard doesn't include any HTML entities, naturally. I'd argue that the correct approach is to decode all HTML entities and convert them to their equivalent Unicode character, even though this is different from what the original Python program did.

Switch to short folder names and module IDs

Switch module folder names and module artifactIDs from e.g.

de.tudarmstadt.ukp.dkpro.c4corpus.boilerplate

dkpro-c4corpus-boilerplate

Add example of reading processed data

Into dkpro-c4corpus-hadoop; word count example from processed C4Corpus; how to select only particular languages and licenses (maybe with an intro how to spin a Hadoop cluster on EMR)

Consistent naming of output folders to match input CommonCrawl

The 2015-11 output was actually performed on CC-MAIN-2015-27, so we need to

rename output folders in the S3 bucked
update the readme appropriately

Update path to CommonCrawl in documentation

As announced in CC mailing list, CC is moving within AWS:

For users of the data, this means that the path to access any data in the corpus, from https or S3, is modified because the data has been moved to a new bucket (location) on AWS S3. Going forward, all Common Crawl data is accessible below https://commoncrawl.s3.amazonaws.com/ or s3://commoncrawl/.
For the next few weeks, the entire corpus will be available at both the old and new locations. During this time, all links on the Common Crawl website that point to datasets in the corpus will be updated to point to the new location.
This group will receive a reminder of this change and notification when the paths to the previous location are no longer active.
The first new dataset shared at the new location is the April crawl (s3://commoncrawl/crawl-data/CC-MAIN-2016-18/). Detail on the crawl archive of April 2016 is posted here on the Common Crawl blog. (Please note that the April crawl is not available at the old location.)

Refactoring: Move WARC record outside hadoop module

Reading WARC files is independent of Hadoop, can be used also for local files

passing directory as argument for boilerplate remover

It would be helpful to add the functionality to boilerplate remover command line to also accept directories as input argument.

WARCFileWriter throws IOException if file already exists

Method createSegment() should create a new segment (file) and not override the existing one; however, this is not the case on S3.

This should be updated

FSDataOutputStream fsStream = (progress == null) ?
                fs.create(path, false) :
                fs.create(path, progress);

as fs.create(path, false) sometimes throws

Error: java.io.IOException: File already exists:s3://ukp-research-data/c4corpus/cc-phase1out-2016-07/part-r-00000.seg-00000.warc.gz at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:634) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:912) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:893) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:790) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:182) at de.tudarmstadt.ukp.dkpro.c4corpus.hadoop.io.WARCFileWriter.createSegment(WARCFileWriter.java:152) at
...

Avoid deploying shaded JAR for hadoop module to repo/Maven central

Wrong contents in gold standard

When I look at the contents of https://github.com/dkpro/dkpro-c4corpus/blob/master/dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/CleanEvalGoldStandard/103.txt the header says that it is 104.txt and the contents match the contents of 104, even though the file name is 103

Delete deprecated classes in de-duplication module

in de.tudarmstadt.ukp.dkpro.c4corpus.deduplication.impl

DocumentDeDuplication
ExactMatchSelectIDsToDelete
MinHash

Upgrade to DKPro Parent POM 13

Upgrade to DKPro Parent POM 13 - but make sure still uses Java 1.7 compatibility.

Deploy release to Maven Central

As a part of releasing 1.0 version

deploy to Maven repository (important for, i.e., boilerplate rem.)

Phase4 Deduplication broken?

I'm only going by inspection here, but I don't think this actually works.

The mappers take two input streams and tag them with their source (0=WARC, 1=dupe list) then output to the reducer using a composite key consisting of the WARC ID plus the source.

The reducer looks at how many values a key has and only outputs the document if the number of values is 1. This would work if the key was just the WARC ID, but since it includes the source tag as well, it's actually going to add the dummy documents to the output stream rather than deleting the duplicate documents.

I've reorganized the data flow and collapsed phases 1, 2, 3.1, 3.2, 3.3, 3.4 into a single job which generates both text-only WARCs and a list of duplicate document IDs and was converted Phase 4 into my new phase 2 when I ran across this.

If you're open to adopting my new workflow (when it's ready to be reviewed), this can probably be deferred, but if you're going to use the existing pipeline for a while, someone should have a look and see if I'm confused or this is a real problem.

Phase 1 loses English documents with License=none

Mitigating the single reducer bottleneck in de.tudarmstadt.ukp.dkpro.c4corpus.hadoop.full.Phase1FullJob
caused that no output is written for "Lic_none_Lang_en"

See output in bucket s3://ukp-research-data/c4corpus/cc-phase1out-2015-11

Update Hadoop to 2.7.1 to keep up with latest AWS EMR version

Currently we rely on Hadoop 2.6.0 which is present in AWS EMR 4.2.0. We should update to Hadoop 2.7.1 to keep up with the latest EMR version (4.4.0).

POM in master branch contains non-SNAPSHOT version

The POM in the repo should always contain a SNAPSHOT version. Non-SNAPSHOT versions are only valid for release tags.

Questions on statistics

I've been trying to wrap my head around the overall process and understand the numbers associated. The questions below are things that I can't figure out:

Why are the CleanEval results different for the Java & Python implementations if it's the same algorithm?
The Phase 1 stats are inconsistent. The text says 22 hours, but the pasted log says 10.5 hrs.
The Phase 1 log says there were 34901 map tasks, which is suspiciously close to the number of files in the CC-MAIN-2016-07 crawl, not the 2015-48 crawl. Are these stats for a different crawl than the others?
Phase 1 mapper output records is 1.1 billion which is significantly lower than the 1.73B (or 1.82B) URLs listed for the crawl. That seems like too big a difference to be accounted for by content type filters (or is my perception wrong?). Is it known what factors contribute to this delta?
The paper says that the there were only ~1% duplicates in the Common Crawl, but the Phase 2 reducer (exact duplicates filter) appears to have only output 39% of the input records (ie it filtered 60%+). Am I misunderstanding the stats or is this the actual number of exact duplicates.
The Phase 1 stats seem to indicate that a significant amount (40%) of time was spent in the shuffle phase, but it doesn't look like the reducer actually does anything. Could Phase 1 be implemented as a map only job? Conversely, could Phase 1 & Phase 2 be merged so that the reducer actually does useful work?
The Phase 3 Step 3 stats for Tuples Creation (36 hrs, 7104 normalized instance hours) seem to indicate that very few instances were used for this phase. Is that an accurate observation? Would more instances reduce the elapsed time?
Are there stats on how many near-duplicate documents were eliminated in Phase 3/4?

Thanks for any answers/insights you can offer!

Extract WARC records given a list of URLs

For a feasible list of URLs (max hunderds of thousands) given as a parameter.

NullWritable as mapper's output key in Phase1 may slow things down

See https://support.pivotal.io/hc/en-us/articles/202810986-Mapper-output-key-value-NullWritable-can-cause-reducer-phase-to-move-slowly

Character encoding issues in boilerplate processing

The output from the boilerplate processeor, e.g. /dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt, appears to use a character encoding other than UTF-8. This causes strings such as Epogen® and “A-thal” to be corrupted.

Add use-case example: search for patterns in C4Corpus

Some simple search for regex occurrences would be nice.

Apache Commons projects are versioned separately...

... thus, controlling them via a single property in the POM doesn't seem to make too much sense:

        <!-- Apache Commons version should be consistent with the one used in hadoop -->
        <commons.version>2.4</commons.version>

Wrong package name in tests in dkpro-c4corpus-hadoop

Wrong: de.tudarmstadt.aiphes.c4corpus.hadoop
Correct: de.tudarmstadt.ukp.dkpro.c4corpus.hadoop

Upgrade to DKPro Parent POM 14

Add citation to the LREC article

Once published online at ACL anthology

Store metadata about keeping minimal html in boilerplate removal

To distinguish that later for file names.

Boilerplate removal header post processing incorrect

The conditional here is wrong:
https://github.com/dkpro/dkpro-c4corpus/blob/master/dkpro-c4corpus-boilerplate/src/main/java/de/tudarmstadt/ukp/dkpro/c4corpus/boilerplate/impl/JusTextBoilerplateRemoval.java#L350
causing the algorithm to attempt to reclassify non-headings, not just headings. The inverted conditionals just to save a little indentation whitespace make my head hurt and are error prone, so I'd recommend using normal logic which matches the algorithm descriptions. ie In this case, instead of:

        if (!(paragraph.isHeading() && paragraph.getClassType().equalsIgnoreCase("bad")
                && !paragraph.getContextFreeClass().equalsIgnoreCase("bad"))) {
            continue;
        }

use

        if (paragraph.isHeading() && paragraph.getClassType().equalsIgnoreCase("bad")
                && !paragraph.getContextFreeClass().equalsIgnoreCase("bad")) {

The current code goes pathologically wrong in the case of documents with a large number empty elements (45,000 "paragraphs", a large number of which were consecutive <br> elements in the example I looked at). In this case the 200 character distance limit never gets reached to trigger the loop exit, causing O(n!) processing of 45,000 elements.

This suggests a couple other possible improvements:

compress runs of more than 2 <br> elements
introduce a max number of elements distance limit in addition to the max number of character limit

O(n!) processing in tag name/path for Paragraph in dedupe code

Attempts to process this segment:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-27/segments/1435375093899.18/warc/CC-MAIN-20150627031813-00201-ip-10-179-60-89.ec2.internal.warc.gz

stalls between 7k-8k records when it encounters a deeply nested tag structure that triggers the O(n!) complexity in tree depth processing of Paragraph.getPath(Node).

The document is pathological in that its many thousands of levels deeply nested, but it causes the entire segment to fail when the mapper gets killed.

Make Java JusText implementation match Python and/or document differences

The differences between the Java and Python implementations were explained as largely an artifact of different XML parsers in a reply to #23, but I think there's more to it than that. I think the differences in the output of the two implementations should be explainable, and preferably should be improvements.

Some differences that I know of (some have already been reported as bugs) include:

header postprocessing implemented incorrectly (#36)
<select> elements not automatically tagged as boilerplate as described in the algorithm description at http://corpus.tools/wiki/Justext/Algorithm
HTML entity decoding not done (#30)
min/max lengths implemented as doubles instead of integers (probably doesn't affect the output, but it seems an unnecessary deviation)
<textarea> is marked as ignorable rather than a block level separator
block level tags are computed using JSoup's Element.isBlock() method, rather than the list of tags defined by the jusText algorithm resulting in a different tag set being used for paragraph splitting. The sets have substantial overlap and JSoup's may be better, but I'm not sure the difference is anything other than arbitrary.
<br><br> handling is different

Bottom line - what's implemented is not the JusText algorithm as documented.

Improve CleanEval reproducibility documentation

To make sure everything in /de.tudarmstadt.ukp.dkpro.c4corpus.boilerplate/ for CleanEval running is correct and can be run without additional tweaking

Fix javadoc for Java 1.8

Java 1.8 is quite strict and picky about missing javadoc entries.

Upgrade documentation to ascii-doc

Similar to dkpro-tc or dkpro-core would be nice, if the documentation grows in the future.

Clarify license for Java JusTex implementation

The source file headers mention an original author, but make no mention of what license the "found code" was under. It would appear that the code was derived from https://github.com/duongphuhiep/justext/tree/master/JusText/src/main/java/dh/tool/justext but that repository doesn't include any license declaration, which effectively means that it's copyrighted and unusable unless a separate license or clearance was obtained.

Was a compatible license provided by the original author? If so, could a statement to that effect please be added to the relevant source files?

SimHash slicing algorithm incorrect & inefficient

The current implementation will never output the top 16-bit slice of the simhash. It also computes the remaining slices incorrectly, but that's less serious since the computations are consistent, so the comparisons aren't effected.

Given input 0X0800040002000100L the current algorithm will generate

[0_{8}, 1_{8}, 2_{8}]

when it should generate:

[0_{8}, 1_{9},  2_{10}, 3_{11}]

It would actually be much more efficient (and easier to understand) if it switched the Hadoop type to Long instead of Text and just generated:

[0X0000000000000100L,
 0X0000000002000000L,
 0X0000040000000000L,
 0X0800000000000000L]

This would also speed up sorting and comparisons, particularly for the more common cases where many bits are set and the text strings become very long and inefficient to compare.

Limit charset detection to first 8k bytes

I thought I had already reported this, but apparently not. Currently the character set detection uses all the bytes that it reads from the input stream. If called with a stream, ICU limits itself to the first 8K bytes, because that should be enough to determine what the character encoding is, but if it's handed a buffer instead, it uses the entire thing. For very large documents, this is inefficient without adding any accuracy.

SimHash returning 32-bit results, not 64-bits

Although the code and paper suggest that 64-bit hashes are being used, the Java Object.hashCode() function only returns 32 bits. The good news is that the bug in #19 has no effect since the upper 16-bits are always 0 (or perhaps all 1s, depending on sign extension effects).

The bad news is that because bits 32-47 are either all zero (or perhaps evenly divided between all zero & all one), I suspect all (or at least half) of the documents will end up being clustered together, making for a very expensive O(n^2) comparison.

You can probably ignore PR #20 for now. It'll get subsumed into the larger rework necessary.

dkpro / dkpro-c4corpus Goto Github PK

dkpro-c4corpus's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs