GithubHelp home page GithubHelp logo

cc_net's Introduction

cc_net

Tools to download and clean Common Crawl as introduced in our paper CCNet.

If you found these resources useful, please consider citing:

@inproceedings{wenzek2020ccnet,
  title={CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data},
  author={Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm{\'a}n, Francisco and Joulin, Armand and Grave, {\'E}douard},
  booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
  pages={4003--4012},
  year={2020}
}

CircleCI

Installation

We only tried this on Linux but installation should be possible on MacOS too.

  1. Create or simlink a data folder to where you want to download the corpus.

  2. Run make install. This will download some resources and install required packages.

  3. If you have a C++ 17 compiler you can also run pip install .[getpy], it provides more memory efficient hashset.

  4. Install the following tools manually if make install failed:

Training Language Models

The Makefile is used to train Sentence Piece and LM on Wikipedia data.

  • make help shows help
  • make lang=de lm trains a Sentence Piece and a LM on German Wikipedia
  • make all_lm trains the same model than in the paper
  • make lang=de dl_lm downloads the LM trained for the paper
  • make dl_all_lm downloads all of them

Pipeline overview

The full mining pipeline is divided in 3 steps:

  • hashes downloads one Common-Crawl snapshot, and compute hashes for each paragraph
  • mine removes duplicates, detects language, run the LM and split by lang/perplexity buckets
  • regroup regroup the files created by mine in chunks of 4Gb

Each step needs the previous step to be over before starting. You can launch the full pipeline using python -m cc_net.

  • python -m cc_net --help shows help
  • python -m cc_net --dump 2019-13 treats a specific snapshot
  • python -m cc_net -l my -l gu restricts to specific languages
  • python -m cc_net --lm_dir my_lms/ uses custom LMs
  • python -m cc_net --lang_threshold 0.3 set a specific field in mine.Config
  • python -m cc_net --config test runs on a tiny subset of a snapshot
  • python -m cc_net --config config/my_config.json uses configuration from the given config file

Reproducing our work

Given the CPU required to run the full pipeline on such a big corpus we share a mapping from url to the information we computed. You can reconstruct the corpus used in the paper by using:

python -m cc_net --conf reproduce --dump 2019-09

Extract XLM-R data

Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa) paper was trained on data extracted by an internal version of cc_net.

Due to the format being a little bit different please use the following command instead:

python cc_net/tools/dl_cc_100.py --help
python cc_net/tools/dl_cc_100.py --outdir data_cc100 --process 8

If you use this version of the data please also consider citing:

@article{conneau2019unsupervised,
  title={Unsupervised Cross-lingual Representation Learning at Scale},
  author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
  journal={arXiv preprint arXiv:1911.02116},
  year={2019}
}

Adapting to your infrastructure

Given the computation cost of running the full pipeline we distributed the computation on a Slurm cluster using submitit. submitit will default to spawning processes on your machine if Slurm cluster is found. You should tweak --task_parallelism to something adapated to your machine. Defaults are 512 for mining and 20 for reproducing.

To run the tasks in-process use --execution debug.

Output format

Generated files are compressed JSON files. There is one JSON object per line.

List of fields:

  • url: webpage URL (part of CC)
  • date_download: date of download (part of CC)
  • digest: sha1 digest of the webpage (part of CC)
  • length: number of chars
  • nlines: number of lines
  • source_domain: web domain of the webpage
  • title: page title (part of CC)
  • raw_content: webpage content after deduplication
  • original_nlines: number of lines before deduplication
  • original_length: number of chars before deduplication
  • language: language detected by FastText LID
  • language_score: language score
  • perplexity: perplexity of a LM trained on Wikipedia

Sample JSON object:

{
  "url": "http://www.pikespeakhospice.org/members/1420",
  "date_download": "2019-02-15T18:40:25Z",
  "digest": "sha1:VQW3KXUOALO543IJGTK2JLVEAN2XXKHI",
  "length": 752,
  "nlines": 5,
  "source_domain": "www.pikespeakhospice.org",
  "title": "LeeRoy Aragon",
  "raw_content": "Date Honored: March 2017\nHe was a man of integrity, a hard worker, and a dedicated family man. He loved spending time with family camping, fishing, hunting, boating and just hanging out.\nHis Catholic faith was extremely important to him as he gave of his time and talents to the community. He had many friends through church and the Knights of Columbus. He was a meticulous handyman, and enjoyed building and fixing things and restoring antique furniture to perfection. He was a fan and supported his Colorado Rockies and Denver Broncos. Throughout the years he had devoted four-legged friends (his dogs and a horse named Sunny Boy).\nWe have many cherished memories of him that we will treasure until we are with him again.\n~ Family of LeeRoy F. Aragon",
  "original_nlines": 7,
  "original_length": 754,
  "language": "en",
  "language_score": 0.99,
  "perplexity": 255.11,
}

You can peak at those files using UNIX tools zcat and jq, eg: zcat data/mined/2019-09/en_head_0000.json.gz | head -1 | jq .

jq can do some complicated filtering. jsonql.py provides a Python API with multiprocess support to do more complicated operations like LM scoring of the document.

License

By contributing to cc_net, you agree that your contributions will be licensed under the LICENSE file in the root directory of this source tree.

cc_net's People

Contributors

gwenzek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cc_net's Issues

Variance of hash files sizes in newer crawls

Hello,
I noticed that hash files that I've produced from the dump of January 21 (and several others months in 2020) are much smaller (x100) than hashes from dump of April and May 2019, even though original wet files were the same size.

In both cases there are 2 shards per one hash and all the other parameters are the same.

Trying to understand why, tnx:)

Doing hashing, mining and regroup from each bin order

Hi,

Sorry, but I think I might have missed reading regarding this but I am trying to get all the json files once the hash bin files are created instead of waiting for all the hash bin to be downloaded and then the mining is done. Is there a way to go about it.

Thanks in advance,
Aswin

从wet格式中提取文本

我已经下好了文件,如何提取呢,脚本是下载和提取一起的,我只想要提取部分,该怎么处理

when use odoo 16.0 in pycharm show this Error

Traceback (most recent call last):
File "E:\odoo 16\odoo source code\16.0\PyPDF2_utils.py", line 53, in
from typing import TypeAlias # type: ignore[attr-defined]
ImportError: cannot import name 'TypeAlias' from 'typing' (C:\Users\HP\AppData\Local\Programs\Python\Python39\lib\typing.py)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "E:\odoo 16\odoo source code\16.0\odoo-16.0\odoo-bin", line 5, in
import odoo
File "E:\odoo 16\odoo source code\16.0\odoo-16.0\odoo_init_.py", line 75, in
import PyPDF2
File "E:\odoo 16\odoo source code\16.0\PyPDF2_init_.py", line 12, in
from ._encryption import PasswordType
File "E:\odoo 16\odoo source code\16.0\PyPDF2_encryption.py", line 34, in
from ._utils import logger_warning
File "E:\odoo 16\odoo source code\16.0\PyPDF2_utils.py", line 55, in
from typing_extensions import TypeAlias
ModuleNotFoundError: No module named 'typing_extensions'

Process finished with exit code 1

ModuleNotFoundError: No module named 'typing_extensions'

After running make install, I was getting

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/data1/alferre/cc_net/cc_net/__main__.py", line 11, in <module>
    import cc_net.mine
  File "/data1/alferre/cc_net/cc_net/mine.py", line 29, in <module>
    from cc_net import dedup, execution, jsonql, perplexity, process_wet_file
  File "/data1/alferre/cc_net/cc_net/dedup.py", line 25, in <module>
    from cc_net import jsonql
  File "/data1/alferre/cc_net/cc_net/jsonql.py", line 50, in <module>
    from typing_extensions import Literal, Protocol

Running pip install typing_extensions fixed it. So this package is probably missing from the setup.py.

Error: Mining phase failure

Hello everyone, I'm having problems running the mining phase. I'm using a computer with 60Gb of RAM and 16 CPU cores. When running the mining phase I get the error below.

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/raphael_assis4347/raphael/cc_net/cc_net/__main__.py", line 18, in <module>
    main()
  File "/home/raphael_assis4347/raphael/cc_net/cc_net/__main__.py", line 14, in main
    func_argparse.parse_and_call(cc_net.mine.get_main_parser())
  File "/home/raphael_assis4347/.local/lib/python3.8/site-packages/func_argparse/__init__.py", line 72, in parse_and_call
    return command(**parsed_args)
  File "/home/raphael_assis4347/raphael/cc_net/cc_net/mine.py", line 631, in main
    all_files = mine(conf)
  File "/home/raphael_assis4347/raphael/cc_net/cc_net/mine.py", line 341, in mine
    ex(_mine_shard, repeat(conf), hashes_files, *_transpose(missing_outputs))
  File "/home/raphael_assis4347/raphael/cc_net/cc_net/execution.py", line 200, in custom_map_array
    raise Exception(message)
Exception: 9 / 9 jobs failed while running _mine_shard
Will run cc_net.mine.main with the following config: Config(config_name='base', dump='2019-09', output_dir=PosixPath('data'), mined_dir='mined_data', execution='auto', num_shards=9, num_segments_per_shard=750, metadata=None, min_len=300, hash_in_mem=9, lang_whitelist=['pt'], lang_blacklist=[], lang_threshold=0.5, keep_bucket=['head', 'middle'], lm_dir=PosixPath('data/lm_sp'), cutoff=PosixPath('/home/raphael_assis4347/raphael/cc_net/cc_net/data/cutoff.csv'), lm_languages=['pt'], mine_num_processes=9, target_size='4G', cleanup_after_regroup=True, task_parallelism=-1, pipeline=['dedup', 'lid', 'keep_lang', 'sp', 'lm', 'pp_bucket', 'drop', 'split_by_lang'], experiments=[], cache_dir=PosixPath('data/wet_cache'))
Submitting 9 jobs for _mine_shard, with task_parallelism=16
Waiting on 9 running jobs. Job ids: 69378,69397,69400,69419...
Failed job 69516 (1 / 9): Job 69516 (task: 0) with path /home/raphael_assis4347/raphael/cc_net/data/logs/69516_0_result.pkl
has not produced any output (state: FINISHED)
Error stream produced:
----------------------------------------
2022-09-14 16:43 INFO 69535:cc_net.jsonql - preparing [<cc_net.dedup.DuplicatesRemover object at 0x7fd7313c8d00>, Classifier(bin/lid.bin), <cc_net.jsonql.where object at 0x7fd7313c8df0>, <cc_net.perplexity.MultiSentencePiece object at 0x7fd7313c8d30>, <cc_net.perplexity.DocLM object at 0x7fd7313c8e50>, <cc_net.perplexity.PerplexityBucket object at 0x7fd7313c8d60>, <cc_net.perplexity.DropKeys object at 0x7fd7313c8fa0>]

Waiting on 8 running jobs. Job ids: 69378,69397,69400,69419...
Failed job 69400 (2 / 9): Job 69400 (task: 0) with path /home/raphael_assis4347/raphael/cc_net/data/logs/69400_0_result.pkl
has not produced any output (state: FINISHED)
Error stream produced:
----------------------------------------
2022-09-14 16:43 INFO 69418:cc_net.jsonql - preparing [<cc_net.dedup.DuplicatesRemover object at 0x7fcee923a970>, Classifier(bin/lid.bin), <cc_net.jsonql.where object at 0x7fcee923aa60>, <cc_net.perplexity.MultiSentencePiece object at 0x7fcee923a9a0>, <cc_net.perplexity.DocLM object at 0x7fcee923aac0>, <cc_net.perplexity.PerplexityBucket object at 0x7fcee923a9d0>, <cc_net.perplexity.DropKeys object at 0x7fcee923ac10>]

I couldn't identify the problem by looking at the logs. In the process .log.err file it only contains the pipeline objects vector. Does anyone have any idea what it could be?

This is my configuration file:

{
    "dump": "2019-09",
    "hash_in_mem": 9,
    "num_shards": 9,
    "mine_num_processes": 9,
    "num_segments_per_shard": 750,
    "lang_whitelist": ["pt"],
    "lm_languages": ["pt"],
    "pipeline": [
        "dedup",
        "lid",
        "keep_lang",
        "sp",
        "lm",
        "pp_bucket",
        "drop",
        "split_by_lang"
    ],
    "execution": "auto",
    "output_dir": "data",
    "mined_dir": "mined_data",
    "target_size": "4G",
    "keep_bucket": ["head", "middle"],
    "cache_dir": "data/wet_cache"
}

make dl_all_lm failing

When I run the command to download all of the language models:
make dl_all_lm
I can't run it, with the response:
make: *** No rule to make target 'dl_all_lm'. Stop.

Also, I am able to download the english language model (using make lang=en dl_lm), but the perplexity calculation doesn't work..

Any ideas?

EOFError: Compressed file ended before the end-of-stream marker was reached

Hi there, I was trying to run the code by MPExecutor but got the following error:

2020-07-23 20:44 INFO 156:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/wet.paths.gz [200]
2020-07-23 20:44 INFO 156:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/segments/1512948512054.0/wet/CC-MAIN-20171211014442-20171211034442-00400.warc.wet.gz
2020-07-23 20:48 INFO 171:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/segments/1512948512208.1/wet/CC-MAIN-20171211052406-20171211072406-00300.warc.wet.gz [200]
2020-07-23 20:48 INFO 171:HashesCollector - Processed 2_915 documents in 0.078h ( 10.4 doc/s).
2020-07-23 20:48 INFO 171:HashesCollector - Found 0k unique hashes over 522 lines. Using 0.1GB of RAM.
multiprocessing.pool.RemoteTraceback: 

Traceback (most recent call last):
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/cc_net-master/cc_net/execution.py", line 145, in global_fn
    return f(*args[1:])
  File "/home/cc_net-master/cc_net/mine.py", line 233, in _hashes_shard
    file=conf.get_cc_shard(shard),
  File "/home/cc_net-master/cc_net/jsonql.py", line 449, in run_pipes
    for res in results:
  File "/home/cc_net-master/cc_net/jsonql.py", line 296, in map
    for x in source:
  File "/home/cc_net-master/cc_net/process_wet_file.py", line 199, in __iter__
    for doc in parse_warc_file(iter(f), self.min_len):
  File "/home/cc_net-master/cc_net/process_wet_file.py", line 117, in parse_warc_file
    for doc in group_by_docs(lines):
  File "/home/cc_net-master/cc_net/process_wet_file.py", line 89, in group_by_docs
    for warc in warc_lines:
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/gzip.py", line 300, in read1
    return self._buffer.read1(size)
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/gzip.py", line 493, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/cc_net-master/cc_net/__main__.py", line 24, in <module>
    main()
  File "/home/cc_net-master/cc_net/__main__.py", line 20, in main
    func_argparse.parse_and_call(parser)
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/site-packages/func_argparse/__init__.py", line 72, in parse_and_call
    return command(**parsed_args)
  File "/home/cc_net-master/cc_net/mine.py", line 524, in main
    regroup(conf)
  File "/home/cc_net-master/cc_net/mine.py", line 379, in regroup
    mine(conf)
  File "/home/cc_net-master/cc_net/mine.py", line 272, in mine
    hashes_groups = list(jsonql.grouper(hashes(conf), conf.hash_in_mem))
  File "/home/cc_net-master/cc_net/mine.py", line 221, in hashes
    ex(_hashes_shard, repeat(conf), *_transpose(missing_outputs))
  File "/home/cc_net-master/cc_net/execution.py", line 174, in __call__
    global_fn, zip(itertools.repeat(f_name), *args)
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
EOFError: Compressed file ended before the end-of-stream marker was reached

and the config I've changed in the mine.py is just like this:

config_name: str = "base"
    dump: str = "2017-51"
    output_dir: Path = Path("data")  
    execution: str = "mp"
    num_shards: int = 800
    num_segments_per_shard: int = -1
    min_len: int = 300
    hash_in_mem: int = 25
    lang_whitelist: Sequence[str] = ["zh"]
    lang_blacklist: Sequence[str] = []
    lang_threshold: float = 0.5
    lm_dir: Path = Path("data/lm_sp")
    cutoff: Path = CUTOFF_CSV
    lm_languages: Optional[Sequence[str]] = ["zh"]
    mine_num_processes: int = 10
    target_size: str = "2G"
    cleanup_after_regroup: bool = True
    task_parallelism: int = 500
    pipeline: Sequence[str] = []
    experiments: Sequence[str] = []

I searched about this error and they all say that caused by the incomplete download file, but I saw your code annotation in jsonql.py func open_remote_file : "Download the files at the given url to memory and opens it as a file" How can I delete these incomplete download file in the memory ? or any other solution to fix this error ?

By the way ,the environment I was running the code is docker containner Ubuntu20.04

win10 use cc_net

Is there a tutorial for installing and using this project for win10?

requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url

When I execute:
python -m cc_net --dump 2019-13

Here is the full log. Err:

2023-05-10 08:56 INFO 259781:cc_net.jsonql - preparing [<cc_net.minify.MetadataFetcher object at 0x7f6b262a5d60>, <cc_net.jsonql.where object at 0x7f6b262a5b20>, <cc_net.jsonql.where object at 0x7f6b262a5d30>]
2023-05-10 08:56 INFO 259781:cc_net.jsonql - Opening /tmp/wet_2019-09.paths.gz with mode 'rt'
2023-05-10 08:56 INFO 259781:root - Starting download of https://data.commoncrawl.org/crawl-data/CC-MAIN-2019-09/segments/1550247479159.2/wet/CC-MAIN-20190215204316-20190215230316-00200.warc.wet.gz
/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/jsonql.py:1102: UserWarning: Swallowed error 503 Server Error: Service Unavailable for url: https://data.commoncrawl.org/crawl-data/CC-MAIN-2019-09/segments/1550247479159.2/wet/CC-MAIN-20190215204316-20190215230316-00200.warc.wet.gz while downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2019-09/segments/1550247479159.2/wet/CC-MAIN-20190215204316-20190215230316-00200.warc.wet.gz (1 out of 3)
  warnings.warn(
/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/jsonql.py:1102: UserWarning: Swallowed error 503 Server Error: Service Unavailable for url: https://data.commoncrawl.org/crawl-data/CC-MAIN-2019-09/segments/1550247479159.2/wet/CC-MAIN-20190215204316-20190215230316-00200.warc.wet.gz while downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2019-09/segments/1550247479159.2/wet/CC-MAIN-20190215204316-20190215230316-00200.warc.wet.gz (2 out of 3)
  warnings.warn(
2023-05-10 08:57 INFO 259781:split - Processed 0 documents in 0.017h (  0.0 doc/s).
2023-05-10 08:57 INFO 259781:split - Found 0 splits.
2023-05-10 08:57 INFO 259781:MetadataFetcher - Processed 0 documents in 0.017h (  0.0 doc/s).
2023-05-10 08:57 INFO 259781:MetadataFetcher - Read 0, stocking 0 doc in 0.1g.
2023-05-10 08:57 INFO 259781:where - Selected 0 documents out of 0 ( 0.0%)
2023-05-10 08:57 INFO 259781:where - Selected 0 documents out of 0 ( 0.0%)
submitit ERROR (2023-05-10 08:57:54,174) - Submitted job triggered an exception
2023-05-10 08:57 ERROR 259781:submitit - Submitted job triggered an exception
Traceback (most recent call last):
  File "/home/admin1/miniconda3/envs/ccnetpy38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/admin1/miniconda3/envs/ccnetpy38/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/admin1/miniconda3/envs/ccnetpy38/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/home/admin1/miniconda3/envs/ccnetpy38/lib/python3.8/site-packages/submitit/core/submission.py", line 72, in submitit_main
    process_job(args.folder)
  File "/home/admin1/miniconda3/envs/ccnetpy38/lib/python3.8/site-packages/submitit/core/submission.py", line 65, in process_job
    raise error
  File "/home/admin1/miniconda3/envs/ccnetpy38/lib/python3.8/site-packages/submitit/core/submission.py", line 54, in process_job
    result = delayed.result()
  File "/home/admin1/miniconda3/envs/ccnetpy38/lib/python3.8/site-packages/submitit/core/utils.py", line 133, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/mine.py", line 432, in _mine_shard
    jsonql.run_pipes(
  File "/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/jsonql.py", line 455, in run_pipes
    write_jsons(data, output)
  File "/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/jsonql.py", line 496, in write_jsons
    for res in source:
  File "/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/jsonql.py", line 284, in map
    for x in source:
  File "/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/jsonql.py", line 277, in map
    for x in source:
  File "/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/process_wet_file.py", line 206, in __iter__
    for doc in parse_warc_file(self.open_segment(segment), self.min_len):
  File "/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/process_wet_file.py", line 199, in open_segment
    return jsonql.open_remote_file(url, cache=file)
  File "/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/jsonql.py", line 1124, in open_remote_file
    raw_bytes = request_get_content(url)
  File "/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/jsonql.py", line 1101, in request_get_content
    raise e
  File "/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/jsonql.py", line 1095, in request_get_content
    r.raise_for_status()
  File "/home/admin1/miniconda3/envs/ccnetpy38/lib/python3.8/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://data.commoncrawl.org/crawl-data/CC-MAIN-2019-09/segments/1550247479159.2/wet/CC-MAIN-20190215204316-20190215230316-00200.warc.wet.gz

Early exit when desired number of documents is reached?

Apologies if this is mentioned somewhere or is otherwise obvious, but:

Is there a way to early-exit when a desired number of documents have been collected? Say I only wanted 1 million documents, can I somehow exit the call to python cc_net mine once I have hit that number?

Thanks a lot in advance.

support of Hausa

Thanks for your contribution to the community. I am wondering whether the ccnet contains the Hausa language (ISO id: ha/hau)? Because in the xlm-r paper, Table 6 mentioned that Hausa was included in CCNet. However, I didn't find the language code of Hausa in the dumped files and fasttext lid's document.

ChunkedEncodingError & ConnectionResetError

Here's the log with command nohup python -m cc_net mine --dump 2019-13 > 2019-13.log 2>2019-13.err &:

2019-11-12 00:26 INFO 22835:HashesCollector - Processed 519_187 documents in 1e+01h ( 14.4 doc/s).
2019-11-12 00:26 INFO 22835:HashesCollector - Found 25_229k unique hashes over 90_967 lines. Using 3.6GB of RAM.
2019-11-12 00:27 INFO 22835:cc_net.process_wet_file - Kept 43_340 documents over 45_437 (95.4%).
2019-11-12 00:27 INFO 22835:cc_net.process_wet_file - Parsed 13 / 35 files. Estimated remaining time: 9.2h
2019-11-12 00:27 INFO 22835:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153624-00039.warc.wet.gz
/data/myusername/projects/cc_net/cc_net/jsonql.py:1138: UserWarning: Swallowed error ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')) while downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153624-00039.warc.wet.gz (1 out of 3)
  f"Swallowed error {e} while downloading {url} ({i} out of {n_retry})"
2019-11-12 01:16 INFO 22835:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153624-00039.warc.wet.gz [200]
2019-11-12 01:16 INFO 22835:HashesCollector - Processed 562_527 documents in 1.1e+01h ( 14.4 doc/s).
2019-11-12 01:16 INFO 22835:HashesCollector - Found 26_687k unique hashes over 98_562 lines. Using 3.7GB of RAM.
2019-11-12 01:16 INFO 22835:HashesCollector - Found 26_687k unique hashes over 98_562 lines. Using 3.7GB of RAM.
2019-11-12 01:17 INFO 22835:cc_net.process_wet_file - Kept 43_268 documents over 45_427 (95.2%).
2019-11-12 01:17 INFO 22835:cc_net.process_wet_file - Parsed 14 / 35 files. Estimated remaining time: 17.7h
2019-11-12 01:17 INFO 22835:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153625-00008.warc.wet.gz
/data/myusername/projects/cc_net/cc_net/jsonql.py:1138: UserWarning: Swallowed error ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')) while downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153625-00008.warc.wet.gz (1 out of 3)
  f"Swallowed error {e} while downloading {url} ({i} out of {n_retry})"
/data/myusername/projects/cc_net/cc_net/jsonql.py:1138: UserWarning: Swallowed error ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')) while downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153625-00008.warc.wet.gz (2 out of 3)
  f"Swallowed error {e} while downloading {url} ({i} out of {n_retry})"
2019-11-12 02:11 INFO 22835:HashesCollector - Processed 605_794 documents in 1.2e+01h ( 14.3 doc/s).
2019-11-12 02:11 INFO 22835:HashesCollector - Found 0k unique hashes over 106_217 lines. Using 3.7GB of RAM.
Traceback (most recent call last):
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
    yield
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 507, in read
    data = self._fp.read(amt) if not fp_closed else b""
  File "/usr/lib/python3.7/http/client.py", line 457, in read
    n = self.readinto(b)
  File "/usr/lib/python3.7/http/client.py", line 501, in readinto
    n = self.fp.readinto(b)
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/models.py", line 750, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 564, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 529, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 443, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/data/myusername/projects/cc_net/cc_net/__main__.py", line 31, in <module>
    main()
  File "/data/myusername/projects/cc_net/cc_net/__main__.py", line 27, in main
    command(**parsed_args)
  File "/data/myusername/projects/cc_net/cc_net/mine.py", line 512, in main
    regroup(conf)
  File "/data/myusername/projects/cc_net/cc_net/mine.py", line 364, in regroup
    mine(conf)
  File "/data/myusername/projects/cc_net/cc_net/mine.py", line 257, in mine
    hashes_groups = list(jsonql.grouper(hashes(conf), conf.hash_in_mem))
  File "/data/myusername/projects/cc_net/cc_net/mine.py", line 206, in hashes
    ex(_hashes_shard, repeat(conf), *_transpose(missing_outputs))
  File "/data/myusername/projects/cc_net/cc_net/execution.py", line 128, in debug_executor
    message = function(*x)
  File "/data/myusername/projects/cc_net/cc_net/mine.py", line 218, in _hashes_shard
    file=conf.get_cc_shard(shard),
  File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 448, in run_pipes
    for res in results:
  File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 295, in map
    for x in source:
  File "/data/myusername/projects/cc_net/cc_net/process_wet_file.py", line 198, in __iter__
    with jsonql.open_remote_file(self.segment_url(segment)) as f:
  File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 1151, in open_remote_file
    content = io.BytesIO(request_get_content(url))
  File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 1136, in request_get_content
    raise e
  File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 1129, in request_get_content
    r = requests.get(url)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/sessions.py", line 686, in send
    r.content
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/models.py", line 828, in content
    self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/models.py", line 753, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

Is this just due to poor network connection between me and Amazon server (I'm in China)? If so, is it recommended to run the code from an AWS server located in US? If I don't have a C++17 compiler, how much memory do I need?
Thanks a lot.

cc_net/tools/dl_cc_100.py fails to extract complete dataset

python3.7 cc_net/tools/dl_cc_100.py --outdir data/cc100 --processes 96 provides only 99GB (277 GB uncompressed) data across 10 languages:

780M    /mnt/data/cc100/bn_IN
2.0G    /mnt/data/cc100/hi_IN
25G     /mnt/data/cc100/id_ID
12G     /mnt/data/cc100/ko_KR
89M     /mnt/data/cc100/my_MM
25G     /mnt/data/cc100/sv_SE
270M    /mnt/data/cc100/sw_KE
6.7G    /mnt/data/cc100/th_TH
475M    /mnt/data/cc100/tl_XX
21G     /mnt/data/cc100/vi_VN

The script should provide all 100 languages listed in https://arxiv.org/pdf/1911.02116.pdf Figure 1:

image

Running on local files

Hi,
Is it possible to mine and analyze local wet files, without downloading from AWS?

Thanks!

Inquiries about utilizing 2022 collected common rawl snapshots

In the paper, it is stated that CCNet conducted the study with the "common crawl snapshot in February 2019" dataset.
I want to use the Common Crawl data snapshots collected after 2022.
Is it also possible to classify Common Crawl data collected after 2022 by language using the CCNet github code?

requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url

dev branch fails to start _mine_shard stage, it timeouts and rises following exception even with parallelism=1:

python3 -m cc_net --config reproduce --dump 2019-09 --task_parallelism 1                                                                                                                               
Will run cc_net.mine.main with the following config: Config(config_name='reproduce', dump='2019-09', output_dir=PosixPath('data'), mined_dir='reproduce', execution='local', num_shards=1600, num_segments_per_shard=-1, metadata='https://dl
.fbaipublicfiles.com/cc_net/1.0.0', min_len=300, hash_in_mem=50, lang_whitelist=[], lang_blacklist=[], lang_threshold=0.5, lm_dir=PosixPath('data/lm_sp'), cutoff=PosixPath('/home/zbr/awork/cc_net/cc_net/data/cutoff.csv'), lm_languages=No
ne, mine_num_processes=16, target_size='4G', cleanup_after_regroup=True, task_parallelism=1, pipeline=['fetch_metadata', 'split_by_lang'], experiments=[], cache_dir=None)                                                                   
Submitting 1600 jobs for _mine_shard, with parallelism=1                                                                                                                                                                                     
Waiting on 1 running jobs. Job ids: 17305                                                                                                                                                                                                    
Failed job 17305 (1 / 1600): Job (task=0) failed during processing with trace:
----------------------                                                                                                
multiprocessing.pool.RemoteTraceback:                                                                                 
"""                                                                                                                   
Traceback (most recent call last):                                                                                                                                                                                                           
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker                                                                                                                                                                     
    result = (True, func(*args, **kwds))                                                                                                                                                                                                     
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar                                                                                                                                                                     
    return list(map(*args))                                                                                           
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 476, in _global_transformer                                                                                                                                                           
    return _GLOBAL_TRANSFORMER(document)                                                                                                                                                                                                     
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 246, in __call__                                               
    y = self.do(x)                                                                                                                                                                                                                           
  File "/home/zbr/awork/cc_net/cc_net/minify.py", line 164, in do                                                                                                                                                                            
    self.fetch_metadata(doc["cc_segment"])                                                                                                                                                                                                   
  File "/home/zbr/awork/cc_net/cc_net/minify.py", line 146, in fetch_metadata                                         
    for m in jsonql.read_jsons(meta_file):                                                                            
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 485, in read_jsons                                             
    lines = open_read(file)                                                                                           
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 942, in open_read                                              
    return open_remote_file(filename)                                                                                 
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 1117, in open_remote_file                                                                                                                                                             
    raw_bytes = request_get_content(url)                                                                                                                                                                                                     
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 1094, in request_get_content                                                                                                                                                          
    raise e                                                                                                           
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 1088, in request_get_content                                   
    r.raise_for_status()                                                                                                                                                                                                                     
  File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://dl.fbaipublicfiles.com/cc_net/1.0.0/2019-09/CC-MAIN-20190215183319-20190215205319-00026.json.gz

While that file can be downloaded in parallel using wget or other tools, so it is not related to network issues.
Last commit id: 9a0d5c2

The end of the log shows:

2020-10-12 20:39 INFO 17351:root - Downloaded https://dl.fbaipublicfiles.com/cc_net/1.0.0/2019-09/CC-MAIN-20190215183319-20190215205319-00026.json.gz [200] took 9s (420.6kB/s)
2020-10-12 20:39 INFO 17351:JsonReader - Processed 30_532 documents in 0.0026h (3278.9 doc/s).
2020-10-12 20:39 INFO 17351:MetadataFetcher - Loaded 30532 metadatas from https://dl.fbaipublicfiles.com/cc_net/1.0.0/2019-09/CC-MAIN-20190215183319-20190215205319-00026.json.gz
submitit WARNING (2020-10-13 01:26:39,791) - Caught signal 10 on gpurnd14: this job is timed-out.
2020-10-13 01:26 WARNING 17307:submitit - Caught signal 10 on gpurnd14: this job is timed-out.
2020-10-13 01:26 INFO 17307:submitit - Job not requeued because: timed-out and not checkpointable.
2020-10-13 01:26 INFO 17307:MetadataFetcher - Processed 0 documents in 5.0h (  0.0 doc/s).
2020-10-13 01:26 INFO 17307:MetadataFetcher - Read 0, stocking 0 doc in 0.3g.
submitit ERROR (2020-10-13 01:26:39,797) - Submitted job triggered an exception
2020-10-13 01:26 ERROR 17307:submitit - Submitted job triggered an exception
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 476, in _global_transformer
    return _GLOBAL_TRANSFORMER(document)
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 246, in __call__
    y = self.do(x)
  File "/home/zbr/awork/cc_net/cc_net/minify.py", line 164, in do
    self.fetch_metadata(doc["cc_segment"])
  File "/home/zbr/awork/cc_net/cc_net/minify.py", line 146, in fetch_metadata
    for m in jsonql.read_jsons(meta_file):
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 485, in read_jsons
    lines = open_read(file)
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 942, in open_read
    return open_remote_file(filename)
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 1117, in open_remote_file
    raw_bytes = request_get_content(url)
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 1094, in request_get_content
    raise e
  File "/home/zbr/awork/cc_net/cc_net/jsonql.py", line 1088, in request_get_content
    r.raise_for_status()
  File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://dl.fbaipublicfiles.com/cc_net/1.0.0/2019-09/CC-MAIN-20190215183319-20190215205319-00026.json.gz

How this can be debugged, or what are the next steps to build a dataset?

The questions about the stats json configuration file

I want to crawl the latest 2023-06 snapshot data, how do I configure my stats.json?
I notice that the json file has two tags, size and checksum. How do I define the values of these two tags, or how do I get them?

Inquiries about korean datasets utilized in the CCNet pipeline

While studying data pipelines, I found CCNet. CCNet is very intriguing to me. I'm going to use CCNet to create a better data pipeline for Korean datasets.
I have a question. In the paper, it is stated that CCNet conducted a study with the "Feb. 2019 snapshot of Common Crawl" dataset.
I wonder how many Korean datasets are in that dataset.
In the paper, the size of the datasets in table 6 is written as the size after data preprocessing. I wonder if the data preprocessing is only deduplication. Also, I'm curious about the size of the Korean dataset before the data preprocessing. If you share the size of the Korean dataset, it will be of great help to me who is conducting research using CCNet.

Error: Job not requeued because: timed-out and not checkpointable.

When I execute:

python -m cc_net -l fa

It throws the following exception:

  File "/usr/local/Cellar/[email protected]/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 502, in readinto
    n = self.fp.readinto(b)
  File "/usr/local/Cellar/[email protected]/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/Cellar/[email protected]/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/Cellar/[email protected]/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
  File "/usr/local/lib/python3.8/site-packages/submitit/core/job_environment.py", line 185, in checkpoint_and_try_requeue
    raise utils.UncompletedJobError(message)
submitit.core.utils.UncompletedJobError: Job not requeued because: timed-out and not checkpointable.

Here is the full log.err:

2021-01-20 22:08 INFO 20945:cc_net.process_wet_file - Parsed 1 / 16000 files. Estimated remaining time: 177.9h
2021-01-20 22:08 INFO 20945:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/155024747>
2021-01-20 22:08 INFO 20945:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/we>
2021-01-20 22:08 INFO 20945:cc_net.process_wet_file - Kept 41_939 documents over 44_039 (95.2%).
2021-01-20 22:08 INFO 20945:cc_net.process_wet_file - Parsed 2 / 16000 files. Estimated remaining time: 147.7h
2021-01-20 22:08 INFO 20945:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/155024747>
submitit WARNING (2021-01-20 22:08:54,313) - Caught signal 10 on 8095a4502934: this job is timed-out.
2021-01-20 22:08 WARNING 20945:submitit - Caught signal 10 on 8095a4502934: this job is timed-out.
2021-01-20 22:08 INFO 20945:submitit - Job not requeued because: timed-out and not checkpointable.
2021-01-20 22:08 INFO 20945:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/we>
submitit WARNING (2021-01-20 22:08:54,522) - Bypassing signal 15
submitit WARNING (2021-01-20 22:08:54,522) - Bypassing signal 15
2021-01-20 22:08 WARNING 20956:submitit - Bypassing signal 15
2021-01-20 22:08 WARNING 20957:submitit - Bypassing signal 15
2021-01-20 22:08 INFO 20945:Classifier - Processed 0 documents in 0.025h (  0.0 doc/s).
2021-01-20 22:08 INFO 20945:Classifier - Kept 0 docs over 0 (0.0%)
2021-01-20 22:08 INFO 20945:Classifier - Found 0 language labels: {}
2021-01-20 22:08 INFO 20945:where - Selected 0 documents out of 0 ( 0.0%)
submitit ERROR (2021-01-20 22:08:54,541) - Submitted job triggered an exception
2021-01-20 22:08 ERROR 20945:submitit - Submitted job triggered an exception
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 851, in next
    item = self._items.popleft()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/opt/conda/lib/python3.8/site-packages/submitit/core/submission.py", line 65, in submitit_main
    process_job(args.folder)
  File "/opt/conda/lib/python3.8/site-packages/submitit/core/submission.py", line 58, in process_job
    raise error
  File "/opt/conda/lib/python3.8/site-packages/submitit/core/submission.py", line 47, in process_job

Cannot download the precpomputed files

Hi

I am trying to reproduce the results from your paper. However, after downloading the common crawl data from aws, the access to the precomputed files seems failed.

Did you changed the location of precomputed file ?

The error messages are like below:

/.local/lib/python3.7/site-packages/cc_net/jsonql.py:1141: 
UserWarning: Swallowed error HTTPSConnectionPool(host='dl.fbaipublicfiles.com', port=443): Max retries exceeded with url: 
/cc_net/2019-09/en_head_0017.json.gz (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7ff4d87913d0>: 
Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')) while downloading https://dl.fbaipublicfiles.com/cc_net/2019-09/en_head_0017.json.gz (2 out of 3)

"Reproducing our work" does not specify set of languages and snapshots

README.md provides python -m cc_net --config reproduce --dump 2019-09 as an example to reproduce the cc_net corpus, which relies on

cc_net/cc_net/mine.py

Lines 172 to 191 in 242e10d

REPRODUCE_CONFIG = Config(
config_name="reproduce",
dump="2019-09",
mined_dir="reproduce",
pipeline=["fetch_metadata", "keep_lang", "keep_bucket", "split_by_lang"],
metadata="https://dl.fbaipublicfiles.com/cc_net/1.0.0",
# Optional filtering:
# It won't change much the execution speed, but decreases the disk requirement.
# Restrict languages
lang_whitelist=["fr"],
# Restrict perplexity buckets
# Top languages have been split in perplexity buckets according
# to a Wikipedia trained LM.
# The buckets from low perplexity (good) to high (bad) are:
# ["head", "middle", "tail"]
# Languages without a LM have only one bucket "all".
# It won't change much the execution speed, but decreases the disk requirement.
keep_bucket=["head", "all"],
mine_num_processes=1,
)

The combination of dump 2019-09 and french languages provides only a small corpus. As the metadata files are only accessible via https://dl.fbaipublicfiles.com/cc_net/1.0.0, it is impossible to list the underlying S3 bucket to obtain a complete list of available languages and dumps. Thus it would be helpful if you can provide the complete list in your README.

403 forbidden while downloading

hi there, I encountered the 403 error while trying downloading ccnet data using this pipeline.
Wondering if this is bcs of the network settings from my side or is there anything wrong?
Thanks in advance.

/ldap_home/raven.ren/cc_net/cc_net/flat_hash_set.py:115: UserWarning: Module 'getpy' not found. Deduplication will take more RAM. Try pip install cc_net[getpy] warnings.warn( 2022-08-23 19:25 INFO 6898:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/wet.paths.gz 2022-08-23 19:25 INFO 6898:HashesCollector - Processed 0 documents in 0.00034h ( 0.0 doc/s). 2022-08-23 19:25 INFO 6898:HashesCollector - Found 0k unique hashes over 0k lines. Using 0.1GB of RAM. submitit ERROR (2022-08-23 19:25:23,974) - Submitted job triggered an exception 2022-08-23 19:25 ERROR 6898:submitit - Submitted job triggered an exception Traceback (most recent call last): File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module> submitit_main() File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/submitit/core/submission.py", line 72, in submitit_main process_job(args.folder) File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/submitit/core/submission.py", line 65, in process_job raise error File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/submitit/core/submission.py", line 54, in process_job result = delayed.result() File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/submitit/core/utils.py", line 133, in result self._result = self.function(*self.args, **self.kwargs) File "/ldap_home/raven.ren/cc_net/cc_net/mine.py", line 273, in _hashes_shard jsonql.run_pipes( File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 455, in run_pipes write_jsons(data, output) File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 496, in write_jsons for res in source: File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 284, in map for x in source: File "/ldap_home/raven.ren/cc_net/cc_net/process_wet_file.py", line 195, in __iter__ n = len(self.segments) File "/ldap_home/raven.ren/cc_net/cc_net/process_wet_file.py", line 243, in segments segments = cc_segments(self.dump, self.cache_dir) File "/ldap_home/raven.ren/cc_net/cc_net/process_wet_file.py", line 38, in cc_segments f = jsonql.open_remote_file(wet_paths, cache=wet_paths_cache) File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 1124, in open_remote_file raw_bytes = request_get_content(url) File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 1101, in request_get_content raise e File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 1095, in request_get_content r.raise_for_status() File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/requests/models.py", line 960, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/wet.paths.gz

Can reproduce still run normally?

hi, there
Can reproduce still run normally? When I run it, the message is
Will run cc_net.mine.main with the following config: Config(config_name='reproduce', dump='2019-09', output_dir=PosixPath('data'), mined_dir='reproduce', execution='auto', num_shards=1600, min_shard=-1, num_segments_per_shard=-1, metadata='https://dl.fbaipublicfiles.com/cc_net/1.0.0', min_len=300, hash_in_mem=50, lang_whitelist=['zh'], lang_blacklist=[], lang_threshold=0.5, keep_bucket=['head', 'all'], lm_dir=PosixPath('data/lm_sp'), cutoff=PosixPath('/home/yutuan.ma/RedPajama-Data-main/data_prep/cc/cc_net/cc_net/data/cutoff.csv'), lm_languages=None, mine_num_processes=1, target_size='4G', cleanup_after_regroup=False, task_parallelism=-1, pipeline=['fetch_metadata', 'keep_lang', 'keep_bucket', 'split_by_lang'], experiments=[], cache_dir=None)
Submitting 1600 jobs for _mine_shard, with task_parallelism=64
Waiting on 64 running jobs. Job ids: 43439,43506,43573,43640...
but nothing happens, the same as change the lang_whitelist='en'

Batch job submission failed: Invalid job array specification

Hi, when I run "python -m cc_net", this error happened:

Submitting _hashes_shard in a job array (1600 jobs)
sbatch: error: Batch job submission failed: Invalid job array specification
subprocess.CalledProcessError: Command '['sbatch', '/data/gsw/test/cc_net/data/logs/submission_file_479eba35e148432da4432891c1191887.sh']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/gsw/test/cc_net/cc_net/main.py", line 18, in
main()
File "/data/gsw/test/cc_net/cc_net/main.py", line 14, in main
func_argparse.parse_and_call(cc_net.mine.get_main_parser())
File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/site-packages/func_argparse/init.py", line 72, in parse_and_call
return command(**parsed_args)
File "/data/gsw/test/cc_net/cc_net/mine.py", line 632, in main
all_files = mine(conf)
File "/data/gsw/test/cc_net/cc_net/mine.py", line 335, in mine
hashes_groups = list(jsonql.grouper(hashes(conf), conf.hash_in_mem))
File "/data/gsw/test/cc_net/cc_net/mine.py", line 263, in hashes
ex(_hashes_shard, repeat(conf), *_transpose(missing_outputs))
File "/data/gsw/test/cc_net/cc_net/execution.py", line 89, in map_array_and_wait
jobs = ex.map_array(function, *args)
File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/site-packages/submitit/core/core.py", line 701, in map_array
return self._internal_process_submissions(submissions)
File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/site-packages/submitit/auto/auto.py", line 218, in _internal_process_submissions
return self._executor._internal_process_submissions(delayed_submissions)
File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/site-packages/submitit/slurm/slurm.py", line 332, in _internal_process_submissions
first_job: core.Job[tp.Any] = array_ex._submit_command(self._submitit_command_str)
File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/site-packages/submitit/core/core.py", line 864, in _submit_command
output = utils.CommandFunction(command_list, verbose=False)() # explicit errors
File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/site-packages/submitit/core/utils.py", line 350, in call
raise FailedJobError(stderr) from subprocess_error
submitit.core.utils.FailedJobError: sbatch: error: Batch job submission failed: Invalid job array specification

Error when Running 2020-34 dumps

When Running the full pipeline with the newest dumps (e.g. 2020-34), there seem to be an issue with the header file format.

It only seem to occur on Texts with non Latin Alphabet. Due to this issue one cannot run the hashing pipeline on some newer dumps. The last successfull dump which I could successfully process was 2020-10.

Are there any quick-fixes available for this problem?

  File "/home/ubuntu/anaconda3/lib/python3.7/logging/__init__.py", line 1025, in emit
    msg = self.format(record)
  File "/home/ubuntu/anaconda3/lib/python3.7/logging/__init__.py", line 869, in format
    return fmt.format(record)
  File "/home/ubuntu/anaconda3/lib/python3.7/logging/__init__.py", line 608, in format
    record.message = record.getMessage()
  File "/home/ubuntu/cc_net/cc_net/process_wet_file.py", line 98, in group_by_docs
    parsed = parse_doc(headers, doc)
  File "/home/ubuntu/anaconda3/lib/python3.7/logging/__init__.py", line 369, in getMessage
    msg = msg % self.args
  File "/home/ubuntu/cc_net/cc_net/process_wet_file.py", line 70, in parse_doc
    logger.warning("Can't parse header:", e, headers, doc)
TypeError: not all arguments converted during string formatting
Call stack:
Traceback (most recent call last):
  File "/home/ubuntu/cc_net/cc_net/process_wet_file.py", line 68, in parse_doc
    length = int(headers[8].split()[1])
ValueError: invalid literal for int() with base 10: 'text/plain'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.7/logging/__init__.py", line 1025, in emit
    msg = self.format(record)
  File "/home/ubuntu/anaconda3/lib/python3.7/logging/__init__.py", line 869, in format
    return fmt.format(record)
  File "/home/ubuntu/anaconda3/lib/python3.7/logging/__init__.py", line 608, in format
    record.message = record.getMessage()
  File "/home/ubuntu/anaconda3/lib/python3.7/logging/__init__.py", line 369, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
Message: "Can't parse header:"
Arguments: (ValueError("invalid literal for int() with base 10: 'text/plain'"), ['WARC/1.0', 'WARC-Type: conversion', 'WARC-Target-URI: http://00.auto.sohu.com/d/details?cityCode=321000&planId=1622&trimId=147575&rd=0', 'WARC-Date: 2020-08-04T02:58:40Z', 'WARC-Record-ID: <urn:uuid:b941a87a-cb63-49f2-8fcb-792b4e90e803>', 'WARC-Refers-To: <urn:uuid:3360da81-ad19-498f-b94a-5f5e52dc5ef4>', 'WARC-Block-Digest: sha1:N2PD5RJ7SNBYO4IV27IGIPF5LO63UZQK', 'WARC-Identified-Content-Language: zho,eng', 'Content-Type: text/plain', 'Content-Length: 4476', ''], ['【扬州|沃尔沃(进口) XC90 2020款 T5四驱智行豪华版 5座】_贷款买车_零零购车|搜狐汽车', '搜狐汽车零零购车', '扬州', '首页 > 沃尔沃 XC90', '沃尔沃(进口) XC90', '2020款 T5四驱智行豪华版 5座', '更换车款', '2.0T涡轮增压 254马力', '2020款 T5四驱智行豪华版 7座', '2.0T涡轮+机械增压 310马力', '2020款 改款 T6四驱智逸豪华版 7座', '2020款 改款 T6四驱智逸运动版 7座', '2020款 改款 T6四驱智雅豪华版 7座', '2020款 改款 T6四驱智雅运动版 7座', '2020款 改款 T6四驱智尊豪华版 7座', '2.0T涡轮+机械增压 320马力', '2020款 T6四驱智逸豪华版 7座', '2020款 T6四驱智逸运动版 7座', '2020款 T6四驱智雅豪华版 7座', '2020款 T6四驱智雅运动版 7座', '2020款 T6四驱智尊豪华版 7座', '扬州 4S店均价:63.39万', '平安车管家', '一成首付 含购置税,送一年保险', '所需材料:身份证,房产证,六个月流水,还款卡', '申请资质:信用记录良好', '总花费:----元', '月供:----元', '首付--%', '----元', '月供----X--期', '月利率----%', '尾款--%', '----元', '首付:', '10%', '期限:', '48期', '立即申请', '办理时需另付保证金----万元(车款报价 X --%)总花费中不包含税费、保险。车款报价随市场行情随时波动。以上价格仅供参考,以实际合同为准。', '方案详情', '[产品优势]', '门槛低:一成首付,含购置税,送一年保险', '月供无压力:低月供,轻松还款无压力', '省时省心:一站式服务,海量新车,无忧上牌', '灵活选择:可买可退', '[套餐介绍]', '常见问题:', '1、平安车管家售后怎么样?', '与正常购车一致,在品牌方授权4S店进行维修保养', '2、提车城市?:', '免运费仅限一下提车城市:济南,南京,苏州,武汉,长沙,成都,昆明。郑州,东莞,南宁,南通。具体提车城市需根据车型售卖情况和活动决定,详情请咨询客服。', '3、关于上牌:', '合同期内,汽车上平安租赁的牌照,按照合同约定支付尾款,车辆将过户给到您。', '4、所需资料:', '10万<贷款额<=30万:二证一卡(身份证+房产证或半年以上银行流水+还款卡)', '贷款额>30万:三证一卡(身份证+房产证+半年以上银行流水+还款卡)', '备注:', '首付、月供金额仅供参考,实际贷款金额将包含购置税、保险等费用,详情可咨询购车顾问,联系电话:021-20662667', '购车流程', '提交申请', '电话回访', '提交材料', '审核通过', '提车上户', '按期还款', '常见问题', 'Q页面中的价格是如何计算的?', 'A页面中计算的首付额、月供额等信息,是以您提车城市的4S店平均报价为准计算的。此价格随市场行情随时波动,仅供参考。了解更精确的价格,您可在页面中填写您的联系方式,我们的工作人员会与您沟通。', 'Q办理分期购车有什么要求?', 'A办理分期购车要求您是18岁以上的**公民,具有一定的还款能力,且需要提交相应的证明材料。不同金融方案要求的资质不同,您可以在方案详情中查看具体的要求,找到最适合自己的金融方案。', 'Q申请贷款时,材料审核需要多久?', 'A零零购车不同的金融方案由于所需材料的不同,材料审核的时间也不同。所需材料齐全后,服务商会立即提交审核,大部分方案在2-24小时即可出审核结果,并第一时间放款。', 'Q车辆的上牌如何办理?', 'A在您的贷款申请通过审批,提车时,车辆的交税、上牌等业务会有专业的客服人员为您统一办理,让您不再被复杂的手续所困扰。', '看过XC90的还看过', '沃尔沃XC60', '月供 9254元起', '奔驰GLC级', '月供 10067元起', '奥迪Q7', '月供 17590元起', '大众途锐', '月供 16060元起', '沃尔沃V90 Cross Country', '月供 11319元起', '丰田普拉多', '月供 12617元起', '完善您的信息,车贷申请极速审核', '×', '姓名:', '请填写您的真实姓名', '手机号:', '请填写常用手机号码', '提车地:', '扬州', '请选择提车城市', '我已阅读并同意 《搜狐汽车隐私政策》', '请同意《搜狐汽车隐私政策》', '提交申请', '×', '您的申请已成功提交,我们会尽快处理!', '关于我们', 'Copyright © Sohu.com Inc. All Rights Reserved. 搜狐公司 版权所有', '免责声明 | 搜狐不良信息举报邮箱:[email protected]', '客服:业务咨询、投诉建议', '279530178', '战略合作、代理商加盟010-61134396', '周一至周五 9:30-18:30', '反馈 顶部'])

CC-100 in statmt version is different from paper

Hi, first of all, thank you for your great work on multilingual NLP.
I'm trying to replicate XLM-R in my own reasearch, and I found that the corpus from statmt is very different from the description in XLM-R paper.
For example, in the case of Esperanto, there are 157M tokens in the paper, but in the statmt version there are actually about 290M tokens.
I tokenized with both sentencepiece + fairseq-preprocess and transformers tokenizer (xlm-roberta-base) for double-checking.

I guess the content of the corpus would be similar (know that CC was based on web-scrapping) since they have similar file size (which is 0.9GiB), but what makes them so different?

getpy version specified in setup.py no longer available

"getpy": ["getpy @ git+https://github.com/gwenzek/[email protected]"],

but only v0.9.9-beta, v0.9.6, v0.2.0-alpha exist: https://github.com/atom-moyer/getpy/tags

This causes cc_net installation to fail:


Could not find a version that satisfies the requirement getpy@ git+https://github.com/gwenzek/[email protected] (from cc-net==1.0.0) (from versions: )
Cleaning up...
  Removing source in /tmp/pip-2x4sgis2-build
  Removing source in /tmp/pip-build-5smbzi07/fasttext
  Removing source in /tmp/pip-build-5smbzi07/kenlm
  Removing source in /tmp/pip-build-5smbzi07/psutil
  Removing source in /tmp/pip-build-5smbzi07/sacremoses
  Removing source in /tmp/pip-build-5smbzi07/sentencepiece
  Removing source in /tmp/pip-build-5smbzi07/submitit
No matching distribution found for getpy@ git+https://github.com/gwenzek/[email protected] (from cc-net==1.0.0)
Exception information:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/usr/lib/python3/dist-packages/pip/commands/install.py", line 353, in run
    wb.build(autobuilding=True)
  File "/usr/lib/python3/dist-packages/pip/wheel.py", line 749, in build
    self.requirement_set.prepare_files(self.finder)
  File "/usr/lib/python3/dist-packages/pip/req/req_set.py", line 380, in prepare_files
    ignore_dependencies=self.ignore_dependencies))
  File "/usr/lib/python3/dist-packages/pip/req/req_set.py", line 554, in _prepare_file
    require_hashes
  File "/usr/lib/python3/dist-packages/pip/req/req_install.py", line 278, in populate_link
    self.link = finder.find_requirement(self, upgrade)
  File "/usr/lib/python3/dist-packages/pip/index.py", line 514, in find_requirement
    'No matching distribution found for %s' % req
pip.exceptions.DistributionNotFound: No matching distribution found for getpy@ git+https://github.com/gwenzek/[email protected] (from cc-net==1.0.0)

503 Server Error: Service Unavailable for url

When I use python -m cc_net to download and extract work, I am told that the connection cannot open
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-05/segments/1642320299852.23/wet/CC-MAIN-20220116093137-20220116123137-00540.warc.wet.gz
process_wet_file.py set WET_URL_ROOT = "https://data.commoncrawl.org"
How to solve this problem

Model finding

When specifying a language model in the config (using lm_languages:=en), the process throws an error:
OSError: Not found: "data/lm_sp/e.sp.model": No such file or directory Error #2

The code works fine when no lm_languages are specified.

I think the issue is the following line, since it only considers a single character for the model name:

{l: conf.lm_dir / f"{l}.sp.model" for l in conf.get_lm_languages()},

Numerous Errors

Hello,

Thank you for all of your great work. I am trying to just download and process the English dumps from CommonCrawl up to 2023. I have been running into multiple errors.

It seems as if the link to download from cc has changed to:
https://data.commoncrawl.org/

Some of the header names were changed as well. This fixed those errors:

        headers_map = {}

        for header in headers[1:]:
            if not header:
                continue
            key, value = header.split(": ", 1)
            headers_map[key] = value

        warc_type = headers_map["WARC-Type"]
        if warc_type != "conversion":
            return None
        url = headers_map["WARC-Target-URI"]
        date = headers_map["WARC-Date"]
        digest = headers_map["WARC-Block-Digest"]
        length = int(headers_map["Content-Length"])

Finally, running into this other issue:

requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-14/segments/1679296943471.24/wet/CC-MAIN-20230320083513-20230320113513-00114.warc.wet.gz

I have not been able to resolve this error yet.

Any help would be greatly appreciated.

Thank you,

Enrico

Failing to use mp execution

I am trying to use the MPExecutor but I am getting the following error:

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/alferre/anaconda3/envs/mtdev/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/data1/alferre/cc_net/cc_net/execution.py", line 145, in global_fn
    return f(*args[1:])
  File "/data1/alferre/cc_net/cc_net/mine.py", line 347, in _mine_shard
    output=tmp_output if not conf.will_split else None,
  File "/data1/alferre/cc_net/cc_net/jsonql.py", line 435, in run_pipes
    initargs=(transform,),
  File "/home/alferre/anaconda3/envs/mtdev/lib/python3.7/multiprocessing/context.py", line 119, in Pool
    context=self.get_context())
  File "/home/alferre/anaconda3/envs/mtdev/lib/python3.7/multiprocessing/pool.py", line 176, in __init__
    self._repopulate_pool()
  File "/home/alferre/anaconda3/envs/mtdev/lib/python3.7/multiprocessing/pool.py", line 241, in _repopulate_pool
    w.start()
  File "/home/alferre/anaconda3/envs/mtdev/lib/python3.7/multiprocessing/process.py", line 110, in start
    'daemonic processes are not allowed to have children'
AssertionError: daemonic processes are not allowed to have children
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/alferre/anaconda3/envs/mtdev/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/alferre/anaconda3/envs/mtdev/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/data1/alferre/cc_net/cc_net/__main__.py", line 24, in <module>
    main()
  File "/data1/alferre/cc_net/cc_net/__main__.py", line 20, in main
    func_argparse.parse_and_call(parser)
  File "/home/alferre/anaconda3/envs/mtdev/lib/python3.7/site-packages/func_argparse/__init__.py", line 72, in parse_and_call
    return command(**parsed_args)
  File "/data1/alferre/cc_net/cc_net/mine.py", line 509, in main
    regroup(conf)
  File "/data1/alferre/cc_net/cc_net/mine.py", line 364, in regroup
    mine(conf)
  File "/data1/alferre/cc_net/cc_net/mine.py", line 271, in mine
    ex(_mine_shard, repeat(conf), hashes_files, *_transpose(missing_outputs))
  File "/data1/alferre/cc_net/cc_net/execution.py", line 174, in __call__
    global_fn, zip(itertools.repeat(f_name), *args)
  File "/home/alferre/anaconda3/envs/mtdev/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
AssertionError: daemonic processes are not allowed to have children

I am running the following command

python -m cc_net mine --config /home/alferre/data/cc_net/config/config_alex.json

And this is my config file:

{
    "output_dir": "/home/alferre/data/cc_net/data_alex",
    "dump": "2019-09",
    "num_shards": 1,
    "num_segments_per_shard": 1,
    "hash_in_mem": 2,
    "mine_num_processes": 4,
    "lang_whitelist": [
        "pt"
    ],
    "execution": "mp",
    "target_size": "32M",
    "cleanup_after_regroup": false
}

The final json files are not as expected

python3 -m cc_net --config config/test_segment.json

finally:

Regrouped test_data3/mined_by_lang/2019-09/en_head_0000.json.gz (1 / 3)
Regrouped test_data3/mined_by_lang/2019-09/en_tail_0000.json.gz (2 / 3)
Regrouped test_data3/mined_by_lang/2019-09/en_middle_0000.json.gz (3 / 3)

but json files are not cleaned-up documents, they are:

{"url": "http://3stepbreath.com/shikhandin.html", "digest": "sha1:TNOFSVGSL4OE4F3JZKAMAAXW2VA5KORA", "cc_segment": "crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-20190215183319-20190215205319-00000.warc.wet.gz", "language": "en", "language_score": 1.0, "perplexity": 296.6, "bucket": "head", "line_ids": "JwAoACkAKgArAA=="}
{"url": "http://911forum.org.uk/board/viewtopic.php?p=175455", "digest": "sha1:HTWRWQKQPGOAPRU3KXF6XWXUIFJIE2GE", "cc_segment": "crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-20190215183319-20190215205319-00000.warc.wet.gz", "language": "en", "language_score": 0.96, "perplexity": 266.5, "bucket": "head", "line_ids": "AAABAAIAAwAIAAsADAANAA4ADwAQABEAEgATABQAFQAWABcAGAAZABoAGwAdACQAJQAmACcAKAApACoAKwAsAC0ALgAvADAAMQAyADwAPQA+AEcATABNAE4ATwBTAFcAWABZAFoAWwBcAF0AXgBfAGAAYgBjAGQAZQBmAGcAaABpAGoAawBsAG0AbgBvAHAAcQByAHMAdAB1AHYAdwB4AHoAewB8AH0AfgB/AIAAgQCCAIMAhACFAIYAhwCIAIkAigCLAJIAkwCUAJUAlgCXAJgAmQCaAJsAnACdAJ4AnwCgAKEAogC8AL0AvgC/AMAAwQDCAMMAxADFAMYAxwDIAMkAygDLAMwAzQDOAM8A0ADRANIA0wDUANUA1gDXANgA2QDaANsA3ADdAN4A3wDoAOkA6gDrAOwA9QD7APwA/QD+AAcBCAEJAQoBCwEMAQ0BEgEVAQ=="}
{"url": "http://965kvki.com/no-more-saturday-postal-service/", "digest": "sha1:M2WX6RZ3E3KISXFCB7GXCMU432RKIYWO", "cc_segment": "crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-20190215183319-20190215205319-00000.warc.wet.gz", "language": "en", "language_score": 0.94, "perplexity": 251.9, "bucket": "head", "line_ids": "VABVAFgAWgBbAFwA"}

why? I just want clean corpus.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.