GithubHelp home page GithubHelp logo

Comments (4)

gwenzek avatar gwenzek commented on August 21, 2024

Did you try again ?
The error means that the file downloaded from https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/segments/1512948512208.1/wet/CC-MAIN-20171211052406-20171211072406-00300.warc.wet.gz isn't a valid gzip file.
It looks like a silent network error.

In the dev branch you can chose to keep the downloaded file on the disk so you can inspect them manually (by setting cache_dir)

from cc_net.

gwenzek avatar gwenzek commented on August 21, 2024

Closing since no activities, I'm trying to clean up my backlog. Feel free to reopen if you observed a non transient failure.
Note that dev branch I was mentionning is now merged in master

from cc_net.

datquocnguyen avatar datquocnguyen commented on August 21, 2024

Hi @gwenzek , I got the same error when running on a test mode on my local computer, but at a later processing stage. Is it because I am not running on slurm, or the bin file is not a valid file? Any solution to fix this error? Thanks.

/data/cc_net/cc_net/flat_hash_set.py:116: UserWarning: Module 'getpy' not found. Deduplication will take more RAM. Try `pip install cc_net[getpy]
  "Module 'getpy' not found. Deduplication will take more RAM."
2023-02-01 14:54 INFO 27658:cc_net.jsonql - Opening test_data/wet_cache/2022-49/wet_2022-49.paths.gz with mode 'rt'
2023-02-01 14:54 INFO 27658:cc_net.jsonql - Opening test_data/wet_cache/2022-49/CC-MAIN-20221203075717-20221203105717-00000.warc.wet.gz with mode 'rt'
2023-02-01 14:54 INFO 27658:HashesCollector - Saved 27105 hashes to test_data/hashes/2022-49/0002.tmp.bin
2023-02-01 14:54 INFO 27658:HashesCollector - Processed 175 documents in 0.00058h ( 84.3 doc/s).
2023-02-01 14:54 INFO 27658:HashesCollector - Found 27k unique hashes over 40k lines. Using 0.1GB of RAM.
submitit ERROR (2023-02-01 14:54:04,589) - Submitted job triggered an exception
2023-02-01 14:54 ERROR 27658:submitit - Submitted job triggered an exception
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/sonla/.local/lib/python3.7/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/home/sonla/.local/lib/python3.7/site-packages/submitit/core/submission.py", line 72, in submitit_main
    process_job(args.folder)
  File "/home/sonla/.local/lib/python3.7/site-packages/submitit/core/submission.py", line 65, in process_job
    raise error
  File "/home/sonla/.local/lib/python3.7/site-packages/submitit/core/submission.py", line 54, in process_job
    result = delayed.result()
  File "/home/sonla/.local/lib/python3.7/site-packages/submitit/core/utils.py", line 133, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "/data/cc_net/cc_net/mine.py", line 276, in _hashes_shard
    inputs=conf.get_cc_shard(shard),
  File "/data/cc_net/cc_net/jsonql.py", line 455, in run_pipes
    write_jsons(data, output)
  File "/data/cc_net/cc_net/jsonql.py", line 496, in write_jsons
    for res in source:
  File "/data/cc_net/cc_net/jsonql.py", line 284, in map
    for x in source:
  File "/data/cc_net/cc_net/process_wet_file.py", line 216, in __iter__
    for doc in parse_warc_file(self.open_segment(segment), self.min_len):
  File "/data/cc_net/cc_net/process_wet_file.py", line 149, in parse_warc_file
    for doc in group_by_docs(lines):
  File "/data/cc_net/cc_net/process_wet_file.py", line 122, in group_by_docs
    for warc in warc_lines:
  File "/data/cc_net/cc_net/jsonql.py", line 971, in _close_when_exhausted
    yield from file
  File "/usr/lib/python3.7/gzip.py", line 289, in read1
    return self._buffer.read1(size)
  File "/usr/lib/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/usr/lib/python3.7/gzip.py", line 482, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

from cc_net.

yanzhouyoung avatar yanzhouyoung commented on August 21, 2024

@datquocnguyen I have the same problem. The wet file(test_data/wet_cache/2022-49/CC-MAIN-20221203075717-20221203105717-00000.warc.wet.gz) is corrupted, you need to delete it and run the program again.

from cc_net.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.