GithubHelp home page GithubHelp logo

commoncrawl / cc-index-server Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ikreymer/cc-index-server

65.0 65.0 18.0 149 KB

Common Crawl Index Server

Home Page: http://index.commoncrawl.org/

Shell 7.82% CSS 7.89% HTML 77.02% Dockerfile 7.28%
cc-index

cc-index-server's Introduction

Common Crawl Support Library

Overview

This library provides support code for the consumption of the Common Crawl Corpus RAW crawl data (ARC Files) stored on S3. More information about how to access the corpus can be found at https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set .

You can take two primary routes to consuming the ARC File content:

(1) You can run a Hadoop cluster on EC2 or use EMR to run a Hadoop job. In this case, you can use the ARCFileInputFormat to drive data to your mappers/reducers. There are two versions of the InputFormat: One written to conform to the deprecated mapred package, located at org.commoncrawl.hadoop.io.mapred and one written for the mapreduce package, correspondingly located at org.commoncrawl.hadoop.io.mapreduce.

(2) You can decode data directly by feeding an InputStream to the ARCFileReader class located in the org.commoncrawl.util.shared package.

Both routes (InputFormat or ARCFileReader direct route) produce a tuple consisting of a UTF-8 encoded URL (Text), and the raw content (BytesWritable), including HTTP headers, that were downloaded by the crawler. The HTTP headers are UTF-8 encoded, and the headers and content are delimited by a consecutive set of CRLF tokens. The content itself, when it is of a text mime type, is encoded using the source text encoding.

Build Notes:

  1. You need to define JAVA_HOME, and make sure you have Ant & Maven installed.
  2. Set hadoop.path (in build.properties) to point to your Hadoop distribution.

Sample Usage:

Once the commoncrawl.jar has been built, you can validate that the ARCFileReader works for you by executing the sample command line from root for the commoncrawl source directory:

./bin/launcher.sh org.commoncrawl.util.shared.ARCFileReader --awsAccessKey <ACCESS KEY> --awsSecret <SECRET> --file s3n://aws-publicdatasets/common-crawl/parse-output/segment/1341690164240/1341819847375_4319.arc.gz

cc-index-server's People

Contributors

chillaranand avatar erikcw avatar ikreymer avatar sebastian-nagel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cc-index-server's Issues

[PyWB2] Remove "source" and "source-coll" fields from results

With PyWB 2.x every result record contains two extra fields "source" and "source-coll" absent in the original index, e.g.

{
  "url": "http://commoncrawl.org/",
  "mime": "text/html",
  "mime-detected": "text/html",
  "status": "200",
  "digest": "FM7M2JDBADOQIHKCSFKVTAML4FL2HPHT",
  "length": "5413",
  "offset": "42695747",
  "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027313617.6/warc/CC-MAIN-20190818042813-20190818064813-00014.warc.gz",
  "charset": "UTF-8",
  "languages": "eng",
  "source": "CC-MAIN-2019-35/indexes/cluster.idx",
  "source-coll": "CC-MAIN-2019-35"
}

This is redundant as the collection (aka. "source") is explicitly queried and means 20% more content with Content-Encoding "identity" (which is mostly used in requests). The 20% matter, given that the index server answers 10 millions of requests per month sending multiple TiB results.

Note: there is a nosource param in BaseAggregator,, must be passed permanently resp. made configurable in config.yaml.

Allow fl= parameter to request partially absent fields

If a field requested by the fl parameter is missing in one of the records, the query processing exits with an exception and the result list is truncated:

Traceback (most recent call last):
  File "/var/venv/lib/python3.5/site-packages/pywb/cdx/cdxobject.py", line 186, in to_text
    result = ' '.join(str(self[x]) for x in fields) + '\n'
  File "/var/venv/lib/python3.5/site-packages/pywb/cdx/cdxobject.py", line 186, in <genexpr>
    result = ' '.join(str(self[x]) for x in fields) + '\n'
KeyError: 'languages'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/var/venv/lib/python3.5/site-packages/pywb/framework/wbrequestresponse.py", line 221, in encode
    for obj in stream:
  File "/var/venv/lib/python3.5/site-packages/pywb/cdx/cdxops.py", line 53, in cdx_to_text
    yield cdx.to_text(fields)
  File "/var/venv/lib/python3.5/site-packages/pywb/cdx/cdxobject.py", line 190, in to_text
    raise CDXException(msg)
pywb.cdx.cdxobject.CDXException: Invalid field "'languages'" found in fields= argument

The absence of a field should be handled. Ideally fl=url,languages and
fl=url should return the same number of results with no/empty values for the missing fields.

Currently, the URL index is still based on PyWB 0.33.2.
PyWB 2.3.0 just crashes with non-existing fields (param name is fields, see #8) and output=text:

  File ".../pywb/warcserver/index/cdxobject.py", line 186, in to_text
    result = ' '.join(str(self[x]) for x in fields) + '\n'
  File ".../pywb/warcserver/index/cdxobject.py", line 186, in <genexpr>
    result = ' '.join(str(self[x]) for x in fields) + '\n'
KeyError: 'languages'

[PyWB2] Query param `fl` is ignored

The query parameter to select the result fields (fl) is ignored by PyWB 2.3.0. As visible in the code it has been renamed from fl to fields with a fall-back for the old param name. But it does not work and furthermore causes the output param to be ignored:

> curl 'http://index-pywb2.commoncrawl.org/CC-MAIN-2019-35-index?url=commoncrawl.org&matchType=domain&fields=url&output=text&limit=1'
http://commoncrawl.org/

> curl 'http://index-pywb2.commoncrawl.org/CC-MAIN-2019-35-index?url=commoncrawl.org&matchType=domain&fl=url&output=text&limit=1'
{"urlkey": "org,commoncrawl)/", "timestamp": "20190818052150", "charset": "UTF-8", "languages": "eng", "url": "http://commoncrawl.org/", "status": "200", "mime": "text/html", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027313617.6/warc/CC-MAIN-20190818042813-20190818064813-00014.warc.gz", "digest": "FM7M2JDBADOQIHKCSFKVTAML4FL2HPHT", "offset": "42695747", "mime-detected": "text/html", "length": "5413", "source": "CC-MAIN-2019-35/indexes/cluster.idx", "source-coll": "CC-MAIN-2019-35"}

403 when locally hosted cc-index-server tries to connect to s3://commoncrawl/

Whenever I do a search on the local cc-index-server I get errors. When I look at the debug logs, it looks like the final authorization is only using the access key ID and the secret, but not the session token.

Is this only designed to work with long-term IAM user creds, or does it support short term creds? If I were to go edit the file building that Authorization, where would I find it? I searched the code globally for Authorization, access_key, and access, excluding the cluster.idx files, and found nothing that matched.

I'd be happy to contribute the fix for supporting short-term creds if you help me find where the fix goes in your code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.