GithubHelp home page GithubHelp logo

cc-lambda's Introduction

Common Crawl Logo

Cognito Common Crawl

This program uses pywren to search common crawl

Setup

virtualenv env/
source env/bin/activate
pip install -r requirements.txt

Set your AWS credentials as [default] in ~/.aws/credentials and make sure your default region is set to us-east-1.

Then configure pywren:

pywren get_aws_account_id
pywren create_config --force

Edit the ~/.pywren_config file and specify:

  • aws_region should be us-east-1
  • bucket should be unique bucket name
  • memory should be 512
pywren create_bucket
pywren create_role
pywren deploy_lambda

Confirm that everything is working using pywren test_function

Configuration

Change the following in cc-lambda.py:

  • MATCH_S3_BUCKET: the bucket where you want to store your findings
  • sentry_sdk.init("..."): should either be removed or changed to your sentry ID

Running the application

Application runs will spawn multiple lambda functions that analyze common crawl WARC files at scale. Running this function will have an impact on your AWS billing!

The application reads the input/warc.paths file and writes to:

  • processed.paths: text file containing the WARC paths that were successfully analyzed
  • failed.paths: text file containing the WARC paths that failed (most likely because of a lambda timeout reached)

When calling cc-lambda.py the script will check if there are any WARC paths in the input which were not already processed or failed, and go through those. Remove processed.paths and failed.paths if you want to re-process all WARC paths.

HTTP responses that match the search are stored in the MATCH_S3_BUCKET S3 bucket.

$ python cc-lambda.py 
No handlers could be found for logger "pywren.executor"
Overall progress: 1.55%
Going to process 250 WARC paths
Got futures from map(), waiting for results...

crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/warc/CC-MAIN-20190215183319-20190215205319-00000.warc.gz
  - Time (seconds): 191.205149174
  - Processed pages: 44969
  - Ignored pages: 93005
  - Matches: {'aws_re_matcher': 9, 'cognito_matcher': 4}

After running the application a few times, and fine-tuning your search, you can leave it running against all the common crawl dataset:

while python cc-lambda.py; do :; done

Debugging

Remember: AWS Lambda sends logs to CloudWatch and you can access the logs here.

PYWREN_LOGLEVEL=INFO python cc-lambda.py

Costs

As the Common Crawl dataset lives in the Amazon Public Datasets program, you can access and process it without incurring any transfer costs.

The costs you'll incur by running this software are:

  • Lambda function
  • S3 storage

The highest cost will come from AWS lambda. In order to reduce this cost you should:

  • Improve the lambda function code to run faster
  • Improve the lambda function to use less RAM
  • Search for Max Memory Used in the cloudwatch logs for lambda and make sure the lambda function configuration is uses ~50mb more of RAM than the max memory used from the log.

After running the tool a few times make sure you also run lambda-cost-calculator:

+---------------------+-----------+--------------------------+-----------------------------+
| Function            | Region    | Cost in the Last Day ($) | Monthly Cost Estimation ($) |
+---------------------+-----------+--------------------------+-----------------------------+
| pywren_cc_search_v3 | us-east-1 | 6.410                    | 192.296                     |
+---------------------+-----------+--------------------------+-----------------------------+
Total monthly cost estimation: $192.296

Monitoring

It is possible to monitor the progress of the analysis functions using:

pywren print_latest_logs | grep total_seen

And the progress of the whole solution using:

pywren print_latest_logs | grep -v Running

cc-lambda's People

Contributors

andresriancho avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

cc-lambda's Issues

pywren logs to cloudwatch every 2 seconds

pywren logs to cloudwatch every 2 seconds, this fill the log with unnecessary lines that say "Running ..."

Send a PR to pywren to make the period between Running ... log lines configurable.

No longer functional

Not sure if the author is interested in updating this project, but I just wanted to leave a note for anyone else who attempts to use it.

Although the project itself is nice and well documented, too much time has passed since it has received an update and it isn't functional in its current state. It think it's important to note, since this is one of the example projects linked from commoncrawl.org.

I'm not sure what it would take to get this running again, but I would imagine the biggest hurdles are going to be that it's based on Python 2.7 (which is now being sunset), and that pywren has not been maintained either.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.