GithubHelp home page GithubHelp logo

tacitus / lambda-scraper-queue Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jimpick/lambda-scraper-queue

0.0 2.0 0.0 47 KB

Demo project showing how to create a simple web scraping service using AWS Lambda and API Gateway

License: ISC License

JavaScript 76.69% Shell 23.31%

lambda-scraper-queue's Introduction

Lambda Scraper Queue

This is a demo project which implements a trivial REST service for queuing web scraping jobs.

It is completely "serverless", designed to use the following Amazon services:

The Lambda functions are written in ES6, with async/await, transpiled using Babel, and bundled using Webpack.

The AWS resources are provisioned using the CloudFormation service, using an add-on custom resource handler to allocate API Gateway resources (which Amazon doesn't support yet for CloudFormation).

Additionally, we use Apex to simplify the uploading of the Lambda functions.

Costs

It should cost very little to run.

  • DynamoDB - only provisioned for 1 read capacity unit, 1 write capacity unit (which limits it to 1 job per second)
  • S3 - storage for retrieved files and JSON, plus data transfer
  • CloudWatch logs
  • Lambda invocations

Demo Instance

API: https://3m7171w3c9.execute-api.us-west-2.amazonaws.com/prod

Web Interface: Under construction

API

Submit a job

curl -X POST -d url=http://jimpick.com/ https://3m7171w3c9.execute-api.us-west-2.amazonaws.com/prod/jobs

Deployment Instructions

Prerequisites

  • You will need an AWS Account
  • You will need OS X, Linux, *BSD or another Unix-based OS (scripts will need some modifications for Windows)
  • Install the AWS CLI and ensure credentials are setup under ~/.aws/credentials (Instructions)
  • Install Node.js (tested with v4.2.6 and v5.7.0)
  • git clone https://github.com/jimpick/lambda-scraper-queue.git (https)
    or
    git clone [email protected]:jimpick/lambda-scraper-queue.git (git)
  • cd lambda-scraper-queue
  • npm install

Setup IAM permissions

Note: These instructions are copied from: https://github.com/carlnordenfelt/aws-api-gateway-for-cloudformation#setup-iam-permissions

To be able to install the Custom Resource library you require a set of permissions. Configure your IAM user with the following policy and make sure that you have configured your aws-cli with access and secret key.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "cloudformation:CreateStack",
        "cloudformation:DescribeStacks",
        "iam:CreateRole",
        "iam:CreatePolicy",
        "iam:AttachRolePolicy",
        "iam:GetRole",
        "iam:PassRole",
        "lambda:CreateFunction",
        "lambda:UpdateFunctionCode",
        "lambda:GetFunctionConfiguration",

        "cloudformation:DeleteStack",
        "lambda:DeleteFunction",
        "iam:ListPolicyVersions",
        "iam:DetachRolePolicy",
        "iam:DeletePolicy",
        "iam:DeleteRole"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}

Install the Custom Resource Library

This installs a special AWS Lambda function so that the CloudFormation recipe can provision the API Gateway using custom resources from Carl Nordenfelt's API Gateway for CloudFormation project.

npm run deploy-custom-resource

If successful, a 'service token' will be saved to deploy/state/SERVICE_TOKEN

Configuration

Copy config.template.js to config.js and customize it.

cp config.template.js config.js

The default config.template.js is:

export default {
  cloudFormation: 'lambdaScraperQueue',
  region: 'us-west-2',
  stage: 'prod'
}

Parameters

cloudFormation: The name of the CloudFormation stack

region: The AWS region

stage: The API Gateway stage to create

Use CloudFormation to create the AWS resources

npm run create-cloudformation

The command returns immediately, but it will take a while to complete. it's deploying a lot of resources. It's a good idea to watch the CloudFormation task in the AWS Web Console to ensure that it completes without errors.

Note: When working with the CloudFormation recipe, you can also use npm run update-cloudformation and npm run delete-cloudformation

Manually create the "prod" deployment stage in API gateway

When the CloudFormation stack in the previous step has been successfully provisioned (check the AWS Web Console), do this step.

The Custom Resource library currently doesn't support this from CloudFormation, so, for now, we need to do it manually.

Go to "API Gateway" in the Amazon web console, and select the desired API. Click the Deploy API button, and under Deployment Stage, select New Stage. Enter prod for the Stage Name, and click the Deploy button.

Save the references to the provisioned CloudFormation resources

npm run save-cloudformation

This will create a file in deploy/state/cloudFormation.json

Setup the Apex build directory

npm run setup-apex

This generates build/apex/project.json

Compile the Lambda scripts using babel

npm run compile-lambda

This will use webpack and babel to compile the source code in src/server/lambdaFunctions into build/apex/functions

The webpack configuration is in deploy/apex/webpack.config.es6.js

Deploy the lambda functions

npm run deploy-lambda

This will run apex deploy in the build/apex directory to upload the compiled lambda functions.

Alternatively, if you want to execute the compile and deploy steps in one command, you can run: npm run deploy

Run the test suite

npm run test

This will run both the local tests, and remote test which test the deployed API and lambda functions.

The local tests can be run as npm run test-local, and the remote tests can be run as npm run test-remote.

View logs

You can tail the CloudWatch logs:

npm run logs

This just executes apex logs -f in build/apex

Submit a job

npm run post-url

Submits a job to the API that scrapes http://jimpick.com/

You should be able to see lambda output in the logs (after a few seconds delay). Also, you should be able to see the files in S3 via the AWS Web Console.

To Do List

  • Integration test
  • Better error handling
  • Handle DynamoDB ProvisionedThroughputExceededException
  • Status subsystem (API + Firebase)
  • Web interface
  • Quotas / Whitelists for public demo
  • Blog post

Similar Work

I'm using Apex, but just for uploading the functions. I haven't investigated the other projects yet.

lambda-scraper-queue's People

Contributors

jimpick avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.