GithubHelp home page GithubHelp logo

avielyo10 / dynamodbdump Goto Github PK

View Code? Open in Web Editor NEW

This project forked from vevo/dynamodbdump

0.0 1.0 0.0 33 KB

DynamoDB backups made easier (and cheaper)

License: Apache License 2.0

Makefile 1.68% Go 96.35% Dockerfile 1.97%

dynamodbdump's Introduction

dynamodbdump: DynamoDB backups made easier

GoDoc Test Status Go Report Card

Table of Contents

What is it?

This tool performs a backup of a given DynamoDB table and pushes it to a given folder in s3 in a format compatible with the AWS datapipeline functionality.

It is also capable of restoring a backup from s3 to a given table both from this tool or from a backup generated using the datapipeline functionality.

Why create this tool?

Using the AWS DataPipelines to backup DynamoDB tables spawns EMR clusters which can take some time, and for small tables it will cost you 20min of EMR runs for just a few seconds of backup time, which makes no sense.

This tool can be run in a command-line, in a docker container and ending up on a Kubernetes cronjob very easily, allowing you to leverage your existing architecture without additional costs.

How to use it?

With the command-line

To build dynamodbdump, you can use the make build command or manually get the dependencies (using glide or go get) and then use go build to build.

Then you can use the dynamodbdump binary you just built to start a backup.

Example:

make build
./dynamodbdump -action backup -dynamo-table my-table -wait-ms 2000  -batch-size 1000 -s3-bucket my-dynamo-backup-bucket -s3-folder "backups/my-table" -s3-date-folder

Note: the command-line options are available via the -h argument. Example:

$ ./dynamodbdump -h
Usage of ./dynamodbdump:
  -action string
        Action to perform. Only accept 'backup' or 'restore'. Environment variable: ACTION (default "backup")
  -batch-size int
        Max number of records to read from the dynamo table at once or to write in case of a restore. Environment variable: BATCH_SIZE (default 1000)
  -dynamo-table string
        Name of the Dynamo table to backup from or to restore in. Environment variable: DYNAMO_TABLE
  -restore-append
        Appends the rows to a non-empty table when restoring instead of aborting. Environment variable: RESTORE_APPEND
  -s3-bucket string
        Name of the s3 bucket where to put the backup or where to restore from. Environment variable: S3_BUCKET
  -s3-date-folder
        Adds an autogenenated suffix folder named using the UTC date in the format YYYY-mm-dd-HH24-MI-SS to the provided S3 folder. Environment variable: S3_DATE_FOLDER
  -s3-folder string
        Path inside the s3 bucket where to put or grab (for restore) the backup. Environment variable: S3_FOLDER
  -wait-ms int
        Number of milliseconds to wait between batches. If a ProvisionedThroughputExceededException is encountered, the script will wait twice that amount of time before retrying. Environment variable: WAIT_MS (default 100)

Inside docker

Each time a merge is done to the master branch, a docker image is built with the latest tag. Each time a tag is pushed in the git repository, an image with the corresponding tag is built on Docker Hub or Docker cloud if you prefer.

To pull the image, you'll need docker installed and then just pull the image as usual:

docker pull vevo/dynamodbdump:latest

Then, assign the environment variables you want to use. If you are doing that locally you will need to provide AWS credentials using the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY or, as this example shows, share your .aws configuration folder and (optionnally) provide a AWS_PROFILE environment variable.

For example:

docker run -e ACTION=backup \
           -e AWS_REGION=us-east-1 \
           -e BATCH_SIZE=100 \
           -e DYNAMO_TABLE='mytable' \
           -e S3_BUCKET='bucket-for-data-dumps' \
           -e S3_DATE_FOLDER=1 \
           -e S3_FOLDER='dynamodb/mytable' \
           -e WAIT_MS=500 \
           -e AWS_PROFILE=dev \
           -v $HOME/.aws:/root/.aws
           -t vevo/dynamodbdump:latest

Note that if you are inside AWS and have properly setup IAM roles you will not need to specify credentials.

Inside a Kubernetes job

In our setup we use IAM roles to get access to the DynamoDB tables and S3 buckets so we don't need to set AWS credentials.

We use dynamodbdump to run as cronjobs in all our kubernetes clusters that have applications that use DynamoDB tables. That use case is really what the tool was designed for.

An example of kubernetes configuration for such a cronjob can be found in resources/kubernetes-cronjob.yaml

In this example we do a backup of the mytable table every night at 2AM into s3://bucket-for-data-dumps/dynamodb/mytable/${TIMESTAMP} with ${TIMESTAMP} being the date at which the backup starts in the following format: YYYY-mm-dd-HH24-MI-SS.

The backup will process by batch of 100 records and wait 500ms between each batch. If an exception is encountered in the mean time, it will wait twice that time before retrying if the exception is a ProvisionedThroughputExceededException.

Note that in the example we max the memory to 512M but most of the time 256M are sufficient.

Contributing to the project

Anybody is more than welcome to create PR if you want to contribute to the project. A minimal testing and explanations about the problem will be asked but that's for sanity purposes.

We're friendly people, we won't bite if the code is not done the way we like! :)

If you don't have a lot of ideas but still want to contribute, we maintain a list of ideas we want to explore in the TODO.md, you can start here!

dynamodbdump's People

Contributors

aerostitch avatar avielyo10 avatar someword avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.