GithubHelp home page GithubHelp logo

db-to-avro's Introduction

Database Backups to Avro

The goal of this project is to load database backups (eg. Oracle datapump exports, MS Sql Server .bak files) into a temporary Docker container running the appropriate vendor database, and then export the tables within that restored database as .avro files for loading into other Big data systems.

At the present time (August 2022) this goal remains aspirational. The current system requires a pre-existing database, the connect string and credentials to which must be passed in on the command line.

Current Functionality

This jar file connects to an existing database and emits .avro files, one for each table in the database or as specified on the command line.

MSSQL Server backups and Oracle Datapump exports are currently supported. Oracle supports row-level multi-threading for high performance.

Architecture

On a host running docker, the application will (one day) instantiate a database (as a container), load the backup files into the container, and then dump the database as Avro files. If a database is already running, the restore step can be skipped by providing a JDBC connection string.

The application is asynchronous and multi-threaded using RxJava2.

Basic Command Line Usage

MSSQL Server

java -jar db-to-avro.jar \
   --flavor=sqlserver \
   --backup-dir=backups/ \
   --backup-files=AdventureWorksDW2016_EXT.bak
   --catalog=AdventureWorks2016 \
   --schema=dbo \
   --destination=avro/

This command currently produces an error message , "You must provide database.url".

Once this project is complete, the above command will restore the Microsoft AdventureWorks database from a backup set located in backups/, using a single backup file AdventureWorksDW2016_EXT.bak. All tables within schema AdventureWorks2016.dbo will be exported as Avro files to the directory avro/

Oracle

java -jar db-to-avro.jar \
   --flavor=oracle \
   --backup-dir=backups/ \
   --backup-files=HR.par
   --schemas=HRDEMO \
   --destination=avro/

Once this project is complete, this will restore the Oracle HR demo database from a backup set located in backups/, using a par file HR.par. All tables within schema HR will be exported as Avro files to the directory avro/

To actually get the above two commands to work at this time, you must also supply the following on the command line as documented in the following section:

--connect=...
--user=..
--password=..

Exporting an Existing Database

MSSQL Server

java -jar db-to-avro.jar \
  --connect="jdbc:sqlserver://localhost" \
  --user="SA" \
  --password="..."  \
  --flavor=sqlserver  \
  --catalog=AdventureWorks2016 \
  --schemas=dbo  \
  --destination=avro/

Oracle

java -jar db-to-avro.jar \
  --connect="jdbc:oracle:thin:@localhost:1521/ORCLPDB1" \
  --user="system" \
  --password="..."  \
  --flavor=oracle  \
  --schemas=HR  \
  --destination=avro/

These do the same as the above, but instead of loading a database backup, it will use an existing database via the connection string.

Command Line Options

Option (* = required)    Description
---------------------    -----------
--backup-dir             directory containing backup to restore
--backup-files           comma-delimited list of backup files, or a single .par file
--catalog                catalog to export (Oracle N/A)
--connect                jdbc connection string for existing database
--destination            avro destination directory
--exclude                exclusions in form schema(.table)(.column)
* --flavor               database type (sqlserver, oracle)
--password               database password (existing db)
--password-file          read database password from file (existing db)
--post-sql               path of sql file to execute after restore/connect
--pre-sql                path of sql file to execute before restore/connect
--schemas                only export this comma-delimited list of schemas
--tables                 only export this comma-delimited list of tables
--user                   database user (existing db)
--split-table-strategy   flexible split table stratgies (only applicable for sqlserver)
-h, --help               show help

Example Configuration Properties

# Application defaults, can be overridden by a job
#
# Note ${UUID} is substituted everywhere with the same random UUID at startup time

# Docker host for managing containers
docker.host=unix:///var/run/docker.sock

# Make sure this is at least # threads or you will get Hikari connection timeouts
database.pool.size=64

# MS SQL Server defaults
sqlserver.database.url=jdbc:sqlserver://localhost;user=SA;password=${UUID};database=master;autoCommit=false
sqlserver.database.user=SA
sqlserver.database.password=${UUID}
sqlserver.image=mcr.microsoft.com/mssql/server:2017-latest
sqlserver.env=ACCEPT_EULA=Y,SA_PASSWORD=${UUID}
# Attempt to dump a table in parallel using partitions based on PKs (beta)
sqlserver.optimized.enable=false

# Oracle defaults
oracle.database.url=jdbc:oracle:thin:@10.10.10.100:1521/ORCLPDB1
oracle.database.user=system
oracle.database.password=${UUID}
oracle.ports=1521:1521
oracle.image=gcr.io/som-rit-phi-starr-dev/oracle-database:12.2.0.1-ee
oracle.mounts=/mnt/database:/opt/oracle/oradata,/mnt/backups:/backup
oracle.env=ORACLE_CHARACTERSET=WE8ISO8859P1,ORACLE_PWD=${UUID}
oracle.impdp.core.multiplier=1
oracle.optimized.enable=true

# Target size for generated Avro files, based on *uncompressed* source table bytes.
# Set to zero for unlimited file size.
avro.filename=%{SCHEMA}.%{TABLE}-%{PART}.avro
avro.logfile=job.json
avro.codec=snappy
avro.targetsize=1000000000
avro.fetchsize=5000
# Normalize table names (columns always normalized)
avro.tidy=true

# Core-count multiplier determines number of avro threads
avro.core.multiplier=0.75

Future Features

Currently the application consists of a single module "db-to-avro-runner". More modules in the future can be added to call this runner from a REST API, Pubsub queue, etc.

TODO, in order of priority:

  • Pass docker database mounts as command line options in addition to properties file

  • Resume features:

    • Use temp files when writing to disk, rename to final name when complete.

    • Cancelled/crashed jobs should resume at last table exported

    • Option: if destination file exists, don’t export that table

  • Option to save directly to a GCS bucket

  • Support for regex in schema/table/column exclusion filters

  • Get row counts from Avro generation, instead of DB introspection (which is too slow for billions of rows)

  • Ability to restore backup into pre-existing database (eg. from CoS boot creation)

  • Schema introspection should be able to exclude tables that will cause problems like '?' in column name

  • Validation: Number of rows in database should match number of rows written by db-goodies ETL

  • Automation

    • Ability to self-bootstrap into a new VM created in GCP and monitor output (?)

    • Job runner that reads VM metadata for job input (?)

    • Pub/sub job runner (?)

  • Formalized progress reporting (not logging)

  • Switches for:

    • Deleting docker container after successful export

    • Listing catalogs, schemas, and tables

    • Testing connection to db

  • Option for saving output as .avro instead of .json, so manifest is included with the dataset itself.

  • Unit tests(!)

Known Issues

  • Table and column names are normalized in db-goodies ETL, which is not reflected in job output log.

  • Excluded tables are not explicitly noted in job output (they just aren’t listed)

db-to-avro's People

Contributors

jmesterh avatar susanweber avatar snyk-bot avatar mayurd-shc avatar dependabot[bot] avatar eyl407 avatar

Stargazers

Patricio Quintero avatar

Watchers

gary.deng avatar  avatar Andy Martin avatar Tyler Tollefson avatar  avatar  avatar

Forkers

eyl407

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.