expediagroup / shunting-yard Goto Github PK

Shunting Yard is a real-time data replication tool that copies data between Hive Metastores.

License: Apache License 2.0

Shell 0.55% Java 99.45%

replicate-data hive-table replication hive hive-metastore big-data circus-train

shunting-yard's Introduction

Shunting Yard

Shunting Yard reads serialized Hive Metastore Events from a queue (AWS SQS is currently supported) and replicates the data between two data lakes. It does this by building a YAML file with the information provided in the event which is then passed to Circus Train to perform the replication.

Status ⚠️

This project is no longer in active development.

Start using

You can obtain Shunting Yard from Maven Central:

System architecture

Overview

Shunting Yard is intended to be a constantly running service which listens to a queue for Hive events. These events are emitted from the Hive Metastore based on the operations performed on the Hive tables. For instance, an ADD_PARTITION_EVENT is emitted from the Hive Metastore when a new partition is added to a table. Similarly, a CREATE_TABLE_EVENT is emitted when a new table is created in the Hive Metastore. We recommend using the Apiary Metastore Listener for getting these events from your Hive Metastore.

Once Shunting Yard receives an event from the queue, it extracts the relevant information from it to build a YAML file which it then passes on to Circus Train to perform the replication. Shunting Yard also aggregates a series of events so that a minimum number of replications are performed via Circus Train.

Install

Download the TGZ from Maven central and then uncompress the file by executing:

tar -xzf shunting-yard-binary-<version>-bin.tgz

Although it's not necessary, we recommend exporting the environment variable SHUNTING_YARD_HOME by setting its value to wherever you extracted it to:

export SHUNTING_YARD_HOME=/<foo>/<bar>/shunting-yard-<version>

Download the latest version of Circus Train and uncompress it:

tar -xzf circus-train-<version>-bin.tgz

Set the CIRCUS_TRAIN_HOME environment variable:

export CIRCUS_TRAIN_HOME=/<foo>/<bar>/circus-train-<circus-train-version>

Usage

To run Shunting Yard you just need to execute the bin/replicator.sh script in the installation directory and pass the configuration file:

$SHUNTING_YARD_HOME/bin/replicator.sh --config=/path/to/config/file.yml

EMR

If you are planning to run Shunting Yard on EMR you will need to set up the EMR classpath by exporting the following environment variables before calling the bin/replicator.sh script:

export HCAT_LIB=/usr/lib/hive-hcatalog/share/hcatalog/
export HIVE_LIB=/usr/lib/hive/lib/

Note that the paths above are correct as of when this document was last updated but may differ across EMR versions. Refer to the EMR release guide for more up-to-date information if necessary.

Configuring Shunting Yard

The examples below all demonstrate configuration using YAML and provide fragments covering the most common use cases that should be useful as a basis for building your own configuration. A full configuration reference is provided in the following sections.

Configuring source, replica and SQS queue

The YAML fragment below shows some common options for setting up the base source (where data is coming from), replica (where data is going to) and the SQS queue to read Hive events from.

source-catalog:
  name: source-cluster
  hive-metastore-uris: thrift://emr-master-1.compute.amazonaws.com:9083
replica-catalog:
  name: replica-cluster
  hive-metastore-uris: thrift://emr-master-2.compute.amazonaws.com:9083
event-receiver:
  configuration-properties:
    sqs.queue: https://sqs.us-west-2.amazonaws.com/123456789/sqs-queue
    sqs.wait.time.seconds: 20
source-table-filter:
  table-names:
    - test_database.test_table_1
table-replications:
  - source-table:
      database-name: source_database
      table-name: test_table
    replica-table:
      database-name: replica_database
      table-name: test_table_1

Selecting tables to monitor

Shunting Yard by default will not replicate any tables unless they are selected using a source-table-filter. For example, if you want Shunting Yard to only monitor two tables, test_database.test_table_1 and test_database.test_table_2 you can configure it as follows:

source-table-filter:
  table-names:
    - test_database.test_table_1
    - test_database.test_table_2

Specifying target database & table names

Shunting Yard will by default replicate data into the replica data lake with the same replica database name and table name as the source. Sometimes a user might need to change the replica database name or table name or both. The YAML fragments below shows some common options for specifying the replica database and table name for the selected tables.

Specify both replica database and table name

table-replications:
  - source-table:
      database-name: source_database
      table-name: test_table
    replica-table:
      database-name: replica_database
      table-name: test_table_1

Change only the replica database but the table name remains same as source

In this case, the replica table name is not provided in the table-replications and therefore, it will be same as source table name.

table-replications:
  - source-table:
      database-name: source_database
      table-name: test_table
    replica-table:
      database-name: replica_database

Change only the replica table name but the database remains same as source

In this case, the replica database name is not provided in the table-replications and therefore, it will be same as source database name.

table-replications:
  - source-table:
      database-name: source_database
      table-name: test_table
    replica-table:
      table-name: test_table_1

Orphaned data strategy

Shunting Yard will invoke Circus Train's default orphaned data strategy which is to run the Housekeeping process to cleanup dereferenced snapshots. This means that Housekeeping configuration will need to be provided.

To override this behaviour, the configuration parameter orphaned-data-strategy can be provided as follows:

table-replications:
  - source-table:
      database-name: source_database
      table-name: test_table
    replica-table:
      database-name: replica_database
      table-name: test_table_1 
orphaned-data-strategy: NONE

This will ensure that Shunting Yard only starts up Circus Train's replication module, and that no paths are added to a Housekeeping database.

This is a necessary step if you want to use a different housekeeping mechanism for your orphaned data, eg. Beekeeper.

Shunting Yard configuration reference

The table below describes all the available configuration values for Shunting Yard.

Property	Required	Description
`source-catalog.name`	Yes	A name for the source catalog for events and logging.
`source-catalog.hive-metastore-uris`	Yes	Fully qualified URI of the source cluster's Hive Metastore Thrift service.
`replica-catalog.name`	Yes	A name for the replica catalog for events and logging.
`replica-catalog.hive-metastore-uris`	Yes	Fully qualified URI of the replica cluster's Hive Metastore Thrift service.
`sqs.queue`	Yes	Fully qualified URI of the AWS SQS Queue to read the Hive events from.
`sqs.wait.time.seconds`	No	Wait time in seconds for which the receiver will poll the SQS queue for a batch of messages. Default is 10 seconds. Read more about long polling with AWS SQS here.
`source-table-filter.table-names`	No	A list of tables selected for Shunting Yard replication. Supported format: `database_1.table_1, database_2.table_2`. If these are not provided, Shunting Yard will not replicate any tables.
`orphaned-data-strategy`	No	Orphaned data strategy to use for replications. Possible values: `NONE` and `HOUSEKEEPING`. Default is `HOUSEKEEPING`.
`table-replications[n].source-table.database-name`	No	The name of the database in which the table you wish to replicate is located. `table-replications` section is optional and if it is not provided, Shunting Yard will use the database name and table name from the source for the replica.
`table-replications[n].source-table.table-name`	No	The name of the table which you wish to replicate.
`table-replications[n].replica-table.database-name`	No	The name of the destination database in which to replicate the table. Defaults to `source-table.database-name`
`table-replications[n].replica-table.table-name`	No	The name of the table at the destination. Defaults to `source-table.table-name`

Configuring Graphite metrics

Graphite configuration can be passed to Shunting Yard using an optional --ct-config argument which takes a different YAML file to the one described above and passes it directly to the internal Circus Train instance. Refer to the Circus Train README for more details.

Sample ct-config.yml for graphite metrics:

graphite:
  host: graphite-host:2003
  namespace: com.company.shuntingyard
  prefix: dev

Housekeeping

Housekeeping is the process that removes expired and orphaned data on the replica. Shunting Yard delegates housekeeping responsibility to Circus Train. Similar to Graphite configuration, the Housekeeping configuration can also be directly passed to the internal Circus Train instance using the --ct-config argument. Refer to the Circus Train README for more details.

Sample ct-config.yml for housekeeping:

housekeeping:
  expired-path-duration: P3D
  db-init-script: classpath:/schema.sql
  data-source:
    driver-class-name: org.h2.Driver 
    url: jdbc:h2:${housekeeping.h2.database};AUTO_SERVER=TRUE;DB_CLOSE_ON_EXIT=FALSE
    username: user
    password: secret

Using Beekeeper for housekeeping

If Beekeeper is installed in your data lake, Circus Train can be configured to use Beekeeper to delete orphaned data by adding table parameters to the replica table during the replication. Please see Metadata transformations in the Circus Train docs for more detailed instructions.

Sample ct-config.yml to use Beekeeper:

transform-options:
  table-properties:
    '[beekeeper.remove.unreferenced.data]': true

Usage with Circus Train common config

To run Shunting Yard with a Circus Train common config file in addition to its own config file, you just need to execute the bin/replicator.sh script in the installation directory and pass both the configuration files:

$SHUNTING_YARD_HOME/bin/replicator.sh --config=/path/to/config/file.yml --ct-config=/path/to/config/ct-common.yml

Legal

This project is available under the Apache 2.0 License.

shunting-yard's People

Contributors

Stargazers

Watchers

Forkers

diffblue-benchmarks rksangeeth007 shermosa

shunting-yard's Issues

Expose replication logs

As a Shunting Yard user
I'd like to be able to access Circus Train logs of a replication
So I can troubleshoot problems and make improvements to replication process

Acceptance criteria

Users must be able to access replication logs of any replication

Make Shunting Yard auditable

As a Shunting Yard user/operator
I'd like to keep track of how event where processed by the service
So I can easily track outputs generated inputs and any events that may have occurred during the processing

Overview

Audit-ability is a must-have in any system. Shunting Yard must be able to be audited in oder to allow for quality improvement and correctness verification.

This issue probably needs to be split up into more fined grained tasks.

Create a production-ready adapter interface for the Kinesis client

As a Shunting Yard developer
I'd like to write a Kinesis adapter
So I can safely use KCL as the event receiver

Overview

Shunting Yard uses polling to read messages from the topic, transform them and invoke Circus Train.
KCL provides an out-of-the-box push client which includes features like rebalancing, load balancing, etc. Writing a poll client with all this features is not easy and it will take time to code, test it and make sure it works as expected.
We would like to reuse this KCL client in CT Event Based so we must write a robust adapter interface to adapt the KCL push protocol to the CT polling mechanism. At the moment this has been done with a BlockingQueue.

Allow users to select the tables to be replicated

As a Shunting Yard user
I'd like to be able to select the tables to be replicated from a source metastore
So I can only replicate those that are of interest for the analysts/applications

Acceptance criteria

User must be able to select tables from on of the pre-configured source metastores and select the target metastores where the table must be replicated.

Allow users to select whether DROP events should be processed

As a Shunting Yard user
I'd like to choose whether drop table and drop partition events should be processed in the replica metastore
So I can decide whether my replica is going to be a mirror of the original metastore or the can be unsycned

Overview

Some users may decide to keep their replica tables partially synced. This could be particularly useful for testing environments.

Acceptance Criteria

For each execution context provide the option to ignore DROP TABLE and DROP PARTITION events.

Create an endpoint to monitor the status of the service

As a Shunting Yard user
I'd like to know the status of the different components of the application
So I can easily find out exceptional conditions that may affect the replication

Acceptance criteria

A REST endpoint that exposes the service status, including different Shunting Yard and Circus Train components, source and target metastores.

Allow users to configure the context of selected tables for replication

As a Shunting Yard user
I'd like to configure the replication context for each of the selected tables
So I can tweak different configuration options for the replication

Overview

In general, when someone wants to replicate a table, they must specify the replica table location. Occasionally, users must also specify the target database and table names as well as configuration and copier options. These values should be exposed for configuration when the tables are selected for replication.

Acceptance criteria

TBD

Create an OffsetCommitter abstraction for the event receiver

As a Shunting Yard developer
I'd like to have control the offset commits
So I can make sure that only successfully processed messages are committed and failures can be reprocessed

Overview

Both Kinesis and Kafka receivers commit the offset as soon as the messages are read. This may be a issue if the application fails to process any of the read events.

Enforce error handling

As a Shunting Yard user
I'd like to have errors handled and monitored properly
So I can make sure the service won't go down upon exceptions

Overview

Shunting Yard is meant to be a 24/7 service and as such it must do a correct handling of exceptional conditions:

Emitter side:

What happens if marshaling fails?
Post message in the topic?
How do we notify about these errors if we care at all?

Receiver side:

What happens if Circus Train replication fail?
What happens if unmarshaling fails?
How do we notify about these errors if we care at all?

Acceptance Criteria

Emitter:

Add metrics if possible: successes and failures. Log all errors using the keywords Error, ShuntingYard and Emitter so they can be analyzed easily.

Receiver:

Add metrics: ingestion successes and failures. Log all errors using the keywords Error, ShuntingYard and Receiver so they can be analyzed easily.

Aggregate MetaStore events

As a Shunting Yard user
I'd like to make sure my tables are replicated as efficiently as possible
So many events in a short period of time should trigger only replication

Overview

Triggering a Circus Train replication per event can be really expensive, mainly because more than one event can happen in a short window of time, e.g. an INSERT INTO executed with TEZ will fire two ALTER TABLE events with scarce milliseconds between each event. These would cause Circus Train running twice to replicate the same data. The Metastore/execution engine behavior cannot be changed so we must find a why to deal with these scenarios. These is also the case of jobs adding new partitions just before completion - these jobs usually add each partition individually one right after the other.

Acceptance criteria

Create a component that takes a sets of events on the same table and aggregates them into a single event. The time window for event aggregation must be configurable.

Deploy emitter JAR

As a Shunting Yard user
I'd like to be able to install the MetaStoreEventListener from a URL
So I can automate the deployment

Acceptance Criteria

Emitter fat JAR available via Maven Central

Create a suite of Integration Tests

As a Shunting Yard developer
I'd like to make sure that the code in master is correctly integrated with the infrastructure
So I can ensure the application behaves as expected in runtime

Overview

At the moment we don't have a test suite to guaranty that correct behavior of Shunting Yard at runtime. This is critical for a service of this nature and we must do the best possible to avoid releasing critical bugs that may have been introduced during the coding phase.

Acceptance Criteria

New project with the integration test bed
The new project must be deployable to AWS
Basic cases that include the whole lifecycle of a partitioned and unpartitioned tables:
** Partitioned table: create table, add partition, update partition, update table cascade, drop partition, drop table
** Unpartitioned table: create table, update table, drop table
Include edge cases:
** Update/drop a table that does not exist in the replica
** Update/drop a partition that does not exist in the replica
** Create a table that already exists in the replica
** Create a partition that already exists in the replica
** Others

Run as a "fully executable application for Unix systems"

As a user of Shunting Yard
I want to be able to deploy it as a Unix service
So that I can start it, stop it and query its status like I would any other Unix service
and so that it shuts down gracefully without dropping any in-progress replications.

h2. Acceptance Criteria

Shunting yard deployable as a Unix service (see https://docs.spring.io/spring-boot/docs/current/reference/html/deployment-install.html) [TODO: figure out whether init.d or systemd are recommended by AWS and use those as the default run environment will be on EMR].
On start the service goes into the loop where it tries to read messages forver.
On stop the service
** Stops reading any new messages
** Runs any replications that are currently queued up in any of the receivers (e.g. aggregated)
** Waits for all in-progress replications to complete
** Only then does it shutdown and the service exits

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble