GithubHelp home page GithubHelp logo

thalamus's Introduction

Thalamus-Analytics

Thalamus Analytics is my consulting project in Insight Data Science for ThalamusGME.

Note: This public repository only contains a few source code samples. These code snippets highlight the use of pyspark dataframe. You can find them in src/etl and src/cnx directories

Data Pipeline

Ingestion and Databases

The data pipeline starts with Azure SQL server database, where ThalamusGME's operational data resides. A light weight ODBC/JDBC connector (single node & serial) in this project transfer data from Azure SQL server to Amazon Redshift, our data warehouse. In Redshift, there are three schemas TASTG, TADW, and TARPT:

TASTG: Staging tables are created in this schema. When tables are migrated from Azure SQL to Redshift, mirroring tables are created in TASTG and all data transferred. Some nvarhcar and text columns are truncated to size 1000 to moderately reduced data size. Binary columns are not supported in Redshift, thus not migrated. There are two types of data load on staging tables, initial load and incremental load. In this project, data is loaded as initial load, which means full table copy. Under current condition(06/19/2017) with relatively small data size, initial load from all tables in Azure SQL DBO schema takes a few minutes. After initial loads, it's recommended to only load incremental data, to keep the workload of our ODBC/JDBC connector relatively small. When the data volume reached a certain height, full table copy from Azure SQL to Redshift using our JDBC connector may take too much time to be viable.

TADW: Data warehouse tables are included int his schema. Tables in TADW are slightly denormalized to reduce he amount of joins. Typically, program_id and calendar_season_id are added to a few tables. In Redshift, primary key and foreign key are not enforced, but only for the purpose of schema design.

TARPT: Report tables are create in this schema. Complex logic that generates report tables is implemented in Spark (PySpark and DataFrame).

Spark (ETL)

Spark is used to handle the ETL process on Redshift data. Spark scripts are written with PySpark DataFrame which represents Spark RDD operation in a SQL-like fashion.

Data Visualization and RestAPI

Data Visualization is achieved by redash.io and Amazon QuickSight. Restful API is generated by flask-restful

System Setup and Configuration

Python

Install pyhton 2.7 on your machine. https://www.python.org/downloads/

AWS Stack

Here's a list of services needed for infrastructure:

Redshift: Distributed columnar database.

EC2: cloud servers used to build Spark cluster, using Pegasus (see Pegasus & Spark )

S3: Needed by Spark-Redshift to store temporary data files.

VPC: Network Security

IAM: Role, group, and user access control

Azure SQL ODBC

Following the instructions on Azure web site

Linux / Unix Users:

sudo su
curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list > /etc/apt/sources.list.d/mssql.list
exit
sudo apt-get update
sudo apt-get install msodbcsql mssql-tools unixodbc-dev
sudo pip install pyodbc==3.1.1

Windows Users:

pip install pyodbc==3.1.1

You can check src/cnx/odbc_azure_cnx.py for implementation details

Redshift, psycopg2 & JDBC

Redshift is build upon PostgreSQL, so we can use psycopg2 to connect.

pip install psycopg2

You can check src/cnx/jdbc_redshift_cnx.py for implementation details.

Pegasus & Spark

We use Pegasus(An Insight open source project) to set up the Spark cluster. You will need install Pegasus on Linux machine.

$ git clone https://github.com/InsightDataScience/pegasus.git
$ pip install awscli

Spark-Redshift(DataBricks)

Spark-Redshift connector, an open source product by DataBricks, is used to connect Spark cluster and Redshift database. The jar package RedshiftJDBC42-1.2.1.1001.jar has already been included in jars, so there's no configuration needed

Flask

We use Flask to generate Restful APIs.

pip install flask-restful

thalamus's People

Contributors

hychrisli avatar

Watchers

James Cloos avatar phaneendra avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.