GithubHelp home page GithubHelp logo

kartikra / dattasa Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 0.0 66 KB

python project for dattasa package in pypi

Home Page: https://pypi.python.org/pypi/dattasa

License: MIT License

Python 100.00%
data-pipeline greenplum redshift rabbitmq kafka mysql airflow salesforce delighted wootric

dattasa's Introduction

Project Background

python package that helps data engineers and data scientists accelerate data-pipeline development

The goal of this python project is to build a bunch of wrappers that can be reused for building data pipelines from -

  • Relational databases: postgres, mysql, greenplum, redshift, etc.
  • NOSQL databases: hive, mongo, etc.
  • messaging sources and caches: kafka, redis, rabbitmq, etc.
  • cloud service providers: salesforce, mixpanel, jira, google-drive, delighted, wootric, etc.

Installation

There are 3 ways to install dattasa package -

  1. Easiest way is to install from pypi using pip
pip install dattasa
  1. Download from github and build from scratch
git clone [email protected]:kartikra/dattasa.git
cd dattasa
python setup.py build
python setup.py clean
python setup.py install
  1. Download from github and install using pip
git clone [email protected]:kartikra/dattasa.git
cd dattasa
pip install -e .
pip install -U -e . (if upgrading)

Config Files

By default dattasa expects the config files to be in the mode directory of user. These can be overridden. See links to sample code in README file below to find out more. There are 2 yaml config files

Environment Variables

dattasa package relies on the following environment variables. Make sure to set these in your bash profile

  • GPLOAD_HOME: Path to gpload package (needed only if using gpload utilities for greenplum or redshift)
  • PROJECT_HOME: Path to python project directory
  • PROJECT_HOME/python_bash_scripts: python scripts to invoke gpload (needed only if using gpload utilities for greenplum or redshift)
  • SQL_DIR: Place to keep all sql scripts
  • TEMP_DIR: All temp files created in this folder
  • LOG_DIR: All log files are created in this folder

Description of classes

v1.0 of the package comprises of the following classes. Please see link to sample code for details on how to use each of them.

class Description Sample Code
environment Lets you source all the os environment variables see first row in mongo example
postgres_client Lets you make connections to postgres / redshift database using pyscopg2 or sqlalchemy.Use the connections to interact with database in interactive program or run queries from a sql file using the connection sample postgres code
greenplum_client (inherits postgres_client) Lets you use psql and gpload utilities provided by pivotal greenplum. Make connections to postgres / greenplum database using pyscopg2 or sqlalchemy.Use the connections to interact with database in interactive program or run queries from a sql file using the connection sample greenplum code
mysql_client Lets you use mysql and other methods provided by PyMySQL Package sample mysql code
file_processor Create sftp connection using paramiko package. Other file manipulations like row_count, encryption, archive (File Class) see file processing example
notification Send email notifications
mongo_client Load data to mongodb using bulk load. Run java script queries see mongo example
redis_client Read data from a redis cache or load a redis cache see redis example
kafka_system Currently allows Publisher and Consumer to use kafka in batch mode see kafka example
rabbitmq_system Currently has Publisher to publish messages in rabbitmq
mixpanel_client Connect to mixpanel api and fetch data using jql or export raw events data. mixpanel api documentation see mixpnael section in api example
salesforce_client Create a connection to salesforce using simple_salesforce package see salesforce section in api example
delighted_client Get nps scores and survey responses from delighted.api documentation see delighted section in api example
wootric_client Gets nps scores and survey responses from wootric.api documentation see wootric section in api example
dag_controller Functions needed to integrate this package within an airflow dag. airflow documentation and github project

data_pipeline classses

This file comprises of the classes that's accessible to other projects. data_pipeline uses a factory design pattern and creates the object based on type of database or api being accessed. The data pipeline consists of data from components and API. Each object of data-processor can use individual data streams and process them data_pipeline decides which modules to call based on type of database (as defined in config file). data_pipeline comprises of 3 classes

  • DataComponent : Each database connection is considered to be data-component object.See examples for postgres, mysql, greenplum, etc above
  • APICall : Each api call is an apicall object. See examples for mixpanel, delighted, salesforce and wootric above
  • DataProcessor : transfers and loads data between data components. see examples

Adding ipython notebook files to github

Use git large file system (git lfs). See documentation

  • if using mac install git-lfs using brew brew install git-lfs
  • install lfs git lfs install
  • track ipynb files in your project. go to the project folder and do git lfs track "*.ipynb"
  • add .*ipynb_checkpoints/ to .gitignore file
  • Finally add .gitattributes file git add .gitattributes

hosting python package in pypi

  • add your credentials to $HOME/.pypirc file see instructions
  • build the code : python setup.py build && python setup.py clean && python setup.py install
  • push to pypitest : python setup.py sdist upload -r pypitest
  • push to pypi prod : python setup.py sdist upload -r pypi

dattasa's People

Contributors

kram202 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.