GithubHelp home page GithubHelp logo

licht1stein / pandas-to-postgres Goto Github PK

View Code? Open in Web Editor NEW

This project forked from cid-harvard/pandas-to-postgres

1.0 1.0 0.0 42 KB

Copy Pandas DataFrames and HDF5 files to PostgreSQL database

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

pandas-to-postgres's Introduction

Pandas-to-postgres

Pandas-to-postgres allows you to bulk load the contents of large dataframes into postgres as quickly as possible. The main differences from pandas' to_sql function are:

  • Uses COPY combined with to_csv instead of execute / executemany, which runs much faster for large volumes of data
  • Uses COPY FROM STDIN with StringIO to avoid IO overhead to intermediate files. This matters in particular for data stored in unusual formats like HDF, STATA, parquet - common in the scientific world.
  • Chunked loading methods to be able to load larger-than-memory tables. In particular the HDF5 functions load data in chunks directly from the file, easily extendible to other formats that support random access by row range.
  • Removes indexing overhead by automatically detecting and dropping indexes before load, and then re-creating them afterwards
  • Allows you to load multiple separate HDF tables in parallel using multiprocessing.Pool
  • Works around pandas null value representation issues: float pandas columns that have an integer SQL type get converted into an object column with int values where applicable and NaN elsewhere.
  • Provides hooks to modify data as it's loaded

Anecdotally, we use this to load approximately 640 million rows of data from a 7.1GB HDF file (zlib compressed), 75% of it spread across 3 of 23 tables, with a mean number of columns of 6. We load this into an m4.xlarge RDS instance running postgres 10.3 in 54 minutes (approximately 10-15 minutes of which is recreating indexes), using 4 threads.

Dependencies

  • Python 3
  • psycopg2 (for the low level COPY from stdin)
  • sqlalchemy (for reflection for indexes)
  • pandas

Usage Example

from pandas_to_postgres import (
    DataFrameCopy,
    hdf_to_postgres,
)

table_model = db.metadata.tables['my_awesome_table']

# already loaded DataFrame & SQLAlchemy Table model
with db.engine.connect() as c:
  DataFrameCopy(df, conn=c, table_obj=table_model).copy()

# HDF from file
hdf_to_postgres('./data.h5', engine_args=["psycopg://..."])

# Parallel HDF from file
hdf_to_postgres('./data.h5', engine_args=["psycopg://..."], processes=4)

Other Comparisons

  • Odo: A much more general tool that provides some similar features across many formats and databases, but missing a lot of our specific features. Unfortunately currently buggy and unmaintained.
  • Postgres Binary Parser: Uses COPY WITH BINARY to remove the pandas to csv bottleneck, but didn't provide as good an improvement for us.
  • pg_bulkload: The industry standard, has some overlap with us. Works extremely well if you have CSV files, but not if you have any other format (you'd have to write your own chunked read/write code and pipe it through, at which point you might as well use ours). Judging by benchmarks we're in the same ballpark. Could perhaps replace psycopg2 as our backend eventually.

pandas-to-postgres's People

Contributors

bleonard33 avatar licht1stein avatar makmanalp avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.