GithubHelp home page GithubHelp logo

csv2parquet's Introduction

csv2parquet

csv2parquet

Wrapper script to convert CSV files to Parquet format using the excellent DuckDB

Since DuckDB does such an excellent job this is just a simple wrapper in python so I don't forget the precise command to use.

Assumes a true CSV (comma separated, not tab or semicolon separated; although this could be made configurable), the parquet file is written out with ZSTD codec.

The read_csv_auto by duckdb does an excellent job of guessing the appropriate type for a column, so no need sofar to provide some means of explicitly casting columns to a certain type.

DuckDB converts the file in streaming fashion, not by loading the entire file first. So converting large files should not pose any issues (short of bugs maybe).

parquet_info

A short script that outputs some metadata and the schema of a parquet file. Useful if you get a parquet file from somewhere and want to quickly check its contents.

Using DuckDB itself

If you have the DuckDB CLI at hand another quick solution (for showing the raw parquet schema information) is to execute this in DuckDB CLI:
SELECT * FROM parquet_schema('filename.parquet')

If you want to see what the parquet file will look like in DuckDB, use:
create view test as select * from 'filename.parquet';
describe table test;

This will also set a view on the parquet file which you can use as a regular table. So to get the number of entries in the parquet file after this, simply do:
select count(*) test;

csv2parquet's People

Contributors

poorting avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.