GithubHelp home page GithubHelp logo

povstenko / parquet_convertor Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 1.0 43 KB

๐Ÿ”„ Convert csv to parquet and explore parquet data structure

License: Apache License 2.0

Python 100.00%
python csv parquet apache-parquet csv-converter argument-parser json parquet-schema

parquet_convertor's Introduction

Data Files Convertor

GitHub Apache 2.0

This script allows user to convert different data types.

This tool accepts comma-separated value files (.csv) as well as apache parquet (.parquet) files. It is assumed that the first row of the spreadsheet is the location of the columns.

This script requires that pandas, pyarrow, argparse and pathlib be installed within the Python environment you are running this script in.

This file can also be imported as a module and contains the following functions:

  • csv_to_parquet - convert csv to parquet and save to file
  • parquet_to_csv - convert parquet to csv and save to file
  • parquet_schema - returns schema of parquet file
  • add_filename_suffix - returns filename string with added suffix for filename and change extension
  • is_file_ext_correct - returns returns True if filename has correct file extension and prints message otherwise
  • print_success_message - prints message of successfully conversion with elapsed time
  • construct_argument_parser - constructs the argument parser
  • main - the main function of the script

Table of Contents

Parameters

  • -h, --help show help message and exit
  • -cp, --csv2parquet convert csv to parquet. Set input csv filename string (example: data.csv)
  • -pc, --parquet2csv convert parquet to csv. Set input parquet filename string (example: data.parquet)
  • -s, --get_schema get schema of parquet file. Set input parquet filename string (example: data.parquet)
  • -o, --output set output file name without extension (example: newfile)
  • -d, --delimiter set delimiter for csv file (default: ,)

Usage

Convert CSV to Parquet

Assume that you have some data in your data.csv file:

id,first_name,second_name,age
0,Vitaliy,Povstenko,19
1,John,Doe,25
2,Bill,Gates,40
3,Elon,Musk,30
4,Don,Joel,25

need to convert some data.csv file to parquet, so you need to write the following command:

$python convertor.py --csv2parquet data.csv

And you receive new file data_converted.parquet in your directory

Convert Parquet to CSV

You can specify --parquet2csv parameter in order to convert data_converted.parquet file back to csv

$python convertor.py --parquet2csv data_converted.parquet

New data_converted_converted.parquet now added to your directory, but it is good to specify the output file name (without extension) in parameter --output:

$python convertor.py --parquet2csv data_converted.parquet --output newfile

Successfully converted from data_converted.parquet to newfile.csv. In order, you need to save or read CSV files using a special delimiter:

$python convertor.py -pc data_converted.parquet -o newfile --delimiter ;

Now file newfile.csv id delimited by ;:

id;first_name;second_name;age
0;Vitaliy;Povstenko;19
1;John;Doe;25
2;Bill;Gates;40
3;Elon;Musk;30
4;Don;Joel;25

Convert CSV to JSON

This option currently is disabled and moved to json-convertor GitHub branch

The script allows you to convert from CSV to JSON in a similar way:

$python convertor.py --csv2json data.csv  

data_converted.json looks like:

[{"id": "0", "first_name": "Vitaliy", "second_name": "Povstenko", "age": "19"}...]

In default, script save JSON file without the indent, but you can specify indent in new JSON file:

$python convertor.py -cj data.csv -o names --json_indent 4

New names.json contains:

[
    {
        "id": "0",
        "first_name": "Vitaliy",
        "second_name": "Povstenko",
        "age": "19"
    }
...
]

Get Parquet schema

There are some cases when you need to know the schema of your Parquet file. For example:

$python convertor.py --get_schema data_converted.parquet

The script produces the following output:

id: int64
first_name: string
second_name: string
age: int64

License

Apache 2.0 License: www.apache.org/licenses/LICENSE-2.0 or see the LICENSE file.

parquet_convertor's People

Contributors

povstenko avatar

Watchers

 avatar  avatar

Forkers

ramyshete

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.