GithubHelp home page GithubHelp logo

apicrafter / pyiterable Goto Github PK

View Code? Open in Web Editor NEW
15.0 1.0 0.0 356 KB

Python library to read, write and convert data files with formats BSON, JSON, NDJSON, Parquet, ORC, XLS, XLSX and XML

License: MIT License

Python 100.00%
bson json jsonlines orc datafile file-conversion parquet xls xlsx xml

pyiterable's Introduction

Iterable Data

Work in progress. Documentation in progress

Iterable data is a Python lib to read data files row by row and write data files. Iterable classes are similar to files or csv.DictReader or reading parquet files row by row.

This library was written to simplify data processing and conversion between formats.

Supported file types:

  • BSON
  • JSON
  • NDJSON (JSON lines)
  • XML
  • XLS
  • XLSX
  • Parquet
  • ORC
  • Avro
  • Pickle

Supported file compression: GZip, BZip2, LZMA (.xz), LZ4, ZIP

Why writing this lib?

Python has many high-quality data processing tools and libraries, especially pandas and other data frames lib. The only issue with most of them is flat data. Data frames don't support complex data types, and you must flatten data each time.

pyiterable helps you read any data as a Python dictionary instead of flattening data. It makes it much easier to work with such data sources as JSON, NDJSON, or BSON files.

This code is used in several tools written by its author. It's command line tool undatum and data processing ETL engine datacrafter

Requirements

Python 3.8+

Installation

pip install iterabledata or use this repository

Documentation

In progress. Please see usage and examples.

Usage and examples

Read compressed CSV file

Read compressed csv.xz file


from iterable.helpers.detect import open_iterable

source = open_iterable('data.csv.xz')
n = 0
for row in iterable:
    n += 1
    # Add data processing code here
    if n % 1000 == 0: print('Processing %d' % (n))

Detect encoding and file delimiter

Detects encoding and delimiter of the selected CSV file and use it to open as iterable


from iterable.helpers.detect import open_iterable
from iterable.helpers.utils import detect_encoding, detect_delimiter

delimiter = detect_delimiter('data.csv')
encoding = detect_encoding('data.csv')

source = open_iterable('data.csv', iterableargs={'encoding' : encoding['encoding'], 'delimiter' : delimiter)
n = 0
for row in iterable:
    n += 1
    # Add data processing code here
    if n % 1000 == 0: print('Processing %d' % (n))

Convert Parquet file to BSON compressed with LZMA using pipeline

Uses pipeline class to iterate through parquet file and convert its selected fields to JSON lines (NDJSON)


from iterable.helpers.detect import open_iterable
from iterable.pipeline import pipeline

source = open_iterable('data/data.parquet')
destination = open_iterable('data/data.jsonl.xz', mode='w')

def extract_fields(record, state):
    out = {}
    record = dict(record)
    print(record)
    for k in ['name',]:
        out[k] = record[k]
    return out

def print_process(stats, state):
    print(stats)

pipeline(source, destination=destination, process_func=extract_fields, trigger_on=2, trigger_func=print_process, final_func=print_process, start_state={})

Convert gzipped JSON lines (NDJSON) file to BSON compressed with LZMA

Reads each row from JSON lines file using Gzip codec and writes BSON data using LZMA codec


from iterable.datatypes import JSONLinesIterable, BSONIterable
from iterable.codecs import GZIPCodec, LZMACodec


codecobj = GZIPCodec('data.jsonl.gz', mode='r', open_it=True)
iterable = JSONLinesIterable(codec=codecobj)        
codecobj = LZMACodec('data.bson.xz', mode='wb', open_it=False)
write_iterable = BSONIterable(codec=codecobj, mode='w')
n = 0
for row in iterable:
    n += 1
    if n % 10000 == 0: print('Processing %d' % (n))
    write_iterable.write(row)

More examples and tests

See tests for example usage and tests

pyiterable's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

pyiterable's Issues

No type hints

Need to add type hints. It will make code more self-documented also it will make usage the library more simple and surprise-free.

Error: No module named 'bson'

Hi, Thanks for sharing this library!

Use case

In an ongoing scheduled task, I need to process and convert XLS and XLSX files both ways. I can probably get away with just XLSX to XLS conversion.

Error

Using the following code, I'm getting the below error similar to your examples. Actually, any of your examples also produces the same error:

iterable = XLSIterable('routes/admin/tms/202404/TMS_202404D02144.xls')
write_iterable = XLSXIterable('routes/admin/xlsx/TMS_202404D02144.xlsx', mode='w')
n = 0
for row in iterable:
    n += 1
    write_iterable.write(row)
iterable.close()
write_iterable.close()
image

BSON is not even in the picture!

I installed your library on my MacBookPro M1 running Sonoma 14.4.1 without errors:

pip install iterabledata
(venv) nuri@MacBook-Pro rdsp % pip show iterabledata                               
Name: iterabledata
Version: 1.0.2
Summary: Iterable data processing Python library
Home-page: https://github.com/apicrafter/pyiterable/
Author: Ivan Begtin
Author-email: [email protected]
License: MIT
Location: /Users/nuri/DELL 5720 REPOS/rdsp/venv/lib/python3.9/site-packages
Requires: avro, chardet, jsonlines, lxml, lz4, openpyxl, orjson, parquet, pyorc, xlrd

I would appreciate any insight.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.