GithubHelp home page GithubHelp logo

datacoon / undatum Goto Github PK

View Code? Open in Web Editor NEW
42.0 2.0 7.0 5.06 MB

undatum: a command-line tool for data processing. Brings CSV simplicity to JSON lines and BSON

License: MIT License

Python 100.00%
bson jsonlines jsonl json csv cli command-line data dataset convert

undatum's People

Contributors

ivbeg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

undatum's Issues

Add XML analysis command

Add command with XML structure analysis. It should allow to detect XML structure and find data container and children tags to make XML file conversion easier.

Right now to convert XML file you need to provide tagname - name of XML tag that will be assumed as tag that contain data which we interested in. But name of this tag is a user knowledge only. Command that analyze XML structure and prints it shoud help user with choice.

Add more commands: sort, merge, join, print, query, search

More commands needed:

  • sort - sort by number of columns
  • merge - merge two or more data files
  • join - join two or more data files
  • print - multiple ways to print values in data files
  • query - simple yet effective query command
  • search - search for value using substring and regex

No available formula for brew install undatum

On MacOs, brew install undatum does not work:

$ brew install undatum
Error: No available formula with the name "undatum"
==> Searching for a previously deleted formula (in the last month)...
Error: No previously deleted formula found.
==> Searching for similarly named formulae...
Error: No similarly named formulae found.
==> Searching taps...
==> Searching taps on GitHub...
Error: No formulae found in taps.

No module named `xmltodict`

There seem to be a missing dependancy when installed using pip install --upgrade undatum

โžœ  undatum --help
Traceback (most recent call last):
  File "/usr/local/bin/undatum", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/undatum/__main__.py", line 10, in main
    from .core import cli
  File "/usr/local/lib/python3.9/site-packages/undatum/core.py", line 9, in <module>
    from .cmds.analyzer import Analyzer
  File "/usr/local/lib/python3.9/site-packages/undatum/cmds/analyzer.py", line 13, in <module>
    import xmltodict
ModuleNotFoundError: No module named 'xmltodict'

After installing the dep manually (pip install xmltodict) it worked

JSONL to CSV conversion

Need to focus on:

  • identify is JSONL object flat
  • use "--fields" options
  • use "skip" or "flatten" or "tables" strategy
  • if "skip" just ignore all complex fields
  • if "flatten" than merge complex fields with different separator
  • if "tables" then generate several CSV files from objects, need to think how to do that right

Key questions:

  • Do we need simple CSV or Data package with a schema like Frictionless Data package?

Implement universal data files conversion

Rewrite code and make data conversion processes universal. Right now it's partially implemented with IterableData and DataWriter classes, but they aren't used in data conversion functions.

An ideas how to implement it:

  • analyze existing data formats to understand cross convertability.
  • write short research notes and convertability comparison table
  • create generic classes for data source and data writer
  • rewrite current convert code to the more universal approach for all used data formats
  • write documentation how to add more data sources/destinations for other data formats

Do data serialization formats review

Following list:

  • BSON
  • JSON
  • JSON lines
  • XDR
  • ASN
  • ASN.1 DER
  • Cap'N Proto
  • Veriform
  • NoProto
  • CBOR
  • csexp
  • MessagePack
  • Protobuf
  • Avro
  • Orc
  • Parquet
  • Flatbuffers
  • Bincode
  • CSV
  • XML
  • UBJSON
  • Thrift
  • Sereal
  • Bencode
  • Gobs
  • ROOT (https://root.cern)
  • HDF5
  • Velocitypack

Parquet compression

It would be nice to have options for compression. Looks like there is no compression by default?

parq RS_2008-04.parquet 

 # Metadata 
 <pyarrow._parquet.FileMetaData object at 0x7f5d6f635490>
  created_by: parquet-cpp-arrow version 7.0.0
  num_columns: 124
  num_rows: 167472
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 53334

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.