datacoon / undatum Goto Github PK
View Code? Open in Web Editor NEWundatum: a command-line tool for data processing. Brings CSV simplicity to JSON lines and BSON
License: MIT License
undatum: a command-line tool for data processing. Brings CSV simplicity to JSON lines and BSON
License: MIT License
Add command with XML structure analysis. It should allow to detect XML structure and find data container and children tags to make XML file conversion easier.
Right now to convert XML file you need to provide tagname - name of XML tag that will be assumed as tag that contain data which we interested in. But name of this tag is a user knowledge only. Command that analyze XML structure and prints it shoud help user with choice.
More commands needed:
On MacOs, brew install undatum
does not work:
$ brew install undatum
Error: No available formula with the name "undatum"
==> Searching for a previously deleted formula (in the last month)...
Error: No previously deleted formula found.
==> Searching for similarly named formulae...
Error: No similarly named formulae found.
==> Searching taps...
==> Searching taps on GitHub...
Error: No formulae found in taps.
For example using same code https://www.kdnuggets.com/2022/07/parallel-processing-large-file-python.html
Support Excel files XLS and XLSX with analyze command and report on it's metadata
Add support for commands uniq, frequency, select and other text processing commands with Excel files
There seem to be a missing dependancy when installed using pip install --upgrade undatum
โ undatum --help
Traceback (most recent call last):
File "/usr/local/bin/undatum", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.9/site-packages/undatum/__main__.py", line 10, in main
from .core import cli
File "/usr/local/lib/python3.9/site-packages/undatum/core.py", line 9, in <module>
from .cmds.analyzer import Analyzer
File "/usr/local/lib/python3.9/site-packages/undatum/cmds/analyzer.py", line 13, in <module>
import xmltodict
ModuleNotFoundError: No module named 'xmltodict'
After installing the dep manually (pip install xmltodict
) it worked
Need to focus on:
Key questions:
Add support of ORC files https://cwiki.apache.org/confluence/display/hive/languagemanual+orc
Transfer code from https://github.com/datacoon/datadifflib and implement diff and apply commands to generate and apply patches to JSONl, BSON and CSV files
Rewrite code and make data conversion processes universal. Right now it's partially implemented with IterableData and DataWriter classes, but they aren't used in data conversion functions.
An ideas how to implement it:
Add support of Avro files https://avro.apache.org/docs/1.2.0/spec.html
More info https://github.com/iqlusioninc/veriform
Following list:
Consider to add table detection and extraction from non-table files like .docx file format. For example using docx2csv command https://github.com/ivbeg/docx2csv
Review other possible data sources and data formats
Add support for YAML files for stats, conversion and other operations
Add support of Cap'N Proto protocol https://capnproto.org
It would be nice to have options for compression. Looks like there is no compression by default?
parq RS_2008-04.parquet
# Metadata
<pyarrow._parquet.FileMetaData object at 0x7f5d6f635490>
created_by: parquet-cpp-arrow version 7.0.0
num_columns: 124
num_rows: 167472
num_row_groups: 1
format_version: 1.0
serialized_size: 53334
Dump MongoDB, ArangoDB, SQL DBMS and other databases directly to BSON, JSON lines, CSV and e.t.c formats.
NoProto is a new data serialization definition https://github.com/only-cliches/NoProto
Add more file conversion types, like:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.