GithubHelp home page GithubHelp logo

caltechlibrary / datatools Goto Github PK

View Code? Open in Web Editor NEW
78.0 15.0 10.0 2.71 MB

A set of tools for working with JSON, CSV and Excel workbooks

Home Page: https://caltechlibrary.github.io/datatools

License: Other

Makefile 1.57% Go 47.50% CSS 3.24% Shell 7.80% HTML 32.34% Batchfile 0.50% Lua 0.03% JavaScript 7.02%
json excel-workbook csv data-munging shell-scripting structured-data xlsx

datatools's Introduction

datatools

datatools is a rich collection of command line programs targetting data conversion, cleanup and analysis directly from your favorite POSIX shell. It has proven useful for data collaberations where individual members of a project may prefer different toolsets in their analysis (e.g. Julia, R, Python) but want to work from a common baseline. It also has been used intensively for internal reporting from various Caltech Library metadata sources.

The tools fall into three broad categories

  • data transformation and conversion
  • shell scripting helpers
  • "string", a tool providing the common string operations missing from shell

See user manual for a complete list of the command line programs. The data transformation tools include support for formats such as Excel XML, csv, tab delimited files, json, yaml and toml.

Compiled versions of the datatools collection are provided for Linux (amd64), Mac OS X (amd64), Windows 10 (amd64) and Raspbian (ARM7). See https://github.com/caltechlibrary/datatools/releases.

Use "-help" option for a full list of options for each utility (e.g. csv2json -help).

Data transformation

The tooling around transformation includes data conversion. These include tools that work with CSV, tab delimited, JSON, TOML, YAML and Excel XML.

There is also tooling to change data shapes using JSON as the intermediate data format.

For the shell

Various utilities for simplifying work on the command line.

  • findfile - find files based on prefix, suffix or contained string
  • finddir - find directories based on prefix, suffix or contained string
  • mergepath - prefix, append, clip path variables
  • range - emit a range of integers (useful for numbered loops in Bash)
  • reldate - display a relative date in YYYY-MM-DD format
  • reltime - display a relative time in 24 hour notation, HH:MM:SS format
  • timefmt - format a time value based on Golang's time format language
  • urlparse - split a URL into parts

For strings

datatools provides the string command for working with text strings (limited to memory available). This is commonly needed when cleanup data for analysis. The string command was created for when the old Unix standbys- grep, awk, sed, tr are unwieldly or inconvient. string provides operations are common in most language like, trimming, spliting, and transforming letter case. The string command also makes it easy to join JSON string arrays into single a string using a delimiter or split a string into a JSON array based on a delimiter. The form of the command is string [OPTIONS] [ACTION] [ARCTION_PARAMETERS...]

    string toupper "one two three"

Would yield "ONE TWO THREE".

Some of the features included

  • change case (upper, lower, title, English title)
  • length, position and count of substrings
  • has prefix, suffix or contains
  • trim prefix, suffix and cutsets
  • split and join to/from JSON string arrays

See string for full details

Installation

See INSTALL.md for details for installing pre-compiled versions of the programs.

datatools's People

Contributors

rsdoiel avatar tmorrell avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datatools's Issues

codemeta.json and CITATION.cff

Since we're driving our releases increasingly from the codemeta.json file it would be really nice to have two tools. One that would generate a CITATION.cff from a codemeta.json file. Another that would scan a codemeta.json file and prompt for any changes (e.g. like version number, links to releases, etc.).

json encoding, decoding is doing odd things

Over the development of Go the json package has evolved. It used to be sensible to use Marshal()/MarshalIndent() or Unmarshal() most of the time when writing JSON processing code. This is not longer the case. In the case of decoding JSON you want to create a custom decoder so you can handle JSON Numbers sensibly. Likewise you want to use custom encoders to you can avoid characters magically turning to Unicode code point notation because someone thought HTML entities were a problem (note, they are not, they can easily be used as is based on the JSON spec). I need to review all places I use the json package to see where these two things are causing problems for our datatools cli.

A custom Unmarshal() should probably look something like

func jsonUnmarshal(src []byte, obj interface{}) error {
        dec := json.NewDecoder(bytes.NewReader(src))
        dec.UseNumber()
        err := dec.Decode(&obj)
        if err != nil && err != io.EOF {
                return err
        }
        return nil
}

An MarshalIndent should probably look like

func MarshalIndent(obj interface{}, prefix string, indent string) ([]byte, error) {
        buf := []byte{}
        w := bytes.NewBuffer(buf)
        enc := json.NewEncoder(w)
        enc.SetEscapeHTML(false)
        enc.SetIndent(prefix, indent)
        err := enc.Encode(obj)
        if err != nil {
                return nil, err
        }
        return w.Bytes(), err
}

csvfind -trimspaces issue

-trimspaces doesn't seem to work

Example file:
text.txt

Works as expected:
csvfind -i text.txt -col=2 -trimspaces "library"
csvfind -i text.txt -col=1 -trimspaces "red"

Doesn't work:
csvfind -i text.txt -col=2 -trimspaces "field"
csvfind -i text.txt -col=1 -trimspaces "yellow"

Add option to supply source codemeta.json and CITATION.cff on command line

While the common use case will likely use the codemeta.json and CITATION.cff in the current directary you might have a tool that updates lots of projects. In that case the full path is needed. Additionally there could be a use case of want to test the output without at the shell script level. Being able to specific the full path and filename allows us to address both use cases with minor modification to the CLI wrapper for the package.

csvcols has issues with quotation marks

csvcols -i example.csv -o out.csv -col 1 2

with
example.txt

(Change .txt to .csv)

Expect:
"A","B"
"C","D"

Result:
example.csv, line 2, column 89: extraneous " in field
[]string []
example.csv, line 3, column 0: wrong number of fields in line
[]string [ ]

File:
A,B
,
" ",

Update build processes, doc and version.go generation

I need to drop codemeta2cff and replace with Pandoc template approach. version.go should be generated from my latest pandoc templates. The manpage generation from -help should also include release dates and hash and reflect my current practices. This needs to happen for all the cli.

jsonmunge should use Mustache templates rather than Go templates

The go templates are just not widely understood outside of Go developers. Mustache is common in many languages an there is a good Go Mustache package. I need to swap out the template engine for Mustache, maybe leave the go one as an option though I don't think anyone actually used it besides me.

If I do this I'll need to bump the semver from 1.2 to 1.3.

sql2csv config file not handling delimiter mapping correctly

Looks like a have a bug in the 1.2.0 release. If you set the delimiter to tab it still comes out comment when using the JSON configuration.

JSON example

{
    "dsn_url": "sqlite://file:mydb.sqlite3",
    "delimiter": "\t"
}

Should result in a tab delimited output when run a query is run via sql2csv. I wonder if the slash escaping isn't happening correctly when I convert the string which represents a single character to the rune for the CSV writer in the Comma attribute.

Why marshal the Context in codemeta2cff?

Hi, overall thanks for your nice work. I tried out the codemeta2cff.

In my codemeta files @context is an array (can be nested in general and not a simple string). which fails, since codemeta.go assumes it to be a string.

I was wondering why you need to parse the @context part at all since it is not represented in the Citation.cff?

Issue on clean compile

It is very likely that Tealeg's xlsx package has changed APIs. When compiling datatools a new system I get an error

    env GOBIN=/Users/rsdoiel/bin go install cmd/xlsx2json/xlsx2json.go
# command-line-arguments
cmd/xlsx2json/xlsx2json.go:111:28: sheet.Rows undefined (type *xlsx.Sheet has no field or method Rows)

funder field issues with an array

Mike has run across a problem with citation2cff. Most of the codemeta.json in the wild treat funder as a single value dict (map) rather than an array. I'm going to switch it back to a map to bring codemeta2cff in alignment with practice.

Mike's example

  "funder": { 
    "@id": "https://ror.org/05dxps055",
    "@type": "Organization",
    "name": "California Institute of Technology Library"
  },

documentation is out of date, 2023-09-19

Review the website after integrating pagefind I noticed almost all the docs are stale. All the page pages need to be refreshed to 1.2.4 version (see cmd files -help for content) and should include compile hash and version. The organization of the directory structure has left many data files unlinked. The cleanup in test_cmd.bash has left many uncessary data files in the how-to directory, those that remain should migrate to a testdata directory or get autogenerated and cleaned up by test_cmd.bash.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.