GithubHelp home page GithubHelp logo

csvz's Introduction

csvz

csvz is the hot new open database standard that is taking the entire technological world by storm.

A csvz file is literally just a bunch of csv files, in a zip file, that has been renamed to have a ".csvz" file extension.


Are you using csvz ? Why not? csvz is the brave technology that unites the worlds of data science, sql and no-sql. Is it no-sql's answer to the rdbms? Or is it the rdbms answer to no-sql? You decide.


Contents


The csvz specification

The csvz specification is broken into meaningful fragments.

Files can call themselves csvz-compliant if they only comply with the first fragment of the specification, csvz-0.

They can also indicate other fragments of the specification that they have implemented, such as csvz-meta-tables, csv-meta-relations etc.


csvz-0 A csvz file is literally just a bunch of csv files, in a zip file with a file name that ends with ".csvz"

A csvz file is compliant with csvz-0 if it is literally just a bunch of csv files, in a zip file, that has been renamed to have a ".csvz" file extension.

(Note that each fragment has a fragment identifier written at the beginning of the fragment. For example this is csvz-0 and the next fragment is csvz-meta-tables. Fragments are optional, but it is good to know which fragments you do or do not comply with.)

The csv files themselves should be parseable with most csv reading software.

(Anywhere that this spec refers to "a csv file" it means a file that complies with RFC 4180 or a compatible dialect as described by the CSV on the Web Working Group, unless a stricter definition is explicitly given.)

(Anywhere that the csvz specification refers to "this spec" it means the csvz specification.)


csvz-meta-tables A csvz file can contain a file called tables.csv describing the contents of the file

Metadata about the contents of the csvz file is contained in a directory called "_meta". The file tables.csv, if present, is inside this directory.

(Assume that the csvz reserves the right to create other .csv files under the _meta folder, and to create more folders under it. Details appear in subsequent spec fragments.)

The file tables.csv contains metadata about all of the csv files included in the csvz file.

(The file tables.csv is a csv file.)

(Anywhere that this spec refers to a file with a name that ends with ".csv" it means the file is a "csv file", as described in csvz-0.)

The file tables.csv meets the following description:

  • There is a header row naming the columns in this file
  • Each data row describes a different csv file within this csvz file
  • The columns must include a column called "filename"
  • There may be more columns.
  • Here are some suggestions:
    • bytes - the size of the file in bytes
    • rows - the number of rows in the file
    • columns - the number of columns in the file
    • description - a description of the file
    • published - the date the data in the file was first published
    • source - information about the source of the data in the file
    • has-column-names - a true/false value indicating if the file has a header row containing column names
    • skip-rows - How many rows need to be skipped, before the data begins? (Rarely need to specify this, but when you need it, you need it!)
    • (todo: where information in table.csv conflicts with information in csv.csv, then tables.csv has precedence over csv.csv, for the file it describes. For example csv.csv may indicate that all files have header rows, but a specific file may not, and this would be indicated in tables.csv)
  • The file tables.csv may also describe itself. See Russell. Note that bytes (for example) might cause a paradox.

(The word "must" is used for parts of the specification that are required for a file or tool to claim compliance with the standards described in this spec. The word "may" is used for parts which are not required; Optional sections may be covered in more detail, as required elements in a subsequent fragment of this spec.)

(Whenever suggestions are provided, they are not required for conformance with the current spec fragment. These suggestion may be described more fully in later spec fragments, in which they may be required.)

(Expectations around the encoding of true/false values, and other fundamental data-types, are not currently defined.)


csvz-meta-columns A csvz file can contain a file called columns.csv

Metadata about the contents of the csvz file is contained in a directory called "_meta". The file columns.csv, if present, is inside this directory.

The file columns.csv contains metadata about all of the columns in all of the csv files included in the csvz file.

The file columns.csv meets the following description:

  • There is a header row naming the columns in this file
  • Each data row describes a different column in a different file
  • The columns must include a column called "filename" and a column called "column".
  • It is expected that the columns "filename" and "column" are unique.
    • If the columns "filename" and "column" are not unique, then any meta data about that file may not be correctly interpreted. This may cause difficulties
  • There should be more columns than just the "filename" and "column" column. Some suggestions:
    • data-type - the type of the column. (Data-types are not described in this spec fragment, and will be covered in later spec fragments.)
    • nullable - a true/false value indicating if the column can be null
    • max-length - a nullable column, that describes the maximum length of the column, in cases where the data-type supports a maximum length
    • unique - a true/false value indicating if the values in the column should be unique
    • primary-key - a true/false value indicating if the column can serve as (part or whole of) the primary key of the table.
    • description - a description of the column
    • units - a nullable name description of the unit of measure
    • ordinal - the order in which the columns have been written to the file. In cases where there is no header row, or where columns are re-ordered, this can be helpful.
    • published - the date the data in the file was first published
    • source - information about the source of the data in the file

(The word "should" is used for parts of the specification that are not required, but which will lead to difficulty for users of the data or the tools if they are not complied with.)


csvz-meta-relations A csvz file can contain a file called relations.csv

Metadata about the contents of the csvz file is contained in a directory called "_meta". The file relations.csv, if present, is inside this directory.

The file relations.csv contains metadata about all of the relationships between any of the columns in any of the files in the csvz file.

The file relations.csv meets the following description:

  • There is a header row naming the columns in this file
  • Each data row describes a different foreign key relationship within the csvz file.
  • The columns must include these columns:
    • called "table", "key-column", "foreign-table","foreign-key-column"
  • There may also be a column called "key-name". In the case of a composite keys, there would be multiple rows with the same "key-name".
  • todo: There may be more columns needed to describe the relationships.

csvz-meta-csv A csvz file can contain a file called csv.csv

(todo: this section is still very much a draft)

Metadata about the rules of the csvz file are contained in a directory called "_meta". The file csv.csv, if present, is inside this directory.

The file csv.csv contains metadata about how the csv files in this csvz file are formatted, from a general csv standards point of view.

(Later spec fragments will give exact definitions for the expected columns and supported columns, their possible values and the meanings of those values.)

But to comply with csvz-meta-csv the file csv.csv must:

  • Be formatted as strict-4180
  • Have a header row naming the columns in this file
  • Data rows, each of which describes a different and very specific but fundamental aspect of the csv format used by all other csv files in this csvz file.
  • Suggested aspects that can be described (and which should be described in subsequent fragments)
    • `encoding - what file encoding is used for the csv files (utf-x, BOM, etc.)
    • field separator - examples comma, tab, semicolon, space, various emoji
    • row separator - examples CRLF, LF, CR, semicolon, exclamation point, backtick
    • qualifier - what qualifiers (if any) are used for embedding delimiters. perhaps qualifiers are not used. Can single/double/mixed/other be used?
    • escaping - Are qualifiers doubled or escaped? (If escaped, escaped with what?)
    • null-values - how are null-values represented? e.g. the literal string null with no quotes? or NULL, or nil? Or are empty strings, unquoted, to be treated as NULLs?
    • has-column-names - a true/false value indicating if every csv file (other than this one) file has a header row containing column names. (Can be over-ridden by a has-column-names value in the tables.csv file, if present.)
    • user defined data-types can be handled elsewhere, but a limited number of common fundamental data-types could be most expediently described in csv.csv, such as:
      • date formats - e.g. ISO 8601
      • boolean formats
      • binary data (hint: base 64 coded)
      • integer ranges.
      • floats
  • Later spec fragments may further describe "sensible defaults" for these things
  • Later spec fragments may describe "sensible heuristics" for detecting delimiters/qualifiers/quoting and escaping rules, etc.

(todo: See also csvw dialect descriptions)


csvz-meta-per-file The ability to include individual meta-files per csv file

This fragment extends all other csvz-meta-* fragments.

Consider an example where a single csv file, people.csv inside the csvz follows different standards to the other files.

It's csv conventions could be described in a file: _meta/csv/people.csv and those would be taken to override the conventions in _meta/csv.csv

Similarly, a file can have its own _meta/tables/{filename}.csv file, _meta/columns/{filename}.csv and _meta/relations/{filename}.csv.

This methods can be assumed to extend for all other _meta/*.csv files.

A per-file meta file is assumed to have higher precedence than the files directly contained in _meta/*.csv.

For example: if _meta/columns.csv decribed the columns of states.csv in one way, but _meta/columns/states.csv described those columns in another way, all details for states.csv in _meta/columns.csv should be ignored, and those in _meta/columns/states.csv used instead. (i.e. they are not combined).

(Note - combining might be more interesting, useful. Would let you build up/inherit attributes. But would also need a way to "erase" a rule, and I can't think of a way to do that so let's stick with "no combining")

(Suggestion for authors of Tooling that reads these files: they may want to provide optional debug information that describes where meta data was sourced from, highlighting situations where precedence rules needed to be applied.)

You can also mix and match _meta/*.csv with per-file meta information, without loss of meaning.

For example the table states.csv may be described in _meta/tables.csv while it's columns may be described in _meta/columns/states.csv


Unwritten meta fragments

More meta-* spec fragments may be needed to describe other meta files.

For example:

  • indexes - what indexes can/should be built on the tables (if the data)
  • data-types - what types are used, how are they encoded (e.g. dates: how? binary data base-64 encoded? etc), what ranges exist for numbers etc.
  • user-defined-types - how can types be extended?
  • schemas - consider situations where directories are used to describe separate schemas\databases (or other namespacing concepts)
  • directories - instead of defining schemas (or other namespaces) perhaps the concept of directories could be directly described, a kind of set of routing rules/conventions. in the directories.csv you might in effect say, the directories in the root directory (other than _meta) are to be treated as "server" names. the next level are to be treated as "share" names... or perhaps you will say, "under "/databases" the directory names in there are treated as "database" names, and the names under that are "schema" names.
  • naming - perhaps you will define naming conventions, e.g. ways to pull data from names, or use names to know which files can be combined into one logical unit later.
  • deltas - csv files may hold operations on data, instead of data itself, i.e. details of insert,update,delete,(upsert) operations
  • constraints - what other constraints exist on the data
  • partitions - consider situations where a single table is split across multiple files, and or a csvz itself is split amongst multiple files, including
  • formulas - are there calculated columns? what form do the calculations take?

A list of csvz-compliant Tools and Libraries

The following tools and libraries are able to read, write or process .csvz files.

Tool Actions Compliance Description
Sylvan.Data.CsvZip Create / Read csvz-0 csvz-meta-tables csvz-meta-columns Library for programatically creating and reading .csvz files
Sylvan.Tools.CsvZip Create csvz-0 csvz-meta-tables csvz-meta-columns .NET global tool for creating .csvz files from the commandline
Packs a set of csv files into a new csvz file, and generates a tables.csv and columns.csv
Converts a .csvz file into a .xlsx file, that can be opened by Excel.
Converts a .csvz file into a .xlsx file, that can be opened by Excel.
Converts a .xlsx file into a .csvz file (note that not all of Excel's features are respected.)
Exports a sqlite database into a new .csvz file
Creates a new sqlite database from a .csvz file
Exports a mysql database into a new .csvz file
Creates a new PostgreSQL database from a .csvz file
Exports a PostgreSQL database into a new .csvz file
Save a JSON file as a series of csv files and _meta files (ready for zipping)
Load some or all of an unzipped csvz as a single json object (limited filtering ability)
Validates which spec fragments a csvz file complies with
(More tools...)

If you know of a csvz compliant tool, or you have created one (hint hint), a pull request is welcome.

Suggestion: You can use existing csvz or csv libraries to build a new type of connection (e.g. A tool to create/read csvz files from an Oracle database, using existing libraries, would take some Oracle knowledge, and not much else.)


Contribute

To experience the fun of contributing, see Contributing

Contributors definitely includes people who raise issues. Raising issues is the quickest way to contribute. Also look for issues marked good first issue or help wanted

A community forum for discussion/ideas for implementors and tool builders is much needed, following issue #14 to find where the community will be built.

License

CC0

To the extent possible under law, Leon Bambrick has waived all copyright and related or neighboring rights to this work.


Some ideas are too smart to live; other ideas are too dumb to die.

csvz's People

Contributors

secretgeek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

doekman 0xflotus

csvz's Issues

`meta-per-file` -- allow individual meta files for each file?

Have you considered using a columns meta file per-table instead of putting all columns into a single csv?

So instead of:
_meta/tables.csv
_meta/columns.csv
states.csv
citites.csv

It would be something like:
_meta/tables.csv
_meta/states_columns.csv
_meta/cities_columns.csv
states.csv
cities.csv

The advantage is that it would be easier to get the schema for a single table.

csvz-meta-relations composite keys

Sometimes (not often) keys are defined as a combination of two columns. In the columns.csv data it would be possible to identify two columns as being "primary-key". Should this mean that the combination of those two columns constitutes a primary key?

Likewise, how would a foreign key relationship in the relations.csv data targeting that primary key be represented?

MIME type

As adoption inevitably grows, a MIME type should be appropriately registered.

Haven't defined zip file

The specification doesn't specify what a Zip file is.

At the least, there should be a link to either the 2015 ISO standard or alternatively Pkware's technical note (including a version number).

Is there a GZip variant?
Similarly, are there 7Zup, bzip2, and rzip variants?
Is Zip64 supported, to allow more than 4GB files?

How can a .tar.z file full of csvs fit into the csvz specs?

Related to #4 ...

Example:

If someone:

  1. unzipped a compliant .csvz file, to a folder “MyData”
  2. Ran tar -z “MyData” (Todo: correct syntax here to specify output name Eg MyData.csv.t.z ??)

...then what standards would this now comply with?

Suggestion: there could be an optional fragment

csvz-0-tz

...which also invites/allows other mutually exclusive 0 sub standards....

Csvz-0-t

... for a compliant tar that is not gzip’d

Csvz-O-7z

...for a compliant 7z file? (Details needed for such)

Could there be a csvz *folder* without zip?

When you unzip a csvz file, you end up with a folder with csv-files.

Can this be considered a csvz-container?
I think it could be useful. For example, when putting datafiles into git.

To differentiate a csvz-container from a folder containing some csv files, I propose such a folder to have the .csvd extension. So if you extract the my_data.csvz file, you get the my_data.csvd folder.

The folder could have the same extension (thus not introducing a new extension). The disadvantage is you can't extract an .csvz file into the same folder without deleting/moving the original file.

Tools are not expected to open these .csvd-folders directly. It's only to denote that when zipped, it's automatically a .csvz file.

csvz-meta-columns column types

The spec for this file feels rather useless unless some minimal set of standard types is defined. A well-defined schema would allow an database import tool to construct the appropriate table in the database. Without a standard set fallback to "string" would be needed when an unknown type was encountered.

I would propose as a minimum:

  • boolean (true/false, 0/1)
  • int (byte/short/long?)
  • float (float/double)
  • date (datetime)
  • string (might be worth specifying ascii vs unicode)
  • binary (Base64)

Possibly also include:

  • guid
  • time (timespan/duration)

Make the spec more meta

I think the specification is now too broad.

Imagine the spec only says there can be meta csv, tables, columns and relations tables (specifying the name). Tool makes then can come up with profiles that describe what data in what meta-tables are put, and what the semantics are. These profiles then can be registered and published in this repository. There could be discussion of course.

You could have an ANSI-96-SQL-IMPORT-EXPORT profile (hope they come up with a better name). But you could also have a MY_OPEN_SOURCE_FORM_APPLICATION profile, that describes how data on forms are validated.

The advantage of tool makers (implementors): you only create was is being used. I do think there should be a csvz implementers forum.

So I propose a more minimalistic base specification, with extensions as profiles. The _meta/csv.csv profile can be built in and be called "localized" profile (if you want to call it that).

What do you think?

Is .csv.z an acceptable variant?

If I take a single csv file, and then zip it to end up with a .csv.z, is it conformant with the basic standard, or must the file end with a .csvz extension?

Update toc

(What is a way to do that automatically in vs code?)

Add `csvz-meta-meta`

A file can have

_meta/meta.csv

describing which of the standards you claim conformance with. e.g.

fragment conformance notes
csvz-0 strict this csvz file claims strict adherence with csv-0
csvz-meta-tables strict yes we have a _meta folder with a tables.csv file in it

...If they only claimed those two rules were followed then it would be up to the consumer to read the files and determine for themselves how to make sense of them.)

(suggestion: tools could generate this, or at least a draft f this, and tools can use this for configuring their own expectactions.

csv-meta-columns column ordinal

With columns headers being optional in the 4180 spec, it might also be useful to specify/require the column ordinal in the meta-columns file. This would allow attaching headers to a csv via the column metadata.

For that matter, in the tables metadata would be useful to include a "HasHeaders" boolean column.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.