GithubHelp home page GithubHelp logo

dataunitylab / jsonoid-discovery Goto Github PK

View Code? Open in Web Editor NEW
13.0 4.0 1.0 5.35 MB

Distributed JSON schema discovery

Home Page: https://dataunitylab.github.io/jsonoid-discovery/

License: MIT License

Scala 99.73% JavaScript 0.15% Shell 0.12%
json scala

jsonoid-discovery's Introduction

JSONoid Discovery

CI codecov OpenSSF Best Practices

Schema discovery for JSON Schema draft 2020-12 using monoids. The goal of JSONoid is to produce a useful JSON Schema from a collection of JSON documents. For an idea of what JSONoid does, you can view example schemas with their corresponding datasets or see the example below.

Example of a JSON document and a schema produced by JSONoid
Table of Contents

Input/Output Format ๐Ÿ“‹

JSONoid accepts newline-delimited JSON either from standard input or a file. This means there should be exactly one JSON value per line in the input. If your JSON is not formatted this way, one option is to use the -c option to jq which can convert files to the appropriate format. Any invalid JSON will be skipped and not produce an error. It is therefore recommended to validate the JSON before providing to JSONoid if handling invalid input is required. The generated schema will be printed JSON Schema as output. Note that depending on the configuration, JSONoid will add additional properties which are not part of the JSON Schema standard. The format is described in the JSON Schema Profile draft and is subject to change.

Running ๐Ÿƒ

To quickly run jsonoid, you can use the Docker image which is built from the latest commit on the main branch. Note that by default, jsonoid accepts newline-delimited JSON on standard input, so it will hang waiting for input. Add the --help option to see possible configuration options.

docker run -i --rm michaelmior/jsonoid-discovery

To simplify, you may wish to add a shell alias so jsonoid can be run directly as a command.

alias jsonoid='docker run -i --rm michaelmior/jsonoid-discovery'
jsonoid --help

Compiling ๐Ÿ‘ท

To produce a JAR file which is suitable for running either locally or via Spark, run sbt assembly. This requires an installation of sbt. Alternatively, you can use ./sbtx assembly to attempt to automatically download install the appropriate sbt and Scala versions using sbt-extras. This will produce a JAR file under target/scala-2.13/ which can either be run directly or passed to spark-submit to run via Spark.

Schema monoids โœ–๏ธ

In JSONoid, the primary way information is collected from a schema is using monoids. A monoid simply stores a piece of information extracted from a JSON document along with information on how to combine together information from all documents in a collection in a scalable way.

The set of monoids (also referred as properties) used for discovery can be controlled using the --prop command line option. The Min set of monoids will produce only simple type information and nothing more. Simple extends this set of monoids to cover a large set of keywords supported by JSON Schema. Finally, All monoids can be enabled to discover the maximum amount of information possible. Note that for large collections of documents, there may be a performance penalty for using all possible monoids in the discovery process.

For each primitive type, the following monoids are defined.

  • BloomFilter - A Bloom filter allows for approximate membership testing. The Bloom filters generated are a Base64 encoded serialized library object.
  • Examples - Corresponding to the examples JSON Schema keyword, a number of example values will be randomly sampled from the observed documents.
  • HyperLogLog - HyperLogLog allows estimates of the number of unique values of a particular key. As with Bloom filters, the generated value is a Base64 encoded library object.

Arrays

  • Histogram, MaxItems, MinItems - Produces a histogram of array size and the maximum and minimum number of elements.
  • Unique - Detects whether elements of an array are unique corresponding to the uniqueItems JSON Schema keyword.

Numbers (integer and decimal)

  • Histogram, MaxValue, MinValue - A histogram of all values and the maximum and minimum values.
  • MultipleOf - If all numerical values are a multiple of a particular constant, this will be detected using Euclid's GCD algorithm. The corresponds to the JSON Schema multipleOf keyword.
  • Stats - Several statistical properties including mean, standard deviation, skewness, and kurtosis are calculated.

Objects

  • Dependencies - In some schemas, a key must exist an object if some other key exists, as in the JSON Schema dependentRequired keyword. For example, if a city is provided, it may also be necessary to provide a state.
  • FieldPresence - For keys which are not required, this tracks the percentage of objects which contain this property.
  • Required - This tracks which keys are always present in a schema, suggesting that they are required.

Strings

  • Format - This attempts to infer a value for the format keyword. Formats are semantic types of strings such as URLs or email addresses. A string will be labelled with the most common format detected.
  • LengthHistogram, MaxLength, MinLength - Both the minimum and maximum length of strings as well as a histogram of all string lengths will be included.
  • Format - This attempts to infer a value for the pattern keyword. A pattern is a regular expression which all string values must match. Currently this property simply finds common prefixes and suffixes of strings in the schema.

Equivalence relations โ†”๏ธ

The concept of equivalence relations was first introduced by Baazizi et al. in Parametric schema inference for massive JSON datasets The idea is that some JSON Schemas may contain some level of variation such as optional values and multiple possible types for a given key. Whether any particular schemas should be considered equivalent is dependent on the particular dataset in question, so this equivalence is configurable.

JSONoid currently supports four equivalence relations (which can be specified using the --equivalence-relation command line option):

  1. Kind equivalence (the default) will combine schemas when they are of the same kind, e.g. both objects, regardless of the contents of the objects.

  2. Label equivalence will combine object schemas only if they have the same keys, regardless of the value of the key.

  3. IntersectingLabel equivalence will combine object schemas if they have any keys in common. This can be helpful when some keys are optional since label equivalence would consider two schemas as different if one is missing an optional key.

  4. TypeMatch equivalence will combine object schemas if any keys that they have in common have the same type. Note that this equivalence is shallow, meaning that two values are considered the same type if they are both objects or arrays, without considering the contained types (similar to kind equivalence).

Transformers

Some useful transformations of schemas can only be applied after the entire schema has been computed. The transformations currently implemented in JSONoid are detailed below.

DefinitionTransformer

This transformer will attempt to discover common substructures present in the schema for the purpose of creating reusable definitions. The transformer will consider common sets of keys which occur across objects in the schema and try to find those which are similar and group them together into adefinition. This experimental feature is disabled by default and can be enabled with the --add-definitions command line option.

DisjointObjectTransformer

The disjoint object transformer attempts to identify cases in a schema where there are multiple objects at the same location in the schema, but with different sets of keys. Consider for example the set of documents below:

{"a: 1, b: 2"}
{"c: 5, d: 6"}
{"a: 3, b: 4"}
{"c: 7, d: 8"}

In this case, we can see there are two types of objects: those with keys a and b and those with keys c and d. The disjoint object transformer will attempt to identify these two types of objects and instead of creating a single object schema with multiple keys, create a schema that uses oneOf and includes each option. This feature is not currently available via the CLI.

DynamicObjectTransformer

This transformer will attempt to identify cases when the keys for an object in the schema are not fixed, but the values have a common schema. This is commonly implemented using the additonalProperties keyword. This transformer implements the approach described in the paper Reducing Ambiguity in Json Schema Discovery by Spoth et al. This is also disabled by default and can be enabled with the --detect-dynamic command line option.

EnumTransformer

This transformer will attempt to infer a value for the enum keyword. This is based on examples which were found in the schema. If only a small number of examples are found, then the set of examples is transformed into an enum. This transformer is always enabled.

MergeAllOfTransformer

This transformer will find cases in a schema where allOf is used and merge all the schemas together. This will remove the use of allOf but produce a schema which should accept the same documents. This is only useful for schemas not generated by JSONoid since JSONoid does not currently generate schemas with allOf. Accordingly, there is no option for this transformer in the CLI, but may be useful via the API.

Apache Spark โœจ

JSONoid also supports distributed schema discovery via Apache Spark. There are two options for running JSONoid on Spark. The first is to the JsonoidSpark class as your main class when running Spark. You can either use the JAR file produced via sbt assembly or download from the latest release. In this case, you can pass a path file path as input and the schema will be written to standard output. Alternatively, you can use the JsonoidRdd#fromString method to convert an RDD of strings to an RDD of schemas that supports schema discovery via the reduceSchemas or treeReduceSchemas method. The result of the reduction will be a JsonSchema object.

Running tests

Tests can be run via ScalaTest via sbt test. It is also possible to run fuzz tests via Jazzer with ./run-fuzzer.sh.

Reporting issues ๐Ÿšฉ

If you encounter any issues, please open an issue on the GitHub repository. Any potential security vulnerabilities should be reported privately.

Datasets ๐Ÿ“

Validation โœ…

JSONoid also contains a partial implementation of a JSON Schema validator. More details on validation can be found in this repository.

jsonoid-discovery's People

Contributors

dependabot[bot] avatar michaelmior avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

sitedata

jsonoid-discovery's Issues

Unable to Run Docker Container

I pulled down the michaelmior/jsonoid-discovery:latest image today and it is not working as expected. I am aware that this might be because I'm trying to run it on a Macbook with an M2 chip. I have been able to run other amd64 images.

Image ID: a4f4c6137331
Digest: sha256:f911fda8d977eccdb64e206126e01fc1060539b23b8181b396fe6f707a6fec22
Created: 6 days ago (as of March 13, 2024)

When I run the docker container as described in the README, it hangs forever. Even just running the --help hangs forever.
I'm running a command like: docker run --rm -i --platform linux/amd64 michaelmior/jsonoid-discovery -h

If I use the Docker terminal to modify the /opt/docker/bin/discovery-schema file and update the first line of the usage function like:

usage() {
-   cat <<EOM
+   cat <<'EOM' 

Then I am able to run ./bin/discovery-schema -h and see the usage output from within the docker container's terminal.

Notably, if I edit that same file and add an echo command immediately after the "Main script" comment, I do not ever see the log and it continues to hang forever.

Unable to run Docker image - unauthorized registry

I am not sure if this is expected behavior but I am running into the following issue - the first line copied directly from the README:

  โฏ docker run -i --rm ghcr.io/michaelmior/jsonoid-discovery:latest
  Unable to find image 'ghcr.io/michaelmior/jsonoid-discovery:latest' locally
  docker: Error response from daemon: Head "https://ghcr.io/v2/michaelmior/jsonoid-discovery/manifests/latest": unauthorized.
  See 'docker run --help'.
  โฏ docker pull ghcr.io/michaelmior/jsonoid-discovery:latest
  Error response from daemon: Head "https://ghcr.io/v2/michaelmior/jsonoid-discovery/manifests/latest": unauthorized

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.