JSON-Schema-Analysis

Description

JSON-Schema-Analysis is a project to analyse real-world JSON Schema documents towards their usage of components and features allowed by the JSON Schema standard. Therefore the JSON Schema files available at the JSON Schema Store are downloaded and analysed using some Python scripts.

Running the Code

Preliminaries

To run this code on your device, you have to download or clone this repository. You must have a version of Python 3 installed. The project is tested with Python version 3.7.17. You'll also need to install some packages available with pipenv__. Additionally you will have to install the GraphViz software from here.

Running the code

After you appropriately installed the packages named above, you can rerun the analysis provided by this project. The main file is JSON_Schema_Analysis.py. It takes some optional command line arguments. Providing -a makes the code analyse all files provided in the directory JSON. With the argument -c <arg> you can specify the amount of files to analyse. It is also possible to print all results on the CLI with the argument -v.

To analyse additional schemas, you have to put them in the directory JSON and add them in the two CSV files responsible for category matching. These are filename_spec.csv and categorisation.csv, both located in the main directory of this project. For each file, you have to specify a nickname, the real filename and a category. Insert a line with nickname,filename in filename_spec.csv and a line with nickname,category in categorisation.csv for every additional JSON Schema you want to analyse. Although JSON_Schema_Analysis specifies the four categories app, data, conf and meta, it is capable of handling other categories just by specifying them as categoryin categorisation.csv.

Results

The results are stored as an Excel sheet named AnalysisResults.xlsx in the projects main directory and as CSV file AnalysisResults.csv in the same directory. Plots have to be generated separately with the provided scripts explained at the bottom of this document. AnalysisResults.xlsx consists of several columns containing information about a specific JSON Schema document in each row. The first column is giving the filename of the JSON Schema document located in the JSON directory. The following columns in the same row provide the information about this JSON Schema generated by the analysis. The following table will give an explanation of the meaning of each column.

Column name	Meaning
`add_prop_count`	Number of occurrences of the `additionalProperties` keyword.
`all_of_count`	Number of occurrences of the `allOf` keyword.
`any_of_count`	Number of occurrences of the `anyOf` keyword.
`array_count`	Number of occurrences of the `array` keyword.
`str_count`	Number of occurrences of the `string` type keyword.
`enum_count`	Number of occurrences of the `enum` keyword.
`mult_of_count`	Number of occurrences of the `multipleOf` keyword.
`not_count`	Number of occurrences of the `not` keyword.
`number_count`	Number of occurrences of the `integer` plus `number` type keywords.
`pattern_count`	Number of occurrences of the `pattern` plus `patternProperty` keyword.
`required_count`	Number of occurrences of the `required` keyword.
`unique_items_count`	Number of occurrences of the `uniqueItems` keyword.
`value_restriction_count`	Sum of occurrences of the `min`, `max`, `minLength`, `maxLength`, `exclusiveMinimum` and `exclusiveMaximum` keywords.
`boolean_count`	Number of occurrences of the `boolean` type keyword.
`nulltype_count`	Number of occurrences of the `null` type keyword.
`object_count`	Number of occurrences of the `object` type keyword.
`ref_count`	Number of occurrences of the `$ref` keyword.
`depth_schema`	Depth of the tree that emerges from loading the raw JSON Schema into an schema_graph .
`depth_resolvedTree`	Depth of the tree after resolving the references. If `has_recursion` is true, this is the maximum cycle length in the recursive document.
`fan_in`	Maximum Fan-In over all nodes included in the schema_graph.
`fan_out`	Maximum Fan-Out over all nodes included in the schema_graph.
`has_recursion`	Boolean flag that indicates whether the JSON Schema document (i.e. the resolved graph) is recursive.
`min_cycle_len`	Minimum cycle length of a recursive document. If `has_recursion` is false, this column will be 0.
`width`	Number of leaf nodes in the schema_graph of the raw JSON Schema document.
`reachability`	Boolean flag that indicates whether the schema contains unreachable (unused) definitions.

Project Structure

The Python script JSON_Schema_Analysis.py contains the main function. When started, it creates several processes equal to the number of virtual CPU cores available on the current machine. These processes are described in the file Analytic_Process.py. The project uses the python multiprocessing library and an Analytic_Process inherits from the process class defined there. Analytic_Processes fetch a file and perform all necessary analytic steps. The results are stored afterwards and a new file is fetched as long as unprocessed files are available. This is implemented to avoid problems with concurrency. The Analytic_Processes build schema_graph from the JSON Schema documents. These graphs are represented by the class defined in schema_graph.py. Most computational stuff is performed there. Three types of graph nodes are defined in the project: KeyValueNodes, ArrayNodes and ObjectNodes that all inherit from SchemaNode defined in the files with the same name. The file load_schema_from_web.py is used to download additional files in the resolving process every time an external reference is required. The schema_checker.py file performs the validity check with one validator. All type counts and some other counts are performed using the visitor pattern. All used visitors are defined in the subdirectory Visitors. The Meta_Schemas directory contains the JSON Schema Meta Schemas for each draft.

All unit tests performed can be found in the directory PyTest. There is an additional README that describes the structure of the tests.

The top directory of the project contains the results in AnalysisResults.xlsx and AnalysisResults.csv. The contained information is equal. The file categorisation.csv contains the mapping of JSON Schema document's short names to their category. The file filename_spec.csv contains the mapping from document short names (see schemastore.org) to the actual used filenames of the stored JSON Schema documents. Both files are used by the project to determine the category of each file. The file filename_spec.csv is generated by get_schemas_from_store.py. This script downloads all JSON Schema documents from schemastore.org, generates the filenames and stores the schemas in the directory JSON. The file typeCompareBoxplot_CombinedCount.py generates the plot typeCompareBoxplot_CombinedCount.png in the directory Plots by reading the required data from AnalysisResults.xlsx. The three bar charts are generated by hist.py. The file countsSpecialCategoriesTotal.csv is generated by table.py. The file writer.py implements helper functions for table.py. Before table.py can be executed, writer.py has to be run at least once.

The directory JsonSchemaAnalysis contains a reference implementation of the Python project which was used to validate the calculated results.

dataunitylab / schemastore-analysis Goto Github PK

schemastore-analysis's Introduction

JSON-Schema-Analysis

Description

Running the Code

Preliminaries

Running the code

Results

Project Structure

schemastore-analysis's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs