mozilla / jx-sqlite Goto Github PK

View Code? Open in Web Editor NEW

35.0 4.0 19.0 3.19 MB

JSON query expressions using SQLite

License: Mozilla Public License 2.0

Python 99.99% Batchfile 0.01%

jx-sqlite's Introduction

jx-sqlite

JSON query expressions using SQLite

Summary

This library will manage your database schema to store JSON documents. You get all the speed of a well-formed database schema without the schema migration headaches.

https://www.youtube.com/watch?v=0_YLzb7BegI&list=PLSE8ODhjZXja7K1hjZ01UTVDnGQdx5v5U&index=26&t=260s

Status

Significant updates to the supporting libraries has broken this ode. It still works works for the simple cases that require it

Jan 2020 - 96/283 test failing

Installation

pip install jx-sqlite

Code Example

Open a database

container = Container()

Declare a table

table = container.get_or_create_facts("my_table")

Pour JSON documents into it

table.add({"os":"linux", "value":42})

Query the table

table.query({
    "select": "os", 
    "where": {"gt": {"value": 0}}
})

An attempt to store JSON documents in SQLite so that they are accessible via SQL. The hope is this will serve a basis for a general document-relational map (DRM), and leverage the database's query optimizer. jx-sqlite is also responsible for making the schema, and changing it dynamically as new JSON schema are encountered and to ensure that the old queries against the new schema have the same meaning.

The most interesting, and most important feature is that we query nested object arrays as if they were just another table. This is important for two reasons:

Inner objects {"a": {"b": 0}} are a shortcut for nested arrays {"a": [{"b": 0}]}, plus
Schemas can be expanded from one-to-one to one-to-many {"a": [{"b": 0}, {"b": 1}]}.

Motivation

JSON is a nice format to store data, and it has become quite prevalent. Unfortunately, databases do not handle it well, often a human is required to declare a schema that can hold the JSON before it can be queried. If we are not overwhelmed by the diversity of JSON now, we soon will be. There will be more JSON, of more different shapes, as the number of connected devices( and the information they generate) continues to increase.

Contributing

Contributions are always welcome! The best thing to do is find a failing test, and try to fix it.

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

$ git clone https://github.com/mozilla/jx-sqlite
$ cd jx-sqlite

Running tests

There are over 200 tests used to confirm the expected behaviour: They test a variety of JSON forms, and the queries that can be performed on them. Most tests are further split into three different output formats ( list, table and cube).

export PYTHONPATH=.
python -m unittest discover -v -s tests

Technical Docs

License

This project is licensed under Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at http://mozilla.org/MPL/2.0/.

History

Sep 2018 - Upgrade libs, start refactoring to work with other libs

Dec 2017 - A number of tests were added, but they do not pass.

Sep 2017 - GSoC work completed, all but a few tests pass.

GSOC

Work done upto the deadline of GSoC'17:

jx-sqlite's People

Contributors

Stargazers

Watchers

Forkers

rohit-rk pombredanne krishnamadgula kustomzone shivanigupta71299 mozilla-github-standards doytsujin vpathak2019 qqq-tech jason-cooke klahnakoski bobosui gkuo06

jx-sqlite's Issues

CODE_OF_CONDUCT.md file missing

As of January 1 2019, Mozilla requires that all GitHub projects include this CODE_OF_CONDUCT.md file in the project root. The file has two parts:

Required Text - All text under the headings Community Participation Guidelines and How to Report, are required, and should not be altered.
Optional Text - The Project Specific Etiquette heading provides a space to speak more specifically about ways people can work effectively and inclusively together. Some examples of those can be found on the Firefox Debugger project, and Common Voice. (The optional part is commented out in the raw template file, and will not be visible until you modify and uncomment that part.)

If you have any questions about this file, or Code of Conduct policies and procedures, please see Mozilla-GitHub-Standards or email [email protected].

(Message COC001)

ModuleNotFoundError: No module named 'jx_base'

When doing pip install jx-sqlite successfully.

And then python -c 'from jx_sqlite.container import Container'

I get ModuleNotFoundError: No module named 'jx_base'

In doubt, I tried pip install jx-base but got ERROR: No matching distribution found for jx-base.

Escaping dots in query (test_dots_in_property_names)

should we escape dots in query's select clause. As while executing query there won't be any column named such. I suspect that because may be by coincidence the tests passes on removing it. Working on this test , and wanted to confirm this first?

Wiki changes

FYI: The following changes were made to this repository's wiki:

defacing spam has been removed
Restricting write access to contributors is strongly encouraged. Please make that change (documentation).

These were made as the result of a recent automated defacement of publically writeable wikis.

Add GUID to every fact record (fix TestDeepOps.test_id_select)

Each fact record has a UID = "__id__" used for joining nested records. All fact records should also have a GUID = "__guid__" which is some long hex number to be used for inter-shard uniqueness.

The actual value of the GUID is hex to pass the test, but it should be easy to change.

test_edge_2 not working

Specifically, the test_count_rows

Simplify the logic that manipulates column/property names

The logic that determines the name appears to be complicated. There is opportunity to simplify it, and possibly other code that touches it.

       if not startswith_field(cname, self.var):
           cols.append({"name": cname, "sql": types, "nested_path": nested_path})
       else:
           cols.append({"name": relative_field(cname, self.var), "sql": types, "nested_path": nested_path})

Simplify setop formatting logic?

This piece of code is a bit mysterious:

    if query.format == "cube":
        for f, _ in self.sf.tables.items():
            if frum.endswith(f) or (test_dots(cols) and isinstance(query.select, list)):

Since it is switching to code that does almost the same thing as the other cube formatter, maybe they are the same.

Why does this code work? Refactor so it is simpler

Should list-formatted aggregates include missing coordinates?

test_2edge_and_sort was recently fixed by changing the test to include all coordinates of the result. But is this the right decision?

Sparse cube-formatted data is space inefficient, but has the benefit of uniformity. Lists can be more efficient at transmitting sparse data, but only if they exclude the coordinates with no data. By forcing a list to contain as many rows as there are coordinates in the cube, we are generating redundant data (zeros, or nulls) that can be a significant portion of the bytes in a query result.

Use json_tree() to parse the JSON

json_tree() requires higher-level metaprogramming, but may be faster for inserting documents into the database.

pour the raw JSON into a temp table of raw JSON
use json_tree() to make a temp table of all the properties for all documents
use SQL to figure how to alter the main schema, and alter as required
generate SQL that loads the schema from temp table of properties

The problem json_tree makes one record for each property, which may be a volume problem. I am also not sure if this works for arbitrarily deep nested object arrays.

Definition of `eq`, `ne`, and other operators returning Boolean

What is the Boolean logic in face of null values or missing properties?

Use sqlite metadata to hold snowfake datastructure?

Sqlite has its own metadata tables that can be queried. The Snowflake datastructure has the same information for it's own use. Can the snowflake use the sqlite metadata instead? Could it be faster?

My concern is that a large number of columns may be slow for Python to process. I might be wrong: sufficient caching and hashing may make pure Python metadata manipulation plenty fast; taking more time to phrase a query to do the work than it would to do the work directly.

Ambiguous function defination.

In /jx_sqlite/init.py file, the functions
[1.] sql_text_array_to_set(column) at line 194 and
[2.]get_column(column) at line 204 are ambiguous.
I think both should take one more argument i.e., adding row as an argument to both functions and calling inner function must be modified with proper arguments.
@klahnakoski

Support multiple snowflakes

This project is to be integrated into the annotation server project. This project will change a so that multiple snowflakes can exist in the same database.

Right now the BaseTable assumes it is the only entity in the database; I am not sure if multiple BaseTable can connect to the same Sqlite database. Or if BaseTable is going to defer to some Container that manages the database.

Isolate union aggregates in own subqueries

To properly isolate each union, they need their own subquery that performs the grouping:

SELECT 
	*,
	a.c2
FROM testing
LEFT JOIN (
	SELECT 
		__parent__,
		JSON_GROUP_ARRAY(DISTINCT (testing._c.$NUMBER)) c2
	FROM testing._c
	) a ON a.__parent__ = testing.__id__

The subquery performs the grouping as required so it results in just one value per fact table. That subquery is then joined with the fact table.

remove deepcopy and wrap from example

The main readme can be simplified so it does not use deepcopy or wrap.

Dev branch needs fixes

This project has a sibling, called ActiveData. ActiveData recently went through an upgrade to Elasticsearch version 5. The dev branch for this repo has changes from this upgrade. It has a few more tests, and it has expression simplification. Unfortunately, it breaks a number of things in this project.

I believe the biggest breakage is the lack of a class to represent SQL expressions. The ES5 has one, called Painless (because that is the name of the scripting language it uses) https://github.com/klahnakoski/ActiveData/blob/5673c313e81dfed03596cbfb1ed3b33afa2a524b/jx_elasticsearch/es52/expressions.py#L40

This class represents the Painless expression required to get a value, it has type information, and it has an expression for if the expression results in a missing (null) value. This structure is used by the partial_eval logic to simplify expressions. A similar one is required for SQL.

In any case, please fix the bugs in the dev branch

select as list, or object: pushing values to form documents

There are two common paterns:

if/else blocks to handle the cases of when select is a list or a dict (object).
row[c.push_name][c.push_child] = <some value>

When dealing with example=Data() objects, example["."] refers to self; example["."]="hello" is the same as example="hello". With that in mind, can all selectclauses be turned into arrays by setting theselect.name="."`? If so, the code will be simpler.

This strategy may have been avoided before because there was some (unjustified) concern that {"select":{"name":"."}} will not work well with any other select clauses.

Add constant propagtion for expressions?

There appears to be a problem with the Elasticsearch version of this project. ESv5+ performs a constant propagation test on the scripts it compiles: throwing an error if it recognizes expressions that could be simplified. This makes automated code generation harder.

I believe we can perform naive constant propagation, and expression simplification enough to avoid this problem.

optimize _accumulate_nested by consuming SQL resultset in order

Here is a piece of frightening code I wrote:

rows = list(reversed(unwrap(result.data)))

Please change the logic in _accumulate_nested() to use these records in order, or have the database deliver them in reverse order.

Library update(Question)

New changes made in pylibrary now results in number of test case failures. Is pylibrary part of the jx-sqlite project or just a dependency? Are the modifications in it are supposed to happen while working on jx-sqlite?