totalhack / zillion Goto Github PK

Make sense of it all. Semantic data modeling and analytics with a sprinkle of AI. https://totalhack.github.io/zillion/

License: MIT License

Makefile 0.34% Python 99.55% Shell 0.11%

ai analytics data-analysis data-warehousing datasources openai python query-builder reporting semantic-data-model semantic-layer sql text-to-sql warehouse

zillion's People

Contributors

Stargazers

Watchers

Forkers

erisonliang vishalbelsare webclinic017 jjhw touristshaun mz0in

zillion's Issues

Add flag to run_report.py to drop into debugger with result

Improve Warehouse save()

Currently a Warehouse can only be safely saved if it was created with reference to a config file and the Warehouse config was not changed in memory after init (if so it would be out of sync with the referenced file). This works OK in use cases where a config file is the master structure of the Warehouse and you aren't editing a Warehouse in memory, but it would be better if the Warehouse could reconstruct and save the current active config back to a specified file path in save() even if it wasn't created from a config file/url.

One caveat is if the Warehouse was created from a remote config file there may be no way to post changes back, so it would only be able to save a local file config without additional changes to try to support pushing warehouse changes to other locations (remote files, git, s3, etc).

One way or another this process should be cleaned up. Either the door needs to be closed on in-memory editing of a Warehouse config (a file is the only possible master and all changes go through that) or we need to support reconstructing and saving a config from the current Warehouse settings.

Support Entity-Attribute-Value Tables

Currently using entity-attribute-value tables (see example image) requires creating views for each attribute and putting those views in your warehouse config as individual dimension tables. It would be nice if zillion had a new "attribute" table type that could automatically adjust the warehouse definition based on the attributes that are supported.

To support this we'd need to:

Adjust core.py:TableTypes to support a new type
Adjust configs.py to support a new table type in the config
Review/adjust Datasource.find_neighbor_tables to make sure attribute tables can join as needed
(Hard part) Review/adjust the logic that builds tables and joins that can meet the required grain -- this will need special logic to look at attribute tables since they store their attributes as rows instead of columns.

Investigate pkl for improving warehouse config DX

https://github.com/apple/pkl

Add Warehouse.chat()

A more flexible method to chat with the warehouse including the ability to execute other warehouse methods besides just executing reports (which execute_text only does). Example might be asking for a list of dimensions in table X, etc.

Many-to-Many Table Support

Currently only a parent-child or parent-sibling (for dimensions only) join model is supported in Zillion. This keeps things simpler, works for many analytics use-cases, and helps prevent introduction of "bad joins" that can throw off aggregation. It's worth investigating how many-to-many relationships might be supported via bridge tables.

Support "in report" operator for criteria

Sometimes it's useful to filter one report based on the results of another. This theoretically would not be that hard to implement, something like [("some_field", "in_report", <report_id>)] and a sub-report is spawned to get that result first. The sub report would have to have "some_field" as a dimension.

Extracting the raw SQL command without executing it

Zillion nlp allows using normal text to execute sql.

result = wh.execute_text("sales for Partner A")
print(result.df) # Pandas DataFrame

The above example shows a case of using the nlp capability to execute a query.
The execute_text callable should be constructing an SQL command before it's executed.
Wondering if there if the raw SQL can be exposed before executing it.
Here is some pseudocode of what I'm thinking.

q = wh.create_nlp_q("sales Partner A")
print(q)
# SELECT sales from ...

Investigate Steampipe Integration

Steampipe allows querying a variety of APIs as SQL, and provides sqlite extensions to help. In theory it should be possible to represent each API source as a zillion config and run reports against it as you would any other sql datasource. It would be nice if there was already metadata we could use to convert to zillion configs for each datasource. Also if there are any tools that can make the sqlite extension installation more seamless, so that it might be automatically handled if the extension is missing at warehouse init time.

https://steampipe.io/
https://til.simonwillison.net/sqlite/steampipe

Ability to register new technical computations

Currently supported technical computations are defined in configs.py and added to TECHNICAL_CLASS_MAP which maps the technical name to a computation class. That map could be updated in place but it would be better to provide an API to manage which technical computations are supported.

Investigate more flexible table models

A user questioned why we need the restriction of defining tables as either metric/fact tables or dimension tables. This organization is currently important to zillion's understanding of how to appropriately form queries and to somewhat following the data warehousing ideas outlined by Kimball. I think it's worth investigating a more flexible model, perhaps controlled/allowed by a mode flag in the zillion config, that doesn't require you label your tables as metric or dimension tables in your config and instead determines how the tables must act dynamically to satisfy a report request. In other words, when a metric is requested from a table treat it like a metric table for the scope of that report request, etc.

Disclaimer: this may be a bad idea. It's possible this could make it too easy for users to have zillion put together bad queries/reports based on a poorly defined warehouse structure. But if there are cases where this would be valuable then I would lean towards letting users utilize this at their own risk, assuming I can manage to fully understand and explain the caveats and gotchas.

Ability to Customize DataSource Group By Clause by Dimension

Currently the datasource level queries just group by the dimensions numerically (i.e. group by 1,2,3 if there are 3 dimensions in use). There are cases where you might want to be able to customize the clause that gets used here, such as adjusting/coercing column collation in MySQL at query time. The flexibility should maybe be limited here though, as screwing up the group by logic when trying to do something more complex could lead to unexpected behavior/output that might be hard to diagnose at first.

Example Sales Analytics : sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) unable to open database file

Following the example tutorial from the docs, running into an error.

System:
Windows 10
Python3.11
qdrant docker-compose file being used

Code I ran:

from zillion import Warehouse

wh = Warehouse(config="https://raw.githubusercontent.com/totalhack/zillion/master/examples/example_wh_config.json")

Error

No ZILLION_CONFIG specified, using default settings
Traceback (most recent call last):
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\engine\base.py", line 3366, in _wrap_pool_connect
    return fn()
           ^^^^
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\pool\base.py", line 327, in connect
    return _ConnectionFairy._checkout(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\pool\base.py", line 894, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\pool\base.py", line 493, in checkout
    rec = pool._do_get()
          ^^^^^^^^^^^^^^
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\pool\impl.py", line 256, in _do_get
    return self._create_connection()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\pool\base.py", line 273, in _create_connection
    return _ConnectionRecord(self)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\pool\base.py", line 388, in __init__
    self.__connect()
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\pool\base.py", line 690, in __connect
    with util.safe_reraise():
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\util\langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\util\compat.py", line 211, in raise_
    raise exception
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\pool\base.py", line 686, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\engine\create.py", line 574, in connect
    return dialect.connect(*cargs, **cparams)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\engine\default.py", line 598, in connect
    return self.dbapi.connect(*cargs, **cparams)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sqlite3.OperationalError: unable to open database file

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\yeman_s1h20q2\Yemane\zill\main.py", line 1, in <module>
    from zillion import Warehouse
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\zillion\__init__.py", line 21, in <module>
    from .datasource import DataSource
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\zillion\datasource.py", line 30, in <module>
    from zillion.field import (
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\zillion\field.py", line 18, in <module>
    from zillion.model import zillion_engine, DimensionValues
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\zillion\model.py", line 51, in <module>
    zillion_metadata.create_all(zillion_engine)
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\sql\schema.py", line 4930, in create_all
    bind._run_ddl_visitor(
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\engine\base.py", line 3232, in _run_ddl_visitor
    with self.begin() as conn:
         ^^^^^^^^^^^^
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\engine\base.py", line 3148, in begin
    conn = self.connect(close_with_result=close_with_result)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\engine\base.py", line 3320, in connect
    return self._connection_cls(self, close_with_result=close_with_result)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\engine\base.py", line 96, in __init__
    else engine.raw_connection()
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\engine\base.py", line 3399, in raw_connection
    return self._wrap_pool_connect(self.pool.connect, _connection)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\engine\base.py", line 3369, in _wrap_pool_connect
    Connection._handle_dbapi_exception_noconnection(
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\engine\base.py", line 2203, in _handle_dbapi_exception_noconnection
    util.raise_(
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\util\compat.py", line 211, in raise_
    raise exception
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\engine\base.py", line 3366, in _wrap_pool_connect
    return fn()
           ^^^^
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\pool\base.py", line 327, in connect
    return _ConnectionFairy._checkout(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\pool\base.py", line 894, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\pool\base.py", line 493, in checkout
    rec = pool._do_get()
          ^^^^^^^^^^^^^^
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\pool\impl.py", line 256, in _do_get
    return self._create_connection()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\pool\base.py", line 273, in _create_connection
    return _ConnectionRecord(self)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\pool\base.py", line 388, in __init__
    self.__connect()
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\pool\base.py", line 690, in __connect
    with util.safe_reraise():
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\util\langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\util\compat.py", line 211, in raise_
    raise exception
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\pool\base.py", line 686, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\engine\create.py", line 574, in connect
    return dialect.connect(*cargs, **cparams)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\yeman_s1h20q2\Yemane\zill\.venv\Lib\site-packages\sqlalchemy\engine\default.py", line 598, in connect
    return self.dbapi.connect(*cargs, **cparams)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) unable to open database file
(Background on this error at: https://sqlalche.me/e/14/e3q8)

Adding GitHub Actions to have CI

I am using Pydigger to monitor recent uploads to PyPI that don't have any Continuous Integration (CI) system configured. A CI system can greatly improve the development experience by providing quick feedback to the developers and contributors, even for a toy or experimental project. As my contribution to open source (see why)I try to contribute a simple CI configuration to these projects to get started.

I've started to work on adding GitHub Action, I am going to report the issues here as I encounter them.

Datasource-level Natural Language Querying

Currently natural language interfaces are limited to using existing field definitions. It might be useful to also allow a more direct form of querying that can produce arbitrary datasource level SQL formulas if an appropriate field doesn't exist to satisfy a request.

Natural Language Report Result Updates

The NLP features leverage langchain under the hood, and I think there are example out there of using it to make a natural language interface to editing a DataFrame (which the Report has for output). It might be interesting to support further modifying the report data via natural language, or editing and re-running the report with natural language (assuming we don't cover that in a warehouse-level chat interface).

Datasource methods to add and remove tables

Currently there are methods to add a new Datasource to an existing Warehouse object but not to add a new table to an existing Datasource. This would allow for more flexibility in how ad hoc tables can be combined with existing data. The current workaround is to just add any tables you want to stick around to your config file and recreate the Warehouse.

totalhack / zillion Goto Github PK

zillion's People

Contributors

Stargazers

Watchers

Forkers

zillion's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs