GithubHelp home page GithubHelp logo

crate-workbench / cratedb-toolkit Goto Github PK

View Code? Open in Web Editor NEW
5.0 1.0 3.0 454 KB

CrateDB Toolkit.

Home Page: https://cratedb-toolkit.readthedocs.io/

License: GNU Affero General Public License v3.0

Python 99.58% Dockerfile 0.37% Shell 0.05%
data-retention olap olap-database expiration data-expiration retention retention-policies retention-policy toolkit cratedb

cratedb-toolkit's Introduction

CrateDB Toolkit

Tests Test coverage Python versions

License Status PyPI Downloads

» Documentation | Changelog | Community Forum | PyPI | Issues | Source code | License | CrateDB

About

This software package includes a range of modules and subsystems to work with CrateDB and CrateDB Cloud efficiently.

You can use CrateDB Toolkit to run data I/O procedures and automation tasks of different kinds around CrateDB and CrateDB Cloud. It can be used both as a standalone program, and as a library.

It aims for DWIM-like usefulness and UX, and provides CLI and HTTP interfaces, and others.

Status

Please note that the cratedb-toolkit package contains alpha-, beta- and incubation-quality code, and as such, is considered to be a work in progress. Contributions of all kinds are much welcome, in order to make it more solid, and to add features.

Breaking changes should be expected until a 1.0 release, so version pinning is strongly recommended, especially when using it as a library.

Install

Install package.

pip install --upgrade cratedb-toolkit

Verify installation.

ctk --version

Run with Docker.

alias ctk="docker run --rm ghcr.io/crate-workbench/cratedb-toolkit ctk"
ctk --version

Development

Contributions are very much welcome. Please visit the documentation to learn about how to spin up a sandbox environment on your workstation, or create a ticket to report a bug or share an idea about a possible feature.

cratedb-toolkit's People

Contributors

amotl avatar dependabot[bot] avatar hammerhead avatar pilosus avatar seut avatar surister avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

cratedb-toolkit's Issues

Testing: Adapt "Testcontainers" implementation to `unittest`

Introduction

Over here, we reported about the state of the "Testcontainers for Python" implementation, for supporting application testing with CrateDB.

About

At the issue referenced above, we will need to resolve this backlog item, in order to make the test layer usable for applications/libraries which are using Python's unittest module for testing.

While a pytest-based wrapper adapter around the "Testcontainers" implementation is nice, the projects crate-python and crash are using Python's builtin unittest module. Can we also grow a unittest-based wrapper adapter, to be reusable by both downstream projects?

Task

Use testing infrastructure from cratedb_toolkit.testing.testcontainers.cratedb and maybe cratedb_toolkit.tests.conftest.CrateDBFixture, and adapt that to unittest instead of using the pytest-specific details.

First Candidate

As a first candidate to apply this adapter, we identified the crash terminal program. This other ticket there outlines how/where to use the unittest-based adapter instead of the previous one.

Apply PyMongo-like amalgamation to AstraPy, to emulate DataStax Astra DB

Introduction

In the spirit of the PyMongo driver amalgamation, it looks like AstraPy, the Python client SDK for DataStax Astra and Stargate, based on the DataStax python-driver, has a very similar interface.

Features

According to the data sheet of DataStax Astra DB, a few or all of those features would need to be unlocked to achieve reasonable feature parity.

Supported APIs

  • REST
  • Document (JSON)
  • GraphQL
  • gRPC API with equivalent performance as drivers
  • CQL API

Supported Languages

  • Java
  • Node.js
  • C#
  • Python
  • Go

Supported Data formats

  • Tabular (Column-family)
  • Document (JSON)
  • Key-Value

Resources

[infra] Improve `quote_table_name` to accept full-qualified table identifiers

@seut: Do you think this routine needs to be improved? You mentioned something about "quoting going south". Maybe the root cause is here, because the routine may only handle a few situations correctly?

Yes the issue here is that it will not quote anything if the ident contains a . like foo.bar. This looks a bit wrong as it normally should quote an ident which contains dots, this would be a valid table name for PG but not for CrateDB which forbids using a . inside a table identifier (see https://cratedb.com/docs/crate/reference/en/latest/general/ddl/create-table.html#naming-restrictions).
But then I wonder why you're not using sqlalchemy.sql.expression.quoted_name which works similar afaik.
That it tries to detect a FQ identifier, splits it and quotes all parts separately, would be a bit custom but needed if the schema and table idents are not quoted dedicated.

Originally posted by @seut in #88 (comment)

CFR: Problem with `sys.jobs_log` table on `sys-export` operation

Problem

On a CrateDB database instance up for two days or so, I received this error when running ctk cfr --debug sys-export.

polars.exceptions.ComputeError: could not append value: "line 1:25: mismatched input '-' expecting {<EOF>, ';'}" of type: str to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length`

Details

14:05:17        [cratedb_toolkit.util.cli            ] ERROR   : could not append value: "line 1:25: mismatched input '-' expecting {<EOF>, ';'}" of type: str to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length`

it might also be that a value overflows the data-type's capacity
Traceback (most recent call last):
  File "/path/to/cratedb-toolkit/cratedb_toolkit/cfr/cli.py", line 50, in sys_export
    path = stc.save()
           ^^^^^^^^^^
  File "/path/to/cratedb-toolkit/cratedb_toolkit/cfr/systable.py", line 149, in save
    df = self.read_table(tablename=tablename)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/cratedb-toolkit/cratedb_toolkit/cfr/systable.py", line 107, in read_table
    return pl.read_database(
           ^^^^^^^^^^^^^^^^^
  File "/path/to/polars/io/database/functions.py", line 267, in read_database
    ).to_polars(
      ^^^^^^^^^^
  File "/path/to/polars/io/database/_executor.py", line 462, in to_polars
    frame = frame_init(
            ^^^^^^^^^^^
  File "/path/to/polars/io/database/_executor.py", line 274, in _from_rows
    return frames if iter_batches else next(frames)  # type: ignore[arg-type]
                                       ^^^^^^^^^^^^
  File "/path/to/polars/io/database/_executor.py", line 261, in <genexpr>
    DataFrame(
  File "/path/to/polars/dataframe/frame.py", line 376, in __init__
    self._df = sequence_to_pydf(
               ^^^^^^^^^^^^^^^^^
  File "/path/to/polars/_utils/construction/dataframe.py", line 433, in sequence_to_pydf
    return _sequence_to_pydf_dispatcher(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/functools.py", line 909, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/polars/_utils/construction/dataframe.py", line 644, in _sequence_of_tuple_to_pydf
    return _sequence_of_sequence_to_pydf(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/polars/_utils/construction/dataframe.py", line 561, in _sequence_of_sequence_to_pydf
    pydf = PyDataFrame.from_rows(
           ^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: could not append value: "line 1:25: mismatched input '-' expecting {<EOF>, ';'}" of type: str to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length`

[wtf] Improve query library

About

The CrateDB SQL query collection on behalf of library.py, added with GH-88, needs further improvements. It has been assembled from a wave of quick & dirty operations, collected from different sources, without much review.

Details

While working on the code base, @seut discovered a few specific shortcomings in this area. Thank you. Maybe @WalBeh, @hlcianfagna, @hammerhead, or others have something to contribute to answer those questions.

Settings

Why only this small subset of settings? It's also not really dedicated to a concrete topic as both, rebalance and recovery settings are queried.

class Settings:
"""
Reflect cluster settings.
"""
info = """
SELECT
name,
master_node,
settings['cluster']['routing']['allocation']['cluster_concurrent_rebalance']
AS cluster_concurrent_rebalance,
settings['indices']['recovery']['max_bytes_per_sec'] AS max_bytes_per_sec
FROM sys.cluster
LIMIT 1;
"""

Shards

This looks unreasonable complicated to just translate the primary boolean into a string.

allocation = InfoElement(
name="shard_allocation",
sql="""
SELECT
IF(s.primary = TRUE, 'primary', 'replica') AS shard_type,
COALESCE(shards, 0) AS shards
FROM
UNNEST([true, false]) s(primary)
LEFT JOIN (
SELECT primary, COUNT(*) AS shards
FROM sys.allocations
WHERE current_state != 'STARTED'
GROUP BY 1
) a ON s.primary = a.primary;
""",
label="Shard Allocation",
description="Support identifying issues with shard allocation.",
)

Why selecting the 2nd decision? This looks problematic e.g. when only 1 shard exists there isn't a 2nd decision.

table_allocation_special = InfoElement(
name="table_allocation_special",
label="Table Allocations Special",
sql="""
SELECT decisions[2]['node_name'] AS node_name, COUNT(*) AS table_count
FROM sys.allocations
GROUP BY decisions[2]['node_name'];
""",
description="Table allocation. Special.",
)

Isn't the query above more detailed? I think this one can be skipped...

translog_uncommitted_size = InfoElement(
name="translog_uncommitted_size",
label="Total uncommitted translog size",
description="A large number of uncommitted translog operations can indicate issues with shard replication.",
sql="""
SELECT COALESCE(SUM(translog_stats['uncommitted_size']), 0) AS translog_uncommitted_size
FROM sys.shards;
""",
transform=get_single_value("translog_uncommitted_size"),
unit="bytes",
)

Thoughts

In general, I am happy to remove any item which should be skipped, and improve all others which have shortcomings, into a DWIM shape, based on your suggestions. Thanks already, and thanks in advance!

Prevent multiple strategies operating on the same table

About

The idea behind the composite primary key PRIMARY KEY ("strategy", "table_schema", "table_name") was to prevent duplicate strategies on the same table. Too bad we don't have UNIQUE constraints in CrateDB.

Regression?

Is there elsewhere in the code a check to prevent duplicates (i.e., for the same table, one entry with DELETE and 3 days retention, and another with DELETE and 5 days retention on the same table)?

Originally posted by @hammerhead in #20 (comment)

Apply database schema already when connecting

At 1 and 2, we have been using SQLAlchemy's abilities to specify the database schema on the connection string already, using ?schema=foobar. In this way, table names will not need to be addressed in full-qualified notation "by hand". Instead, they can be addressed by using basename only, when selecting the schema at connection time already.

Let's also do it in the same spirit here.

Footnotes

  1. https://github.com/crate-workbench/mlflow-cratedb

  2. https://github.com/crate-workbench/langchain

[I/O] InfluxDB adapter: `Failed to establish a new connection: [Errno 111] Connection refused`, when using Docker runtime

Procedure

docker run -d --name crate -p 4200:4200 crate/crate:latest

docker run -d --name influxdb -p 8086:8086 \
    --env=DOCKER_INFLUXDB_INIT_MODE=setup \
    --env=DOCKER_INFLUXDB_INIT_USERNAME=user1 \
    --env=DOCKER_INFLUXDB_INIT_PASSWORD=secret1234 \
    --env=DOCKER_INFLUXDB_INIT_ORG=example \
    --env=DOCKER_INFLUXDB_INIT_BUCKET=testdrive \
    --env=DOCKER_INFLUXDB_INIT_ADMIN_TOKEN=token \
    influxdb:latest

alias ctk="docker run --rm -it ghcr.io/crate-workbench/cratedb-toolkit:latest ctk"
ctk load table influxdb2://example:token@localhost:8086/testdrive/demo --cratedb-sqlalchemy-url "crate://crate@localhost:4200/testdrive/demo"

Problem

urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f9ebfd32450>: Failed to establish a new connection: [Errno 111] Connection refused

References

[LIB] Improve UX for ad hoc applications

About

For certain ad hoc applications like presenting functionalities in Jupyter Notebooks, accessing data from CrateDB in Python, or otherwise exploring it, querying should not be more difficult than like how EasyDB, TinyDB, dataset, and Datasette are demonstrating it, with or without using SQLite.

EasyDB

from easydb import EasyDB

db = EasyDB("filename.db")
for record in db.query("SELECT * FROM mytable"):
  print(record)

TinyDB

from tinydb import TinyDB, Query

db = TinyDB("/path/to/db.json")
db.insert({'int': 1, 'char': 'a'})
db.insert({'int': 1, 'char': 'b'})

db.search((User.name == 'John') & (User.age <= 30))

dataset

import dataset

db = dataset.connect('sqlite:///:memory:')

table = db['sometable']
table.insert(dict(name='John Doe', age=37))
table.insert(dict(name='Jane Doe', age=34, gender='female'))

john = table.find_one(name='John Doe')

Datasette

datasette serve path/to/database.db
open http://localhost:8001/

References

Share and use datasets via Python code

About

Easily consume datasets from tutorials and/or production applications like others are doing it, using Python code.

References

Standards

[I/O] Use cr8 for loading tables from PostgreSQL

About

@hlcianfagna elaborated about typical cr8 usage patterns, which did not make it into the ctk load table interface yet. Thanks!

Details

Regarding copying the content from one table to a new one with different settings/partitioning options/etc, you can use the cr8 insert-from-sql utility. It accepts a --fetch-size parameter which defaults at 100 records, and a --concurrency parameter which defaults at 25.

This tool reads through the PostgreSQL protocol connecting on port 5432, so username and passwords for the source need to be encoded in the connection string. Writing happens through the HTTP endpoint of CrateDB on port 4200, and it can go to a separate cluster.

If your passwords have special characters, you need to encode them properly.

CLI Example

cr8 insert-from-sql \
  --src-uri "postgresql://readuser:readpwd@localhost:5432/doc" --query "SELECT * FROM sourcetable;" \
  --hosts writeuser:writepassword@localhost:4200 --table doc.targettable

[I/O] InfluxDB adapter: CRATEDB_SQLALCHEMY_URL not working when using the Docker-aliased command

Procedure

alias ctk="docker run --rm -it ghcr.io/crate-workbench/cratedb-toolkit:latest ctk"
export CRATEDB_SQLALCHEMY_URL=crate://crate@localhost:4200/testdrive/demo
ctk load table influxdb2://example:token@localhost:8086/testdrive/demo

Problem

KeyError: 'Either CrateDB Cloud Cluster identifier or CrateDB SQLAlchemy or HTTP URL needs to be supplied. Use --cluster-id / --cratedb-sqlalchemy-url / --cratedb-http-url CLI options or CRATEDB_CLOUD_CLUSTER_ID / CRATEDB_SQLALCHEMY_URL / CRATEDB_HTTP_URL environment variables.'

References

ValueError: max() arg is an empty sequence

@hammerhead reported this problem, happening right away when invoking cratedb-toolkit without any command line options.

~/ cratedb-toolkit          
Traceback (most recent call last):
  File "/usr/local/bin/cratedb-toolkit", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1054, in main
    with self.make_context(prog_name, args, **extra) as ctx:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 920, in make_context
    self.parse_args(ctx, args)
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1610, in parse_args
    echo(ctx.get_help(), color=ctx.color)
         ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 699, in get_help
    return self.command.get_help(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1298, in get_help
    self.format_help(ctx, formatter)
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1331, in format_help
    self.format_options(ctx, formatter)
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1533, in format_options
    self.format_commands(ctx, formatter)
  File "/usr/local/lib/python3.11/site-packages/click_aliases/__init__.py", line 65, in format_commands
    max_len = max(len(cmd) for cmd in sub_commands)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: max() arg is an empty sequence

Testing: Improve "Testcontainers for Python" implementation

Introduction

We are aiming to provide canonical "Testcontainers" implementations for Java and Python, per testcontainers-java and testcontainers-python.

About

At the spots enumerated below, we added the first version of a corresponding Python implementation, originally conceived at daq-tools/lorrystream#47.

Backlog

  • Add documentation
  • GH-53
  • GH-58
  • Currently, the adapter and test layer is being exercised using an SQLAlchemy connection and corresponding test case. It makes sense to also exercise and demonstrate a pure DBAPI-based variant of the same thing.
  • It will be nice to have a modern test layer which forms a cluster, for both Java and Python. I think cr8 has it already?
  • Cherry-pick CrateDB invocation options from cr8: '-Cdiscovery.initial_state_timeout=0', '-Cnetwork.host=127.0.0.1', '-Cudc.enabled=false', '-Ccluster.name=cr8-tests'
  • Revisit downstream issues crate/cratedb-examples#72 and crate/cratedb-examples#282.
  • Upstream to testcontainers-python.

[io] Croud related IO subsystem tests may fail when a croud configuration file exist for the local user

Some croud related IO subsystem tests may fail when a local croud configuration settings for the local user exists. These settings will be taken into account and passed to the croud API, causing the mocked croud calls to not match anymore.

Failing tests:

  • cratedb_toolkit.io.croud.test_import_url
  • cratedb_toolkit.io.croud.test_import_file

They are failing in my setup when the stored croud.yml configuration file contains a different endpoint, example:

current-profile: dev
profiles:
  dev:
    endpoint: https://console.cratedb-dev.cloud

AnyIO

About

High level asynchronous concurrency and networking framework that works on top of either trio or asyncio
Topics.

AnyIO is an asynchronous networking and concurrency library that works on top of either asyncio or trio. It implements trio-like structured concurrency (SC) on top of asyncio and works in harmony with the native SC of trio itself.

References

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.