GithubHelp home page GithubHelp logo

mara / mara-db Goto Github PK

View Code? Open in Web Editor NEW
38.0 38.0 17.0 777 KB

Lightweight configuration and access to multiple databases in a single project

License: MIT License

Python 97.92% JavaScript 1.37% Makefile 0.71%
backend flask mara sqlalchemy

mara-db's People

Contributors

gathineou avatar gozrehm avatar ice1e0 avatar jankatins avatar jrolland82 avatar leo-schick avatar martin-loetzsch avatar tafkas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

mara-db's Issues

add function to return the jdbc url

Add a function which returns the jdbc url for a specific DB. This is helpful when dealing with java based systems like Spark or Mondrian. This could be done either by a specific module mara_db.jdbc.jdbc_url(db_alias) or by a property method of the DB class (DB.jdbc_url).

Remove UI requirements

Currently the packages graphviz and mara-page are added as requirement. But there might be usecases where you want to use the mara-db package without a UI, e.g. when running mara via console or inside a Notebook without Flask app.

I think we should remove these packages as requirement for this package.

mara_db.dbs.PostgreSQLDB does not work with port given as int

Running the following piece of code results in the exception below.

import mara_db.auto_migration
import mara_db.config
import mara_db.dbs

mara_db.config.databases = lambda: {
    "mara": mara_db.dbs.PostgreSQLDB(
        host="localhost", port=32768, user="postgres", database="example_etl_mara"
    )
}

print(mara_db.auto_migration.engine("mara"))

Exception:

<ipython-input-1-016eb66e9ef8> in <module>()
      9 }
     10
---> 11 print(mara_db.auto_migration.engine("mara"))

~/.python_envs/test/lib/python3.7/functools.py in wrapper(*args, **kw)
    818
    819     def wrapper(*args, **kw):
--> 820         return dispatch(args[0].__class__)(*args, **kw)
    821
    822     registry[object] = func

~/.python_envs/test/lib/python3.7/site-packages/mara_db/auto_migration.py in __(alias, **_)
    141 @engine.register(str)
    142 def __(alias: str, **_):
--> 143     return engine(mara_db.dbs.db(alias))
    144
    145

~/.python_envs/test/lib/python3.7/functools.py in wrapper(*args, **kw)
    818
    819     def wrapper(*args, **kw):
--> 820         return dispatch(args[0].__class__)(*args, **kw)
    821
    822     registry[object] = func

~/.python_envs/test/lib/python3.7/site-packages/mara_db/auto_migration.py in __(db)
    152 def __(db: mara_db.dbs.PostgreSQLDB):
    153     return sqlalchemy.create_engine(
--> 154         f'postgresql+psycopg2://{db.user}{":"+db.password if db.password else ""}@{db.host}{":"+db.port if db.port else "" }/{db.database}')
    155
    156

TypeError: can only concatenate str (not "int") to str

Passing the port as a string works.

I would have excepted that port is casted to str or that a more informative error is throw at the creation of the PostgreSQLDB object.

Schema view for Azure Synapse

Currently schema view is not possible in the UI for Azure Synapse databases, becase function OBJECT_SCHEMA_NAME is not supported by Azure Synapse. This is used here. This function should be replaced with another solution to support schema view for Azure Synapse, e.g. by using a left join to sys.schemas

Sort out sqlalchemy.engine vs dbs.engine

Currently, the import is done like:

from sqlalchemy import engine
[...]
@lru_cache(maxsize=None)
def engine(alias) -> engine.Engine:
    [...]

When I from dbs import engine, then pycharm thinks I get the sqlalchemy engine module when in fact I get the engine(alias) function. (This is probably a bug in Pycharm...)

It would probably be easiest if the import is changed to from sqlalchemy import engine as sql_engine or even import sqlalchemy.engine and the full name is used in specifying the return type.

breaking constraint BigQuery implementation of copy_from_stdin_command

The current BigQuery implementation uses the function copy_from_stdin_command to get the bq load ... command for a local file without piping in case the gcloud_gcs_bucket_name parameter of the BigQueryDB is not set. See here

This is not align with the copy_from_stdin_command function name, since it does not use piping. In my business case, I don't need support for this. But in case we want to keep this, I think it should be moved to another function (e.g. copy _from_file_command).

This "workaround" to load local files without using a gcs bucket will cause issues when calling other commands, like mara_pipelines.commands.sql.Copy.

Migration not working on postgres database with non default port

Works fine with default port, not when other port is specified.
localsetup.py

'mara': mara_db.dbs.PostgreSQLDB(user='user', password='pw', port=10000, host='localhost', database='database')

make

...
migrate-mara-db: Could not access or create database "postgresql+psycopg2://user:pw@localhost/database":
migrate-mara-db: (psycopg2.ProgrammingError) permission denied to create database
...

Add support for different file formats [refactoring]

Currently the shell methods copy_to_stdout_command and copy_from_stdin_command only supports TEXT and CSV file format.

Some databases support additional formats which I would like to use here.
Example: BigQuery supports AVRO, Parquet, OCR, CSV, JSON and new line delimited JSON (aka JSONL).

I would like to redesign this so that the methods copy_to_stdout_command and copy_from_stdin_command supports these formats as well.

Concept

  1. create a base class like DB just for file formats (e.g. Format) and add sub classes like CsvFormat, ParquetFormat, ...
  2. add a new parameter pipe_format to the functions copy_to_stdout_command and copy_from_stdin_command used to interpret these values.
  3. The current used format specific parameters (csv_format, skip_header, delimiter_char, quote_char, null_value_string) will be moved into the the sub classes of Format and will be marked as deprecated.

Sample

new file format.py:

"""Different formats for piping"""

class Format:
    """Base format definition"""
    def __repr__(self) -> str:
        return (f'<{self.__class__.__name__}: '
                + ', '.join([f'{var}={getattr(self, var)}'
                             for var in vars(self) if getattr(self, var)])
                + '>')

class CsvFormat(Format):
    """
    CSV file format. See https://tools.ietf.org/html/rfc4180
    """
    def __init__(self, delimiter_char: str = None, quote_char: str = None, header: bool = None):
        """
        CSV file format. See https://tools.ietf.org/html/rfc4180

        Args:
            delimiter_char: The character that separates columns
            quote_char: The character for quoting strings
            header: Whether a csv header with the column name(s) is part of the CSV file.
        """
        self.delimiter_char = delimiter_char
        self.quote_char = quote_char
        self.header = header

class ParquetFormat(Format):
    """Apache Parquet"""
    def __init__(self):
        pass

[...]

Sample implementation for Postgres and BigQuery (in shell.py):

@copy_from_stdin_command.register(dbs.BigQueryDB)
def __(db: dbs.BigQueryDB, target_table: str, csv_format: bool = None, skip_header: bool = None,
       delimiter_char: str = None, quote_char: str = None, null_value_string: str = None, timezone: str = None,
       pipe_format: Format = None):
    assert db.gcloud_gcs_bucket_name, f"Please provide the 'gcloud_gcs_bucket_name' parameter to database '{db}' "

    import uuid
    import datetime

    if csv_format or isinstance(pipe_format, CsvFormat):
        bq_format = 'CSV'
        if isinstance(pipe_format, CsvFormat):
            if not delimiter_char:
                delimiter_char = pipe_format.delimiter_char
            if not quote_char:
                quote_char = pipe_format.quote_char
            if not skip_header:
                skip_header = pipe_format.header
    elif not pipe_format or isinstance(pipe_format, JsonlFormat):
        bq_format = 'NEWLINE_DELIMITED_JSON'
    elif isinstance(pipe_format, AvroFormat):
        bq_format = 'AVRO'
    elif isinstance(pipe_format, ParquetFormat):
        bq_format = 'PARQUET'
    elif isinstance(pipe_format, OrcFormat):
        bq_format = 'ORC'
    else:
        raise ValueError(f'Unsupported pipe_format for BigQueryDB: {pipe_format}')

    tmp_file_name = f'tmp-{datetime.datetime.now().isoformat()}-{uuid.uuid4().hex}.' + (
        'csv' if csv_format else 'json')

    service_account_email = bigquery_credentials(db).service_account_email

    set_env_prefix = f'CLOUDSDK_CORE_ACCOUNT={service_account_email}'
    bq_load_command = (set_env_prefix
                       + ' bq load'
                       + ' --headless'
                       + ' --quiet'
                       + (f' --location={db.location}' if db.location else '')
                       + (f' --project_id={db.project}' if db.project else '')
                       + (f' --dataset_id={db.dataset}' if db.dataset else '')
                       + (f' --skip_leading_rows=1' if skip_header else '')
                       )

    bq_load_command += f' --source_format={bq_format}'

    if delimiter_char is not None:
        bq_load_command += f" --field_delimiter='{delimiter_char}'"
    if null_value_string is not None:
        bq_load_command += f" --null_marker='{null_value_string}'"
    if quote_char is not None:
        bq_load_command += f" --quote='{quote_char}'"

    bq_load_command += f" '{target_table}'  gs://{db.gcloud_gcs_bucket_name}/{tmp_file_name}"

    gcs_write_command = f'{set_env_prefix} gsutil -q cp - gs://{db.gcloud_gcs_bucket_name}/{tmp_file_name}'
    gcs_delete_temp_file_command = f'{set_env_prefix} gsutil -q rm gs://{db.gcloud_gcs_bucket_name}/{tmp_file_name}'

    return gcs_write_command + '\\\n  \\\n  && ' \
           + bq_load_command + '\\\n  \\\n  && ' \
           + gcs_delete_temp_file_command

Sample implementation for Postgres (in shell.py):

@copy_from_stdin_command.register(dbs.PostgreSQLDB)
def __(db: dbs.PostgreSQLDB, target_table: str, csv_format: bool = None, skip_header: bool = None,
       delimiter_char: str = None, quote_char: str = None, null_value_string: str = None, timezone: str = None,
       pipe_format: Format = None):
    if pipe_format and type(pipe_format) not in [CsvFormat, JsonlFormat]:
        raise ValueError(f'Unsupported pipe_format for PostgreSQLDB: {pipe_format}')

    columns = ''
    if isinstance(pipe_format, JsonlFormat):
        columns = ' (' + ', '.join(['data']) + ')'

    sql = f'COPY {target_table}{columns} FROM STDIN WITH'
    if csv_format or isinstance(pipe_format, CsvFormat):
        sql += ' CSV'
        if isinstance(pipe_format, CsvFormat):
            if delimiter_char is None:
                delimiter_char = pipe_format.delimiter_char
            if quote_char is None:
                quote_char = pipe_format.quote_char
            if skip_header is None and pipe_format.header:
                skip_header = True
    if skip_header:
        sql += ' HEADER'
    if delimiter_char is not None:
        sql += f" DELIMITER AS '{delimiter_char}'"
    if null_value_string is not None:
        sql += f" NULL AS '{null_value_string}'"
    if quote_char is not None:
        sql += f" QUOTE AS '{quote_char}'"

    # escape double quotes
    sql = sql.replace('"', '\\"')

    sed_stdin = ''
    if isinstance(pipe_format, JsonlFormat):
        # escapes JSON escapings since PostgreSQL interprets C-escapes in TEXT mode
        sed_stdin += "sed 's/\\\\/\\\\x5C/g' \\\n| "

    return f'{sed_stdin}{query_command(db, timezone)} \\\n      --command="{sql}"'

Creating DB from sqlalchemy URL

Would be nice to have a static method in class mara_db.dbs.DB to create a mara db from its SQLALchemy URL

Sample use case

from mara_db.dbs import DB

db = DB.from_sqlalchemy_url("postgresql://scott:tiger@localhost:5432/mydatabase")
# db is now a PostgreSQLDB instance with all parameters set from the SQLAlchemy URL

Copy jsonb from postgres to postgres doesn't handle escaped quotes

For a jsonb payload like
{"text": "Some \"great\" text"}
the Copy will fail. The copy_to_stdout_command correctly writes the payload as escaped string.
the copy_from_stdin_command however will not recognize. When trying to work around by setting target column to TEXT the double quotes get unescaped.

tried:

  • running COPY FROM in CSV mode using ESCAPE '', but will not easily work due to interference of double quotes in json tokens.

possible solutions:

  • maybe? running postgres to postgres via COPY TO STDOUT BINARY | COPY FROM STDIN BINARY?

Useful hints or workarounds are welcome. I can probably move these imports to postgres_fdw for the time being.

mara_db

Could not find suitable distribution for Requirement.parse('mara-page>=1.3.0')

making package requirement `psycopg2-binary` optional

I think the packages required for PostgreSQL psycopg2-binary>=2.7.3 should not be a default requirement for this package. I currently experiemnt with using BigQuery as a database backend for mara and there I do not need a PostgreSQL db connection.

To do this I suggest the following changes:

  1. In version 4.8.0 we add an extra postgres with the packages required for postgres. People can then start adding it as an extra requirement in their projects.
  2. In the next major version 5 we drop the packages as default requirement

sqsh vs sqlcmd for SQL Server support

Current sqsh implementation

Currently SQL Server is supported via sqsh which has some disadvantageous:

  • does not support named instances
  • requires the GO command to be written in lowercase, but by default it is case insensitive (GO, Go, gO and go should all work)
  • requires additional seeding of ;;\n\go after the sql statement, see here

Possible alternative sqlcmd

As alternative, the official Microsoft tool sqlcmd (official available for Linux) could be used instead.

Sample shell call with piping: echo "select 1;" | sqlcmd -S "server" -U "user" -P "password" -i /dev/stdin (ref.)

BigQuery use environment variable to detect service account json file

Currently it is required to set the service_account_json_file_name property for BigQueryDB. Google by default uses the environment variable GOOGLE_APPLICATION_CREDENTIALS to define a path to the service account private key.

I think the service_account_json_file_name property should be made optional and if not set the value should be taken from env. GOOGLE_APPLICATION_CREDENTIALS

See as well: https://cloud.google.com/docs/authentication/getting-started

Remove timezone parameter for shell commands

All of the functions in the shell.py file have a timezone parameter. I would like to challenge this design pattern.

Often you define as database designer in which timezone you want to save datetime data. Often UTC is used in environments where multiple timezones can be used. The conversion from UTC to the client/user timezone then often takes place a) in the receiving client or b) in SQL. In the recent years DBMSs started to add timezone handling via SQL. Samples:

The timezone parameter is only implemented for postgresql. It looks like that other shell clients do not implement timezone handling. Therefore, I suggest to

  • remove the timezone parameter from the functions in shell.py
  • remove the default_timezone config
  • add a timezone config parameter to the PostgreSQLDB config class

With this design the timezone handling is expected to be handled in SQL (or after reading the data). In case someone needs to call psql with different timezones, this will be still possible by duplicating the db config. I consider this to be an uncommon use case.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.