sqlalchemy-redshift / sqlalchemy-redshift Goto Github PK

View Code? Open in Web Editor NEW

219.0 219.0 141.0 643 KB

Amazon Redshift SQLAlchemy Dialect

Home Page: https://sqlalchemy-redshift.readthedocs.org/en/latest/

License: MIT License

Python 99.74% Shell 0.26%

sqlalchemy-redshift's People

Contributors

Stargazers

Watchers

Forkers

bbayles bouk haleemur gitter-badger berdario solackerman andry1 datalytica jseabold briandailey mindsnacks mtambos adrizein mtrbean jkehler mehulrijawani lrpauley vizbii scribu jklukas guitmz sriney-boc atrigent wrmsr lyuggang rectalogic joshdover rivierapartners gametimesf mjschultz rosner nmwari jamiepg4 mbrunthaler srikiraju asaolabs rhockenbury earnwithdrop crowdscores nightmarebadger chadlagore jlungc mpatek bluelabsio f0rk jbasko moneylion cwegrzyn e01n0 sburns mheilman jasongluck ms32035 jose-pvargas agable-vt unacademy microcred mostafafotoh zuarbase affirm ebunt jcampbell huntcsg graingert bdevlieger swidoff luanli22 kawaamai grundprinzip jferg368 yeongseon bshinnebarger ulili5 singulico ewengillies gregbrowndev neob91 etison stratasan hivestack dddaaammmooo maksimleanovich kvh sppidy att14 xdelivery onelive-dev capitalrx jkeelan brooke-white anilktechie harshach open-metadata efstat fundseeder fingalp xhesikaruci ra0k mtami datalearns

sqlalchemy-redshift's Issues

column info doesnt register in alembic autogenerated migrations

class Example(Model):
    id = Column(types.Integer(), primary_key=True, info={'encode': 'lzo'})
    name = Column(types.Unicode(255), info={'encode': 'lzo'})

I wasn't sure if this was an alembic bug/setting or not, but given that e.g. defining identity columns or column encodings are things which would be more important to redshift tables, I would have expected them to be included in the generated migrations.

Another reflection regression

Sorry, I should've tested this earlier as soon as you finished working on #64 , but I kept using 0.1.2 and didn't check if there were any more issues.

The issue now is with tables that have foreign keys contraints, I've been able to reproduce the problem with the following 2 tables:

create table ref.foo(id integer identity(1, 1) not null unique);

create table ref.bar(
foo_id integer not null,
foreign key(foo_id) references ref.foo(id))

And, when running .reflect(only=['bar'], schema='ref')

I get *** AttributeError: 'NoneType' object has no attribute 'group'

This is the traceback:

  File "/home/dario/Projects/l2/Hydra/database/hydra_database/__init__.py", line 81, in cached_metadatas
    redshiftmeta.reflect(only=tablenames, schema=schema)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/sql/schema.py", line 3647, in reflect
    Table(name, self, **reflect_opts)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/sql/schema.py", line 416, in __new__
    metadata._remove_table(name, schema)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/sql/schema.py", line 411, in __new__
    table._init(name, metadata, *args, **kw)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/sql/schema.py", line 484, in _init
    self._autoload(metadata, autoload_with, include_columns)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/sql/schema.py", line 496, in _autoload
    self, include_columns, exclude_columns
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1477, in run_callable
    return callable_(self, *args, **kwargs)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 364, in reflecttable
    return insp.reflecttable(table, include_columns, exclude_columns)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/engine/reflection.py", line 578, in reflecttable
    exclude_columns, reflection_options)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/engine/reflection.py", line 666, in _reflect_fk
    table_name, schema, **table.dialect_kwargs)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/engine/reflection.py", line 447, in get_foreign_keys
    **kw)
  File "<string>", line 2, in get_foreign_keys
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy/engine/reflection.py", line 54, in cache
    ret = fn(self, con, *args, **kw)
  File "/home/dario/.local/share/virtualenvs/hydra_worker/local/lib/python2.7/site-packages/sqlalchemy_redshift/dialect.py", line 396, in get_foreign_keys
    referred_column = m.group('referred_column')
AttributeError: 'NoneType' object has no attribute 'group'

Cut a release

The last release was in April, and there have been a whole mass of patches made since then.

Add a persistent reflection cache option

As discussed in #101, reflection on Redshift can be time-consuming because the per-query overhead is high and SQLAlchemy issues new queries for every table you need to reflect.

I'd love to find a way to "freeze" the schema cache and have it persist while you reflect multiple tables. We already do bulk queries to get info about all tables on reflection, but that info gets thrown away after each table is reflected.

One possibility is to introduce a RedshiftDialect._global_reflection_cache member that would normally be None and modify the reflection.cache decorator to check for it. We'd need to be very intentional about how this gets used.

It might be nice to wrap it up in a context manager. You could write something like:

with RedshiftDialect.caching_schema():
    # reflect a bunch of tables here, utilizing the persistent cache

# Reflection is back to normal now that we've left the caching_schema context

I've never used alembic. I wonder if there's some way we could hook this into alembic so that migrations could be faster.

can't execute query when passing tuple parameter

I m trying to execute basic query and pass tuple parameter into it like:

engine = create_engine(app.config['REDSHIFT_DATABASE_URI'], echo=True)
conn = engine.connect()
s = text("select distinct t1.hh_id, t2.mri_id from msa t1 left outer join (select hh_id, mri_id from msa where mri_id in :code_param) t2 on t1.hh_id = t2.hh_id")
mriset = conn.execute(s, code_param = tuple(codes)).fetchall()

Here is an error:

2015-11-11 22:49:37,373 INFO sqlalchemy.engine.base.Engine select distinct t1.hh_id, t2.mri_id from msa t1 left outer join (select hh_id, mri_id from msa where mri_id in %(code_param)s) t2 on t1.hh_id = t2.hh_id
2015-11-11 22:49:37,374 INFO sqlalchemy.engine.base.Engine {'code_param': (-207426274, -205158201, -174737541)}
2015-11-11 22:49:37,681 INFO sqlalchemy.engine.base.Engine ROLLBACK

It works good though with postgres db.

Do you have any clue why?

BZIP2 format not supported with COPY

We use BZIP2 compression on our s3 files, it doesn't seem possible to specify this with the CopyCommand.
http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html#copy-bzip2

>>> import sqlalchemy as sa
>>> from redshift_sqlalchemy.dialect import CopyCommand
>>> table = sa.table("foo")
>>> CopyCommand(table, "s3://bucket/data/", "XXXYYYZZZXXXYYYZZZ11", "XXXYYYZZZXXXYYYZZZ11XXXYYYZZZXXXYYYZZZ11", format="BZIP2")
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/virtualenv/local/lib/python2.7/site-packages/sqlalchemy_redshift/commands.py", line 362, in __init__
    self.formats)
ValueError: "format" parameter must be one of ['CSV', 'JSON', 'AVRO', None]
>>> CopyCommand(table, "s3://bucket/data/", "XXXYYYZZZXXXYYYZZZ11", "XXXYYYZZZXXXYYYZZZ11XXXYYYZZZXXXYYYZZZ11", compression="BZIP2")
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/virtualenv/local/lib/python2.7/site-packages/sqlalchemy_redshift/commands.py", line 368, in __init__
    self.compression_types
ValueError: "compression" parameter must be one of ['GZIP', 'LZOP']

Fix DELETE statement for Redshift Dialect

Redshift's delete statement varies slightly from Postgresql's. See here for documentation.

Basic delete statements have the same syntax.

For instance, the following is valid SQL in both dialects:

DELETE FROM customer_table WHERE customer_table.id > 1000

However, while the following is a valid statement in Postgresql:

DELETE FROM customer_table 
WHERE customer_table.id = order_table.customer_id 
AND order_table.id < 100

It needs to be written for Redshift as:

DELETE FROM customer_table
USING order_table
WHERE customer_table.id = order_table.customer_id
AND order_table.id < 100

SqlAlchemy should be able to build this resultant query with the following Python snippet:

from sqlalchemy import delete, Table, Column, Integer, MetaData
from redshift_sqlalchemy import RedshiftDialect
meta = MetaData()
customer = Table('customer_table', meta, Column('id', Integer, primary_key=True))
order = Table('order_table', meta, Column('id', Integer, primary_key=True), Column('customer_id', Integer)
del_stmt = delete(order).where(order.c.customer_id==customer.c.id).where(order.c.id<100)
print(del_stmt.compile(dialect=RedshiftDialect())

Redshift does not support sequences

The dialect needs to specify that sequences are not supported. Otherwise sqlalchemy tries to reference a nonexistent sequence and you get an error such as the following:
(sqlalchemy.exc.ProgrammingError) (psycopg2.ProgrammingError) relation "tablename_model_id_seq" does not exist [SQL: 'select nextval(\'"tablename_model_id_seq"\')'] [SQL: u'INSERT INTO tablename (model_id, model_name) VALUES (%(model_id)s, %(model_name)s,)']

redshift-ca-bundle.crt is out of date

After Amazon issued a new SSL root certificate for Redshift last week, I started getting connection failures with "sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) SSL error: certificate verify failed".

I followed the instructions at http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-transitioning-to-acm-certs.html and copied the new https://s3.amazonaws.com/redshift-downloads/redshift-ca-bundle.crt to ~/.postgresql/root.crt , which is the default location according to postgresql docs.

However, for the client to actually check the certificate, I had to provide the file path:

redshift+psycopg2://username:[email protected]:5439/mydb?sslmode=verify-ca&sslrootcert=/home/mydir/.postgresql/root.crt

I guess this is because the module provides its own root certificate file, which is now out of date:

def create_connect_args(self, *args, **kwargs):
        """
        Build DB-API compatible connection arguments.

        Overrides interface
        :meth:`~sqlalchemy.engine.interfaces.Dialect.create_connect_args`.
        """
        default_args = {
            'sslmode': 'verify-full',
            'sslrootcert': pkg_resources.resource_filename(
                __name__,
                'redshift-ca-bundle.crt'
            ),
        }
        cargs, cparams = super(RedshiftDialect, self).create_connect_args(
            *args, **kwargs
        )
        default_args.update(cparams)
        return cargs, default_args

Can't load plugin: sqlalchemy.dialects:psycopg2

Hey,

I just wanted to give it a spin. But I get a NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:psycopg2. I'm running 3.6.1 and installed sqlalchemy 1.1.9 and sqlalchemy-redshift 0.6.0.

From there I just followed the docs and used my custom connection string to create the engine. However I'm not sure if this is actually a bug on this project.

Question: errors from queries involving pg_catalog tables running while things are being dropped

Hello! The impetus for the investigation that led to issue #101 is a class of errors that we've been seeing that seem to be due to concurrent access to pg_catalog tables. We have a test which makes sure that our alembic migrations are up to date. This test necessarily does a lot of reflection. We've been seeing this test fail intermittently with errors like these:

E       sqlalchemy.exc.InternalError: (psycopg2.InternalError) could not find tuple for constraint 839833
E        [SQL: '\n        SELECT\n          n.nspname as "schema",\n          c.relname as "table_name",\n          t.contype,\n          t.conname,\n          t.conkey,\n          a.attnum,\n          a.attname,\n          pg_catalog.pg_get_constraintdef(t.oid, true) as condef,\n          n.oid as "schema_oid",\n          c.oid as "rel_oid"\n        FROM pg_catalog.pg_class c\n        LEFT JOIN pg_catalog.pg_namespace n\n          ON n.oid = c.relnamespace\n        JOIN pg_catalog.pg_constraint t\n          ON t.conrelid = c.oid\n        JOIN pg_catalog.pg_attribute a\n          ON t.conrelid = a.attrelid AND a.attnum = ANY(t.conkey)\n        WHERE n.nspname !~ \'^pg_\'\n        ORDER BY n.nspname, c.relname\n        ']

E       sqlalchemy.exc.InternalError: (psycopg2.InternalError) cache lookup failed for relation 870981
E        [SQL: '\n            SELECT\n              n.nspname as "schema",\n              c.relname as "table_name",\n              d.column as "name",\n              encoding as "encode",\n              type, distkey, sortkey, "notnull", adsrc, attnum,\n              pg_catalog.format_type(att.atttypid, att.atttypmod),\n              pg_catalog.pg_get_expr(ad.adbin, ad.adrelid) AS DEFAULT,\n              n.oid as "schema_oid",\n              c.oid as "table_oid"\n            FROM pg_catalog.pg_class c\n            LEFT JOIN pg_catalog.pg_namespace n\n              ON n.oid = c.relnamespace\n            JOIN pg_catalog.pg_table_def d\n              ON (d.schemaname, d.tablename) = (n.nspname, c.relname)\n            JOIN pg_catalog.pg_attribute att\n              ON (att.attrelid, att.attname) = (c.oid, d.column)\n            LEFT JOIN pg_catalog.pg_attrdef ad\n              ON (att.attrelid, att.attnum) = (ad.adrelid, ad.adnum)\n            WHERE n.nspname !~ \'^pg_\'\n            ORDER BY n.nspname, c.relname, att.attnum\n            ']

E       sqlalchemy.exc.InternalError: (psycopg2.InternalError) cache lookup failed for relation 731237
E        [SQL: "\n            SELECT c.oid\n            FROM pg_catalog.pg_class c\n            LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace\n            WHERE (pg_catalog.pg_table_is_visible(c.oid))\n            AND c.relname = %(table_name)s AND c.relkind in ('r', 'v', 'm', 'f')\n        "] [parameters: {'table_name': 'ref_cm_icd9'}]

We think that these errors happen when one of the other schemas in our database is being dropped while one of these queries is being run. Since these queries grab information about the entire database, they are not scoped to schemas at all. We have also found this page in the redshift docs: http://docs.aws.amazon.com/redshift/latest/dg/c_serial_isolation.html which seems to be saying that this might be expected behavior.

Error number 1 is the one that we've seen the most. We think this is because of issue #101 - because the constraint query was being run many times, there was more of a chance that it would be run at the wrong time. However, we just recently saw error number 2, which is the query that this dialect uses for getting column info. Error number 3 is actually a query from the regular postgres dialect, so this issue isn't specific to this dialect. We think that these sorts of errors happen specifically when functions are involved - some data is fetched, a function is run on some fetched id, and if the entity associated with that id disappears between those two steps, an error occurs.

I'm creating this issue more to ask if you guys have encountered something like this before and whether you've found any resolution. However, it may be useful for our purposes to restrict the _get_all_* queries so that they only query information about the requested schema. Since checking the schema name would not involve any functions, we think it would resolve our issue. I don't know whether that would be a good change to make in upstream as well.

NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:redshift.psycopg2

I am trying to connect to redshift from my python code. my pip installed:

psycopg2==2.6.1
redshift-sqlalchemy==0.4.1
SQLAlchemy==1.0.9
and my virtual machine(ubuntu) has:

libpq-dev
python-psycopg2
But I am still getting

engine = create_engine('redshift+psycopg2://{}:{}@{}'.format(username, password, url))
File "/opt/project/env/local/lib/python2.7/site-packages/sqlalchemy/engine/init.py", line 386, in create_engine
return strategy.create(args, *kwargs)
File "/opt/project/env/local/lib/python2.7/site-packages/sqlalchemy/engine/strategies.py", line 51, in create
entrypoint = u._get_entrypoint()
File "/opt/project/env/local/lib/python2.7/site-packages/sqlalchemy/engine/url.py", line 131, in _get_entrypoint
cls = registry.load(name)
File "/opt/project/env/local/lib/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 205, in load
(self.group, name))
NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:redshift.psycopg2
I tried SQLAlchemy version 0.8.0, but doesn't work either.
With the same config, I am able to run from my laptop (mac), but on linux, I guess some packages still missing? Any suggestion will be appreciated, thanks!

Support external tables

With Spectrum, Amazon has added support for querying external data to Redshift. There are a few new features that make this work. It would be nice to support these for migrations/ddl and querying:

DDL:

Create external schema http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-schemas.html
Create external table http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html
Add partition http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html#c-spectrum-external-tables-partitioning

Querying:

Query by partition http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html#c-spectrum-external-tables-partitioning

Add a CONTRIBUTING.md

For most PRs, we are requiring tests, documentation, and CHANGES.rst notes. Sometimes we ask for these upfront, but sometimes we forget until later in the review, and it delays merging good changes.

We should provide a CONTRIBUTING.md both as a checklist for those doing review to make sure they aren't forgetting things, and for contributors so that they don't feel like requirements are popping up out of nowhere.

cc @graingert

Cannot connect to Redshift due to hostname error

(psycopg2.OperationalError) server certificate for "ec2-[IP address].compute-1.amazonaws.com" does not match host name "[Hostname].redshift.amazonaws.com"

The same connection URI was fine until last Friday. I've checked that both hostnames correspond to the same IP address. If I set 'sslmode': 'prefer' in dialect.py, the connection is successful.

According to https://docs.aws.amazon.com/redshift/latest/mgmt/connecting-ssl-support.html,

SSL support in Amazon Redshift is strictly for encrypting the connection between your client and your cluster; it should not be relied on for authenticating the server. To authenticate the server, install the public key (.pem file) for the SSL certificate on your client and use the key to connect to your clusters.

Perhaps sslmode: require should be the default?

Advertise for maintainers

Had a discussion with @graingert that he's no longer using Redshift, so likely won't be able to be active in owning and maintaining this project.

I'm still using Redshift and can be involved, but probably can't give the project the attention it deserves without help. I propose we advertise for a new maintainer.

A proposed path forward:

Update README with a call to action at the top indicating that we're looking for a maintainer
Leave comments on all open issues indicating that we're looking for a new maintainer
@graingert grants rights on pypi to let @jklukas upload new releases

If we get someone interested in being maintainer, then we'd proceed with making sure they also have rights on pypi, and we can update metadata in setup.py as appropriate.

How does the above sound, @graingert ?

how to use CopyCommand

I have a working copy command that works in SQL Workbench/J. Mirroring it, I created a CopyCommand object. What do I do next to run the copy command against Redshift?

I see that there's a
sqlalchemy_redshift.commands.visit_copy_command(element, compiler, **kw)
method. Looks like I'll have element = my_copy_command_object, is there more documentation on the compiler argument?

It'd be super if there's an end-to-end example somewhere, thanks a lot.

Constraint information is not being cached

In RedshiftDialect, there are three methods, _get_all_relation_info, _get_all_column_info, and _get_all_constraint_info, which query information about the whole cluster. These methods are called by several other methods which pass their info_cache argument down to one of the above methods. These methods, in turn, are called by other methods which, in most cases, pass down all of their keyword arguments. The one exception is _get_redshift_constraints - this method is never called with the keyword arguments of its caller, and therefore never gets the info_cache object that was passed in by SQLAlchemy. The result of this is that constraint information is never cached. When running something like alembic which does a lot of reflection, this can slow things down considerably because the dialect winds up querying information about the entire database every time alembic asks for anything.

Hosted documentation

It'd be swell to have some documentation, that for now could consist of answers to questions like "How is this different from the postgresql dialect?" rtfd.org + sphinx would be my tool of choice here, but others may have thoughts on this.

Reflection bug on SQLAlchemy 1.2

Hi there,

It looks like SQLAlchemy has changed a few method signatures in version >=1.2. The following statement works on SQLAlchemy <1.2 but not on SQLAlchemy >= 1.2:

table = sa.Table(table_name, meta, autoload=True, postgresql_ignore_search_path=True)

I get the following traceback with SQLAlchemy version 1.2.5:

  File "/home/colin/work/spectrify/spectrify/utils/schema.py", line 54, in get_table_schema
    table = sa.Table(table_name, meta, **table_kwargs)
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/sql/schema.py", line 456, in __new__
    metadata._remove_table(name, schema)
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/sql/schema.py", line 451, in __new__
    table._init(name, metadata, *args, **kw)
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/sql/schema.py", line 533, in _init
    include_columns, _extend_on=_extend_on)
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/sql/schema.py", line 559, in _autoload
    _extend_on=_extend_on
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2056, in run_callable
    return conn.run_callable(callable_, *args, **kwargs)
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1545, in run_callable
    return callable_(self, *args, **kwargs)
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 389, in reflecttable
    table, include_columns, exclude_columns, **opts)
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/engine/reflection.py", line 618, in reflecttable
    table_name, schema, **table.dialect_kwargs):
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/engine/reflection.py", line 369, in get_columns
    **kw)
  File "<string>", line 2, in get_columns
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy/engine/reflection.py", line 54, in cache
    ret = fn(self, con, *args, **kw)
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy_redshift/dialect.py", line 404, in get_columns
    enums=[], schema=col.schema, encode=col.encode)
  File "/home/colin/.virtualenvs/spectrify/local/lib/python2.7/site-packages/sqlalchemy_redshift/dialect.py", line 604, in _get_column_info
    **kw
TypeError: _get_column_info() takes exactly 9 arguments (8 given)

I tracked the issue down to the following commit: zzzeek/sqlalchemy@fadb8d6#diff-d33159d80d3deef1d5bdcd057dcc3d6bR2443

Redshift spinup/shutdown script

We're running tests against Redshift now and have AWS access keys available to Travis, but the cluster is currently running full-time.

@graingert - you mentioned that you had a good idea of how to hook a script in to spin up the cluster for tests and then spin it back down. Are there any blockers at this point to make that happen?

Feature request: CREATE LIBRARY command

https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_LIBRARY.html

Would be nice for this to be supported in sqlalchemy-redshift.

Run travis tests on real redshift database

Can't currently justify the cost of a redshift database just to run travis tests.

CopyCommand doesn't support REGION

I want to copy some files from S3 in a different region: https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-source-s3.html#copy-region

Update Certs?

Redshift is transitioning to ACM certificates. http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-transitioning-to-acm-certs.html

Does the cert in this project need to be updated to support this transition? https://github.com/sqlalchemy-redshift/sqlalchemy-redshift/blob/master/sqlalchemy_redshift/redshift-ssl-ca-cert.pem

Reflection should raise the proper Exception type when things are not found

Right now, when reflection fails, it's usually with a KeyError, but the other sqlalchemy dialects raise sa.exc.NoSuch<object>Error. I have code that depends on this exception being raised, which is how I found this issue.

I've got a PR in the works.

Migration plan for moving to the sqlalchemy-redshift org

There are some details we'll need to deal with when this project moves to the https://github.com/sqlalchemy-redshift org.

Documentation
- Make sure we're using sqlachemy-redshift and sqlalchemy_redshift consistently everywhere
- Make sure any references to the repo are correct
Read the Docs
- Change the url of docs to sqlalchemy-redshift (it's currently inverted)
- Set up a webhook on the repo so docs will build automatically.
PyPI
- Make redshift-sqlalchemy rely on sqlalchemy-redshift

I wanted to make sure to document thoughts as I have them. Other things we'll need to keep in mind?

Can't generate migrations with alembic against redshift cluster

Pretty long traceback due to use wrapping alembic within click based command:

Traceback (most recent call last):
  File "/home/gsliwinski/.virtualenvs/x/bin/dw", line 11, in <module>
    load_entry_point('dw', 'console_scripts', 'dw')()
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/gsliwinski/dw/src/dw/scripts/decorators.py", line 11, in call_with_app_context
    return cmd(*args, **kwargs)
  File "/home/gsliwinski/dw/src/dw/scripts/alembic_migrations.py", line 137, in autogenerate
    alembic_ctx = command.revision(alembic_config(click_ctx.obj['DB']), autogenerate=True, message=message)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/command.py", line 176, in revision
    script_directory.run_env()
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/script/base.py", line 425, in run_env
    util.load_python_file(self.dir, 'env.py')
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/util/pyfiles.py", line 93, in load_python_file
    module = load_module_py(module_id, path)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/util/compat.py", line 75, in load_module_py
    mod = imp.load_source(module_id, path, fp)
  File "alembic/env.py", line 88, in <module>
    run_migrations_online()
  File "alembic/env.py", line 82, in run_migrations_online
    context.run_migrations()
  File "<string>", line 8, in run_migrations
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/runtime/environment.py", line 836, in run_migrations
    self.get_context().run_migrations(**kw)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/runtime/migration.py", line 321, in run_migrations
    for step in self._migrations_fn(heads, self):
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/command.py", line 156, in retrieve_migrations
    revision_context.run_autogenerate(rev, context)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/autogenerate/api.py", line 415, in run_autogenerate
    self._run_environment(rev, migration_context, True)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/autogenerate/api.py", line 451, in _run_environment
    autogen_context, migration_script)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/autogenerate/compare.py", line 22, in _populate_migration_script
    _produce_net_changes(autogen_context, upgrade_ops)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/alembic/autogenerate/compare.py", line 38, in _produce_net_changes
    schemas = set(inspector.get_schema_names())
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/sqlalchemy/engine/reflection.py", line 158, in get_schema_names
    info_cache=self.info_cache)
  File "<string>", line 2, in get_schema_names
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/sqlalchemy/engine/reflection.py", line 54, in cache
    ret = fn(self, con, *args, **kw)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/sqlalchemy/dialects/postgresql/base.py", line 2349, in get_schema_names
    ).columns(nspname=sqltypes.Unicode))
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 945, in execute
    return meth(self, multiparams, params)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 263, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1053, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1189, in _execute_context
    context)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1405, in _handle_dbapi_exception
    util.reraise(*exc_info)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1182, in _execute_context
    context)
  File "/home/gsliwinski/.virtualenvs/x/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 470, in do_execute
    cursor.execute(statement, parameters)
TypeError: 'dict' object does not support indexing

Problem with [waits] command for concurrent write queries

I m trying to reproduce an example from RS documentation using:
python 3.4
sqlalchemy-redshift==0.3.1
Here is an example from RedShift doc (http://goo.gl/V2PuZh):

begin;
[waits]
copy listing from ;
end;

But it throws an exception:

ERROR - [RedShift] Programming Error: (psycopg2.ProgrammingError) syntax error at or near "["
LINE 1: begin; [waits] copy my_schema.my_table (userid, flag,...
               ^
[SQL: "begin; [waits] copy my_schema.my_table (userid, flag, campid, condition, ts) from 's3://mybucket/mydata' credentials 'aws_access_key_id=<my_aws_secret_key>;aws_secret_access_key=<my_aws_secret_access_key>' delimiter ','; end;"]

It works good without the [waits] operator. Is there any workaround so I can use [waits]?

_get_column_info() missing 1 required positional argument: 'comment' in SQLAlchemy 1.2.0

sqlalchemy-redshift/sqlalchemy_redshift/dialect.py

Line 607 in 92852b7

column_info = super(RedshiftDialect, self)._get_column_info(

This breaks, because this was added:

http://docs.sqlalchemy.org/en/latest/changelog/changelog_12.html#change-3f38508932257c303017b1276f285c39

When it's called from get_columns():

sqlalchemy-redshift/sqlalchemy_redshift/dialect.py

Lines 406 to 409 in 92852b7

 column_info = self._get_column_info( 

 name=col.name, format_type=col.format_type, 

 default=col.default, notnull=col.notnull, domains=domains, 

 enums=[], schema=col.schema, encode=col.encode)

Because comment is missing from there.

SQL_IDENTIFIER_RE issue

I have encountered table names that look like schema.table.name and reflection is breaking because SQL_IDENTIFER_RE.findall(key) returns a list of length 3 in dialect._get_schema_and_relation.

RelationKey.unquote only works when schema is None

Hi all,

In our database, we have a relation named quote sitting inside a schema called platform.

Because the word quote is a Redshift keyword, the string representation of a dialect.RelationKey corresponding to this relation is platform."quote". This is all good.

The class RelationKey has an unquote method, which returns a string representation of it with any "" quotes removed. Unfortunately, this method correctly removes the quotes only when the schema is None. In our case, the code is used with the schema non None (and equal to platform). What we see is that unquote, erroneously, does not remove the quotes (as it triggers only when str(RelationKey) both starts and ends with quotes).

I think a more comprehensive behaviour for the unquote method would be applying unquoting at both the schema names AND the relation names, so that unquoted(platform."quote") -> platform.quote.

This bug prevents sqlalchemy from reflecting our platform schema, ending with:

File "/home/vlasisva/.local/lib/python3.5/site-packages/sqlalchemy_redshift/dialect.py", line 630, in _get_redshift_relation
raise sa.exc.NoSuchTableError(key)
sqlalchemy.exc.NoSuchTableError: platform."quote"

Happy to elaborate, if above is not clear.

What's the maintenance status of redshift_sqlalchemy?

It looks like all the action is happening in this fork rather than the original @binarydud repo. I'm working on a PR that fills out the reflection capabilities. Should I propose that here?

Add support for AWS token/key credentials

sqlalchemy redshift does not appear to support authenticating with AWS token/secret key. I did find something that looks like it might in commands.py but there are no docs supporting how to supply the credentials if it does exist.

Dialect cannot work with huge data tables (allocation of memory on server/client)

For Amazon Redshift SQLAlchemy library by default using psycopg2 library that is used for Postgres. It by default using client side cursors. When sql query is executing it allocated all QuerySet in memory, because of this big query is overflow memory limit.
After this I tried to use Raw Driver Connection using Psycopg2 and work with server side cursor that is recommended by psycopg documentation (http://initd.org/psycopg/docs/usage.html#server-side-cursors):
http://docs.aws.amazon.com/redshift/latest/dg/declare.html

begin;
DECLARE curname cursor FOR
    SELECT * FROM schema.table;
fetch forward 10 from curname;
fetch next from curname;
close curname;
commit;

command_text = """
    SELECT * FROM schema.table;
"""
conn = psycopg2.connect(host="HOST", dbname="dbname", password="****", user="user", port="5439")
cursor2 = conn.cursor('curname')
cursor2.itersize = 100
cursor2.execute(command_text)
for item in cursor2:
    print(item)

This method is executed but it still allocated all queryset on server side and executing for about 10 minutes. In same cases it still have problem with overflow memory, but at this time on server side.

When I tried to execute it using SQL Workbench query is starting executing and return result in loop by 1000 result and printed it.
I assuming that workbench is making another kind of limit offset on their side.
Also I tested it using java code and use redhift jdbc driver using Java. It code work as expected and return result row by row.
So I assuming problem in python driver psycopg2 that allocating queryset in memory before returning result

package connection;

import java.sql.*;
import java.util.Properties;
public class Redshift {
    //Redshift driver: "jdbc:redshift://host:5439/dev";
    static final String dbURL = "jdbc:redshift://host:5439/dbname";
    static final String MasterUsername = "user";
    static final String MasterUserPassword = "****";
    public static void main(String[] args) {
        Connection conn = null;
        Statement stmt = null;
        try{
           Class.forName("com.amazon.redshift.jdbc41.Driver");
           //Open a connection and define properties.
           System.out.println("Connecting to database...");
           Properties props = new Properties();
           //Uncomment the following line if using a keystore.
           //props.setProperty("ssl", "true");
           props.setProperty("user", MasterUsername);
           props.setProperty("password", MasterUserPassword);
           conn = DriverManager.getConnection(dbURL, props);
           //Try a simple query.
           System.out.println("Listing system tables...");
           stmt = conn.createStatement();
           String sql;
           sql = "SELECT * FROM schema.table;";
           ResultSet rs = stmt.executeQuery(sql);
           //Get the data from the result set.
           while(rs.next()){
              //Retrieve two columns.
              String catalog = rs.getString("list_entry_id");
              String name = rs.getString("source_code");
              //Display values.
              System.out.print("Catalog: " + catalog);
              System.out.println(", Name: " + name);
           }
           rs.close();
           stmt.close();
           conn.close();
        }catch(Exception ex){
           //For convenience, handle all errors here.
           ex.printStackTrace();
        }finally{
           //Finally block to close resources.
           try{
              if(stmt!=null)
                 stmt.close();
           }catch(Exception ex){
           }// nothing we can do
           try{
              if(conn!=null)
                 conn.close();
           }catch(Exception ex){
              ex.printStackTrace();
           }
        }
        System.out.println("Finished connectivity test.");
     }
}

3rd reflection regression

As I mentioned a couple weeks ago, I got another regression, here's the minimal schema test case:

create table foo.domain(
id integer identity(1,1) not null unique);

create table foo.bar(
domain_id integer not null,
foreign key(domain_id) references foo.domain(id))

(I almost got it working last time, but then life intervened and haven't touch this code since :( )

This is quite weird, since redshiftmeta.reflect(only=['domain'], schema='foo') works just fine, but redshiftmeta.reflect(only=['bar'], schema='foo') fails with

*** NoSuchTableError: foo."domain"

Also, the same with foo.domain2 will work just fine... so I'd guess that there's some special-cased/hardcoded code for stuff named domain, though by grepping briefly through the code I don't find anything suspicious

Redshift reserved words are not escaped

I got an error trying to create a table that had a column called "tag". The list of reserved words are here: http://docs.aws.amazon.com/redshift/latest/dg/r_pg_keywords.html

I fixed it in a very dirty way:

import sqlalchemy.dialects.postgresql.base as ps

REDSHIFT_RESERVED_WORDS = set("""aes128,aes256,all,allowoverwrite,analyse,
analyze,and,any,array,as,asc,authorization,backup,between,binary,blanksasnull,
both,bytedict,bzip2,case,cast,check,collate,column,constraint,create,
credentials,cross,current_date,current_time,current_timestamp,current_user,
current_user_id,default,deferrable,deflate,defrag,delta,delta32k,desc,disable,
distinct,do,else,emptyasnull,enable,encode,encrypt,encryption,end,except,
explicit,false,for,foreign,freeze,from,full,globaldict256,globaldict64k,grant,
group,gzip,having,identity,ignore,ilike,in,initially,inner,intersect,into,is,
isnull,join,leading,left,like,limit,localtime,localtimestamp,lun,luns,lzo,lzop,
minus,mostly13,mostly32,mostly8,natural,new,not,notnull,null,nulls,off,offline,
offset,oid,old,on,only,open,or,order,outer,overlaps,parallel,partition,percent,
permissions,placing,primary,raw,readratio,recover,references,respect,rejectlog,
resort,restore,right,select,session_user,similar,some,sysdate,system,table,tag,
tdes,text255,text32k,then,timestamp,to,top,trailing,true,truncatecolumns,union,
unique,user,using,verbose,wallet,when,where,with,without""".split(","))

ps.PGIdentifierPreparer.reserved_words = ps.PGIdentifierPreparer.reserved_words.union(helpers.REDSHIFT_RESERVED_WORDS)

Would be good if the redshift dialect would support this out of the box.

Model instance overwritten after commit.

Hi, I'm using the latest version 7691341 of the repo with sa v. 1.0.12. I'm noticing some strange behavior if I define a model with an identity column. Here's an example:

import os

import sqlalchemy as sa
from sqlalchemy.ext import declarative

Base = declarative.declarative_base()


class IdentityExample(Base):
    __tablename__ = 'identity_example'

    id = sa.Column(sa.Integer, primary_key=True, info={'identity': (0, 1)})
    name = sa.Column(sa.String)


engine = sa.create_engine(os.environ['REDSHIFT_URL'])
Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)
RedshiftSession = sa.orm.sessionmaker()
session = RedshiftSession(bind=engine)

example_1 = IdentityExample(name='Example #1')
session.add(example_1)
print("EXAMPLE 1 BEFORE: ", example_1.id, example_1.name)
session.commit()
print("EXAMPLE 1 AFTER: ", example_1.id, example_1.name)

example_2 = IdentityExample(name='Example #2')
session.add(example_2)
print("EXAMPLE 2 BEFORE: ", example_2.id, example_2.name)
session.commit()
print("EXAMPLE 2 AFTER: ", example_2.id, example_2.name)

I'd expect to see output like this:

EXAMPLE 1 BEFORE:  None Example #1
EXAMPLE 1 AFTER:  0 Example #1
EXAMPLE 2 BEFORE:  None Example #2
EXAMPLE 2 AFTER:  1 Example #2

Instead I see this:

EXAMPLE 1 BEFORE:  None Example #1
EXAMPLE 1 AFTER:  0 Example #1
EXAMPLE 2 BEFORE:  None Example #2
EXAMPLE 2 AFTER:  0 Example #1

Am I making some mistake, or is this a bug? Thanks!

SSL causing connection to fail

When I try to connect using redshift+psycopg2 I end up with this error:
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) sslmode value "verify-full" invalid when SSL support is not compiled in
What do I need to do to actually connect to my redshift instance? I'm able to connect fine using postgresql+psycopg2

Support other drivers eg pg8000 or psycopg2cffi

Also do not depend on psycopg2

autoincrement=True Column() kwarg does not work

# SQLAlchemy==1.0.10, sqlalchemy-redshift==0.4.0, psycopg2==2.6
import sqlalchemy as sa
from sqlalchemy.schema import CreateTable

engine = sa.create_engine('redshift+psycopg2://example')
user = sa.Table(
    'user', 
    sa.MetaData(), 
    # http://docs.sqlalchemy.org/en/latest/core/metadata.html#sqlalchemy.schema.Column.params.autoincrement
    sa.Column('id', sa.Integer, primary_key=True, autoincrement=True),
)

col = user.columns.get('id')
print(col.expression.autoincrement)
# >>> True

print(CreateTable(user).compile(engine))
# CREATE TABLE user (
#         id INTERGER NOT NULL
#         PRIMARY KEY (id)
# )

The ddl_compiler.get_column_default_string() method comes back None. I'm unable to figure out whether this is a lack of documentation and there is an easy solution or a deeper problem with the DDL for redshift.

Thanks in advance!

Add support for all sqlachemy pg drivers

eg pg8000 or psycopg2cffi; Also do not depend on psycopg2

Master branch failing tests to Redshift

As discovered in #130 (comment) it looks like there's a change in Redshift behavior that's causing tests to fail in master.

We may simply need to update tests to match the new behavior.

UnloadFromSelect does not support LIMIT clause.

Parameters for where are all properly escaped, but adding a .limit() to the SQLAlchemy expression causes an error because it is unescaped (the parameter is passed in as (%param1)).

Bigcrunch improvements

@graingert - What would you think about hosting the bigcrunch repo on GitHub under the sqlalchemy-redshift org? It would make it easier to browse the code, and it would let us make PRs.

Another thought on bigcrunch - I'd be for upping the timeout. It looks like it's supposed to shut down 1 hour after the last created test session, and I've gotten burned by that a few times, meaning waiting for another whole round of shutdown and spinup to be able to run more tests. Maybe let it keep running for 4 hours?

We use role based credentials (http://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-access-permissions.html#copy-usage_notes-access-role-based) instead of access and secret key. However it appears that UnloadFromSelect and CopyCommand do not support them.

Unable to reflect non-default schemas

I'm new at SQL Alchemy so apologies if this is obvious

I'm not able to to reflect for the non-default schemas:

When I run:

 meta = sa.MetaData(schema="booker")
 meta.reflect(bind=engine)

I get

OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
 [SQL: '
        SELECT
          c.relkind,
          n.oid as "schema_oid",
          n.nspname as "schema",
          c.oid as "rel_oid",
          c.relname,
          CASE c.reldiststyle
            WHEN 0 THEN \'EVEN\' WHEN 1 THEN \'KEY\' WHEN 8 THEN \'ALL\' END
            AS "diststyle",
          c.relowner AS "owner_id",
          u.usename AS "owner_name",
          TRIM(TRAILING \';\' FROM pg_catalog.pg_get_viewdef(c.oid, true))
            AS "view_definition",
          pg_catalog.array_to_string(c.relacl, \'
\') AS "privileges"
        FROM pg_catalog.pg_class c
             LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
             JOIN pg_catalog.pg_user u ON u.usesysid = c.relowner
        WHERE c.relkind IN (\'r\', \'v\', \'m\', \'S\', \'f\')
          AND n.nspname !~ \'^pg_\'
        ORDER BY c.relkind, n.oid, n.nspname;
        ']

ps. This seems similar to apache/superset#217

	column_info = self._get_column_info(
	name=col.name, format_type=col.format_type,
	default=col.default, notnull=col.notnull, domains=domains,
	enums=[], schema=col.schema, encode=col.encode)

sqlalchemy-redshift / sqlalchemy-redshift Goto Github PK

sqlalchemy-redshift's People

Contributors

Stargazers

Watchers

Forkers

sqlalchemy-redshift's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs