GithubHelp home page GithubHelp logo

drgfreeman / dynamo-pandas Goto Github PK

View Code? Open in Web Editor NEW
21.0 21.0 6.0 166 KB

Make working with pandas data and AWS DynamoDB easy

Home Page: https://dynamo-pandas.readthedocs.io/en/stable/

License: MIT License

Python 100.00%
aws aws-dynamodb boto3 database dataframe deserialization dynamo-pandas dynamodb interface pandas serialization

dynamo-pandas's People

Contributors

dependabot[bot] avatar drgfreeman avatar sreyan-ghosh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

dynamo-pandas's Issues

Make boto3 an "extra" requirement

boto3 is currently defined in the install_requires parameter of setup in setup.py. This result in the boto3 and botocore packages being added to lambda layers built using AWS SAM tools. These two packages use about 60 MB of layer storage space, a significant fraction of the AWS lambda layer size limit of 250 MB, although they are not required to be installed in the lambda layer since they are included in the lambda runtime environment.

Moving boto3 to the extras_require parameter of the setup function would prevent the addition of boto3 and botocore to lambda layers while allowing their installation using the 'boto' extra option.

Update Installation section of README and docs to reflect the changes in installation options.

Returned unprocessed items have incorrect format

The unprocessed items returned by the _put_items function embedded in the transactions.put_items function are not in the same format as the items passed to the function.

def _put_items(items, table=table):
response = client.batch_write_item(
RequestItems={table: [{"PutRequest": {"Item": item}} for item in items]}
)
if response["UnprocessedItems"] != {}:
return response["UprocessedItems"][table]
else:
return []

The _put_item function expects a list of item dictionaries serialized with the serde.TypeSerializer.serialize() method whereas the function returns a list of dictionaries in the format {"PutRequest": {"Item": item}} where item is a serialized dictionary.

A correct implementation would be:

def _put_items(items, table=table):
        response = client.batch_write_item(
            RequestItems={table: [{"PutRequest": {"Item": item}} for item in items]}
        )
        if response["UnprocessedItems"] != {}:
            return [
                item["PutRequest"]["Item"]
                for item in response["UnprocessedItems"][table]
            ]
        else:
            return []

This bug is currently pass unit tests since the unprocessed items handling is not covered by tests (ref. #43).

Release version 1.2.1

Release version 1.2.1 to make bug fixes from #45 available on PyPi.

Also add a CHANGELOG.md file to make tracking of changes easier.

Handling of unprocessed items from the client's batch_write_item function is not tested

The handling of the unprocessed items from the client's batch_write_item function called in transactions.put_items is not covered by unit tests. This can lead to bugs like #42 remaining unnoticed.

if response["UnprocessedItems"] != {}:
return response["UprocessedItems"][table]

Investivate whether mocking using moto can be used to return unprocessed items. Otherwise, potentially use a custom mock to return unprocessed items and ensure the whole function is covered by tests.

Timedelta string values cannot be converted with the dtype parameter

Timedelta string values stored in a table cannot be converted with the dtype parameter of the get_df and to_df functions or using the dataframe astype method. This is due to a known bug in pandas (ref.: pandas-dev/pandas#38509).

As a result, unit tests for the dtype parameter of the get_df and put_df function do not test this conversion. Once the pandas issue is resolved, this conversion can be added to the tests.

As a workaround, the Timedelta columns can be converted using pd.to_timedelta(df.column_name).

Move the keys function to the main module

Move the keys function from the transactions module to the main module.

Using the package in with the high level interface functions, a use should not have to import functions from sub modules. Since the keys function is meant as a helper function to keep the interface simple, it would make more sense to have it as part of the main module.

Configure tox

Use tox to run unit test on different python version locally and in CI.

Add high level transaction functions

Add high level transaction functions that integrate conversion and transactions in a single function call:

  • put_df(df, table) add/update all items from a dataframe.
  • get_df(keys, table) get specific items (or all if keys=None) from a table into a dataframe.

Add functions to convert DataFrame and Series to items dict and vice-versa

Add functions to convert pandas DataFrame and Series to items dict and vice-versa.

Examples (subject to modification):

  • to_items(df) to convert a dataframe to a list of dictionaries.
  • to_item(obj) to convert a single row dataframe or a series to a dictionary.
  • to_df(items, dtype=None) to convert a single or multiple items to a dataframe with optional data types.

Unprocessed keys in get_items are not handled correctly

In the transactions.get_items function, the unprocessed keys returned by the boto3.resource().batch_get_item() function are not handled correctly and the function is called again with all the original keys:

while response["UnprocessedKeys"] != {}:
response = resource.batch_get_item(RequestItems=_request(keys))
items.extend(response["Responses"][table])

Also, this block of code is not covered by unit tests, preventing this bug from being reported in tests.

AWS configuration parameters cannot be overwritten

While AWS configuration parameters can be set via a config file or environment variables, there may be cases where these parameters need to be overwritten.

The current put_df, get_df and transactions module functions do not provide a mean to pass these parameters.

Adding a **kwargs argument to the different functions and passing it to the underlying boto3.client or boto3.resource function call would provide this functionality.

For examples, the get_df function signature would become:

def get_df(*, table, keys=None, attributes=None, dtype=None, **resource_kwargs):
    ...

filter with attribute value

Could we get dataframe from Dynamodb with filtering on attribute value? I know right now we can filter on keys, but not sure if we can filter on attribute value. Thank you.

Bad indentation in Overview documentation code example

The indentation of the dtype parameter and closing parenthesis in the get_df with dtype example in docs/overview.rst are incorrect:

df = get_df(
    table="players",
    keys=keys(player_id=["player_two", "player_four"]),
        dtype={
            "bonus_points": "Int8",
            "last_play": "datetime64[ns, UTC]",
            # "play_time": "timedelta64[ns]"  # See note below.
        }
    )

Should read:

df = get_df(
    table="players",
    keys=keys(player_id=["player_two", "player_four"]),
    dtype={
        "bonus_points": "Int8",
        "last_play": "datetime64[ns, UTC]",
        # "play_time": "timedelta64[ns]"  # See note below.
    }
)

Tables with GSI & LSI?

Hi, firstly this package looks like it could really make my life easier, so thanks for putting the time in!
i'm not a dynamoDB expert, so sorry if this is a stupid error on my part.
I'm receiving a client error when working with 'get_df' on dynamo tables that have either GSI or LSI:
"An error occurred (ValidationException) when calling the BatchGetItem operation: The provided key element does not match the schema"

Following your examples, it's working for all tables that dont have a GSI or LSI, should i be using a different "keys" / query structure for those tables?

error when calling get_df()

I defined boto3_agrs as a dictionary

boto3_args={}
boto3_args["endpoint_url"] = "http://localhost:8000"
boto3_args["aws_access_key_id"] = "fakeMyKeyId"
boto3_args["aws_secret_access_key"] = "fakeSecretAccessKey"

And tried to execute
df = get_df(table = "Employee", boto3_kwargs = boto3_args)

Errror: TypeError: get_df() got an unexpected keyword argument 'boto3_kwargs'

But when I checked th soruce code, the method signaure in dynamo_pandas.py is:
def get_df(*, table, keys=None, attributes=None, dtype=None, boto3_kwargs={}):

This does have keyword argument boto3_kwargs as a Key word argument.

Add parameter to select item attributes to get

Add parameter to select item attributes to get when calling the following functions:

  • get_df
  • transactions.get_all_items
  • transactions.get_item
  • transactions.get_items

The parameter would take a list of attribute names.

Example

>>> df = get_df(
...     table="players",
...     keys=[{"player_id": "player_three"}, {"player_id": "player_one"}],
...     attributes=["player_id", "play_time"],
... )
>>> print(df)
      player_id        play_time
0  player_three  1 days 14:01:19
1    player_one  2 days 17:41:55

Release version 1.0.0

  • Remove development notices from README.
  • Change version in __init__.py and docs/conf.py.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.