innobi / pantab Goto Github PK

View Code? Open in Web Editor NEW

90.0 11.0 33.0 656 KB

Read/Write pandas DataFrames with Tableau Hyper Extracts

License: BSD 3-Clause "New" or "Revised" License

Python 43.67% C 0.87% CMake 3.70% C++ 51.75%

tableau pandas hyper-extracts tableau-extract-api

pantab's People

Stargazers

Watchers

pantab's Issues

Linux wheels

should be achievable now that tableauhyperapi is distributed with a manylinux tag

Allow user to pass connection / hyper instance

Right now the public facing functions create new Hyper Processes / Connections for the user. This however can be rather expensive (I think 300-400 ms) and limiting for advanced use cases

If someone out there is willing to benchmark and propose an API accordingly would certainly welcome contributions on this

TypeError: init() got an unexpected keyword argument 'name'

Looks like there was a change to the Hyper API between releases 8707 and 8953. At least on the former, the beta1 and prior versions of pantab will throw that error when creating Hyper extracts.

Either pinning the Hyper API to 8707 or upgrading pantab to beta2 and above will fix the issue

Append mode with string columns

Describe the bug
I read a DF from SQLServer in chunk, but when I try to append it to my hyperfile it was raising typerror for all text columns:

TypeError: Mismatched column definitions: (Name="ID (End User)", Type=TEXT, Nullability=Nullability.NULLABLE), (Name="Name (End User)", Type=TEXT, Nullability=Nullability.NULLABLE),...

PS: I casted all types for all columns before try to append the DF.

Contributing Guide

Would be nice to document a contributing guide. Willing to work with anyone out there that may be interested in learning more about pantab or open source

Replace MANIFEST.in

Would ideally like to move this to the setup.py instead of having the MANIFEST, as the MANIFEST file is pretty poorly documented and I think a packaging relic

@chillerno1

Not creating hyper extract in cloud notebook

Tried creating hyper extract in a cloud notebook in gcp. No hyper extract was created. Just a hyperd.log file. Tried the same code in local jupyter notebook, it works. Are there some compatibility issue or problem when creating it in cloud?

Actually Write NULL Values for Missing Timestamps

As noted by @vogelsgesang there is a bug in the current writer where pd.NaT is not recognized as a null value and written as 0001-01-01 instead. This stems from isNull in the writer module being incorrectly implemented.

We might be able to use pd.isnull for the check but I'm not sure what performance implications that would have. If it adds a lot, the other option is to maybe check for np.int64.min as that is the sentinel used for NULL timestamps, though again somewhat fragile.

One other element to consider - if you write this date into a Hyper file and open it it will also display as NULL in Tableau. We might want to deprecate reading those dates or give some kind of warning to user that it may change in the future

Cannot read NaT timestamps from .hyper extracts

Describe the bug
A clear and concise description of what the bug is.

Writing dataframe with NaT timestamps dotted arount to hyper works, but reading it back gives an out of bounds timestamp 1-01-01 00:00:00 for those fields, which throws an error.

To Reproduce
Steps to reproduce the behavior:

import pandas as pd
import pantab
from pathlib import Path
import numpy as np
from datetime import datetime
from tableauhyperapi import TableName

np.random.seed(0)
df = pd.DataFrame(
    {"A": np.random.rand(10), "B": np.random.rand(10)}, index=range(10)
)
df["datetime"] = datetime.now()
df.iloc[1] = None

pantab.frame_to_hyper(df, Path("rng.hyper"), table=TableName("rng"))
pantab.frame_from_hyper("rng.hyper", table=TableName("rng")) # Throws the error

Expected behavior
A clear and concise description of what you expected to happen.
Should be able to these out of bounds timestamps as NaT in dataframe.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Windows

Additional context
Add any other context about the problem here.

Missing `tableauhyperapi.h` during pip install

When performing a clean install of pantab from pip (pip install tableauhyperapi), the extension module breaks.

Full traceback below:

$ pip install pantab
Collecting pantab
  Using cached https://files.pythonhosted.org/packages/e3/82/e3844ce038e52f9927f43f5ab94984f9bbfb6f6dee25ea1fbd6e4b40eeae/pantab-0.1.0.tar.gz
Requirement already satisfied: pandas in /home/akos/akos_venv/lib/python3.6/site-packages (from pantab) (0.25.3)
Requirement already satisfied: numpy>=1.13.3 in /home/akos/akos_venv/lib/python3.6/site-packages (from pandas->pantab) (1.17.4)
Requirement already satisfied: python-dateutil>=2.6.1 in /home/akos/akos_venv/lib/python3.6/site-packages (from pandas->pantab) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /home/akos/akos_venv/lib/python3.6/site-packages (from pandas->pantab) (2019.3)
Requirement already satisfied: six>=1.5 in /home/akos/akos_venv/lib/python3.6/site-packages (from python-dateutil>=2.6.1->pandas->pantab) (1.13.0)
Building wheels for collected packages: pantab
  Building wheel for pantab (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/akos/akos_venv/bin/python3.6 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-p1jbd85j/pantab/setup.py'"'"'; __file__='"'"'/tmp/pip-install-p1jbd85j/pantab/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-o3f3iwl1 --python-tag cp36
       cwd: /tmp/pip-install-p1jbd85j/pantab/
  Complete output (28 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.6
  creating build/lib.linux-x86_64-3.6/benchmarks
  copying benchmarks/__init__.py -> build/lib.linux-x86_64-3.6/benchmarks
  copying benchmarks/custom.py -> build/lib.linux-x86_64-3.6/benchmarks
  copying benchmarks/benchmarks.py -> build/lib.linux-x86_64-3.6/benchmarks
  creating build/lib.linux-x86_64-3.6/pantab
  copying pantab/__init__.py -> build/lib.linux-x86_64-3.6/pantab
  copying pantab/_writer.py -> build/lib.linux-x86_64-3.6/pantab
  copying pantab/_reader.py -> build/lib.linux-x86_64-3.6/pantab
  copying pantab/_types.py -> build/lib.linux-x86_64-3.6/pantab
  creating build/lib.linux-x86_64-3.6/pantab/tests
  copying pantab/tests/__init__.py -> build/lib.linux-x86_64-3.6/pantab/tests
  copying pantab/tests/test_writer.py -> build/lib.linux-x86_64-3.6/pantab/tests
  copying pantab/tests/conftest.py -> build/lib.linux-x86_64-3.6/pantab/tests
  copying pantab/tests/test_roundtrip.py -> build/lib.linux-x86_64-3.6/pantab/tests
  copying pantab/tests/test_reader.py -> build/lib.linux-x86_64-3.6/pantab/tests
  running build_ext
  building 'libwriter' extension
  creating build/temp.linux-x86_64-3.6
  creating build/temp.linux-x86_64-3.6/pantab
  gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/home/akos/akos_venv/include -I/usr/local/include/python3.6m -c pantab/_writermodule.c -o build/temp.linux-x86_64-3.6/pantab/_writermodule.o
  pantab/_writermodule.c:6:29: fatal error: tableauhyperapi.h: No such file or directory
  compilation terminated.
  error: command 'gcc' failed with exit status 1
  ----------------------------------------
  ERROR: Failed building wheel for pantab```

Post to conda

Use stdint.h uintptr_t for Address Space

Right now in the writer extension module we parse an integer to an unsigned long long to hold an address space in C. While this works on most platforms, I don't think this behavior is always accurate

Instead should probably use uintptr_t from stdint.h. Appropriate format syntax as follows

https://stackoverflow.com/questions/5795978/string-format-for-intptr-t-and-uintptr-t

Bounds Checking For Dates

The Hyper API establishes a date as a uint32, a time as a uint64 but I think writes both in as a int64

There are probably some weird edge cases when values cross the limits of int64 into uint64 so might want to do explicit bounds checking rather than the casts currently in code

Configure Doctest

We should set up doctest to avoid issues like #73 in the future

frame_from_hyper causing python app crash when reading .hyper extracts from tableau server

Looking for some assistance debugging the following issue.

Describe the bug

I've got multiple .hyper files that have been downloaded from a Tableau Server using TSC. When using pantab.frame_from_hyper to read any of them in, the python application crashes.

Trace shows this as final exec before crash:
_reader.py(45): df = pd.DataFrame(libreader.read_hyper_query(address, query, dtype_strs))

To Reproduce

Unfortunately I can't share the datasources this is occurring with, I'm happy to debug locally and provide as much detail as possible. This is a snippet of what I'm running.

import pantab
from tableauhyperapi import TableName

hyper = 'Data/Extracts/Job _Datasource.hyper'

df = pantab.frame_from_hyper(hyper, table=TableName("Extract", "Extract"))

Expected behavior

The dataframe should be read in without crashing the application.

Screenshots

Desktop (please complete the following information):

Window 7 Enterprise (VM)
Python 3.6
Pantab 0.2.3 (also occuring with 0.1.0, 0.2.0)

Additional context

From what I've tested so far, .hyper files created using pantab are not affected. They can be read and do not cause the application to crash (even if I set a schema of TableName("Extract", "Extract")).

This only occurs when using pantab and I'm currently using tableauhyperapi as a temp workaround without issue:

import pandas as pd
from tableauhyperapi import HyperProcess, Telemetry, Connection, TableName

def tabapi_frame_from_hyper(db):

    table = TableName("Extract", "Extract")

    with HyperProcess(telemetry=Telemetry.DO_NOT_SEND_USAGE_DATA_TO_TABLEAU) as hyper:
        with Connection(endpoint=hyper.endpoint, database=db) as connection:

            table_definition = connection.catalog.get_table_definition(table)

            with connection.execute_query(query=f"SELECT * FROM {table}") as result:
                
                rows = list(result)  
                columns = [column.name.unescaped for column in table_definition.columns]

                return pd.DataFrame(rows, columns=columns)

df = tabapi_frame_from_hyper(db="Data/Extracts/Job _Datasource.hyper")

Native Support for Spatial Data

Tableau offers Spatial as a distinct type in the Hyper API, so we should probably offer some support for GeoPandas Dataframes during serialization

Dates not being interpreted by Tableau

Describe the bug
After the 0.2.0 release, datetime data types written to the .hyper file are not being interpreted properly by Tableau Desktop. Loading data from the .hyper file to a new pandas dataframe displays the correct date/time information. This only affects dates that have been parsed with pandas and does not affect dates stored as an object.

Reverting back to pantab version 0.1.1 resolved the issue.

To Reproduce
Steps to reproduce the behavior:

import pandas as pd
import datetime
import pantab
  
url = 'http://bit.ly/uforeports'
  
# read csv file 
df = pd.read_csv(url, low_memory=False, parse_dates=['Time'])

# Time should have a data type of datetime
print(df.info())

# write the dataframe to a hyper file
pantab.frame_to_hyper(df, 'UFO.hyper', table="UFO")

# read the hyper to a new dataframe
df2 = pantab.frame_from_hyper('UFO.hyper',table="UFO")

# display the top rows - dates/times come back properly
print(df2.head())>

Next, open the created .hyper file with Tableau Desktop. Describing the Time column shows "Error" for the domain.

Expected behavior
Expected behavior is that the Time column is properly interpreted by Tableau as a series of Nulls and Dates/Times.

Screenshots

Desktop (please complete the following information):

OS: Windows 10 Enterprise
Python: 3.7
Tableau: Desktop Professional 2019.2
Hyper API: 0.0.9273
Pantab: 0.2.2

Timedelta field with pd.NaT throws ValueError

Describe the bug
Timedelta field with pd.NaT throws ValueError: cannot convert float NaN to integer

To Reproduce

>>> df = pd.DataFrame([pd.Timedelta(None), pd.Timedelta(days=1)], columns=list("a"))
>>> pantab.frame_to_hyper(df, "test.hyper", table="test")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/williamayd/clones/pantab/pantab/_writer.py", line 177, in frame_to_hyper
    _insert_frame(df, connection=connection, table=table, table_mode=table_mode)
  File "/Users/williamayd/clones/pantab/pantab/_writer.py", line 143, in _insert_frame
    df, dtypes = _maybe_convert_timedelta(df)
  File "/Users/williamayd/clones/pantab/pantab/_writer.py", line 96, in _maybe_convert_timedelta
    df.iloc[:, index] = content.apply(_timedelta_to_interval)
  File "/Users/williamayd/miniconda3/envs/pantab/lib/python3.8/site-packages/pandas/core/series.py", line 4045, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/lib.pyx", line 2228, in pandas._libs.lib.map_infer
  File "/Users/williamayd/clones/pantab/pantab/_writer.py", line 25, in _timedelta_to_interval
    without_days = td - pd.Timedelta(days=days)
  File "pandas/_libs/tslibs/timedeltas.pyx", line 1245, in pandas._libs.tslibs.timedeltas.Timedelta.__new__
ValueError: cannot convert float NaN to integer

Expected behavior
Should work without error

Getting Segmentation fault when writing dataframe to hyper file in Docker python3.8-slim

When running the function:

pantab.frame_to_hyper(self.data_frame, file_name, table=table_name)

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
hyperapi::InserterBuffer::addInt64 (this=0x55a67711dce0, value=<optimized out>) at ./api/cxx/hyperapi-c/hyperapi/api/InserterBuffer.hpp:102
102	./api/cxx/hyperapi-c/hyperapi/api/InserterBuffer.hpp: No such file or directory.

To Reproduce
Steps to reproduce the behavior:
In a Docker image

FROM python3.8-slim
RUN pip install tableauhyperapi==0.0.10309 pantab==1.0.1

Build a dataframe with an int64 column type
Run pantab.frame_to_hyper function

Expected behavior
Hyper file successfully built

Desktop (please complete the following information):

OS: Docker python3.8-slim

Additional context
It looks like this error is specific to having a different version of either pantab / tableauhyperapi than their respective releases were built with.

Seems like a simple fix in pantab would be to specify hard requirements for the tableauhyperapi version that is supported.

TypeError: a bytes-like object is required, not XXX

Right now if you have an arbitrary data type stored within an object column (say a dict) you get a rather cryptic message TypeError: a bytes-like object is required, not XXX

Should be able to catch this and at the very least give users a better indicator as to what went wrong

Avoid Unnecessary copy call in code

Right now we unambiguously copy the data frame for timedelta handling, which is pretty wasteful. We should fix that :-)

Progress Bar

We could add tqdm as a pretty lightweight dependency for progress bars

C Extension for Reading

Done for writing in #30 this had significant performance improvements. Would welcome PRs to do something similar for reading

ENH: Support datetime64[ns, UTC}

Though utility in Tableau is limited this should be allowable

Writing mixed dtypes causes error

One of my data sets has a column that's mixed between integers and strings. The vast majority are integers, but there are a number of string values within the same column. Pandas interprets this column correctly as an object data type, but pantab raises a dtype error when writing to an extract.

Traceback (most recent call last):
File "", line 1, in
File ".venv\lib\site-packages\pantab_writer.py", line 186, in frame_to_hyper
_insert_frame(df, connection=connection, table=table, table_mode=table_mode)
File ".venv\lib\site-packages\pantab_writer.py", line 162, in _insert_frame
dtypes,
TypeError: Invalid value "10216040" found (row 0 column 0)

Working with Schemas example is throwing an error

Describe the bug
Usage Example: Working with Schemas code snippet throws an error due to an unnecessary df arg.

Traceback (most recent call last): File ".\pantabtest.py", line 16, in <module> df2 = pantab.frame_from_hyper(df, "example.hyper", table=table) TypeError: frame_from_hyper() takes 1 positional argument but 2 positional arguments (and 1 keyword-only argument) were given

To Reproduce
Steps to reproduce the behavior:

https://pantab.readthedocs.io/en/latest/examples.html#working-with-schemas
Copy code snippet.
Run the example.

Expected behavior
The code snippet should work as expected.

Desktop (please complete the following information):
Not applicable

Additional context
@WillAyd on the road at the moment but I'll fix this in a PR tonight.

Append Mode Support

After discussing with @rhelenius on the Tableau forums I think need to add support for Append Mode with Hyper Extracts. This can support delta processes and potential chunking of datasets into Hyper.

Some consideration points:

The meaning of "Append" can be ambiguous; in some contexts append may refer to adding a Table to an Extract / Schema and in other contexts this may mean appending to an existing table
Need to be cognizant of what the Hyper API exposes as terminology at the database level; those terms are:
- CREATE: Create the database. Will fail if it already exists
- CREATE_AND_REPLACE: Create the database. If already exists, drop the old one first
- CREATE_IF_NOT_EXISTS: Create the database if it does not exist
- NONE: Do not create the database. Will fail if it already exists

I'll jot some more thoughts down later, but we may want to offer similar enumerations for Table creation and add as a keyword argument to the existing functions, or maybe even just re-purpose the enum already exposed by Hyper

Implement Custom Extension to Boost Performance

ENH: Add way to write DATE types to Hyper

I'm trying to write a dataframe that contains datetime.date object to hyper using pantab.frame_to_hyper method and it raisers TypeError.

Steps to reproduce the problem:

import pandas as pd
import datetime
date = datetime.date(2020,5,8)
df = pd.DataFrame({'Date': [date,date,date], 'Col' : list('ABC') })
df.head()
df.info()
import pantab
from tableauhyperapi import TableName
table = TableName('Extract','Extract')
pantab.frame_to_hyper(df, 'random_db.hyper', table=table)

=> TypeError: Invalid value "datetime.date(2020, 5, 8)" found (row 0 column 0)

converting datetime.date to pd.datetime solves the problem
df.iloc[0,0]
df['Date'] = pd.to_datetime(df['Date'])
pantab.frame_to_hyper(df, 'random_db.hyper', table=table)

other info:
OS: macOS Catalina 10.15.3
pandas version 1.0.0
pantab version 1.1.0

Thanks

Hadi

Deprecate Reading 0001-01-01 as NULL Timestamp

Follow up to #60 we should deprecate this behavior on the reading side and read literally. Would certainly welcome a community contribution and happy to offer guidance if anyone out there is interested

Refactor: Clean Up Internal dtype references

#94 fixed a pretty critical bug but exposed a flaw in internal terminology. Specifically, we create a lot of mappings back to "pandas dtypes" that we pass to the C extension, which then creates the appropriate objects

"date" however is not a real pandas dtype, but needs differentiation from datetime objects in the C extension. I hacked things together by just adding date to the "pandas_dtype" dictionary that we have and subsequently casting after creating a dataframe, but there is a cleaner way to do this

Timezone Support

For historical reasons we write date times with time zones as UTC timestamps to a Hyper File. I know that recent Hyper releases have support for IANA time zone designations, so we could in theory add support for actual time zones now

Install tableauhyperapi as dependency

Now that tableauhyperapi is available on pip (starting with v 0.0.9746) we can rip out all of the custom solutions in CI to manually install

Failures during Append Mode Remove Original File

Allow Telemetry

Right now we explicitly disable Telemetry in the project. If someone cares to I don't have any objection to allow users to share usage stats with Tableau, but at the same time I don't know of the legal implications of doing so

If someone has legal resources at their disposal and has an interest in this would be open to it

asv package in wrong location in environment.yml

The asv package in the environment.yml file is in the wrong location. It should be placed under the pip section in dependencies because otherwise, the environment setup will fail.

To Reproduce
Steps to reproduce the behavior:

Running the below line directly after cloning the repo:

conda env create -f environment.yml

gives this error message:

Solving environment: failed

ResolvePackageNotFound:
  - asv

Expected behavior
pantab-dev environment should install without issue.

Desktop (please complete the following information):

macOS Catalina version 10.15.4 (19E287)

Additional context
I just moved the - asv line from under dependencies: to under - pip: to resolve this in my build.

BoolNA Support

Unfortunately mixing False and np.nan coerces to object in pandas and when writing object we get TypeError: a bytes-like object is required, not 'bool'

Not sure of the best way to handle this without a BoolNA type but posting here for now

Further Performance Optimization?

Amazing work on the C Module for enhanced performance. Question for large table conversion since I've noticed that the hyper api is able to COPY from csv faster than a Pandas Dataframe already in memory:

Is there room for even more improvement to pass the underlying Numpy Arrays instead of iterating through the tuples, which requires wrapping and unwrapping each value in an PyObject?

pantab/pantab/_writermodule.c

Line 273 in 5fc6a64

while ((row = PyIter_Next(iterator))) {

(I don't know enough about how Pandas String Series are handled to know if this would help or not.)

Another option might be to convert the Pandas DataFrame to a pyarrow Table and then pass that, as pyarrow has nullable ints and strings. This would allow for a pyarrow.parquet.read_table().to_hyper() as a data path.

INSERT option as table_mode for updating existing extract

Would be good to have an INSERT option ('i') when writing to existing hyper extract to insert new records based on a given key.

CFFI Version Dependency

Running this package from fresh installs of Anaconda 3.6 and the tableauhyperapi results in the error message below. Doing a pip install of msgpack and then of cffi resolves this issue. It is related to the pycparser version, but I can't tell you which version is needed as I didn't think to save the relevant logs.

 File "<ipython-input-3-62d421bcaed4>", line 1, in <module>
   extract_queries('Y:/UiPath/Tableau Extract/','Hyper_Refresh.xlsx')

 File "<ipython-input-1-44a04ab0d533>", line 92, in extract_queries
   process_query(path, filename, connection, sql)

 File "<ipython-input-1-44a04ab0d533>", line 74, in process_query
   pantab.frame_to_hyper(combined_csv, top + '\\' + filename, table = filename.rsplit('.')[0])

 File "C:\ProgramData\Anaconda3\lib\site-packages\pantab\_pantab.py", line 157, in frame_to_hyper
   with HyperProcess(Telemetry.DO_NOT_SEND_USAGE_DATA_TO_TABLEAU) as hpe:

 File "C:\ProgramData\Anaconda3\lib\site-packages\tableauhyperapi-0.0.8707-py3.6.egg\tableauhyperapi\hyperprocess.py", line 83, in __init__
   InteropUtil.string_to_char_p(user_agent),

 File "C:\ProgramData\Anaconda3\lib\site-packages\tableauhyperapi-0.0.8707-py3.6.egg\tableauhyperapi\impl\dllutil.py", line 34, in string_to_char_p
   return ffi.from_buffer('char[]', s.encode())

TypeError: from_buffer() takes 2 positional arguments but 3 were given```

isort broken in CI

Happened since switch to GH Actions; temporarily disabled in CI for now

sphinx_rtd_theme module needs to be included to make the html docs

Describe the bug
When running make html in the /doc folder as per the CONTRIBUTING.md instructions, this error occurs:

Running Sphinx v3.0.3

Extension error:
Could not import extension sphinx_rtd_theme (exception: No module named 'sphinx_rtd_theme')

To Reproduce
Steps to reproduce the behavior:

Run this in the /doc folder after fresh creation of pantab-dev environment:

make html

Expected behavior
The /build folder should be created with HTML documentation in the /doc folder.

Desktop (please complete the following information):

macOS Catalina Version 10.15.5 (19F96)

Additional context
Recommended to add sphinx_rtd_theme to the environment.yml file under '- pip` section as this resolved my build for HTML documentation.

ENH: Clean up Nullable Types

Some types will have a clearly defined nullable status, like np.int* and boolean types (always non-null), alongside the more experimental pd.Int* types (nullable).

We can add explicit support for these

Hyper files created using pantab cannot be published to tableau server using tableauserverclient

Describe the bug
When trying to publish .hyper files created using pantab using tableauserverclient you will get the generic publishing error "400011: Bad Request -- There was a problem publishing the file".

This is similar to what is described in this tableau forum thread: https://community.tableau.com/thread/324979

To Reproduce

create .hyper file with pantab
try publishing it using tableauserverclient

Additional context
The fix for this will be to do as described in the forum thread and edit the _insert_frame function in the _writer.py file to properly create the schema when defining tables.

The lines:

if isinstance(table, tab_api.TableName) and table.schema_name:
            connection.catalog.create_schema_if_not_exists(table.schema_name)

are never executed because table.schema_name is never defined so the schema never gets created.

issue with values from pandas `read_xxx()` data frames

I've just had a look at this fine project. When I threw my own testing data set at it, it choked. I've read the data set using the Pandas pd.read_csv() for a CSV file, or pd.read_parquet() for an (equivalent) Parquet file. This sample file contained some empty string fields, which got mapped to a float with value nan (in Pandas terms, when reading the CSV file) or a None (when reading the Parquet file).

Unfortunately pantab passed the values through to tableausdk's setString function, it choked with an exception. I've been able to patch the __init__.py to work around this (for now), but I don't think it's a 'clean enough' solution. I'm just testing for the setString accessor in _append_args_for_val_and_accessor, and I'm appending '' to arg_l in case of a nan or None.

If you wish, I can provide a PR with that fix. But I do think it deserves a cleaner solution.

ImportError: cannot import name 'find_hyper_api_dll' from 'tableauhyperapi.impl.util'

When building from source a user faces this import error with the latest version of the tableauhyperapi released March 25. Looks like find_hyper_api_dll was changed to find_hyper_api_library

Will want to update the setup script accordingly to handle both; I don't think the tableau hyperapi allows for version inspection yet, so probably just have to try...except for now

Reorg Project

I think would be nicer if we followed convention by placing the C files in a src folder separate from the Python modules

Enabled -Wall and -WError in CI

Erroring on build warnings would make the extensions a lot more stable. Need to see the cross compiler way of doing this but at least for GCC would be -Wall and -WError flags sent to extra_compile_args argument of the Extension objects

Feature Documentation

I think the docs could do a better point of selling “why pantab”? Performance and ease of use (i.e. auto schema management) I think come to mind, but I’m biased so would love input from others. Would definitely accept PRs to update the documentation as well - something like benchmarks against just the basic Hyper API bindings would be cool to publish

innobi / pantab Goto Github PK

pantab's People

Stargazers

Watchers

Forkers

pantab's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs