innobi / pantab Goto Github PK
View Code? Open in Web Editor NEWRead/Write pandas DataFrames with Tableau Hyper Extracts
License: BSD 3-Clause "New" or "Revised" License
Read/Write pandas DataFrames with Tableau Hyper Extracts
License: BSD 3-Clause "New" or "Revised" License
should be achievable now that tableauhyperapi is distributed with a manylinux tag
Right now the public facing functions create new Hyper Processes / Connections for the user. This however can be rather expensive (I think 300-400 ms) and limiting for advanced use cases
If someone out there is willing to benchmark and propose an API accordingly would certainly welcome contributions on this
Looks like there was a change to the Hyper API between releases 8707 and 8953. At least on the former, the beta1 and prior versions of pantab will throw that error when creating Hyper extracts.
Either pinning the Hyper API to 8707 or upgrading pantab to beta2 and above will fix the issue
Describe the bug
I read a DF from SQLServer in chunk, but when I try to append it to my hyperfile it was raising typerror for all text columns:
TypeError: Mismatched column definitions: (Name="ID (End User)", Type=TEXT, Nullability=Nullability.NULLABLE), (Name="Name (End User)", Type=TEXT, Nullability=Nullability.NULLABLE),...
PS: I casted all types for all columns before try to append the DF.
Would be nice to document a contributing guide. Willing to work with anyone out there that may be interested in learning more about pantab or open source
Would ideally like to move this to the setup.py instead of having the MANIFEST, as the MANIFEST file is pretty poorly documented and I think a packaging relic
As noted by @vogelsgesang there is a bug in the current writer where pd.NaT
is not recognized as a null value and written as 0001-01-01
instead. This stems from isNull
in the writer module being incorrectly implemented.
We might be able to use pd.isnull
for the check but I'm not sure what performance implications that would have. If it adds a lot, the other option is to maybe check for np.int64.min
as that is the sentinel used for NULL timestamps, though again somewhat fragile.
One other element to consider - if you write this date into a Hyper file and open it it will also display as NULL
in Tableau. We might want to deprecate reading those dates or give some kind of warning to user that it may change in the future
Describe the bug
A clear and concise description of what the bug is.
Writing dataframe with NaT timestamps dotted arount to hyper works, but reading it back gives an out of bounds timestamp 1-01-01 00:00:00 for those fields, which throws an error.
To Reproduce
Steps to reproduce the behavior:
import pandas as pd
import pantab
from pathlib import Path
import numpy as np
from datetime import datetime
from tableauhyperapi import TableName
np.random.seed(0)
df = pd.DataFrame(
{"A": np.random.rand(10), "B": np.random.rand(10)}, index=range(10)
)
df["datetime"] = datetime.now()
df.iloc[1] = None
pantab.frame_to_hyper(df, Path("rng.hyper"), table=TableName("rng"))
pantab.frame_from_hyper("rng.hyper", table=TableName("rng")) # Throws the error
Expected behavior
A clear and concise description of what you expected to happen.
Should be able to these out of bounds timestamps as NaT in dataframe.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
When performing a clean install of pantab from pip (pip install tableauhyperapi), the extension module breaks.
Full traceback below:
$ pip install pantab
Collecting pantab
Using cached https://files.pythonhosted.org/packages/e3/82/e3844ce038e52f9927f43f5ab94984f9bbfb6f6dee25ea1fbd6e4b40eeae/pantab-0.1.0.tar.gz
Requirement already satisfied: pandas in /home/akos/akos_venv/lib/python3.6/site-packages (from pantab) (0.25.3)
Requirement already satisfied: numpy>=1.13.3 in /home/akos/akos_venv/lib/python3.6/site-packages (from pandas->pantab) (1.17.4)
Requirement already satisfied: python-dateutil>=2.6.1 in /home/akos/akos_venv/lib/python3.6/site-packages (from pandas->pantab) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /home/akos/akos_venv/lib/python3.6/site-packages (from pandas->pantab) (2019.3)
Requirement already satisfied: six>=1.5 in /home/akos/akos_venv/lib/python3.6/site-packages (from python-dateutil>=2.6.1->pandas->pantab) (1.13.0)
Building wheels for collected packages: pantab
Building wheel for pantab (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /home/akos/akos_venv/bin/python3.6 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-p1jbd85j/pantab/setup.py'"'"'; __file__='"'"'/tmp/pip-install-p1jbd85j/pantab/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-o3f3iwl1 --python-tag cp36
cwd: /tmp/pip-install-p1jbd85j/pantab/
Complete output (28 lines):
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.6
creating build/lib.linux-x86_64-3.6/benchmarks
copying benchmarks/__init__.py -> build/lib.linux-x86_64-3.6/benchmarks
copying benchmarks/custom.py -> build/lib.linux-x86_64-3.6/benchmarks
copying benchmarks/benchmarks.py -> build/lib.linux-x86_64-3.6/benchmarks
creating build/lib.linux-x86_64-3.6/pantab
copying pantab/__init__.py -> build/lib.linux-x86_64-3.6/pantab
copying pantab/_writer.py -> build/lib.linux-x86_64-3.6/pantab
copying pantab/_reader.py -> build/lib.linux-x86_64-3.6/pantab
copying pantab/_types.py -> build/lib.linux-x86_64-3.6/pantab
creating build/lib.linux-x86_64-3.6/pantab/tests
copying pantab/tests/__init__.py -> build/lib.linux-x86_64-3.6/pantab/tests
copying pantab/tests/test_writer.py -> build/lib.linux-x86_64-3.6/pantab/tests
copying pantab/tests/conftest.py -> build/lib.linux-x86_64-3.6/pantab/tests
copying pantab/tests/test_roundtrip.py -> build/lib.linux-x86_64-3.6/pantab/tests
copying pantab/tests/test_reader.py -> build/lib.linux-x86_64-3.6/pantab/tests
running build_ext
building 'libwriter' extension
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/pantab
gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/home/akos/akos_venv/include -I/usr/local/include/python3.6m -c pantab/_writermodule.c -o build/temp.linux-x86_64-3.6/pantab/_writermodule.o
pantab/_writermodule.c:6:29: fatal error: tableauhyperapi.h: No such file or directory
compilation terminated.
error: command 'gcc' failed with exit status 1
----------------------------------------
ERROR: Failed building wheel for pantab```
Right now in the writer extension module we parse an integer to an unsigned long long to hold an address space in C. While this works on most platforms, I don't think this behavior is always accurate
Instead should probably use uintptr_t from stdint.h. Appropriate format syntax as follows
https://stackoverflow.com/questions/5795978/string-format-for-intptr-t-and-uintptr-t
A C extension was developed in #31 to help read performance, but there are a few tweaks that could still make this more efficient. An obvious improvement would be to yield values during iteration of the hyper rowsets from the C extension, rather than placing them all into a list and returning the list
The Hyper API establishes a date as a uint32, a time as a uint64 but I think writes both in as a int64
There are probably some weird edge cases when values cross the limits of int64 into uint64 so might want to do explicit bounds checking rather than the casts currently in code
We should set up doctest to avoid issues like #73 in the future
Looking for some assistance debugging the following issue.
Describe the bug
I've got multiple .hyper files that have been downloaded from a Tableau Server using TSC. When using pantab.frame_from_hyper to read any of them in, the python application crashes.
Trace shows this as final exec before crash:
_reader.py(45): df = pd.DataFrame(libreader.read_hyper_query(address, query, dtype_strs))
To Reproduce
Unfortunately I can't share the datasources this is occurring with, I'm happy to debug locally and provide as much detail as possible. This is a snippet of what I'm running.
import pantab
from tableauhyperapi import TableName
hyper = 'Data/Extracts/Job _Datasource.hyper'
df = pantab.frame_from_hyper(hyper, table=TableName("Extract", "Extract"))
Expected behavior
The dataframe should be read in without crashing the application.
Desktop (please complete the following information):
Additional context
From what I've tested so far, .hyper files created using pantab are not affected. They can be read and do not cause the application to crash (even if I set a schema of TableName("Extract", "Extract")).
This only occurs when using pantab and I'm currently using tableauhyperapi as a temp workaround without issue:
import pandas as pd
from tableauhyperapi import HyperProcess, Telemetry, Connection, TableName
def tabapi_frame_from_hyper(db):
table = TableName("Extract", "Extract")
with HyperProcess(telemetry=Telemetry.DO_NOT_SEND_USAGE_DATA_TO_TABLEAU) as hyper:
with Connection(endpoint=hyper.endpoint, database=db) as connection:
table_definition = connection.catalog.get_table_definition(table)
with connection.execute_query(query=f"SELECT * FROM {table}") as result:
rows = list(result)
columns = [column.name.unescaped for column in table_definition.columns]
return pd.DataFrame(rows, columns=columns)
df = tabapi_frame_from_hyper(db="Data/Extracts/Job _Datasource.hyper")
Tableau offers Spatial as a distinct type in the Hyper API, so we should probably offer some support for GeoPandas Dataframes during serialization
Describe the bug
After the 0.2.0 release, datetime data types written to the .hyper file are not being interpreted properly by Tableau Desktop. Loading data from the .hyper file to a new pandas dataframe displays the correct date/time information. This only affects dates that have been parsed with pandas and does not affect dates stored as an object.
Reverting back to pantab version 0.1.1 resolved the issue.
To Reproduce
Steps to reproduce the behavior:
import pandas as pd
import datetime
import pantab
url = 'http://bit.ly/uforeports'
# read csv file
df = pd.read_csv(url, low_memory=False, parse_dates=['Time'])
# Time should have a data type of datetime
print(df.info())
# write the dataframe to a hyper file
pantab.frame_to_hyper(df, 'UFO.hyper', table="UFO")
# read the hyper to a new dataframe
df2 = pantab.frame_from_hyper('UFO.hyper',table="UFO")
# display the top rows - dates/times come back properly
print(df2.head())>
Next, open the created .hyper file with Tableau Desktop. Describing the Time column shows "Error" for the domain.
Expected behavior
Expected behavior is that the Time column is properly interpreted by Tableau as a series of Nulls and Dates/Times.
Desktop (please complete the following information):
Describe the bug
Timedelta field with pd.NaT throws ValueError: cannot convert float NaN to integer
To Reproduce
>>> df = pd.DataFrame([pd.Timedelta(None), pd.Timedelta(days=1)], columns=list("a"))
>>> pantab.frame_to_hyper(df, "test.hyper", table="test")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/williamayd/clones/pantab/pantab/_writer.py", line 177, in frame_to_hyper
_insert_frame(df, connection=connection, table=table, table_mode=table_mode)
File "/Users/williamayd/clones/pantab/pantab/_writer.py", line 143, in _insert_frame
df, dtypes = _maybe_convert_timedelta(df)
File "/Users/williamayd/clones/pantab/pantab/_writer.py", line 96, in _maybe_convert_timedelta
df.iloc[:, index] = content.apply(_timedelta_to_interval)
File "/Users/williamayd/miniconda3/envs/pantab/lib/python3.8/site-packages/pandas/core/series.py", line 4045, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/lib.pyx", line 2228, in pandas._libs.lib.map_infer
File "/Users/williamayd/clones/pantab/pantab/_writer.py", line 25, in _timedelta_to_interval
without_days = td - pd.Timedelta(days=days)
File "pandas/_libs/tslibs/timedeltas.pyx", line 1245, in pandas._libs.tslibs.timedeltas.Timedelta.__new__
ValueError: cannot convert float NaN to integer
Expected behavior
Should work without error
When running the function:
pantab.frame_to_hyper(self.data_frame, file_name, table=table_name)
Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
hyperapi::InserterBuffer::addInt64 (this=0x55a67711dce0, value=<optimized out>) at ./api/cxx/hyperapi-c/hyperapi/api/InserterBuffer.hpp:102
102 ./api/cxx/hyperapi-c/hyperapi/api/InserterBuffer.hpp: No such file or directory.
To Reproduce
Steps to reproduce the behavior:
In a Docker image
FROM python3.8-slim
RUN pip install tableauhyperapi==0.0.10309 pantab==1.0.1
Build a dataframe with an int64 column type
Run pantab.frame_to_hyper function
Expected behavior
Hyper file successfully built
Desktop (please complete the following information):
Additional context
It looks like this error is specific to having a different version of either pantab / tableauhyperapi than their respective releases were built with.
Seems like a simple fix in pantab would be to specify hard requirements for the tableauhyperapi version that is supported.
Right now if you have an arbitrary data type stored within an object column (say a dict) you get a rather cryptic message TypeError: a bytes-like object is required, not XXX
Should be able to catch this and at the very least give users a better indicator as to what went wrong
Right now we unambiguously copy the data frame for timedelta handling, which is pretty wasteful. We should fix that :-)
We could add tqdm as a pretty lightweight dependency for progress bars
Done for writing in #30 this had significant performance improvements. Would welcome PRs to do something similar for reading
Though utility in Tableau is limited this should be allowable
One of my data sets has a column that's mixed between integers and strings. The vast majority are integers, but there are a number of string values within the same column. Pandas interprets this column correctly as an object data type, but pantab raises a dtype error when writing to an extract.
Traceback (most recent call last):
File "", line 1, in
File ".venv\lib\site-packages\pantab_writer.py", line 186, in frame_to_hyper
_insert_frame(df, connection=connection, table=table, table_mode=table_mode)
File ".venv\lib\site-packages\pantab_writer.py", line 162, in _insert_frame
dtypes,
TypeError: Invalid value "10216040" found (row 0 column 0)
Describe the bug
Usage Example: Working with Schemas code snippet throws an error due to an unnecessary df arg.
Traceback (most recent call last): File ".\pantabtest.py", line 16, in <module> df2 = pantab.frame_from_hyper(df, "example.hyper", table=table) TypeError: frame_from_hyper() takes 1 positional argument but 2 positional arguments (and 1 keyword-only argument) were given
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The code snippet should work as expected.
Desktop (please complete the following information):
Not applicable
Additional context
@WillAyd on the road at the moment but I'll fix this in a PR tonight.
After discussing with @rhelenius on the Tableau forums I think need to add support for Append Mode with Hyper Extracts. This can support delta processes and potential chunking of datasets into Hyper.
Some consideration points:
I'll jot some more thoughts down later, but we may want to offer similar enumerations for Table creation and add as a keyword argument to the existing functions, or maybe even just re-purpose the enum already exposed by Hyper
I'm trying to write a dataframe that contains datetime.date object to hyper using pantab.frame_to_hyper method and it raisers TypeError.
Steps to reproduce the problem:
import pandas as pd
import datetime
date = datetime.date(2020,5,8)
df = pd.DataFrame({'Date': [date,date,date], 'Col' : list('ABC') })
df.head()
df.info()
import pantab
from tableauhyperapi import TableName
table = TableName('Extract','Extract')
pantab.frame_to_hyper(df, 'random_db.hyper', table=table)
=> TypeError: Invalid value "datetime.date(2020, 5, 8)" found (row 0 column 0)
converting datetime.date to pd.datetime solves the problem
df.iloc[0,0]
df['Date'] = pd.to_datetime(df['Date'])
pantab.frame_to_hyper(df, 'random_db.hyper', table=table)
other info:
OS: macOS Catalina 10.15.3
pandas version 1.0.0
pantab version 1.1.0
Thanks
Hadi
Follow up to #60 we should deprecate this behavior on the reading side and read literally. Would certainly welcome a community contribution and happy to offer guidance if anyone out there is interested
#94 fixed a pretty critical bug but exposed a flaw in internal terminology. Specifically, we create a lot of mappings back to "pandas dtypes" that we pass to the C extension, which then creates the appropriate objects
"date" however is not a real pandas dtype, but needs differentiation from datetime objects in the C extension. I hacked things together by just adding date to the "pandas_dtype" dictionary that we have and subsequently casting after creating a dataframe, but there is a cleaner way to do this
For historical reasons we write date times with time zones as UTC timestamps to a Hyper File. I know that recent Hyper releases have support for IANA time zone designations, so we could in theory add support for actual time zones now
Now that tableauhyperapi is available on pip (starting with v 0.0.9746) we can rip out all of the custom solutions in CI to manually install
Right now we explicitly disable Telemetry in the project. If someone cares to I don't have any objection to allow users to share usage stats with Tableau, but at the same time I don't know of the legal implications of doing so
If someone has legal resources at their disposal and has an interest in this would be open to it
The asv package in the environment.yml file is in the wrong location. It should be placed under the pip
section in dependencies because otherwise, the environment setup will fail.
To Reproduce
Steps to reproduce the behavior:
Running the below line directly after cloning the repo:
conda env create -f environment.yml
gives this error message:
Solving environment: failed
ResolvePackageNotFound:
- asv
Expected behavior
pantab-dev environment should install without issue.
Desktop (please complete the following information):
Additional context
I just moved the - asv
line from under dependencies:
to under - pip:
to resolve this in my build.
Unfortunately mixing False and np.nan coerces to object in pandas and when writing object we get TypeError: a bytes-like object is required, not 'bool'
Not sure of the best way to handle this without a BoolNA type but posting here for now
Amazing work on the C Module for enhanced performance. Question for large table conversion since I've noticed that the hyper api is able to COPY from csv faster than a Pandas Dataframe already in memory:
Is there room for even more improvement to pass the underlying Numpy Arrays instead of iterating through the tuples, which requires wrapping and unwrapping each value in an PyObject?
Line 273 in 5fc6a64
(I don't know enough about how Pandas String Series are handled to know if this would help or not.)
Another option might be to convert the Pandas DataFrame to a pyarrow Table and then pass that, as pyarrow has nullable ints and strings. This would allow for a pyarrow.parquet.read_table().to_hyper() as a data path.
Would be good to have an INSERT option ('i') when writing to existing hyper extract to insert new records based on a given key.
Running this package from fresh installs of Anaconda 3.6 and the tableauhyperapi results in the error message below. Doing a pip install of msgpack and then of cffi resolves this issue. It is related to the pycparser version, but I can't tell you which version is needed as I didn't think to save the relevant logs.
File "<ipython-input-3-62d421bcaed4>", line 1, in <module>
extract_queries('Y:/UiPath/Tableau Extract/','Hyper_Refresh.xlsx')
File "<ipython-input-1-44a04ab0d533>", line 92, in extract_queries
process_query(path, filename, connection, sql)
File "<ipython-input-1-44a04ab0d533>", line 74, in process_query
pantab.frame_to_hyper(combined_csv, top + '\\' + filename, table = filename.rsplit('.')[0])
File "C:\ProgramData\Anaconda3\lib\site-packages\pantab\_pantab.py", line 157, in frame_to_hyper
with HyperProcess(Telemetry.DO_NOT_SEND_USAGE_DATA_TO_TABLEAU) as hpe:
File "C:\ProgramData\Anaconda3\lib\site-packages\tableauhyperapi-0.0.8707-py3.6.egg\tableauhyperapi\hyperprocess.py", line 83, in __init__
InteropUtil.string_to_char_p(user_agent),
File "C:\ProgramData\Anaconda3\lib\site-packages\tableauhyperapi-0.0.8707-py3.6.egg\tableauhyperapi\impl\dllutil.py", line 34, in string_to_char_p
return ffi.from_buffer('char[]', s.encode())
TypeError: from_buffer() takes 2 positional arguments but 3 were given```
Happened since switch to GH Actions; temporarily disabled in CI for now
Describe the bug
When running make html
in the /doc folder as per the CONTRIBUTING.md instructions, this error occurs:
Running Sphinx v3.0.3
Extension error:
Could not import extension sphinx_rtd_theme (exception: No module named 'sphinx_rtd_theme')
To Reproduce
Steps to reproduce the behavior:
Run this in the /doc folder after fresh creation of pantab-dev environment:
make html
Expected behavior
The /build folder should be created with HTML documentation in the /doc folder.
Desktop (please complete the following information):
Additional context
Recommended to add sphinx_rtd_theme to the environment.yml file under '- pip` section as this resolved my build for HTML documentation.
Some types will have a clearly defined nullable status, like np.int* and boolean types (always non-null), alongside the more experimental pd.Int* types (nullable).
We can add explicit support for these
Describe the bug
When trying to publish .hyper files created using pantab using tableauserverclient you will get the generic publishing error "400011: Bad Request -- There was a problem publishing the file".
This is similar to what is described in this tableau forum thread: https://community.tableau.com/thread/324979
To Reproduce
Additional context
The fix for this will be to do as described in the forum thread and edit the _insert_frame function in the _writer.py file to properly create the schema when defining tables.
The lines:
if isinstance(table, tab_api.TableName) and table.schema_name:
connection.catalog.create_schema_if_not_exists(table.schema_name)
are never executed because table.schema_name is never defined so the schema never gets created.
I've just had a look at this fine project. When I threw my own testing data set at it, it choked. I've read the data set using the Pandas pd.read_csv()
for a CSV file, or pd.read_parquet()
for an (equivalent) Parquet file. This sample file contained some empty string
fields, which got mapped to a float
with value nan
(in Pandas terms, when reading the CSV file) or a None
(when reading the Parquet file).
Unfortunately pantab
passed the values through to tableausdk
's setString
function, it choked with an exception. I've been able to patch the __init__.py
to work around this (for now), but I don't think it's a 'clean enough' solution. I'm just testing for the setString
accessor in _append_args_for_val_and_accessor
, and I'm appending ''
to arg_l
in case of a nan
or None
.
If you wish, I can provide a PR with that fix. But I do think it deserves a cleaner solution.
When building from source a user faces this import error with the latest version of the tableauhyperapi released March 25. Looks like find_hyper_api_dll
was changed to find_hyper_api_library
Will want to update the setup script accordingly to handle both; I don't think the tableau hyperapi allows for version inspection yet, so probably just have to try...except for now
I think would be nicer if we followed convention by placing the C files in a src
folder separate from the Python modules
Erroring on build warnings would make the extensions a lot more stable. Need to see the cross compiler way of doing this but at least for GCC would be -Wall and -WError flags sent to extra_compile_args
argument of the Extension
objects
I think the docs could do a better point of selling “why pantab”? Performance and ease of use (i.e. auto schema management) I think come to mind, but I’m biased so would love input from others. Would definitely accept PRs to update the documentation as well - something like benchmarks against just the basic Hyper API bindings would be cool to publish
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.