lbnl-eta / adapter Goto Github PK

The Adapter software is developed at the Energy Efficiency Standards Department and it provides a convenient data table loader from various formats such as xlsx, csv, db (sqlite database), and sqlalchemy. Its main feature is the ability to convert data tables identified in one main and optionally one or more additional input files into database tables and Pandas DataFrames for downstream usage in any compatible software.

Python 100.00%

adapter's People

Contributors

Stargazers

Watchers

Forkers

lyralan

adapter's Issues

'unicode' in create_named_range should be removed

https://github.com/LBNL-ETA/Adapter/blob/master/adapter/comm/excel.py#L408

I have a script using create_named_range from my local copy of an older version of adapter, where apparently I'd removed this a while ago. Now that I'm installing adapterio I need to remove this too. It looks like unicode became deprecated a while ago

Add example to readme on how to handle multiple platforms

Add an example below the current similar examples:

To automatically convert paths between platforms, for example if you are using a VPN connection to access input data files, use the mapping argument:

from adapter.i_o import IO

input_loader = IO(<fullpath_to_the_main_input_file>, 
                               os_mapping={'win32': 'C:', 'darwin': '/Volumes/A', 'linux': '/media/A'})
data = input_loader.load()

xlwings incompetible with Linux

Unfortunately, xlwings couldn't run successfully on Linux because of missing dependencies.
ModuleNotFoundError: No module named 'aem'

29_openpyxl_keep_vba

keep_vba is set to True in Adapter's Excel class (line), which revokes some zipfile methods and raise value error in a new connector PR (link).

To solve this issue, just change keep_vba to False, as with openpyxl's default. As far as I know, we only read non-visual data in Excel when running python tools (we don't normally have images in input files), setting keep_vba to True doesn't have an advantage over False. I'm wondering if @0xd5dc could just set this in the current open PR, as it's just a one line change? I can also make another PR if we treat it as a separate problem.

ps: keep_vba controls whether any Visual Basic elements (images and charts) are preserved or not (default). If preserved, they are not editable (source).

win path issue X: vs X:\

dir1 = os.path.join(
            "X:",  # will get converted for a given OS
            "First_Level",
            "Second_Level",
            "Third_Level",
            "input",
        )
>> dir1
>> "X:First_Level\Second_Level\Third_Level\input"
dir2=r"X:\First_Level\Second_Level\Third_Level\input"

dir1 and dir2 are not equal.
Adapter should handle both cases.

Simplify `comm.tools.user_select_file()` method

The comm.tools.user_select_file() is unnecessarily complex, and it relies on pywin32 for WIndows OS. Ultimately, the functionality of this method can be achieved with the tkinter library (as is already implemented for OSX platform).

Using tkinter should make this method more robust and maintainable in the future.

Reduce Excessive Logging Output

The current logging implementation in our project generates an excessive amount of log messages, leading to log files becoming cluttered and difficult to analyze. This issue aims to address this problem by implementing a more streamlined and efficient logging strategy.

'DefinedNameDict' object has no 'definedName' issue on Mac only

I found this error from a Mac user. Did anyone else get this error while loading xlsx input file on Mac?

Error messages:

file_path_convert

adding Linux support to convert_file_path

reformat folder name with timestamp

utilize the adapter to generate folder names with a timestamp in two formats, short format(default), and long format.
For example

long format	short format
prefix_2022_07_25-13h_50m	abbr.ver._220725_1350
product_version_branch_2022_08_02_14h_01m	p100_220802_1401

Update user_select_file to remove unnecessary dependencies

The user_select_file will try to import either win32ui, win32con, or tkinter depending on sys.platform. But the win32ui and win32con seem to induce a dependency on pywin32==225 that I see included in the setup.py of other tools that use this, which is undesirable. It appears that using later versions of pywin32 can result in an error on windows.

I imagine it is possible nowadays to find a single package (native or otherwise) that supports file prompt dialogs across OSX, windows, and linux.

Low priority because it works as-is, but if managing dependencies for these imports can be a touch cumbersome now, it can only get worse in the future

convert_network_drive_path should be applied to outpath in run_parameters

The output path can be specified in run_parameters but will get interpreted literally. Should also go through convert_network_drive_path so that the same input file read by a person on windows & OSX will write to the same location.

handled here

read pickle compatibility with pandas 2

Pandas 2.0.2 broke the backward compatibility with Pandas 1.x.
In i_o.py, from_pickle() used the pickle module to read pickle files that will raise ModuleNotFoundError: No module named 'pandas.core.indexes.numeric' error under Pandas 2.
A possible solution is to use pd.read_pickle() to read pickle files instead.

refs:
https://stackoverflow.com/questions/75953279/modulenotfounderror-no-module-named-pandas-core-indexes-numeric-using-metaflo

Test input file path - readin for functional test modules

We have some repeating code in the functional tests to handle variation in OS and the Adapter should be able to make the repeating sections obsolete.

This issue is to ensure:

Adapter IO load can load in data from an input file, for any OS;
all the error checking and logging that is currently handled in the functional test occurs in the Adapter;
create corresponding issues on our repos to replace therefore obsolete code in the functional tests of the inhouse packages.

Db always downloads all tables

Adapter/adapter/to_python.py

Line 254 in 95362e0

all_dict_of_dfs = self.db.tables2dict(close=True)

the tables2dict method doesn't take table_names, and will request all tables get loaded from the database no matter how much load is asked for.

Could super speed up asking for individual (or small subset of) tables in most cases by updating this.

The error for a requested table not existing could be moved down into tables2dict as well

Add flag to omit any writeout

Update sqlalchemy and/or pandas dependency

The current setup.py fixes the sqlalchemy requirement at 1.4.29, though for versions of pandas ≥2.2.0 (which is not restricted in setup.py), this will result in some adapter dependents no longer working.

It seems to make sense to commit to having dependents of adapter become compatible with pandas ≥2.2.0, and then commit to updating adapter to require pandas≥2.2.0 and newer sqlalchemy.

Related to this issue, which is a matter of newer versions of pandas requiring newer versions of openpyxl.

update_excel

With the latest Openpyxl (v3.1.0, released on 2023.01.31) leading to errors

AttributeError: 'DefinedNameDict' object has no attribute 'definedName'

when loading an Excel with named ranges in line 77 in adapter, an update in Adapter to be compatible with newer Openpyxl may be needed. An easy solution is to change line 77 to

all_input_ranges = {object_range for object_range in self.wb.defined_names}

However, this is only for folks who use v3.1.0, and this change will lead to backwards incompatibility for users with older Openpyxl versions.

Also, it may be good to check if there's any named ranges or tables in the input file if users specify kind='ranges' or kind='tables'.

A substitute for xlwings

The use of xlwings in Adapter keeps asking permissions to open excel files to extract table ranges, which could be distracting when there's multiple input files.

Adapter/adapter/comm/excel.py

Line 6 in 95362e0

import xlwings as xw

A potential substitute function for handeling excel table ranges without actually opening files can be data_frame_from_xlsx (https://stackoverflow.com/questions/20486453/reading-an-excel-named-range-into-a-pandas-dataframe).

Logger and note

Please address comments I provided here:

#11 (review)

create IO for pickles

this feature enables reading a pickle format of a dict of dataframs

Add functionality to prompt user to select an input file and return path to code

This is useful for many of our runscripts.

All keys are being checked for in the pre-existing keys, instead of just the table names that need to be read

In the load function of class Db, duplicates of keys are being checked for.
https://github.com/LBNL-ETA/Adapter/blob/master/adapter/to_python.py#L283

However, the docstring indicates that only table_names will be read from the database. Therefore, only duplicates of table_names should be checked for.

For reference, this issue didn't exist in v1.2.1.

Remove "Unnamed: 0" column when reading from files

Implement solution as here: https://stackoverflow.com/questions/36519086/how-to-get-rid-of-unnamed-0-column-in-a-pandas-dataframe-read-in-from-csv-fil

Update openpyxl dependency

The current setup.py fixes the openpyxl dependency at 3.0.9 because later versions break one very specific segment of code in adapter.to_python here.

I think this could be fixed with:

if hasattr(self.wb.defined_names,"definedName"):
  # This case is for openpyxl <3.1.0
  all_input_ranges = {object_range.name for object_range in self.wb.defined_names.definedName}
else:
  # This case is for openpyxl ≥3.1.0
  all_input_ranges = set(self.wb.defined_names.keys())

This change was capable of getting around the error in my local environment and working, but it should be tried and ran through the adapter tests. If all seems good, then I'd advise just getting rid of the version requirement on openpyxl

lbnl-eta / adapter Goto Github PK

adapter's People

Contributors

Stargazers

Watchers

Forkers

adapter's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs