GithubHelp home page GithubHelp logo

pyxlsb2's Introduction

pyxlsb2

pyxlsb2 (a variant of pyxlsb - https://github.com/wwwiiilll/pyxlsb) is an Excel 2007+ Binary Workbook (xlsb) parser written in Python.

pyxslb2 offers the following improvements/changes in comparison to pyxlsb:

  1. By default, keeps all data in memory instead of creating temporary files. This is mainly to speed up the processing and also not changing the local filesystem during the processing.
  2. relies on both "xl\workbook.bin" and "xl\_rels\workbook.bin.rels" to load locate boundsheets. As a result, it can load all worksheets as well as all macrosheets.
  3. extracts macro formulas:
  • accurately shows the formulas
  • supports A1 addressing
  • supports external addressing (partially implemented))
  1. extracts defined names such as auto_open

Install

  1. Installing the whl file

Download .whl file from the release section

pip install -U [path to whl file]
  1. Installing the latest development

Using pip

pip install -U https://github.com/DissectMalware/pyxlsb2/archive/master.zip

Or download the latest version

wget https://github.com/DissectMalware/pyxlsb2/archive/master.zip

Extract the zip file and go to the extracted directory

python setup.py install --user

Usage

The module exposes an open_workbook(name) method (similar to Xlrd and OpenPyXl) for opening XLSB files. The Workbook object representing the file is returned.

from pyxlsb2 import open_workbook
with open_workbook('Book1.xlsb') as wb:
    # Do stuff with wb

The Workbook object exposes a get_sheet_by_index(idx) and get_sheet_by_name(name) method to retrieve Worksheet instances.

# Using the sheet index (0-based, unlike VBA)
with wb.get_sheet_by_index(0) as sheet:
    # Do stuff with sheet

# Using the sheet name
with wb.get_sheet_by_name('Sheet1') as sheet:
    # Do stuff with sheet

A sheets property containing the sheet names is available on the Workbook instance.

The rows() method will hand out an iterator to read the worksheet rows. The Worksheet object is also directly iterable and is equivalent to calling rows().

# You can use .rows(sparse=False) to include empty rows
for row in sheet.rows():
    print(row)
# [Cell(r=0, c=0, v='TEXT'), Cell(r=0, c=1, v=42.1337)]

NOTE: Iterating the same Worksheet instance multiple times in parallel (nested for for instance) will yield unexpected results, retrieve more instances instead.

Note that dates will appear as floats. You must use the convert_date(date) method from the corresponding Workbook instance to turn them into datetime.

print(wb.convert_date(41235.45578))
# datetime.datetime(2012, 11, 22, 10, 56, 19)

Example

Converting a workbook to CSV:

import csv
from pyxlsb2 import open_workbook

with open_workbook('Book1.xlsb') as wb:
    for name in wb.sheets:
        with wb.get_sheet_by_name(name) as sheet:
            with open(name + '.csv', 'w') as f:
                writer = csv.writer(f)
                for row in sheet.rows():
                    writer.writerow([c.v for c in row])

Limitations

Non exhaustive list of things that are currently not supported:

  • Style and formatting WIP
  • Rich text cells (formatting is lost, but getting the text works)
  • Encrypted (password protected) workbooks
  • Comments and other annotations
  • Writing (out of scope)

pyxlsb2's People

Contributors

cccs-jh avatar dissectmalware avatar oleglpts avatar wmetcalf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyxlsb2's Issues

Licensing

Hi!

I was wondering whether this project could be used with pandas instead of original pyxlsb. However, they are rather strict with licensing questions.

The original pyxlsb was released under LGPLv3 (which was fine). It seems derived work can't be licensed under Apache 2.0 according to this. You could consider relicensing pyxlsb2 under LGPLv3 so it would be compliant with licenses and usable.

Missing argument to str method for Formula

Hi,

There's a small bug in pyxlsb2/formula.py:Formula.str

It calls the stringify method which expects a workbook as an argument without any arguments.
As far as I could tell this code is not used, but if it were it would crash as it is missing that required argument, so I thought I would bring it to your attention.

Thanks!

IndexError: list index out of range

Hello,

I want to use pyxlsb2 for reading an .xlsb file. I found some files which doesn't seem to open while other work.
Opening with pyxlsb works, pyxlsb2 not.

I have two files:

  1. x.xlsb (works)
  2. y.xlsb (doesn't work)

Source code:

from pyxlsb2 import open_workbook

with open_workbook("y.xlsb") as wb:
    for sheet in wb.sheets:
        print(sheet)

I can't find any difference in the files. Their both are .xlsb. files. I need to see if the files have any hidden worksheets. This is not possible with pyxlsb only with pyxlsb2 as I know so far.

Best regards

Patrick

IndexError: list index out of range

Hello,
I reading an .xlsb file. I found some files which doesn't seem to open while other work.

Where the files which not worked throw this error -
IndexError: list index out of range

from pyxlsb2 import open_workbook
xlsb_file_path= "sample_file.xlsb"
with open_workbook(xlsb_file_path) as wb:
    for sheet in wb.sheets:
        print(sheet.name)

Error -
image

This is what chatgpt is suggesting -

The error you're encountering is an IndexError with the message "list index out of range." This typically occurs when you're trying to access an element in a list using an index that is outside the valid range of indices for that list. In this specific case, the error is happening in the pyxlsb2 library code while parsing the formula of a cell.

The relevant part of the traceback is:

plaintext
Copy code

File "/home/ujwala/.venvs/ingest_venv/lib/python3.9/site-packages/pyxlsb2/formula.py", line 16, in stringify
    return '' if not tokens else tokens.pop().stringify(tokens, workbook)

IndexError: list index out of range
Here's what's happening:

The tokens list is being accessed with tokens.pop().
The pop() method removes and returns the last item from the list.
The stringify method is then called on the popped item.
However, it seems that the tokens list is empty when pop() is called, resulting in an IndexError. The IndexError occurs because there is no item to pop from an empty list.

To fix this issue:

Check if tokens is not empty before calling pop().

Update the return statement in the stringify method of the pyxlsb2/formula.py file to include a check for an empty list:

return '' if not tokens else tokens.pop().stringify(tokens, workbook)
becomes:

return '' if not tokens else tokens.pop().stringify(tokens, workbook) if tokens else ''
This modification ensures that pop() is only called when tokens is not empty.

Update pyxlsb2 library to the latest version.

Ensure that you are using the latest version of the pyxlsb2 library, as this issue may have been addressed in a newer release. You can check the library's official repository for updates.

After making these changes, try running your code again and see if the issue persists. If the problem persists or if there's a specific version of pyxlsb2 that you're using, you might want to check the library's GitHub repository for open issues or consider reporting this issue to the library maintainers.

Installation Issue

When I try to install the package with

pip install enum34 pyxlsb2

It fails:

Collecting enum34==1.1.10
  Downloading enum34-1.1.10-py2-none-any.whl (11 kB)
Collecting pyxlsb2==0.0.2
  Downloading pyxlsb2-0.0.2.tar.gz (31 kB)
    ERROR: Command errored out with exit status 1:
     command: /tmp/venv/env/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-CQMm6n/pyxlsb2/setup.py'"'"
'; __file__='"'"'/tmp/pip-install-CQMm6n/pyxlsb2/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-6J3CgU
         cwd: /tmp/pip-install-CQMm6n/pyxlsb2/
    Complete output (13 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-CQMm6n/pyxlsb2/setup.py", line 3, in <module>
        from pyxlsb2 import __version__
      File "pyxlsb2/__init__.py", line 3, in <module>
        from .workbook import Workbook
      File "pyxlsb2/workbook.py", line 7, in <module>
        from .recordreader import RecordReader
      File "pyxlsb2/recordreader.py", line 4, in <module>
        from . import records as recs
      File "pyxlsb2/records.py", line 1, in <module>
        from enum import Enum
    ImportError: No module named enum
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

But when I installed it separately, it works:

pip install enum34
pip install pyxlsb2

Handling shared string

Handle shared string (BrtCellIsst)

38e01ea82f15a2dcd6905daf98e2f51886e1611ccc0dfc0e76a933b0b6db719d

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.