GithubHelp home page GithubHelp logo

Comments (2)

JosePVB avatar JosePVB commented on August 30, 2024

First of all, let me say that this is a great library! Parsing table data out of PDFs is very hard and this library makes this much more accessible.

I'd like express my interest in this feature as well. For the last page of the following PDF, http://web2.gov.mb.ca/bills/40-5/billstatus.en.pdf, I tried the suggested table_regions argument that was suggested in atlanhq/camelot#357, but I encountered the IndexError mentioned in that same issue, as camelot identified two tables within the specified table_regions.

Here is the exact call:

>>> columns = ['93,242,305,350,395,468,517,566,629,693']
>>> tables = camelot.read_pdf('camb-40-5.pdf', flavor='stream', edge_tol=200, pages='5', columns=columns, table_regions=[(40,540,730,55)])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jose/venv/camelot/local/lib/python2.7/site-packages/camelot/io.py", line 106, in read_pdf
    layout_kwargs=layout_kwargs, **kwargs)
  File "/home/jose/venv/camelot/local/lib/python2.7/site-packages/camelot/handlers.py", line 162, in parse
    layout_kwargs=layout_kwargs)
  File "/home/jose/venv/camelot/local/lib/python2.7/site-packages/camelot/parsers/stream.py", line 425, in extract_tables
    cols, rows = self._generate_columns_and_rows(table_idx, tk)
  File "/home/jose/venv/camelot/local/lib/python2.7/site-packages/camelot/parsers/stream.py", line 321, in _generate_columns_and_rows
    if self.columns is not None and self.columns[table_idx] != "":
IndexError: list index out of range

Here is my OS/Python information:

>>> import platform; print(platform.platform())
Linux-4.15.0-70-generic-x86_64-with-Ubuntu-18.04-bionic
>>> import sys; print('Python', sys.version)
('Python', '2.7.17 (default, Nov  7 2019, 10:07:09) \n[GCC 7.4.0]')
>>> import numpy; print('NumPy', numpy.__version__)
('NumPy', '1.16.6')
>>> import cv2; print('OpenCV', cv2.__version__)
('OpenCV', '4.1.2')
>>> import camelot; print('Camelot', camelot.__version__)
('Camelot', '0.7.2')

from camelot.

sh-ankar avatar sh-ankar commented on August 30, 2024

If I give the "columns=" parameter, I get the exact same error as above.
If I remove "columns=" , I get no errors, but the output has repeated duplicate subset(s) as part of the TableList. Here is what I mean:

>>>table = camelot.read_pdf("file20.pdf", password="xxxx", pages="20",
              flavor='stream', table_regions=['21,761,575,55'],
              columns=['74,345,412,470,520'])
>>> table
<TableList n=3>
>>> table[0]
<Table shape=(25, 7)>
>>> table[1]
<Table shape=(28, 6)>
>>> **table[2]** 
<Table shape=(66, 7)>      
>>> 

Note here that that complete page is correctly extracted into table[2]. table[0] and table[1] are a subset - hence I am getting duplicate data.

Interestingly in another page in same doc, i get a similar error and here I have 2 tables - but in this case, the table[0] has the full page and table[1] has a subset] ... so its not consitent either

The problem seems be that we are parsing the same rows multiple times. Any parameter I am missing?
The PDF source file I am using is a report from the CAMS mutual fund detailed statement, so I cannot share :(

BTW, this works awesome for the rest of the pages ...

from camelot.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.