artifexsoftware / pdf2docx Goto Github PK

View Code? Open in Web Editor NEW

2.5K 25.0 362.0 22.38 MB

Open source Python library for converting PDF to DOCX.

Home Page: https://pdf2docx.readthedocs.io

License: GNU Affero General Public License v3.0

Python 99.51% Makefile 0.49%

pdf-converter docx pdf-to-word extract-table pymupdf

pdf2docx's Introduction

English | 中文

pdf2docx

Extract data from PDF with PyMuPDF, e.g. text, images and drawings
Parse layout with rule, e.g. sections, paragraphs, images and tables
Generate docx with python-docx

Features

Parse and re-create page layout
- page margin
- section and column (1 or 2 columns only)
- page header and footer [TODO]
Parse and re-create paragraph
- OCR text [TODO]
- text in horizontal/vertical direction: from left to right, from bottom to top
- font style, e.g. font name, size, weight, italic and color
- text format, e.g. highlight, underline, strike-through
- list style [TODO]
- external hyper link
- paragraph horizontal alignment (left/right/center/justify) and vertical spacing
Parse and re-create image
- in-line image
- image in Gray/RGB/CMYK mode
- transparent image
- floating image, i.e. picture behind text
Parse and re-create table
- border style, e.g. width, color
- shading style, i.e. background color
- merged cells
- vertical direction cell
- table with partly hidden borders
- nested tables
Parsing pages with multi-processing

It can also be used as a tool to extract table contents since both table content and format/style is parsed.

Limitations

Text-based PDF file
Left to right language
Normal reading direction, no word transformation / rotation
Rule-based method can't 100% convert the PDF layout

Documentation

Sample

pdf2docx's People

Contributors

Stargazers

Watchers

Forkers

saonam vinduja76 qyttools smilelight phearun008 guddubhagat94 isehrob fammr filips123 zhonghaishen enzococca silianpan rus1ru sshuster lh151 pis0sion leyiwang libin-k anminhhung messidagod cyc3w docuprint fmogollonr golpesareirooni flexudy-pipe icodein rajitha1998 docs-of-all-trades gobig87 ecanro security888test futurepaycc inzhir fgiopp flashorange johntsim mrz1996 chengcjun powpao zenfiric temisai forti-lab gavinlwz linjianwei888 mateusmb yhongzheng verydemo sanitpeng jeffreykuang weinbery hadoan huang882088 selvabharathis mosjin dogan-87 echan85 wefuture vatsal2210 wenxuefeng3930 linuxyn mamafun charliewangchen tjnh05 jeozhao xuantianfengwu cubantech chh-sys import-this-neteasemail chronosxyz jieqianchen1990 kvkevin test202010 qchen-lexcheck yuejc hjc3613 oliverkehl wzhsunn arun-prabhakar zhonggithub python-gare mervin278 baifengbai sailfish009 iqbalme yingyueyf gintian hodaifa98 naughtydogofschrodinger tj569984165 firetofu alonggs alexmaehon freepeasantry2021 xuesong55 minsinyo kofancy2017 jmszg chaobingya sigmayang wu89053

pdf2docx's Issues

Converted document has no text.

@api_view(['POST'])
def pdfToDocx(request):
    base64String = request.data[0]['data'].split(',')[1]
    filename = request.data[0]['file']['path'].split('.')[0]

   with open(f'{settings.BASE_DIR}\\static\\files\\{filename}.pdf', 'wb') as f:
        f.write(base64.b64decode(base64String))

    pdf_file = f'{settings.BASE_DIR}\\static\\files\\{filename}.pdf'
    docx_file = f'{settings.BASE_DIR}\\static\\files\\{filename}.docx'

    parse(pdf_file, docx_file, start=0, end=None)
    return Response('OK)

Can you please help me with this?

AttributeError: module 'pdf2docx.common.utils' has no attribute 'reset_paragraph_format'

When I convert PDF to docx, I have this problem.

i can't reproduce an style like that

Hi I would convert an pdf file like that but i can't reproduce. can you tel me some suggestion?
scheda_USICCD.pdf
thanks
E

部分文字没有转换。图片也有错乱的。

文件不能传。pdf中间有一部分文本没有转换。变成图片了。还有一部分图片位置错乱了。

这里文字转成图片了。

这边的图像乱了。
兄台能否留一个联系方式，有其他的项目可以讨论一下。

import parse: start = 1

Is it possible set start = 1 and not 0 like first page? I mean in the global setting

ERROR: Could not consume arg: test.pdf

ERROR: Could not consume arg: test.pdf
Usage: pdf2docx
available commands: convert | table

For detailed information on this command, run:
pdf2docx --help

IndexError: list index out of range

Hello.

I tried to use pdf2docx.
When I try to .pdf with Korean language to docx, I met some errors.

Here is the error message.

Traceback (most recent call last):
File "E:/insu_mrc/insu_terms_comparision/pdftohtmltest.py", line 45, in
print(cv.parse(page))
File "C:\Users\19021003\AppData\Local\Programs\Python\Python37\lib\site-packages\pdf2docx\converter.py", line 112, in parse
self.init(page).parse(self._debug_kwargs)
File "C:\Users\19021003\AppData\Local\Programs\Python\Python37\lib\site-packages\pdf2docx\layout\Layout.py", line 153, in parse
self.parse_implicit_tables(kwargs)
File "C:\Users\19021003\AppData\Local\Programs\Python\Python37\lib\site-packages\pdf2docx\common\utils.py", line 148, in inner
res = func(*args, **kwargs)
File "C:\Users\19021003\AppData\Local\Programs\Python\Python37\lib\site-packages\pdf2docx\layout\Layout.py", line 278, in parse_implicit_tables
tables = self._tables_constructor.implicit_tables(X0, X1)
File "C:\Users\19021003\AppData\Local\Programs\Python\Python37\lib\site-packages\pdf2docx\table\TablesConstructor.py", line 85, in implicit_tables
table_rects = self._implicit_borders(table_bboxes, X0, X1)
File "C:\Users\19021003\AppData\Local\Programs\Python\Python37\lib\site-packages\pdf2docx\table\TablesConstructor.py", line 510, in _implicit_borders
inner_borders = self._borders_from_bboxes(rects, border_bbox)
File "C:\Users\19021003\AppData\Local\Programs\Python\Python37\lib\site-packages\pdf2docx\table\TablesConstructor.py", line 547, in _borders_from_bboxes
cols_rects = self._column_borders_from_bboxes(rects)
File "C:\Users\19021003\AppData\Local\Programs\Python\Python37\lib\site-packages\pdf2docx\table\TablesConstructor.py", line 604, in _column_borders_from_bboxes
cols_rects[-1].append(rect)
IndexError: list index out of range

If you are okay, I want to send some files to you for debugging.
Could you let me know your e-mail address?

raise ValueError("no valid image found")

Processing PDF with svg images

PDF to DOCX conversion!!

can we just convert all the pages in a pdf file without mentioning ending pages in parse parameters

Ignore page due to error: 'TableBlock' object has no attribute 'lines'

您好，我这边在parse的时候遇到了这个报错：Ignore page due to error: 'TableBlock' object has no attribute 'lines'。我刚才尝试了一下，发现0.5.0是可以parse的但是效果不是特别好（不过这无伤大雅），但是我现在用的0.5.1版本会出现上面的这个报错。不知道是不是版本迭代的过程中修改了一些代码导致新出现的问题。

测试文件我已经发到您邮箱了。

谢谢！

您好，我在尝试转换类似于这样的报告单中，由于存在左上角logo和右上角slogan，以及左下角slogan与右下角页码，程序将其识别为single block with 2 rows。且将中间内容（如链接中的所示）全部识别为inline图片插入到表格中去，导致了图片无法正常显示出来。
我尝试将图片提取出来后单独创建block表示图片，但是在cv.restore+make_docx后出来的文档为空白文档。
我想问一下像这样的无法识别的问题是为何产生的？在没办法短时间内解决的情况下，我想把这个图片拿出来当做单一block显示（最后的json应该为block(页头的表格)+block(图片)+block(页脚的表格)），应该如何实现？

谢谢！

[Bug] 当PDF中某页内容为空时程序报错。

尊敬的pdf2docx的开发者您好，首先非常感谢您在pdf转docx上的工作并且开源了该库，我试用了一下感觉该库的解析效果很是不错！
但是目前的程序有一个bug就是说当PDF中某页为空时，会导致程序运行报错。
比如说某页内容为：

报错信息为：

Traceback (most recent call last):
  File "D:\Software\Anaconda3\envs\file_parse\lib\site-packages\pdf2docx\common\Collection.py", line 24, in __getitem__
    instances = self._instances[idx]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:/Projects/GiteeProjects/doc_parser/tests/pdf2docx_test.py", line 11, in <module>
    parse(pdf_path, docx_path)
  File "D:\Software\Anaconda3\envs\file_parse\lib\site-packages\pdf2docx\main.py", line 32, in parse
    cv.parse(page).make_page()
  File "D:\Software\Anaconda3\envs\file_parse\lib\site-packages\pdf2docx\converter.py", line 131, in make_page
    self._layout.make_page(self._doc_docx)
  File "D:\Software\Anaconda3\envs\file_parse\lib\site-packages\pdf2docx\layout\Layout.py", line 234, in make_page
    if self.blocks[-1].is_table_block():
  File "D:\Software\Anaconda3\envs\file_parse\lib\site-packages\pdf2docx\common\Collection.py", line 27, in __getitem__
    raise IndexError(msg)
IndexError: Collection index -1 out of range

我个人理解是在pdf2docx/Layout.py at master · dothinking/pdf2docx的235行中的

if self.blocks[-1].is_table_block():

中此时blocks内容为空而直接用-1这个下标获取item造成的。
希望您抽时间能够修复这个Bug,谢谢~

Arabic texts are saved in the document from left to right instead of right-left

pdf2docx writes the arabic text (or any rtl languagues) from left-to-right in the output docx. My input text is in correct format i.e. from right-to-left. But when I convert this pdf to docx, it just reverses the whole string.

my pdf has this words:
"دبي: طرح مجموعة جديدة من الإجراءات الاحترازية في دبي"

docx shows:
يبد يف ةيزارتحالا تاءارجإلا نم ةديدج ةعومجم حرط :يبد

Error when start=0

Hello,

First, thank's for your amazing job.
I have a problem with the lib when i try to use it in a Python script.

I try to convert a PDF who got 14 pages and i can't convert the first one. When i use the parameter start=0 for the parse() method i got this error:

>>> parse(pdf_file, docx_file, start=0, end=13)
Processing 0/14...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/luca/.local/share/virtualenvs/taptouche_account_creator-2TJMnHyn/lib/python3.8/site-packages/pdf2docx/main.py", line 35, in parse
    layout = pdf.parse(page)
  File "/home/luca/.local/share/virtualenvs/taptouche_account_creator-2TJMnHyn/lib/python3.8/site-packages/pdf2docx/reader.py", line 126, in parse
    layout = self.layout(page)
  File "/home/luca/.local/share/virtualenvs/taptouche_account_creator-2TJMnHyn/lib/python3.8/site-packages/pdf2docx/reader.py", line 108, in layout
    raw_layout['rects'] = self.rects(page)
  File "/home/luca/.local/share/virtualenvs/taptouche_account_creator-2TJMnHyn/lib/python3.8/site-packages/pdf2docx/reader.py", line 89, in rects
    rects = pdf_shape.rects_from_source(page_content, height)
  File "/home/luca/.local/share/virtualenvs/taptouche_account_creator-2TJMnHyn/lib/python3.8/site-packages/pdf2docx/pdf_shape.py", line 194, in rects_from_source
    if  lines[i+1] in ('f', 'F', 'f*'):
IndexError: list index out of range

If i use parse() with start=1, it's work, but i don't get the first page of the document. I tried to use parse() without parameter and i got the same error.

Do i use it bad ? I miss something ?

Installation via Pip?

Hi @dothinking
First of all, thanks so much for making this library! How can we install it? Is it possible via pip?

中文乱码？

中文乱码？怎么解决？

AttributeError: 'Page' object has no attribute 'getDrawings'这个错误是因为什么原因？

File "D:/Pycharmproject/pdf2docx/test/local_test.py", line 180, in
local_test(filename, compare=False, make_test_case=False)
File "D:/Pycharmproject/pdf2docx/test/local_test.py", line 151, in local_test
cv.debug_page(0, docx_file)
File "D:\Pycharmproject\pdf2docx\pdf2docx\converter.py", line 75, in debug_page
layouts = self.make_docx(docx_filename, pages=[i], config=config)
File "D:\Pycharmproject\pdf2docx\pdf2docx\converter.py", line 107, in make_docx
layouts = self._make_docx(docx_file, page_indexes, config)
File "D:\Pycharmproject\pdf2docx\pdf2docx\converter.py", line 145, in _make_docx
layout = self.parse(self.doc_pdf[i], config)
File "D:\Pycharmproject\pdf2docx\pdf2docx\converter.py", line 53, in parse
return Layout(page, config).parse()
File "D:\Pycharmproject\pdf2docx\pdf2docx\layout\Layout.py", line 61, in init
data = self.__source_from_page(parent) if parent else {}
File "D:\Pycharmproject\pdf2docx\pdf2docx\layout\Layout.py", line 271, in __source_from_page
self.__preprocess_shapes(page, raw_layout)
File "D:\Pycharmproject\pdf2docx\pdf2docx\common\share.py", line 197, in inner
objects = func(*args, **kwargs)
File "D:\Pycharmproject\pdf2docx\pdf2docx\layout\Layout.py", line 329, in __preprocess_shapes
raw_paths = page.getDrawings()
AttributeError: 'Page' object has no attribute 'getDrawings'

Extra line after table

Hi Team
Have problem with parsing my pdf report
An extra line is always added after each table. Is there any way I can fix it?

Thank you!

Original pdf and result docx :
myreport.pdf
myreport.docx

Processing images

I have clear limitations of pdf2docx, but with I understand very well the part of

No floating images

How exactly are the floating images? My doubt arises because I have processed two files with images, one with more images than the other but both have, however with one I throw an error, and with the other not, I attach the files and error information.

File that throws error: ValueError: no valid image found
test.pdf

File processed correctly
test2.pdf

I assume the error occurs on page 1 because the output console shows me:

Processing Pages: 1/16...Traceback (most recent call last):

So apparently that's as far as processing goes.

Beyond simply making an exception when there are errors with the images, something like that:

from pdf2docx import parse

try:
	parse('Input.pdf', 'Output.docx')
except ValueError:
	print('Image error, this file cant be processed.')

I would like to know exactly how to identify a floating image in a PDF.

ZeroDivisionError: float division by zero

Hello I use your documentation code but I think it has a problem about your module
I get this error again and again...
I get this error when the script get to my 3 pdf page
Processing Pages: 2/3...Traceback (most recent call last):
File "m.py", line 8, in <module> parse(pdf_file, docx_file, start=0, end=3) File "/home/submissive/.local/lib/python3.6/site-packages/pdf2docx/main.py", line 31, in parse cv.make_docx(indexes, multi_processing) File "/home/submissive/.local/lib/python3.6/site-packages/pdf2docx/converter.py", line 117, in make_docx self._make_docx(page_indexes) File "/home/submissive/.local/lib/python3.6/site-packages/pdf2docx/converter.py", line 191, in _make_docx self.initialize(page).parse().make_page(self.doc_docx) File "/home/submissive/.local/lib/python3.6/site-packages/pdf2docx/converter.py", line 171, in initialize images, paths = self._paths_extractor.extract_paths(page) File "/home/submissive/.local/lib/python3.6/site-packages/pdf2docx/shape/Path.py", line 73, in extract_paths if largest.contains_curve(constants.FACTOR_A_FEW): File "/home/submissive/.local/lib/python3.6/site-packages/pdf2docx/shape/Path.py", line 128, in contains_curve return bbox.getArea()/self.bbox.getArea() >= ratio ZeroDivisionError: float division by zero

Please Help me ❤️

a href link missing from word

When I am trying to convert PDF file to docx. A href url tag with the url is missing in the docx file. Kindly help.

[Feature] 段落换行拼接功能

还有一个关于段落换行的问题，测试文档链接：测试文档。

原文如图：

使用pdf2docx解析结果为：

可见每行后面都跟有一个回车换行符，对应Word文档链接：pdf2docx解析结果。

使用smallpdf解析结果为：

可见段落是连续的，之间并无回车换行符，对应Word文档链接：samllpdf解析结果。

这个问题看起来不太容易处理，不过请问您考虑今后增加该功能吗

ValueError: unsupported colorspace for 'png'

When trying to getImageData(output="png") from an image in DeviceCMYK color space, an error shows:

File "d:\workspace\github\pdf2docx\pdf2docx\image\ImageSpan.py", line 42, in make_docx
    docx.add_image(paragraph, self.image, self.bbox.x1-self.bbox.x0)
  File "d:\workspace\github\pdf2docx\pdf2docx\image\Image.py", line 45, in image
    return fitz.Pixmap(self._image).getImageData(output="png") # convert to png image
  File "E:\Python\Python37\lib\site-packages\fitz\fitz.py", line 5187, in getImageData
    raise ValueError("unsupported colorspace for '%s'" % output)
ValueError: unsupported colorspace for 'png'

cannot import name 'parse' from 'pdf2docx'

I try to run de example in the README.md and give that error.

from pdf2docx import parse

pdf_file = '09_0370.pdf'
docx_file = 'File.docx'

# convert pdf to docx
parse(pdf_file, docx_file, start=0, end=1)

What's wrong?

Text in table

Why is the text obtained after conversion in the table?

Skip items that cause errors

Let's assume that for now there is no way to process the floating images, as an enhancement I would like to make a small recommendation for future updates. It would be very useful a parameter that allows omitting the images or objects that cause errors, and so that the incoming pdf file, although it has unprocessable elements, can be omitted and get the output file without these elements, and then one as a programmer is responsible for making these clarifications to the user.

align borders in stream table

Currently stream table border is determined by two adjacent cell blocks: right in the middle of them

Though such stream borders are hidden in docx, it's better to make them aligned, so that the structure of stream table should be simpler.

每次转换docx都会覆盖前面的页

pdf有好多页，每次转换docx都会覆盖前面的页，如何将转换好的每一页docx,存到一个docx文件里
pdf_file = os.path.join(output, f'{filename}.pdf')
docx_file = os.path.join(output, f'{filename}.docx')
cv = Converter(pdf_file)
for i in range(len(cv)):
cv.debug_page(i, docx_file)
# cv.debug_page(0, docx_file)
cv.close() # close pdf

# check results
if compare:
    check_result(pdf_file, docx_file, 'comparison.pdf', make_test_case)

Problem with the docx file after the convert

Hello to the community, im new in the programming. So, thanks in advance, i run the program in pycharm, the Convert starts and seems to work without problems (Parsing Page... -> Creating Page... etc.) then, when i go to the directory that my file was saved, to check, if the conversion worked, i see what is shown in the attach picture (the docx file is shown like pictures, like pieces, not like text) and i was wonder , if you any idea why this happening and if you have any idea how to fix it.

程序报 ValueError: no valid image found错误

测试样例：链接: https://pan.baidu.com/s/1yP1v554PKxk0UAkSpk8CHw 提取码: knvt
在进行格式转换过程中，报了ValueError: no valid image found错误，辛苦看下哈
dev分支转换过程输出如下：

Parsing Pages: 154/297 per CPU 2...Ignore Line "<image>" due to overlap
Parsing Pages: 77/297 per CPU 1...Ignore Line "<image>" due to overlap
Parsing Pages: 232/297 per CPU 3...Ignore Line "<image>" due to overlap
Parsing Pages: 168/297 per CPU 2...Ignore Line "<image>" due to overlap
Parsing Pages: 88/297 per CPU 1...Ignore Line "�" due to overlap
Parsing Pages: 185/297 per CPU 2...Ignore Line "<image>" due to overlap
Parsing Pages: 186/297 per CPU 2...Ignore Line "<image>" due to overlap
Parsing Pages: 101/297 per CPU 1...Ignore Line "<image>" due to overlap
Parsing Pages: 191/297 per CPU 2...Ignore Line "<image>" due to overlap
Parsing Pages: 263/297 per CPU 3...Ignore Line "<image>" due to overlap
Parsing Pages: 109/297 per CPU 1...Ignore Line "<image>" due to overlap
Parsing Pages: 111/297 per CPU 1...Ignore Line "<image>" due to overlap
Parsing Pages: 268/297 per CPU 3...Ignore Line "<image>" due to overlap
Parsing Pages: 124/297 per CPU 1...Ignore Line "<image>" due to overlap
Parsing Pages: 285/297 per CPU 3...Ignore Line "<image>" due to overlap
Parsing Pages: 289/297 per CPU 3...Ignore Line "<image>" due to overlap
Parsing Pages: 224/297 per CPU 2...Ignore Line "<image>" due to overlap
Parsing Pages: 138/297 per CPU 1...Ignore Line "<image>" due to overlap
Parsing Pages: 144/297 per CPU 1...Ignore Line "<image>" due to overlap
Parsing Pages: 146/297 per CPU 1...Ignore Line "<image>" due to overlap

@dothinking

Can't reproduce transparent PNG images correctly

A png image showing in PDF

was converted to wrong style in docx:

compression error -2

Running into an error compression error -2. It would be great if anyone is able to provide some pointers

Attached the PDF with the issue:
5_EN.pdf

Error message:

Processing Pages: 1/28...mupdf: compression error -2
Traceback (most recent call last):
  File "/Users/erikchan/Downloads/convert.py", line 10, in <module>
    parse(pdf_files[i], docx_files[i])
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/main.py", line 31, in parse
    cv.make_docx(indexes, multi_processing)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/converter.py", line 118, in make_docx
    self._make_docx(page_indexes)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/converter.py", line 192, in _make_docx
    self.initialize(page).parse().make_page(self.doc_docx)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/converter.py", line 172, in initialize
    images, paths = self._paths_extractor.extract_paths(page)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/shape/Path.py", line 61, in extract_paths
    image = largest.to_image(page) if largest.contains_curve else None
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/shape/Path.py", line 140, in to_image
    return ImagesExtractor.clip_page(page, bbox, zoom)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/image/Image.py", line 60, in clip_page
    return cls.to_raw_dict(image, bbox)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/image/Image.py", line 50, in to_raw_dict
    'image': image.getPNGData()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/fitz/fitz.py", line 5899, in getPNGData
    barray = self._getImageData(1)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/fitz/fitz.py", line 5868, in _getImageData
    return _fitz.Pixmap__getImageData(self, format)
RuntimeError: compression error -2

Converting PDF with `/ColorSpace /CSp /DeviceRGB` causes ValueError

I'm trying to convert PDF and later parse some of the text data. However, PDF contains image with solor spaces which cause ValueError when converting it.

Unfortunately, PDF is private so I can't share it, but I will try to make some reproducable PDF if needed. PDFs are not generated by me. It is probably not caused by any file corruption, because I have multiple similar PDFs from the same source (only text in tables is different), and all of them have the same problem.

The XREF object of one of PDF pages is (obj_contents variable in _check_device_cs function):

<<
  /Type /Page
  /Parent 2 0 R
  /Contents 14 0 R
  /Resources <<
    /ExtGState <<
      /GSa 3 0 R
    >>
    /ColorSpace <<
      /CSp /DeviceRGB
    >>
    /Font <<
      /F6 6 0 R
    >>
    /XObject <<
      /Im1 9 0 R
    >>
  >>
  /Annots 17 0 R
  /MediaBox [ 0 0 595 842 ]
>>

Problem is the color space definition. pdf2docx will detect it and try to parse /CSp /DeviceRGB to check if it is a device based color space. Based on comments in the code, CS should be in format /Cs6 14 0 R, so 14 will be converted to int and then passed to _is_device_cs. However, in my case, CS is /CSp /DeviceRGB. This means int conversion of /DeviceRGB will fail and program will throw ValueError:

Traceback (most recent call last):
  File "D:\Users\filips\Downloads\pdf2docx\venv\Scripts\pdf2docx-script.py", line 33, in <module>
    sys.exit(load_entry_point('pdf2docx', 'console_scripts', 'pdf2docx')())
  File "d:\users\filips\downloads\pdf2docx\pdf2docx\main.py", line 67, in main
    fire.Fire(parse)
  File "d:\users\filips\downloads\pdf2docx\venv\lib\site-packages\fire\core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "d:\users\filips\downloads\pdf2docx\venv\lib\site-packages\fire\core.py", line 463, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "d:\users\filips\downloads\pdf2docx\venv\lib\site-packages\fire\core.py", line 672, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "d:\users\filips\downloads\pdf2docx\pdf2docx\main.py", line 31, in parse
    cv.make_docx(indexes, multi_processing)
  File "d:\users\filips\downloads\pdf2docx\pdf2docx\converter.py", line 117, in make_docx
    self._make_docx(page_indexes)
  File "d:\users\filips\downloads\pdf2docx\pdf2docx\converter.py", line 170, in _make_docx
    self.initialize(page).parse().make_page(self.doc_docx)
  File "d:\users\filips\downloads\pdf2docx\pdf2docx\converter.py", line 150, in initialize
    images, paths = self._paths.parse(page).filter_pixmaps(page)
  File "d:\users\filips\downloads\pdf2docx\pdf2docx\shape\Path.py", line 42, in parse
    raw_paths = pdf.paths_from_stream(page)
  File "d:\users\filips\downloads\pdf2docx\pdf2docx\common\pdf.py", line 270, in paths_from_stream
    color_spaces = _check_device_cs(page)
  File "d:\users\filips\downloads\pdf2docx\pdf2docx\common\pdf.py", line 579, in _check_device_cs
    cs[cs_name] = _is_device_cs(int(xref), doc)
ValueError: invalid literal for int() with base 10: '/DeviceRGB'

This happens both on latest PyPI version and latest version of dev branch.

I found some workaround for me, because I only need text data: If I just surround failing line with try, PDF parsing will work. All text data will be correct. However, all background of that image will be black instead of white (but other non-white colors will stay the same). Wrong background is not really big issue for me, but it would be nice to have this fixed and is still better than ValueError.

I will open PR with that workaround soon, but it would be nice to have it fixed completely.

页眉页脚能够识别出来么

如果用 word2016来转换pdf 能够识别出页眉页脚，如果这个库也能支持这个功能那可真是太赞了

Generate all pages from file

A way for process all pages from a file without set a range? Example:

parse(pdf_file, docx_file, start=0, end=pdf2docx.allpages())

Parse table looks weird: merged cells not in rectangular shape

Merged cells are identified by checking adjacent borders. According to this rule, the cell on the top-right corner is merged with two cells on the left and bottom of it, respectively. But, it's invalid in docx.

A proper parsing is that,

the top-right cell is merged with the cell on the bottom of it,
and the left border of top-right cell is hidden.

[Bug] ValueError: could not convert string to float: 'Td'

开发者您好，我在使用pdf2docx中又发现了一个bug，在简单调试之后发现pdf页面相应内容对应的xref_stream内容为：

-0.0009 Tc 0.0042 Tw 5.247 0 Td
(15 K )Tj
EMC

对应于pdf2docx/common/pdf.py中第293行的lines对象的部分内容为

['-0.0009', 'Tc', '0.0042', 'Tw', '5.247', '0', 'Td', '(15', 'K', ')Tj', 'EMC']

故而在执行第334、335行即如下代码时出错。

# - CMYK mode
elif line.upper()=='K': # c m y k K
    c, m, y, k = map(float, lines[i-4:i])

所以我猜测将该内容调整成

try:
    c, m, y, k = map(float, lines[i-4:i])
except ValueError:
    continue

并在本地进行了测试通过。

ImportError: cannot import name 'Converter' from 'pdf2docx'

from pdf2docx import Converter

pdf_file = 'D:\\dev\\python\\test.pdf'
docx_file = 'D:\\dev\\python\\test.docx'

# convert pdf to docx
cv = Converter(pdf_file)
cv.convert(docx_file) # 默认参数start=0, end=None
cv.close()

ImportError: cannot import name 'Converter' from 'pdf2docx'

consider multi-processing for large size pdf

explicit border in stream table

The structure of stream table is determined by text blocks layout, but should consider showing borders.

Such borders in stream table are ignored now 0.4.0.

[Feature] 部分连贯字体看起来粗细不同

如图：

原文截图：

令我感到好奇的是“一、引论”那一行虽然人肉眼看上去字体明显不同，但是在Word中样式都是加粗的且font-family统一。
您在Pdf2docx开发概要中提到的Solid Documents看起来没有这个问题，但是其也丢失掉了字体加粗的样式，如图：

请问这个功能pdf2docx可以实现吗，使用的是版本为0.3.4（目前最新）的pdf2docx。

测试文档链接地址：

pdf测试文档

License

Hi,

first I'd like to thank you for opening this code.

Unfortunately, you may not be aware of this, legally, without a license, no one can use your code, which btw. makes any contribution legally impossible (http://opensource.stackexchange.com/q/1720/775).

As we think about to further develop to this project, but at the current state would just not be allowed to do it. It would be nice if you could add a License to your project.

Cheers

after convert:

Traceback (most recent call last):
  File "d:\21_GitHub\pdf2docx\test\local_test.py", line 30, in <module>
    docx.make_page(layout)
  File "d:\21_GitHub\pdf2docx\pdf2docx\writer.py", line 17, in make_page
    docx.make_page(self._doc, layout)
  File "d:\21_GitHub\pdf2docx\pdf2docx\docx.py", line 50, in make_page
    section.right_margin = Pt(right)
  File "D:\89_Program_Files\Python368\lib\site-packages\docx\section.py", line 234, in right_margin
    self._sectPr.right_margin = value
  File "D:\89_Program_Files\Python368\lib\site-packages\docx\oxml\section.py", line 292, in right_margin
    pgMar.right = value
  File "D:\89_Program_Files\Python368\lib\site-packages\docx\oxml\xmlchemy.py", line 192, in set_attr_value
    str_value = self._simple_type.to_xml(value)
  File "D:\89_Program_Files\Python368\lib\site-packages\docx\oxml\simpletypes.py", line 25, in to_xml
    cls.validate(value)
  File "D:\89_Program_Files\Python368\lib\site-packages\docx\oxml\simpletypes.py", line 185, in validate
    cls.validate_int_in_range(value, 0, 18446744073709551615)
  File "D:\89_Program_Files\Python368\lib\site-packages\docx\oxml\simpletypes.py", line 42, in validate_int_in_range
    (min_inclusive, max_inclusive, value)
ValueError: value must be in range 0 to 18446744073709551615 inclusive, got -161543