I'm trying to convert PDF and later parse some of the text data. However, PDF contains image with solor spaces which cause ValueError when converting it.
Unfortunately, PDF is private so I can't share it, but I will try to make some reproducable PDF if needed. PDFs are not generated by me. It is probably not caused by any file corruption, because I have multiple similar PDFs from the same source (only text in tables is different), and all of them have the same problem.
<<
/Type /Page
/Parent 2 0 R
/Contents 14 0 R
/Resources <<
/ExtGState <<
/GSa 3 0 R
>>
/ColorSpace <<
/CSp /DeviceRGB
>>
/Font <<
/F6 6 0 R
>>
/XObject <<
/Im1 9 0 R
>>
>>
/Annots 17 0 R
/MediaBox [ 0 0 595 842 ]
>>
Problem is the color space definition. pdf2docx will detect it and try to parse /CSp /DeviceRGB
to check if it is a device based color space. Based on comments in the code, CS should be in format /Cs6 14 0 R
, so 14
will be converted to int and then passed to _is_device_cs
. However, in my case, CS is /CSp /DeviceRGB
. This means int conversion of /DeviceRGB
will fail and program will throw ValueError:
Traceback (most recent call last):
File "D:\Users\filips\Downloads\pdf2docx\venv\Scripts\pdf2docx-script.py", line 33, in <module>
sys.exit(load_entry_point('pdf2docx', 'console_scripts', 'pdf2docx')())
File "d:\users\filips\downloads\pdf2docx\pdf2docx\main.py", line 67, in main
fire.Fire(parse)
File "d:\users\filips\downloads\pdf2docx\venv\lib\site-packages\fire\core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "d:\users\filips\downloads\pdf2docx\venv\lib\site-packages\fire\core.py", line 463, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "d:\users\filips\downloads\pdf2docx\venv\lib\site-packages\fire\core.py", line 672, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "d:\users\filips\downloads\pdf2docx\pdf2docx\main.py", line 31, in parse
cv.make_docx(indexes, multi_processing)
File "d:\users\filips\downloads\pdf2docx\pdf2docx\converter.py", line 117, in make_docx
self._make_docx(page_indexes)
File "d:\users\filips\downloads\pdf2docx\pdf2docx\converter.py", line 170, in _make_docx
self.initialize(page).parse().make_page(self.doc_docx)
File "d:\users\filips\downloads\pdf2docx\pdf2docx\converter.py", line 150, in initialize
images, paths = self._paths.parse(page).filter_pixmaps(page)
File "d:\users\filips\downloads\pdf2docx\pdf2docx\shape\Path.py", line 42, in parse
raw_paths = pdf.paths_from_stream(page)
File "d:\users\filips\downloads\pdf2docx\pdf2docx\common\pdf.py", line 270, in paths_from_stream
color_spaces = _check_device_cs(page)
File "d:\users\filips\downloads\pdf2docx\pdf2docx\common\pdf.py", line 579, in _check_device_cs
cs[cs_name] = _is_device_cs(int(xref), doc)
ValueError: invalid literal for int() with base 10: '/DeviceRGB'
I found some workaround for me, because I only need text data: If I just surround failing line with try, PDF parsing will work. All text data will be correct. However, all background of that image will be black instead of white (but other non-white colors will stay the same). Wrong background is not really big issue for me, but it would be nice to have this fixed and is still better than ValueError.
I will open PR with that workaround soon, but it would be nice to have it fixed completely.