ssine / pptx2md Goto Github PK
View Code? Open in Web Editor NEWa pptx to markdown converter
a pptx to markdown converter
Tried to use virtualenv with versions python3.9, python3.8 and python3.7 and got the same error:
ImportError: cannot import name 'etree' from 'lxml' (/usr/lib/python3/dist-packages/lxml/init.py)
virtualenv -p /usr/bin/python3.9 pptx2md
source pptx2md/bin/activate
pip install pptx2md
pptx2md file.pptx
Tried solutions on theses issues but nothing worked:
babybuddy/babybuddy#336
WeblateOrg/weblate#4183
Every time I use the tag -o I need to move manually the img folder to the path that was passed.
I don't know if it's clear, if you need more info I can explain in a better way.
This may be an issue in a package dependency, but I am seeing the following error on an Ubuntu 22.04 box, when converting a given PowerPoint file:
$ pptx2md --disable-image --disable-color --disable-escaping \
--disable-notes -o mypowerpint.pptx.md mypowerpoint.pptx
Traceback (most recent call last):
File "/home/me/project/venv/lib/python3.11/site-packages/pptx2md/parser.py", line 202, in parse
shapes = sorted(ungroup_shapes(slide.shapes), key=attrgetter('top', 'left'))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/me/project/venv/lib/python3.11/site-packages/pptx2md/parser.py", line 185, in ungroup_shapes
if shape.shape_type == MSO_SHAPE_TYPE.GROUP:
^^^^^^^^^^^^^^^^
File "/home/me/project/venv/lib/python3.11/site-packages/pptx/shapes/autoshape.py", line 362, in shape_type
raise NotImplementedError(msg)
NotImplementedError: Shape instance of unrecognized shape type
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/me/project/venv/bin/pptx2md", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/me/project/venv/lib/python3.11/site-packages/pptx2md/__main__.py", line 141, in main
parse(prs, out)
File "/home/me/project/venv/lib/python3.11/site-packages/pptx2md/parser.py", line 207, in parse
print(sp.shape_type)
^^^^^^^^^^^^^
File "/home/me/project/venv/lib/python3.11/site-packages/pptx/shapes/autoshape.py", line 362, in shape_type
raise NotImplementedError(msg)
NotImplementedError: Shape instance of unrecognized shape type
I am only interested in the text, so I would be thrilled with a solution that simply ignored unknown shapes and kept moving. 😄
Is there additional info I can provide to help troubleshoot?
when i run:
import pptx2md as p
p [A1News.pptx]
The result is NameError: name 'A1News' is not defined
I tried to run
import pptx2md as p
p [pptx A1News.pptx]
the outcome is SyntaxError: invalid syntax. Perhaps you forgot a comma?
But if i run
import pptx2md as p
p A1News.pptx
it also give error SyntaxError: invalid syntax
Could you please kindly advise?
pptx2md giving following error :
Traceback (most recent call last):
File "/home/murali/.local/lib/python3.10/site-packages/pptx/compat/__init__.py", line 10, in <module>
Container = collections.abc.Container
AttributeError: module 'collections' has no attribute 'abc'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/murali/.local/bin/pptx2md", line 5, in <module>
from pptx2md.__main__ import main
File "/home/murali/.local/lib/python3.10/site-packages/pptx2md/__main__.py", line 1, in <module>
from pptx import Presentation
File "/home/murali/.local/lib/python3.10/site-packages/pptx/__init__.py", line 14, in <module>
from pptx.api import Presentation # noqa
File "/home/murali/.local/lib/python3.10/site-packages/pptx/api.py", line 15, in <module>
from .package import Package
File "/home/murali/.local/lib/python3.10/site-packages/pptx/package.py", line 6, in <module>
from pptx.opc.package import OpcPackage
File "/home/murali/.local/lib/python3.10/site-packages/pptx/opc/package.py", line 11, in <module>
from pptx.compat import is_string, Mapping
File "/home/murali/.local/lib/python3.10/site-packages/pptx/compat/__init__.py", line 14, in <module>
Container = collections.Container
AttributeError: module 'collections' has no attribute 'Container'
Thanks a lot for the script. While testing the conversion, I get the following error. Not sure what to do, do I need to adjust a image in the presentation? How is pptx2md handling graphical forms like box / ellipse etc. drawn in ppt, btw?
File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.10/site-packages/pptx/parts/slide.py", line 30, in get_image
return self.related_part(rId).image
AttributeError: 'Part' object has no attribute 'image'
After installing pptx2md successfully, came into this error,
root@mgt01:~/terraform# pptx2md kubevirt.pptx -o kubevirt.md
Traceback (most recent call last):
File "/usr/local/bin/pptx2md", line 5, in <module>
from pptx2md.__main__ import main
File "/usr/local/lib/python2.7/dist-packages/pptx2md/__main__.py", line 78
print(f'source file {file_path} not exist!')
^
SyntaxError: invalid syntax
even just run cmd pptx2md came into this error too.
Could you take a look?@ssine
In init.py line 100 calls exit(0)
without importing it from sys with from sys import exit
and leads to an error.
However, using pptx2md within another script it terminates if above is fixed which might not be wanted. Better return with a nozero exit value.
Is there way to export Speaker's note too?
Thanks
Albert
MacOS
Python: 3.9
is there an 'ignore' option if it can't render ?
Can it be fixed in the Powerpoint format?
Would it be this one in the library?
scanny/python-pptx#333
python3.9/site-packages/pptx/shapes/base.py", line 153, in placeholder_format
raise ValueError("shape is not a placeholder")
ValueError: shape is not a placeholder
Traceback (most recent call last):
File "/usr/local/bin/pptx2md", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.9/site-packages/pptx2md/main.py", line 148, in main
parse(prs, out)
File "/usr/local/lib/python3.9/site-packages/pptx2md/parser.py", line 237, in parse
if hasattr(shape, "placeholder_format"):
File "/usr/local/lib/python3.9/site-packages/pptx/shapes/base.py", line 153, in placeholder_format
raise ValueError("shape is not a placeholder")
ValueError: shape is not a placeholder
I would like to have the ability to change "\n---\n" to a different one.
one that contains information about the current number of the slide.
preferably one that I can set via the cli.
for example --enable-slides --slides-separator=[---(slide_number)---] or something like that instead of "\n---\n".
Thanks, really like the project.
Not sure if this is expected to work or not, but currently it does not. Would you consider supporting preserving hyperlinks within table cells?
I am practically python illiterate, thank you in advance for creating exactly what I was looking for, I have encountered my first issue using this program for the first time.
File "/Library/Frameworks/Python.framework/Versions/3.12/bin/pptx2md", line 8, in <module> sys.exit(main()) ^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx2md/__main__.py", line 121, in main prs = Presentation(file_path) ^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/api.py", line 28, in Presentation presentation_part = Package.open(pptx).main_document_part ^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/opc/package.py", line 73, in open return cls(pkg_file)._load() ^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/opc/package.py", line 157, in _load pkg_xml_rels, parts = _PackageLoader.load(self._pkg_file, self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/opc/package.py", line 186, in load return cls(pkg_file, package)._load() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/opc/package.py", line 190, in _load parts, xml_rels = self._parts, self._xml_rels ^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/util.py", line 215, in __get__ value = self._fget(obj) ^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/opc/package.py", line 219, in _parts content_types = self._content_types ^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/util.py", line 215, in __get__ value = self._fget(obj) ^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/opc/package.py", line 203, in _content_types return _ContentTypeMap.from_xml(self._package_reader[CONTENT_TYPES_URI]) ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/opc/serialized.py", line 35, in __getitem__ return self._blob_reader[pack_uri] ^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/util.py", line 215, in __get__ value = self._fget(obj) ^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/opc/serialized.py", line 49, in _blob_reader return _PhysPkgReader.factory(self._pkg_file) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/opc/serialized.py", line 135, in factory raise PackageNotFoundError("Package not found at '%s'" % pkg_file) pptx.exc.PackageNotFoundError: Package not found at '3._Analisi_delle_dinamiche_II.pptx'
+
I am on Mac
Thank you for your patience
Alessandro
Hi. This software is great!
When pptx2md reach the slide that contains the certain types of images (created in power point,like arrows), the process is interrupted. Perhaps pillow is throwing an error when trying to process an image that pillow can't load. The proof is that --disable image eliminates the error.
Translated from Japanese to English by Google Translate.
Command: pptx2md -o redme.md -i images infile.pptx
Output:
processing slide 1...
Traceback (most recent call last):
File "/home/tangarora/.local/bin/pptx2md", line 8, in
sys.exit(main())
File "/home/tangarora/.local/lib/python3.9/site-packages/pptx2md/main.py", line 117, in main
parse(prs, out)
File "/home/tangarora/.local/lib/python3.9/site-packages/pptx2md/parser.py", line 176, in parse
process_picture(shape)
File "/home/tangarora/.local/lib/python3.9/site-packages/pptx2md/parser.py", line 130, in process_picture
common_path = os.path.commonpath([g.out_path, g.img_path])
File "/usr/lib/python3.9/posixpath.py", line 510, in commonpath
raise ValueError("Can't mix absolute and relative paths") from None
ValueError: Can't mix absolute and relative paths
When I tried to run pptx2md [filename]
I get a bunch of errors. This is what I am left with:
Traceback (most recent call last):
File "/opt/homebrew/bin/pptx2md", line 5, in <module>
from pptx2md.__main__ import main
File "/opt/homebrew/lib/python3.9/site-packages/pptx2md/__main__.py", line 1, in <module>
from pptx import Presentation
File "/opt/homebrew/lib/python3.9/site-packages/pptx/__init__.py", line 14, in <module>
from pptx.api import Presentation # noqa
File "/opt/homebrew/lib/python3.9/site-packages/pptx/api.py", line 15, in <module>
from .package import Package
File "/opt/homebrew/lib/python3.9/site-packages/pptx/package.py", line 6, in <module>
from pptx.opc.package import OpcPackage
File "/opt/homebrew/lib/python3.9/site-packages/pptx/opc/package.py", line 13, in <module>
from pptx.opc.oxml import CT_Relationships, serialize_part_xml
File "/opt/homebrew/lib/python3.9/site-packages/pptx/opc/oxml.py", line 5, in <module>
from lxml import etree
ImportError: dlopen(/opt/homebrew/lib/python3.9/site-packages/lxml/etree.cpython-39-darwin.so, 2): no suitable image found. Did find:
/opt/homebrew/lib/python3.9/site-packages/lxml/etree.cpython-39-darwin.so: mach-o, but wrong architecture
/opt/homebrew/lib/python3.9/site-packages/lxml/etree.cpython-39-darwin.so: mach-o, but wrong architecture
When exporting using basic options (only the output), the images are extracted correctly, but they are inserted in the markdown using URL-encoded path. This lead to not able to display the image.
![](img%5Cmy-pptx0.png)
![](img/my-pptx0.png)
Just to note, I am using this one for a one-of conversion of 2 files. A simple search and replace is enough for me. But it might not be for someone using pptx2md
in a workflow.
I can't really understand how am I suppose to use the -o flag and I think other people don't know either.
Get something like this, I also encounter this with my own code using Pillow
Cannot convert wmf image xxxxxxxx.wmf in slide 24 to png, this probably won't be displayed correctly.
wand can solve this, I add this to a image.py:
def convert_wmf_to_png(input_file, output_png_path):
"""
Convert WMF data to a PNG file.
"""
from wand.image import Image
with Image(filename=input_file) as img:
img.format = 'png'
img.save(filename=output_png_path)
Then I replace the image process lines with something like this:
# wmf images, try to convert, if failed, output as original
try:
try:
Image.open(output_path).save(os.path.splitext(output_path)[0] + '.png')
out.put_image(os.path.splitext(img_outputter_path)[0] + '.png', g.max_img_width)
notes.append(f'Image {output_path} in slide {slide_idx} converted to png.')
except Exception: # Image failed, try another
from image import convert_wmf_to_png
convert_wmf_to_png(output_path, os.path.splitext(output_path)[0] + '.png')
out.put_image(os.path.splitext(img_outputter_path)[0] + '.png', g.max_img_width)
notes.append(f'Image {output_path} in slide {slide_idx} converted to png.')
except Exception as e:
notes.append(
f'Cannot convert image {output_path} in slide {slide_idx} to png, this probably won\'t be displayed correctly. f{str(e)}'
)
out.put_image(img_outputter_path, g.max_img_width)
Since wand must install ImageMagisk, and I am not sure you like this idea, so I did not fork and make a formal merge request.
If some one encounters this, one can merge my code, it works perfectly. And thank you for writing pptx2md, I am using it to parse ppt essays, which makes my life much easier.
Para fazer classificação das áreas de forma
robusta, é preciso elaborar uma análise de
risco detalhada para identificar as fontes
Para fazer classificação das áreas de forma robusta\, é preciso elaborar uma análise de risco detalhada para identificar as fontes
As you can see it doesn't happen every time, although I've seen an entire file full of ""
you are awesome!!!!
In my use case, I have many shapes, images and text box that are grouped to form a schema. The pptx2md export each individual element as standalone, losing the schema in the markdown.
The feature request would be to check if a group contains shapes and images. If so, convert them to an image and ignore the content for the textual export.
The current workaround is to group the items. Export them as images. Insert the image in the presentation. Delete the grouped items, keeping only the image. While this work for my use case, if somebody is using the tool in a workflow this may not work for them, since they'll lose ability to edit the groupe elements.
Just to note, I am using this one for a one-of conversion of 2 files. The workaround is enough for me, but at least the request will be documented.
Error:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\ProgramData\Anaconda3\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\ProgramData\Anaconda3\lib\site-packages\pptx2md\__main__.py", line 123, in <module>
main()
File "C:\ProgramData\Anaconda3\lib\site-packages\pptx2md\__main__.py", line 119, in main
parse(prs, out)
File "C:\ProgramData\Anaconda3\lib\site-packages\pptx2md\parser.py", line 175, in parse
process_picture(shape)
File "C:\ProgramData\Anaconda3\lib\site-packages\pptx2md\parser.py", line 137, in process_picture
out.put_image(os.path.splitext(img_outputter_path)[0]+'.png', width)
NameError: name 'width' is not defined
Lines 118 to 139 in 6d904dc
width
is undefined.
I can open a pull request to fix this.
It'd be lovely if this program could take inserted spreadsheets to make markdown tables. Currently I'm C/V any spreadsheets into https://tabletomarkdown.com/convert-spreadsheet-to-markdown/ to convert any spreadsheets to tables.
Error when launching pptx2md:
AttributeError: module 'collections' has no attribute 'abc'
Traceback (most recent call last):
File "C:\Python\Python310\lib\site-packages\pptx\compat\__init__.py", line 10, in <module>
Container = collections.abc.Container
AttributeError: module 'collections' has no attribute 'abc'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\Python\Python310\Scripts\pptx2md.exe\__main__.py", line 4, in <module>
File "C:\Python\Python310\lib\site-packages\pptx2md\__main__.py", line 1, in <module>
from pptx import Presentation
File "C:\Python\Python310\lib\site-packages\pptx\__init__.py", line 14, in <module>
from pptx.api import Presentation # noqa
File "C:\Python\Python310\lib\site-packages\pptx\api.py", line 15, in <module>
from .package import Package
File "C:\Python\Python310\lib\site-packages\pptx\package.py", line 6, in <module>
from pptx.opc.package import OpcPackage
File "C:\Python\Python310\lib\site-packages\pptx\opc\package.py", line 11, in <module>
from pptx.compat import is_string, Mapping
File "C:\Python\Python310\lib\site-packages\pptx\compat\__init__.py", line 14, in <module>
Container = collections.Container
AttributeError: module 'collections' has no attribute 'Container'
I'm having a bug while installing the package. I don't know if it's a problem on my machine or the package has a bug, so I'll post this here.
$ pip install pptx2md
WARNING: Ignoring invalid distribution -ip (c:\python310\lib\site-packages)
WARNING: Ignoring invalid distribution -ip (c:\python310\lib\site-packages)
Collecting pptx2md==0.7.9
Downloading pptx2md-0.7.9-py3-none-any.whl (7.3 kB)
Requirement already satisfied: python-pptx in c:\python310\lib\site-packages (from pptx2md==0.7.9) (0.6.21)
Requirement already satisfied: rapidfuzz in c:\python310\lib\site-packages (from pptx2md==0.7.9) (1.9.1)
Requirement already satisfied: pillow in c:\python310\lib\site-packages (from pptx2md==0.7.9) (8.4.0)
Requirement already satisfied: lxml>=3.1.0 in c:\python310\lib\site-packages (from python-pptx->pptx2md==0.7.9) (4.7.1)
Requirement already satisfied: XlsxWriter>=0.5.7 in c:\python310\lib\site-packages (from python-pptx->pptx2md==0.7.9) (3.0.2)
WARNING: Ignoring invalid distribution -ip (c:\python310\lib\site-packages)
Installing collected packages: pptx2md
Attempting uninstall: pptx2md
WARNING: Ignoring invalid distribution -ip (c:\python310\lib\site-packages)
Found existing installation: pptx2md 1.0.0
Uninstalling pptx2md-1.0.0:
Successfully uninstalled pptx2md-1.0.0
WARNING: Failed to write executable - trying to use .deleteme logic
Rolling back uninstall of pptx2md
Moving to c:\python310\lib\site-packages\pptx2md-1.0.0.dist-info\
from C:\Python310\Lib\site-packages\~ptx2md-1.0.0.dist-info
Moving to c:\python310\lib\site-packages\pptx2md\
from C:\Python310\Lib\site-packages\~ptx2md
ERROR: Could not install packages due to an OSError: [WinError 2] O sistema não pode encontrar o arquivo especificado: 'C:\\Python310\\Scripts\\pptx2md.exe' -> 'C:\\Python310\\Scripts\\pptx2md.exe.deleteme'
WARNING: Ignoring invalid distribution -ip (c:\python310\lib\site-packages)
WARNING: Ignoring invalid distribution -ip (c:\python310\lib\site-packages)
The same error happens on commands:
I often saw presentation that uses a single text box to provide a multiline title, where the first line is the main section and the second line is the more specific section.
Either behind a option flag or using a specific layout of the title file, it would be a nice feature that if such title block are found, the first line is considered as an header of level 1 (or what the title.txt
says it should be) and the second line is considered as an header of level 2 (or what the title.txt
says it should be).
See the demo-subtitle.pptx
presentation that I made to explain it better.
Output of pptx2md demo-subtitle.pptx --out demo-subtitle.md --disable-escaping
# Section Title
My content 1
My content 2
# Section Title�Sub-Title Made with « enter »
My content 1.1
My content 1.2
# Section Title�Sub-Title using « shift+enter »
My content 2.1
My content 2.2
# Section Title Sub-Title 3 on same line – Made with « tab »
My content 3.1
My content 3.2
# Section Title that is very long so that it is on two line without using « enter »
My content 4.1
My content 4.2
You will see that the output use the character �
(U+000B) for both subtitle made with enter. The one with the tab doesn't show well in the render, but if you use an editor that shows tab, you will see that between Title
and Sub-Title
, the character used is U+0009.
I expect either behavior to be behind a enable or disable flag.
I think the algorithm could be something along the line:
U+000B
or U+0009
are encountered in what is recognized as a title block
before
and after
before
match the previous title OR the previous header level title
after
as one header level X+1
, where X
is the header level of before
before
as header level X
after
as header level X+1
title.txt
# Section Title
My content 1
My content 2
## Sub-Title Made with « enter »
My content 1.1
My content 1.2
## Sub-Title using « shift+enter »
My content 2.1
My content 2.2
## Sub-Title 3 on same line – Made with « tab »
My content 3.1
My content 3.2
# Section Title that is very long so that it is on two line without using « enter »
My content 4.1
My content 4.2
title.txt
First Header
Section Title
Sub-Title Made with « enter »
Sub-Title using « shift+enter »
Sub-Title 3 on same line – Made with « tab »
Section Title that is very long so that it is on two line without using « enter »
## Section Title
My content 1
My content 2
### Sub-Title Made with « enter »
My content 1.1
My content 1.2
### Sub-Title using « shift+enter »
My content 2.1
My content 2.2
### Sub-Title 3 on same line – Made with « tab »
My content 3.1
My content 3.2
## Section Title that is very long so that it is on two line without using « enter »
My content 4.1
My content 4.2
The Markdown output uses HTML <img>
tags for images, though there is dedicated Markdown syntax for it:
![Alt text](/path/to/img.jpg)
![Alt text](/path/to/img.jpg "Optional title")
Could you please support the Markdown syntax, at least as an option? The advantage is that this allows further conversion to formats other than HTML, e.g. via Pandoc. My application is to convert pptx
to Markdown, and then use Pandoc's beamer option to create slides.
I'm guessing the reason for this choice is that this syntax does not allow to specify the image width. Here you could use Pandoc's syntax inspired by PHP Markdown extra:
![](file.jpg){ width=50% }
Of course only optionally.
Dear,
I have installed python and pip. Following your wiki, I use the command
pip install pptx2md
in the python terminal, only to find the following
File "", line 1
pip install pptx2md
^
SyntaxError: invalid syntax
could you please assist me with this?
kr,
j
What is the license on this repo? Thank you in advance.
(great tool!)
Sometimes Powerpoint embeds images as WMF files (which libreoffice converts to EMF). Even if it is a vector file Pillow should be able to translate it to pixel image -- which may be better than nothing.
Unfortunately your program fails to recognize WMF files: "cannot find loader for this WMF file". It even does not proceed without images but stops processing.
Question: do you think you could extend your program in a way that it can handle WMF files?
Sorry if this is a silly question.
I have followed the instructions to install both Python and extension.
When you say:
Once you have installed it, use the command pptx2md [pptx filename] to convert pptx file into markdown.
Where should I place the pptx file? / Which directory should I be in Terminal when executing the command?
I am on Macos Monterey.
I'm trying to convert a back catalog of powerpoint lectures to md. The first two I did worked great. The next two, I'm getting a mysterious error:
pptx2md ../Lecture03bECD.pptx -o Lecture03bECD.md
/Users/ralmond/Library/Python/3.7/lib/python/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
Traceback (most recent call last):
File "/Users/ralmond/bin/pptx2md", line 8, in <module>
sys.exit(main())
File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx2md/__main__.py", line 77, in main
prs = Presentation(file_path)
File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/api.py", line 28, in Presentation
presentation_part = Package.open(pptx).main_document_part
File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/opc/package.py", line 125, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/opc/pkgreader.py", line 33, in from_file
phys_reader = PhysPkgReader(pkg_file)
File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/opc/phys_pkg.py", line 32, in __new__
raise PackageNotFoundError("Package not found at '%s'" % pkg_file)
pptx.exc.PackageNotFoundError: Package not found at '../Lecture03bECD.pptx'
dhcp138177:Markdown ralmond$
I just figured this out. PPT saved the file to the wrong directory, so the file did not exist. I'll leave this as a minor issue, because "Package not found" is a confusing error message.
I am trying to install it. From a very bare global installation I did pip install pptx2md
. When trying to run pptx2md --help
, I had many ModuleNotFoundError: No module named '[pkg name]'
.
Here is the bigger issue. Sometimes, I'm getting a mysterious error about missing a NULL slide. Note sure where this is from.
dhcp138177:Markdown ralmond$ pptx2md ../ECDQuestions1.pptx -o ECDQuestions1.md
/Users/ralmond/Library/Python/3.7/lib/python/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
Traceback (most recent call last):
File "/Users/ralmond/bin/pptx2md", line 8, in <module>
sys.exit(main())
File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx2md/__main__.py", line 77, in main
prs = Presentation(file_path)
File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/api.py", line 28, in Presentation
presentation_part = Package.open(pptx).main_document_part
File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/opc/package.py", line 125, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/opc/pkgreader.py", line 37, in from_file
phys_reader, pkg_srels, content_types
File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/opc/pkgreader.py", line 70, in _load_serialized_parts
for partname, blob, srels in part_walker:
File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/opc/pkgreader.py", line 106, in _walk_phys_parts
phys_reader, part_srels, visited_partnames
File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/opc/pkgreader.py", line 106, in _walk_phys_parts
phys_reader, part_srels, visited_partnames
File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/opc/pkgreader.py", line 103, in _walk_phys_parts
blob = phys_reader.blob_for(partname)
File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/opc/phys_pkg.py", line 111, in blob_for
return self._zipf.read(pack_uri.membername)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py", line 1431, in read
with self.open(name, "r", pwd) as fp:
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py", line 1470, in open
zinfo = self.getinfo(name)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py", line 1398, in getinfo
'There is no item named %r in the archive' % name)
KeyError: "There is no item named 'ppt/slides/NULL' in the archive"
I'll attach 2 slides which seem to generate this error.
ECDQuestions1.pptx
cannot convert the table and chart in pptx
Result In blank page
Can this package be used as a module instead of the CLI? If yes, are there any examples? Thanks!
Hi, when I convert my pptx, I get a correct MD though space between words are omitted. Could you please correct this? Thanks
You may have forgot to push 0.7.5 to this repo
MacBook Pro (16-inch, 2021) M1 Pro MacOS 12.1
Python 3.8.9
Package Version
----------- -------
lxml 4.7.1
Pillow 9.0.0
pip 21.3.1
pptx2md 1.1.1
python-pptx 0.6.21
rapidfuzz 1.9.1
setuptools 49.2.1
six 1.15.0
wheel 0.33.1
XlsxWriter 3.0.2
Input: test.pptx
Output:
# TE;ST
* Lorem ipsum
* dolor sitamet\,
* consectetur/adipiscingelit\.
* Phasellus/tortorturpis\,
* semper etporttitorvel\,
* ullamcorpersempermassa\.
* Phasellustempor
* felisutnullafermentum
* hendrerit\. Donec et lacinia ipsum\.
* Fuscelectuslacus\,
* auctor aquama\,
* sagittisvehiculatortor\.
* Pellentesqueiaculisfelisodio\,
* vitaescelerisque
Hello I am trying to use pptx2md as part of my diff processing within git.
Is there a way to direct the standard out to a variable in my git config?
i.e. textconv = pptx2md
This is incredibly useful. Unvbelievably fast and runs like a charm on Win10Edu.
I didn't know where else to put a commendation.
I need help urgently.
I despair of getting the program to work under Windows.
I have uninstalled and reinstalled everything. I have tried it on different computers.
No matter if I try with PIP or with PIP3. No chance.
Please, what mistake am I making? Is there any other idea?
Thank you very much in advance,
Tio
As the title, I also turn the fix of the previous issue "cant save wmf" into a patch. I fixed it. Since I am busy working on something, pardon me for not making it a formal patch quest.
The key fix is in parser.py:
elif shape.shape_type == MSO_SHAPE_TYPE.PICTURE:
notes += process_picture(shape, idx + 1)
elif shape.shape_type == MSO_SHAPE_TYPE.TABLE:
notes += process_table(shape, idx + 1)
else: # Add the following lines
if hasattr(shape, "placeholder_format"):
ph = shape.placeholder_format
if ph.type == PP_PLACEHOLDER.OBJECT and hasattr(shape, "image") and getattr(shape, "image"):
notes += process_picture(shape, idx + 1)
else:
print(f"Unrecognized shape: {shape.shape_type}, place holder: {ph.type}, place at page: {idx + 1}")
else:
print(f"Unrecognized shape: {shape.shape_type}, place at page: {idx + 1}")
0001-Fix-some-image-used-as-object-that-can-t-be-output-a.patch
The ppt that can recreate the problem is also attached:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.