ssine / pptx2md Goto Github PK

View Code? Open in Web Editor NEW

457.0 457.0 72.0 140 KB

a pptx to markdown converter

Python 99.58% Makefile 0.17% SCSS 0.25%

pptx2md's Introduction

Coder. Gamer. Pragmatist. Utilitarian.

pptx2md's People

Stargazers

Watchers

pptx2md's Issues

Python versioning problem: cannot import name 'etree' from 'lxml'

Tried to use virtualenv with versions python3.9, python3.8 and python3.7 and got the same error:
ImportError: cannot import name 'etree' from 'lxml' (/usr/lib/python3/dist-packages/lxml/init.py)

virtualenv -p /usr/bin/python3.9 pptx2md
source pptx2md/bin/activate
pip install pptx2md
pptx2md file.pptx

Tried solutions on theses issues but nothing worked:
babybuddy/babybuddy#336
WeblateOrg/weblate#4183

-o tag won't change the img folder

Every time I use the tag -o I need to move manually the img folder to the path that was passed.
I don't know if it's clear, if you need more info I can explain in a better way.

NotImplementedError: Shape instance of unrecognized shape type

This may be an issue in a package dependency, but I am seeing the following error on an Ubuntu 22.04 box, when converting a given PowerPoint file:

$ pptx2md --disable-image --disable-color --disable-escaping \
    --disable-notes -o mypowerpint.pptx.md mypowerpoint.pptx

Traceback (most recent call last):
  File "/home/me/project/venv/lib/python3.11/site-packages/pptx2md/parser.py", line 202, in parse
    shapes = sorted(ungroup_shapes(slide.shapes), key=attrgetter('top', 'left'))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/project/venv/lib/python3.11/site-packages/pptx2md/parser.py", line 185, in ungroup_shapes
    if shape.shape_type == MSO_SHAPE_TYPE.GROUP:
       ^^^^^^^^^^^^^^^^
  File "/home/me/project/venv/lib/python3.11/site-packages/pptx/shapes/autoshape.py", line 362, in shape_type
    raise NotImplementedError(msg)
NotImplementedError: Shape instance of unrecognized shape type

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/me/project/venv/bin/pptx2md", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/me/project/venv/lib/python3.11/site-packages/pptx2md/__main__.py", line 141, in main
    parse(prs, out)
  File "/home/me/project/venv/lib/python3.11/site-packages/pptx2md/parser.py", line 207, in parse
    print(sp.shape_type)
          ^^^^^^^^^^^^^
  File "/home/me/project/venv/lib/python3.11/site-packages/pptx/shapes/autoshape.py", line 362, in shape_type
    raise NotImplementedError(msg)
NotImplementedError: Shape instance of unrecognized shape type

I am only interested in the text, so I would be thrilled with a solution that simply ignored unknown shapes and kept moving. 😄

Is there additional info I can provide to help troubleshoot?

filename is not defined

when i run:

import pptx2md as p

p [A1News.pptx]

The result is NameError: name 'A1News' is not defined

I tried to run

import pptx2md as p

p [pptx A1News.pptx]

the outcome is SyntaxError: invalid syntax. Perhaps you forgot a comma?

But if i run

import pptx2md as p

p A1News.pptx

it also give error SyntaxError: invalid syntax

Could you please kindly advise?

giving error on python 3.10.1

pptx2md giving following error :

Traceback (most recent call last):
  File "/home/murali/.local/lib/python3.10/site-packages/pptx/compat/__init__.py", line 10, in <module>
    Container = collections.abc.Container
AttributeError: module 'collections' has no attribute 'abc'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/murali/.local/bin/pptx2md", line 5, in <module>
    from pptx2md.__main__ import main
  File "/home/murali/.local/lib/python3.10/site-packages/pptx2md/__main__.py", line 1, in <module>
    from pptx import Presentation
  File "/home/murali/.local/lib/python3.10/site-packages/pptx/__init__.py", line 14, in <module>
    from pptx.api import Presentation  # noqa
  File "/home/murali/.local/lib/python3.10/site-packages/pptx/api.py", line 15, in <module>
    from .package import Package
  File "/home/murali/.local/lib/python3.10/site-packages/pptx/package.py", line 6, in <module>
    from pptx.opc.package import OpcPackage
  File "/home/murali/.local/lib/python3.10/site-packages/pptx/opc/package.py", line 11, in <module>
    from pptx.compat import is_string, Mapping
  File "/home/murali/.local/lib/python3.10/site-packages/pptx/compat/__init__.py", line 14, in <module>
    Container = collections.Container
AttributeError: module 'collections' has no attribute 'Container'

Error: AttributeError: 'Part' object has no attribute 'image'

Thanks a lot for the script. While testing the conversion, I get the following error. Not sure what to do, do I need to adjust a image in the presentation? How is pptx2md handling graphical forms like box / ellipse etc. drawn in ppt, btw?

File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.10/site-packages/pptx/parts/slide.py", line 30, in get_image
    return self.related_part(rId).image
AttributeError: 'Part' object has no attribute 'image'

SyntaxError: invalid syntax

After installing pptx2md successfully, came into this error,

root@mgt01:~/terraform# pptx2md kubevirt.pptx -o kubevirt.md
Traceback (most recent call last):
  File "/usr/local/bin/pptx2md", line 5, in <module>
    from pptx2md.__main__ import main
  File "/usr/local/lib/python2.7/dist-packages/pptx2md/__main__.py", line 78
    print(f'source file {file_path} not exist!')
                                              ^
SyntaxError: invalid syntax

even just run cmd pptx2md came into this error too.
Could you take a look?@ssine

exit(0) call without importing sys

In init.py line 100 calls exit(0) without importing it from sys with from sys import exit and leads to an error.
However, using pptx2md within another script it terminates if above is fixed which might not be wanted. Better return with a nozero exit value.

Speaker's note

Is there way to export Speaker's note too?
Thanks
Albert

ValueError: shape is not a placeholder errors on some PPT's

MacOS
Python: 3.9

is there an 'ignore' option if it can't render ?
Can it be fixed in the Powerpoint format?

Would it be this one in the library?
scanny/python-pptx#333

python3.9/site-packages/pptx/shapes/base.py", line 153, in placeholder_format
raise ValueError("shape is not a placeholder")
ValueError: shape is not a placeholder

Traceback (most recent call last):
File "/usr/local/bin/pptx2md", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.9/site-packages/pptx2md/main.py", line 148, in main
parse(prs, out)
File "/usr/local/lib/python3.9/site-packages/pptx2md/parser.py", line 237, in parse
if hasattr(shape, "placeholder_format"):
File "/usr/local/lib/python3.9/site-packages/pptx/shapes/base.py", line 153, in placeholder_format
raise ValueError("shape is not a placeholder")
ValueError: shape is not a placeholder

The ability to change "\n---\n" to a custom string when --enable-slides is set.

I would like to have the ability to change "\n---\n" to a different one.

one that contains information about the current number of the slide.

preferably one that I can set via the cli.

for example --enable-slides --slides-separator=[---(slide_number)---] or something like that instead of "\n---\n".

Thanks, really like the project.

[Feature Request] Support hyperlinks in tables

Not sure if this is expected to work or not, but currently it does not. Would you consider supporting preserving hyperlinks within table cells?

I can't comprehend this error

I am practically python illiterate, thank you in advance for creating exactly what I was looking for, I have encountered my first issue using this program for the first time.

trying to transform my first file I get this message:

File "/Library/Frameworks/Python.framework/Versions/3.12/bin/pptx2md", line 8, in <module> sys.exit(main()) ^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx2md/__main__.py", line 121, in main prs = Presentation(file_path) ^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/api.py", line 28, in Presentation presentation_part = Package.open(pptx).main_document_part ^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/opc/package.py", line 73, in open return cls(pkg_file)._load() ^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/opc/package.py", line 157, in _load pkg_xml_rels, parts = _PackageLoader.load(self._pkg_file, self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/opc/package.py", line 186, in load return cls(pkg_file, package)._load() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/opc/package.py", line 190, in _load parts, xml_rels = self._parts, self._xml_rels ^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/util.py", line 215, in __get__ value = self._fget(obj) ^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/opc/package.py", line 219, in _parts content_types = self._content_types ^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/util.py", line 215, in __get__ value = self._fget(obj) ^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/opc/package.py", line 203, in _content_types return _ContentTypeMap.from_xml(self._package_reader[CONTENT_TYPES_URI]) ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/opc/serialized.py", line 35, in __getitem__ return self._blob_reader[pack_uri] ^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/util.py", line 215, in __get__ value = self._fget(obj) ^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/opc/serialized.py", line 49, in _blob_reader return _PhysPkgReader.factory(self._pkg_file) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pptx/opc/serialized.py", line 135, in factory raise PackageNotFoundError("Package not found at '%s'" % pkg_file) pptx.exc.PackageNotFoundError: Package not found at '3._Analisi_delle_dinamiche_II.pptx'
+
I am on Mac

Thank you for your patience

Alessandro

The conversion to md is interrupted.

Hi. This software is great！

When pptx2md reach the slide that contains the certain types of images (created in power point,like arrows), the process is interrupted. Perhaps pillow is throwing an error when trying to process an image that pillow can't load. The proof is that --disable image eliminates the error.

Translated from Japanese to English by Google Translate.

Does not work at all

Command: pptx2md -o redme.md -i images infile.pptx

Output:
processing slide 1...
Traceback (most recent call last):
File "/home/tangarora/.local/bin/pptx2md", line 8, in
sys.exit(main())
File "/home/tangarora/.local/lib/python3.9/site-packages/pptx2md/main.py", line 117, in main
parse(prs, out)
File "/home/tangarora/.local/lib/python3.9/site-packages/pptx2md/parser.py", line 176, in parse
process_picture(shape)
File "/home/tangarora/.local/lib/python3.9/site-packages/pptx2md/parser.py", line 130, in process_picture
common_path = os.path.commonpath([g.out_path, g.img_path])
File "/usr/lib/python3.9/posixpath.py", line 510, in commonpath
raise ValueError("Can't mix absolute and relative paths") from None
ValueError: Can't mix absolute and relative paths

M1 Mac is unable to run command

When I tried to run pptx2md [filename] I get a bunch of errors. This is what I am left with:

Traceback (most recent call last):
  File "/opt/homebrew/bin/pptx2md", line 5, in <module>
    from pptx2md.__main__ import main
  File "/opt/homebrew/lib/python3.9/site-packages/pptx2md/__main__.py", line 1, in <module>
    from pptx import Presentation
  File "/opt/homebrew/lib/python3.9/site-packages/pptx/__init__.py", line 14, in <module>
    from pptx.api import Presentation  # noqa
  File "/opt/homebrew/lib/python3.9/site-packages/pptx/api.py", line 15, in <module>
    from .package import Package
  File "/opt/homebrew/lib/python3.9/site-packages/pptx/package.py", line 6, in <module>
    from pptx.opc.package import OpcPackage
  File "/opt/homebrew/lib/python3.9/site-packages/pptx/opc/package.py", line 13, in <module>
    from pptx.opc.oxml import CT_Relationships, serialize_part_xml
  File "/opt/homebrew/lib/python3.9/site-packages/pptx/opc/oxml.py", line 5, in <module>
    from lxml import etree
ImportError: dlopen(/opt/homebrew/lib/python3.9/site-packages/lxml/etree.cpython-39-darwin.so, 2): no suitable image found.  Did find:
	/opt/homebrew/lib/python3.9/site-packages/lxml/etree.cpython-39-darwin.so: mach-o, but wrong architecture
	/opt/homebrew/lib/python3.9/site-packages/lxml/etree.cpython-39-darwin.so: mach-o, but wrong architecture

Image path is URL-encoded in exported markdown

When exporting using basic options (only the output), the images are extracted correctly, but they are inserted in the markdown using URL-encoded path. This lead to not able to display the image.

Actual output

![](img%5Cmy-pptx0.png)

Expected output

![](img/my-pptx0.png)

Just to note, I am using this one for a one-of conversion of 2 files. A simple search and replace is enough for me. But it might not be for someone using pptx2md in a workflow.

Can you add on readme how do I use all those flags?

I can't really understand how am I suppose to use the -o flag and I think other people don't know either.

Cannot convert wmf image, and the possible solution

Get something like this, I also encounter this with my own code using Pillow

Cannot convert wmf image xxxxxxxx.wmf in slide 24 to png, this probably won't be displayed correctly.

wand can solve this, I add this to a image.py:

def convert_wmf_to_png(input_file, output_png_path):
    """
    Convert WMF data to a PNG file.
    """
    from wand.image import Image

    with Image(filename=input_file) as img:
        img.format = 'png'
        img.save(filename=output_png_path)

Then I replace the image process lines with something like this:

# wmf images, try to convert, if failed, output as original
  try:
    try:
      Image.open(output_path).save(os.path.splitext(output_path)[0] + '.png')
      out.put_image(os.path.splitext(img_outputter_path)[0] + '.png', g.max_img_width)
      notes.append(f'Image {output_path} in slide {slide_idx} converted to png.')
    except Exception:  # Image failed, try another
      from image import convert_wmf_to_png
      convert_wmf_to_png(output_path, os.path.splitext(output_path)[0] + '.png')
      out.put_image(os.path.splitext(img_outputter_path)[0] + '.png', g.max_img_width)
      notes.append(f'Image {output_path} in slide {slide_idx} converted to png.')
  except Exception as e:
    notes.append(
        f'Cannot convert image {output_path} in slide {slide_idx} to png, this probably won\'t be displayed correctly. f{str(e)}'
    )
    out.put_image(img_outputter_path, g.max_img_width)

Since wand must install ImageMagisk, and I am not sure you like this idea, so I did not fork and make a formal merge request.

If some one encounters this, one can merge my code, it works perfectly. And thank you for writing pptx2md, I am using it to parse ppt essays, which makes my life much easier.

🐛 BUG: Converter creating some "\" after breakline

PPTX

Para fazer classificação das áreas de forma 
robusta, é preciso elaborar uma análise de 
risco detalhada para identificar as fontes

Markdown

Para fazer classificação das áreas de forma robusta\, é preciso elaborar uma análise de risco detalhada para identificar as fontes

More details

As you can see it doesn't happen every time, although I've seen an entire file full of ""

Thanks

you are awesome!!!!

feature request: Convert grouped elements as images

In my use case, I have many shapes, images and text box that are grouped to form a schema. The pptx2md export each individual element as standalone, losing the schema in the markdown.

The feature request would be to check if a group contains shapes and images. If so, convert them to an image and ignore the content for the textual export.

The current workaround is to group the items. Export them as images. Insert the image in the presentation. Delete the grouped items, keeping only the image. While this work for my use case, if somebody is using the tool in a workflow this may not work for them, since they'll lose ability to edit the groupe elements.

Just to note, I am using this one for a one-of conversion of 2 files. The workaround is enough for me, but at least the request will be documented.

NameError: name 'width' is not defined when the exported image is wmf format.

Error:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\ProgramData\Anaconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pptx2md\__main__.py", line 123, in <module>
    main()
  File "C:\ProgramData\Anaconda3\lib\site-packages\pptx2md\__main__.py", line 119, in main
    parse(prs, out)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pptx2md\parser.py", line 175, in parse
    process_picture(shape)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pptx2md\parser.py", line 137, in process_picture
    out.put_image(os.path.splitext(img_outputter_path)[0]+'.png', width)
NameError: name 'width' is not defined

pptx2md/pptx2md/parser.py

Lines 118 to 139 in 6d904dc

 def process_picture(shape): 

 if g.disable_image: 

 return 

 global picture_count 

 global out 

 pic_name = g.file_prefix + str(picture_count) 

 pic_ext = shape.image.ext 

 if not os.path.exists(g.img_path): 

 os.makedirs(g.img_path) 

 output_path = g.path_name_ext(g.img_path, pic_name, pic_ext) 

 common_path = os.path.commonpath([g.out_path, g.img_path]) 

 img_outputter_path = os.path.relpath(output_path, common_path) 

 with open(output_path, 'wb') as f: 

 f.write(shape.image.blob) 

 picture_count += 1 

 if pic_ext == 'wmf': 

 if not g.disable_wmf: 

 Image.open(output_path).save(os.path.splitext(output_path)[0]+'.png') 

 out.put_image(os.path.splitext(img_outputter_path)[0]+'.png', width) 

 else: 

 out.put_image(img_outputter_path, g.max_img_width)

width is undefined.

I can open a pull request to fix this.

Feature request: Table conversion

It'd be lovely if this program could take inserted spreadsheets to make markdown tables. Currently I'm C/V any spreadsheets into https://tabletomarkdown.com/convert-spreadsheet-to-markdown/ to convert any spreadsheets to tables.

Error on Windows: module 'collections' has no attribute 'abc'

Error when launching pptx2md:

AttributeError: module 'collections' has no attribute 'abc'

Traceback (most recent call last):
  File "C:\Python\Python310\lib\site-packages\pptx\compat\__init__.py", line 10, in <module>
    Container = collections.abc.Container
AttributeError: module 'collections' has no attribute 'abc'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Python\Python310\Scripts\pptx2md.exe\__main__.py", line 4, in <module>
  File "C:\Python\Python310\lib\site-packages\pptx2md\__main__.py", line 1, in <module>
    from pptx import Presentation
  File "C:\Python\Python310\lib\site-packages\pptx\__init__.py", line 14, in <module>
    from pptx.api import Presentation  # noqa
  File "C:\Python\Python310\lib\site-packages\pptx\api.py", line 15, in <module>
    from .package import Package
  File "C:\Python\Python310\lib\site-packages\pptx\package.py", line 6, in <module>
    from pptx.opc.package import OpcPackage
  File "C:\Python\Python310\lib\site-packages\pptx\opc\package.py", line 11, in <module>
    from pptx.compat import is_string, Mapping
  File "C:\Python\Python310\lib\site-packages\pptx\compat\__init__.py", line 14, in <module>
    Container = collections.Container
AttributeError: module 'collections' has no attribute 'Container'

It's saying that the syntax is invalid or that pptx2md isn't defined. I'm in Spyder on Mac.

Bug on installation

I'm having a bug while installing the package. I don't know if it's a problem on my machine or the package has a bug, so I'll post this here.

$ pip install pptx2md
WARNING: Ignoring invalid distribution -ip (c:\python310\lib\site-packages)
WARNING: Ignoring invalid distribution -ip (c:\python310\lib\site-packages)
Collecting pptx2md==0.7.9
  Downloading pptx2md-0.7.9-py3-none-any.whl (7.3 kB)
Requirement already satisfied: python-pptx in c:\python310\lib\site-packages (from pptx2md==0.7.9) (0.6.21)
Requirement already satisfied: rapidfuzz in c:\python310\lib\site-packages (from pptx2md==0.7.9) (1.9.1)
Requirement already satisfied: pillow in c:\python310\lib\site-packages (from pptx2md==0.7.9) (8.4.0)
Requirement already satisfied: lxml>=3.1.0 in c:\python310\lib\site-packages (from python-pptx->pptx2md==0.7.9) (4.7.1)
Requirement already satisfied: XlsxWriter>=0.5.7 in c:\python310\lib\site-packages (from python-pptx->pptx2md==0.7.9) (3.0.2)
WARNING: Ignoring invalid distribution -ip (c:\python310\lib\site-packages)
Installing collected packages: pptx2md
  Attempting uninstall: pptx2md
    WARNING: Ignoring invalid distribution -ip (c:\python310\lib\site-packages)
    Found existing installation: pptx2md 1.0.0
    Uninstalling pptx2md-1.0.0:
      Successfully uninstalled pptx2md-1.0.0
  WARNING: Failed to write executable - trying to use .deleteme logic
  Rolling back uninstall of pptx2md
  Moving to c:\python310\lib\site-packages\pptx2md-1.0.0.dist-info\
   from C:\Python310\Lib\site-packages\~ptx2md-1.0.0.dist-info
  Moving to c:\python310\lib\site-packages\pptx2md\
   from C:\Python310\Lib\site-packages\~ptx2md
ERROR: Could not install packages due to an OSError: [WinError 2] O sistema não pode encontrar o arquivo especificado: 'C:\\Python310\\Scripts\\pptx2md.exe' -> 'C:\\Python310\\Scripts\\pptx2md.exe.deleteme'

WARNING: Ignoring invalid distribution -ip (c:\python310\lib\site-packages)
WARNING: Ignoring invalid distribution -ip (c:\python310\lib\site-packages)

The same error happens on commands:

pip install --upgrade pptx2md
pip install pptx2md==0.7.9

feature request: Consider multiline titles as two markdown headers

I often saw presentation that uses a single text box to provide a multiline title, where the first line is the main section and the second line is the more specific section.

Either behind a option flag or using a specific layout of the title file, it would be a nice feature that if such title block are found, the first line is considered as an header of level 1 (or what the title.txt says it should be) and the second line is considered as an header of level 2 (or what the title.txt says it should be).

See the demo-subtitle.pptx presentation that I made to explain it better.

Actual Behavior

Output of pptx2md demo-subtitle.pptx --out demo-subtitle.md --disable-escaping

# Section Title

My content 1

My content 2

# Section Title�Sub-Title Made with « enter »

My content 1.1

My content 1.2

# Section Title�Sub-Title using « shift+enter »

My content 2.1

My content 2.2

# Section Title	Sub-Title 3 on same line – Made with « tab »

My content 3.1

My content 3.2

# Section Title that is very long so that it is on two line without using « enter »

My content 4.1

My content 4.2

You will see that the output use the character � (U+000B) for both subtitle made with enter. The one with the tab doesn't show well in the render, but if you use an editor that shows tab, you will see that between Title and Sub-Title, the character used is U+0009.

Feature Behavior

I expect either behavior to be behind a enable or disable flag.

I think the algorithm could be something along the line:

If U+000B or U+0009 are encountered in what is recognized as a title block
- Split text in before and after
- If before match the previous title OR the previous header level title
  - Output after as one header level X+1, where X is the header level of before
- Else
  - Output before as header level X
  - Output after as header level X+1

Without `title.txt`

# Section Title

My content 1

My content 2

## Sub-Title Made with « enter »

My content 1.1

My content 1.2

## Sub-Title using « shift+enter »

My content 2.1

My content 2.2

## Sub-Title 3 on same line – Made with « tab »

My content 3.1

My content 3.2

# Section Title that is very long so that it is on two line without using « enter »

My content 4.1

My content 4.2

With a `title.txt`

First Header
  Section Title
    Sub-Title Made with « enter »
    Sub-Title using « shift+enter »
    Sub-Title 3 on same line – Made with « tab »
  Section Title that is very long so that it is on two line without using « enter »

## Section Title

My content 1

My content 2

### Sub-Title Made with « enter »

My content 1.1

My content 1.2

### Sub-Title using « shift+enter »

My content 2.1

My content 2.2

### Sub-Title 3 on same line – Made with « tab »

My content 3.1

My content 3.2

## Section Title that is very long so that it is on two line without using « enter »

My content 4.1

My content 4.2

Markdown syntax for images

The Markdown output uses HTML <img> tags for images, though there is dedicated Markdown syntax for it:

![Alt text](/path/to/img.jpg)
![Alt text](/path/to/img.jpg "Optional title")

Could you please support the Markdown syntax, at least as an option? The advantage is that this allows further conversion to formats other than HTML, e.g. via Pandoc. My application is to convert pptx to Markdown, and then use Pandoc's beamer option to create slides.

I'm guessing the reason for this choice is that this syntax does not allow to specify the image width. Here you could use Pandoc's syntax inspired by PHP Markdown extra:

![](file.jpg){ width=50% }

Of course only optionally.

python 3.9 - Not able to install

Dear,

I have installed python and pip. Following your wiki, I use the command
pip install pptx2md in the python terminal, only to find the following

error message:

File "", line 1
pip install pptx2md
^
SyntaxError: invalid syntax

could you please assist me with this?

kr,
j

If the name of the input pptx file has any Chinese (maybe any non-ascii) character, the image in the output md file is broken.

License

What is the license on this repo? Thank you in advance.

wmf / emf

(great tool!)

Sometimes Powerpoint embeds images as WMF files (which libreoffice converts to EMF). Even if it is a vector file Pillow should be able to translate it to pixel image -- which may be better than nothing.
Unfortunately your program fails to recognize WMF files: "cannot find loader for this WMF file". It even does not proceed without images but stops processing.

Question: do you think you could extend your program in a way that it can handle WMF files?

Where to go to convert

Sorry if this is a silly question.
I have followed the instructions to install both Python and extension.

When you say:

Once you have installed it, use the command pptx2md [pptx filename] to convert pptx file into markdown.

Where should I place the pptx file? / Which directory should I be in Terminal when executing the command?

I am on Macos Monterey.

pptx.exc.PackageNotFoundError: Package not found at '../Lecture03bECD.pptx'

I'm trying to convert a back catalog of powerpoint lectures to md. The first two I did worked great. The next two, I'm getting a mysterious error:

pptx2md ../Lecture03bECD.pptx -o Lecture03bECD.md
/Users/ralmond/Library/Python/3.7/lib/python/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
Traceback (most recent call last):
  File "/Users/ralmond/bin/pptx2md", line 8, in <module>
    sys.exit(main())
  File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx2md/__main__.py", line 77, in main
    prs = Presentation(file_path)
  File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/api.py", line 28, in Presentation
    presentation_part = Package.open(pptx).main_document_part
  File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/opc/package.py", line 125, in open
    pkg_reader = PackageReader.from_file(pkg_file)
  File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/opc/pkgreader.py", line 33, in from_file
    phys_reader = PhysPkgReader(pkg_file)
  File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/opc/phys_pkg.py", line 32, in __new__
    raise PackageNotFoundError("Package not found at '%s'" % pkg_file)
pptx.exc.PackageNotFoundError: Package not found at '../Lecture03bECD.pptx'
dhcp138177:Markdown ralmond$

I just figured this out. PPT saved the file to the wrong directory, so the file did not exist. I'll leave this as a minor issue, because "Package not found" is a confusing error message.

Missing dependencies

I am trying to install it. From a very bare global installation I did pip install pptx2md. When trying to run pptx2md --help, I had many ModuleNotFoundError: No module named '[pkg name]'.

numpy
matplotlib
scipy

KeyError: "There is no item named 'ppt/slides/NULL' in the archive"

Here is the bigger issue. Sometimes, I'm getting a mysterious error about missing a NULL slide. Note sure where this is from.

dhcp138177:Markdown ralmond$ pptx2md ../ECDQuestions1.pptx -o ECDQuestions1.md
/Users/ralmond/Library/Python/3.7/lib/python/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
Traceback (most recent call last):
  File "/Users/ralmond/bin/pptx2md", line 8, in <module>
    sys.exit(main())
  File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx2md/__main__.py", line 77, in main
    prs = Presentation(file_path)
  File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/api.py", line 28, in Presentation
    presentation_part = Package.open(pptx).main_document_part
  File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/opc/package.py", line 125, in open
    pkg_reader = PackageReader.from_file(pkg_file)
  File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/opc/pkgreader.py", line 37, in from_file
    phys_reader, pkg_srels, content_types
  File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/opc/pkgreader.py", line 70, in _load_serialized_parts
    for partname, blob, srels in part_walker:
  File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/opc/pkgreader.py", line 106, in _walk_phys_parts
    phys_reader, part_srels, visited_partnames
  File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/opc/pkgreader.py", line 106, in _walk_phys_parts
    phys_reader, part_srels, visited_partnames
  File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/opc/pkgreader.py", line 103, in _walk_phys_parts
    blob = phys_reader.blob_for(partname)
  File "/Users/ralmond/Library/Python/3.7/lib/python/site-packages/pptx/opc/phys_pkg.py", line 111, in blob_for
    return self._zipf.read(pack_uri.membername)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py", line 1431, in read
    with self.open(name, "r", pwd) as fp:
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py", line 1470, in open
    zinfo = self.getinfo(name)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py", line 1398, in getinfo
    'There is no item named %r in the archive' % name)
KeyError: "There is no item named 'ppt/slides/NULL' in the archive"

I'll attach 2 slides which seem to generate this error.
ECDQuestions1.pptx

Cannot convert the table and chart

cannot convert the table and chart in pptx
Result In blank page

Using as a module

Can this package be used as a module instead of the CLI? If yes, are there any examples? Thanks!

No space between words

Hi, when I convert my pptx, I get a correct MD though space between words are omitted. Could you please correct this? Thanks

Code here is version 0.7.4 and pip has 0.7.5

You may have forgot to push 0.7.5 to this repo

Spaces not correctly converted

Environment

MacBook Pro (16-inch, 2021) M1 Pro MacOS 12.1
Python 3.8.9

Package     Version
----------- -------
lxml        4.7.1
Pillow      9.0.0
pip         21.3.1
pptx2md     1.1.1
python-pptx 0.6.21
rapidfuzz   1.9.1
setuptools  49.2.1
six         1.15.0
wheel       0.33.1
XlsxWriter  3.0.2

Reproduce

Input: test.pptx

Output:

# TE;ST

* Lorem ipsum
  * dolor sitamet\,
  * consectetur/adipiscingelit\.
  * Phasellus/tortorturpis\,
  * semper etporttitorvel\,
    * ullamcorpersempermassa\.
* Phasellustempor
  * felisutnullafermentum
  * hendrerit\. Donec et lacinia ipsum\.
  * Fuscelectuslacus\,
    * auctor aquama\,
    * sagittisvehiculatortor\.
    * Pellentesqueiaculisfelisodio\,
  * vitaescelerisque

TE;ST

Lorem ipsum
- dolor sitamet,
- consectetur/adipiscingelit.
- Phasellus/tortorturpis,
- semper etporttitorvel,
  - ullamcorpersempermassa.
Phasellustempor
- felisutnullafermentum
- hendrerit. Donec et lacinia ipsum.
- Fuscelectuslacus,
  - auctor aquama,
  - sagittisvehiculatortor.
  - Pellentesqueiaculisfelisodio,
- vitaescelerisque

Thank you very much in advance,
Tio

Cant save images that marked as an placeholder with an object instead of a picture, and the solution

As the title, I also turn the fix of the previous issue "cant save wmf" into a patch. I fixed it. Since I am busy working on something, pardon me for not making it a formal patch quest.

The key fix is in parser.py:

      elif shape.shape_type == MSO_SHAPE_TYPE.PICTURE:
        notes += process_picture(shape, idx + 1)
      elif shape.shape_type == MSO_SHAPE_TYPE.TABLE:
        notes += process_table(shape, idx + 1)
      else:  # Add the following lines
        if hasattr(shape, "placeholder_format"):
          ph = shape.placeholder_format
          if ph.type == PP_PLACEHOLDER.OBJECT and hasattr(shape, "image") and getattr(shape, "image"):
            notes += process_picture(shape, idx + 1)
          else:
            print(f"Unrecognized shape: {shape.shape_type}, place holder: {ph.type}, place at page: {idx + 1}")
        else:
          print(f"Unrecognized shape: {shape.shape_type}, place at page: {idx + 1}")

0001-Fix-some-image-used-as-object-that-can-t-be-output-a.patch

The ppt that can recreate the problem is also attached:

test.pptx

	def process_picture(shape):
	if g.disable_image:
	return
	global picture_count
	global out
	pic_name = g.file_prefix + str(picture_count)
	pic_ext = shape.image.ext
	if not os.path.exists(g.img_path):
	os.makedirs(g.img_path)

	output_path = g.path_name_ext(g.img_path, pic_name, pic_ext)
	common_path = os.path.commonpath([g.out_path, g.img_path])
	img_outputter_path = os.path.relpath(output_path, common_path)
	with open(output_path, 'wb') as f:
	f.write(shape.image.blob)
	picture_count += 1
	if pic_ext == 'wmf':
	if not g.disable_wmf:
	Image.open(output_path).save(os.path.splitext(output_path)[0]+'.png')
	out.put_image(os.path.splitext(img_outputter_path)[0]+'.png', width)
	else:
	out.put_image(img_outputter_path, g.max_img_width)