artiomn / markdown_articles_tool Goto Github PK

Parse markdown article, download images and replace images URL's with local paths

License: MIT License

Python 99.94% Shell 0.06%

markdown markdown-converter images md markdown-parser downloader markdown-to-html markdown-to-pdf html markdown-articles

markdown_articles_tool's Introduction

Markdown articles tool 0.1.3

Free command line utility, written in Python, designed to help you manage online and downloaded Markdown documents (e.g., articles). The Markdown Articles Tool is available for macOS, Windows, and Linux.

Tool can be used:

To download Markdown documents with images and:
- Find all image links, download images and fix links in the document.
- Can skip broken links.
- Deduplicate similar images by content hash or using hash as a name.
Support images, linked with HTML <img> tag.
Support local image files.
Convert Markdown documents to:
- HTML.
- PDF.
- Or save in the plain Markdown.

Also, if you want to use separate functions, you can just import the package.

Installation

From the repository

You need Python 3.9+.

Run:

git clone "https://github.com/artiomn/markdown_articles_tool"
pip3 install -r markdown_articles_tool/requirements.txt

From the PIP

pip3 install markdown-tool

Usage

Syntax:

markdown_tool [options] <article_file_path_or_url>

options:
  -h, --help            show this help message and exit
  -D {disabled,names_hashing,content_hash}, --deduplication-type {disabled,names_hashing,content_hash}
                        Deduplicate images, using content hash or SHA1(image_name) (default: disabled)
  -d IMAGES_DIRNAME, --images-dirname IMAGES_DIRNAME
                        Folder in which to download images (possible variables: $article_name, $time, $date, $dt, $base_url) (default: images)
  -a, --skip-all-incorrect
                        skip all incorrect images (default: False)
  -E, --download-incorrect-mime
                        download "images" with unrecognized MIME type (default: False)
  -s SKIP_LIST, --skip-list SKIP_LIST
                        skip URL's from the comma-separated list (or file with a leading '@') (default: None)
  -i {md,html,md+html,html+md}, --input-format {md,html,md+html,html+md}
                        input format (default: md)
  -l, --process-local-images
                        [DEPRECATED] Process local images (default: False)
  -n, --replace-image-names
                        Replace image names, using content hash (default: False)
  -o {md,html}, --output-format {md,html}
                        output format (default: md)
  -p IMAGES_PUBLIC_PATH, --images-public-path IMAGES_PUBLIC_PATH
                        Public path to the folder of downloaded images (possible variables: $article_name, $time, $date, $dt, $base_url)
  -P, --prepend-images-with-path
                        Save relative images paths (default: False)
  -R, --remove-source   Remove or replace source file (default: False)
  -t DOWNLOADING_TIMEOUT, --downloading-timeout DOWNLOADING_TIMEOUT
                        how many seconds to wait before downloading will be failed (default: -1)
  -O OUTPUT_PATH, --output-path OUTPUT_PATH
                        article output file name or path
  --verbose, -v         More verbose logging (default: False)
  --version             return version number

Run example 1:

./markdown_tool.py nc-1-zfs/article.md

Run example 2:

./markdown_tool.py not-nas/sov/article.md -o html -s "http://www.ossec.net/_images/ossec-arch.jpg" -a

Run example 3 (run on a folder):

find content/ -name "*.md" | xargs -n1 ./markdown_tool.py

Changes

0.1.3

Mostly technical fixes, necessary to work GUI tool.
Now the tool has Qt-based GUI.

0.1.2

-l, --process-local-images deprecated from the version 0.1.2 and will not work: local images will always be processed.
Images with unrecognized MIME type will not be downloaded by default (use -E to disable this behaviour).
New option -P, --prepend-images-with-path changes image output path structure. If this option is enabled, "remote" image path will be saved in the local directory structure.
Code was significantly refactored.
Some auto tests were added.

0.0.8

-D (deduplication) option was changed in the version 0.0.8. Now option is not boolean, it has several values: "disabled", "names_hashing", "content_hash". Long option name was changed too: now it's deduplication-type.

Internals

Tools is a pipeline, which get Markdown form the source and process them, using blocks:

Source download article.
ImageDownloader download every image. Inside may be used image deduplicator blocks applied to the image.
Transform article file, i.e. fix images URLs.
Format article to the specific format (Markdown, HTML, PDF, etc.), using selected formatters.

ArticleProcessor class is a strategy, applies blocks, based on the parameters (from the CLI, for example).

markdown_articles_tool's People

Contributors

Stargazers

Watchers

Forkers

sylvinus webstruck zlu27 b1tk3y ra2003 zhutao100 melvio arkb dirtypipe oneadm kaal18 jlnbxn nasingfaund markdowngo uahoo snozzlebert freshy969 athari-saif abhijit47 poa00 williammei gresand

markdown_articles_tool's Issues

images

are only the images pertinent to the article downloaded or all images?

Refactoring

Use one parameters block in the ArticleProcessor instead of the many separated parameters.

TODO: Build and publication to PyPI from Github

Option to replace image file name with hash

Hello!
Can you add option to replace image file name with hash ?
Needed for using in "media" big folder for markdown notes: "deduplicate" while processing many notes and for an unique file names

Support of misplaced local images

It would be great if this plugin can also search for missing images in the vault. So far I was not able to find a plugin that can find missing images in the vault. Sometimes we move notes around and the attachments/images are not moved properly, so we end up with missing images even though images would be somewhere in the vault.

thanks

Fix deduplicators

Deduplicators now:

Doesn't work correctly.
Doesn't fix hash collisions.

Local image processing

Hello!
Can you add processing "local" images ? Copying (relative and absolute paths) and replacing path in file
For example, files in folders:

1\Test.md
1\_d\1.png
_resources\2.png

Content of Test.md:

![](https://pandoc.org/diagram.jpg)

![](_d/1.png)

![](../_resources/2.png)

and command:

markdown_tool.py Test.md -D -d out -p out -O Test2.md

Produces files:

out\diagram.jpg

and file Test2.md

![](out/diagram.jpg)

![](_d/1.png)

![](../_resources/2.png)

Needed:

out\diagram.jpg
out\1.jpg
out\2.jpg

and file:

![](out/diagram.jpg)

![](out/1.png)

![](out/2.png)

Script fails when image URLs have the same suffix

I tend to share images publicy on my nextcloud. They get their own preview webpage on links like

https://nextcloud/s/xj3DQpSJXgzsA2x

and can be downloaded/inserted into markdown files by appending '/preview'

https://nextcloud/s/xj3DQpSJXgzsA2x/preview

Unfortunately markdown_images_downloader will download all images as preview.ext, overwriting the same file again and again.

Option to specify output filename

Thanks for this nice tool !
Currently the output filename is non-deterministic which makes it hard to use in batch scripts.
Please provide a parameter like --out FILENAME or sth like this.

Make Docker container with a tool

And publish it.

--replace-image-names Option Not Implemented

I tried using the --replace-image-names option as described in the README, but it seems like this feature is not implemented yet. When I use it, no image names are actually replaced. Could you please confirm if this feature is currently available or not? If it's not implemented yet, it would be a really useful addition to the tool.

Steps to Reproduce:

Run command markdown_articles_tool --replace-image-names ...
Observe that image names are not replaced.
Expected Outcome:
Image names should be replaced as per the documentation.

Actual Outcome:
No image names were replaced.

Thank you for looking into this issue.

Download image with blank inside

Some tools as codimd use blank to set image size as [name](url =300px)

To download image, a simple fix is

diff --git a/markdown_toolset/www_tools.py b/markdown_toolset/www_tools.py
index bc2e58c..ac54304 100644
--- a/markdown_toolset/www_tools.py
+++ b/markdown_toolset/www_tools.py
@@ -34,6 +34,7 @@ def download_from_url(url: str, timeout=None):
     :param url: URL to download.
     :param timeout: timeout before fail.
     """
+    url = url.split()[0]
 
     try:
         response = requests.get(url, allow_redirects=True, timeout=timeout, headers=NECESSARY_HEADERS)

TODO: Improve tests

Additionally:

Replace print's with logging.
Add missing docstrings.

Image download is skipped when using md image size syntax

Hello

Following the 0.1.2 update, I have noticed that some images were not downloaded. It comes from the fact that in Markdown, you can specify an image width or height by adding " =WIDTHxHEIGHT". But when trying to download the image, the tool includes this information in the image URL. For instance, if a markdown file contains

My avatar scaled to 300 pixels width: ![](https://avatars.githubusercontent.com/u/32387838 =300x)

the tool will try to download the image at

https://avatars.githubusercontent.com/u/32387838 =300x

which is an invalid URL. Thus, the error message for unrecognized MIME type will be printed, and the download will be skipped.

Notes:

This syntax is not recognized by every md parser, but it works on CodiMD.
A link may still be valid if a query is used in the URL, as =300x will be considered a parameter. For example, https://avatars.githubusercontent.com/u/32387838?s=80&v=4 =300x is a valid URL
I found ths syntax described in this StackOverflow answer: https://stackoverflow.com/a/21242579

Make PyQT UI

Link is deleted even though no image was found

Hello

My users may introduce a mistake when writing markdown, tagging an important link as an image. As a result, the tool will try to download the html and then delete the link. Losing the link from the article is a problem.

For example in a md file containing

Important link to remember: ![](https://www.google.com/)

the link would be deleted when processing the article, resulting in the following

Important link to remember: ![](.html)

A solution to this particular case would be to raise an exception when replacing the image link if the file name is empty. In www_tools.py after line 73 in function get_filename_from_url, add

if f_name == "": raise ValueError(f'F_name is empty {req.url}')

However, the problem would persist for certain links.

Important link: ![](https://github.com/artiomn/markdown_articles_tool)
would still be replaced by
Important link: ![](markdown_articles_tool.html)

In this case it would be necessary to check the MIME type of the downloaded content before replacing the link.

Image links and folder doesn't match

The script downloaded the images perfectly into an /images folder in the same directory as the files. However, the markdown links only reference the image by name and should be prepended with /images/. Maybe I just need to specify this when running the command, but I'm not sure from the instructions how to do that.

images with unrecognized MIME type work wrong

Hello!
I use this image link format like

![](https://cubox.pro/c/filters:no_upscale()?valid=false&imageUrl=https%3A%2F%2Fpicx.zhimg.com%2F50%2Fv2-53de590b6bb3f42d1a06d28c806c698d_720w.jpg%3Fsource%3D1940ef5c)

so i use the code

python markdown_articles_tool/markdown_tool.py 1.md -E

The program recognized some different image links as identical and replaced the links with

root@pdf:/home/guang# python markdown_articles_tool/markdown_tool.py 1.md -E
Markdown tool version 0.1.3 started...
02.08.2023 05:10:39 File "1.md" will be processed...
02.08.2023 05:10:39 Image public path: 
02.08.2023 05:10:39 Images links count = 17
02.08.2023 05:10:39 Downloading image 1 of 17 from "https://cubox.pro/c/filters:no_upscale()?valid=false&imageUrl=https%3A%2F%2Fpicx.zhimg.com%2F50%2Fv2-53de590b6bb3f42d1a06d28c806c698d_720w.jpg%3Fsource%3D1940ef5c"...
02.08.2023 05:10:40 Image will be written to the file "/home/guang/images/1.png"...
02.08.2023 05:10:40 Downloading image 2 of 17 from "https://cubox.pro/c/filters:no_upscale()?valid=false&imageUrl=https%3A%2F%2Fpica.zhimg.com%2F50%2Fv2-872d10f75dfa52172835fe6fbf22c5fe_720w.jpg%3Fsource%3D1940ef5c"...
02.08.2023 05:10:40 Image will be written to the file "/home/guang/images/1.jpg"...
02.08.2023 05:10:40 Downloading image 3 of 17 from "https://cubox.pro/c/filters:no_upscale()?valid=false&imageUrl=https%3A%2F%2Fpic1.zhimg.com%2F50%2Fv2-c4b89a30d2a3fe1897cfe24388ec935e_720w.jpg%3Fsource%3D1940ef5c"...
02.08.2023 05:10:40 Image "/home/guang/images/1.jpg" already exists and will not be written...
02.08.2023 05:10:40 Downloading image 4 of 17 from "https://cubox.pro/c/filters:no_upscale()?valid=false&imageUrl=https%3A%2F%2Fpic1.zhimg.com%2F50%2Fv2-2a53a9691dd1823bf8e268bccd5ddc33_720w.jpg%3Fsource%3D1940ef5c"...
02.08.2023 05:10:41 Image "/home/guang/images/1.png" already exists and will not be written...
02.08.2023 05:10:41 Downloading image 5 of 17 from "https://cubox.pro/c/filters:no_upscale()?valid=false&imageUrl=https%3A%2F%2Fpic1.zhimg.com%2F50%2Fv2-0efb5de65201ba08c47863d88b61669f_720w.jpg%3Fsource%3D1940ef5c"...

Hope you can help me to solve this problem, thanks

Missing png file extension

When download from unsplash, there is no extension appended.

TODO: think how to rework publication to pypi

From master?
Don't use releases/* branches or use?
Restore func. with ref.started and the master branch?