GithubHelp home page GithubHelp logo

artiomn / markdown_articles_tool Goto Github PK

View Code? Open in Web Editor NEW
107.0 5.0 23.0 308 KB

Parse markdown article, download images and replace images URL's with local paths

License: MIT License

Python 99.94% Shell 0.06%
markdown markdown-converter images md markdown-parser downloader markdown-to-html markdown-to-pdf html markdown-articles

markdown_articles_tool's Introduction

Python package License Stargazers Forks Latest Release

Markdown articles tool 0.1.3

Free command line utility, written in Python, designed to help you manage online and downloaded Markdown documents (e.g., articles). The Markdown Articles Tool is available for macOS, Windows, and Linux.

Tool can be used:

  • To download Markdown documents with images and:
    • Find all image links, download images and fix links in the document.
    • Can skip broken links.
    • Deduplicate similar images by content hash or using hash as a name.
  • Support images, linked with HTML <img> tag.
  • Support local image files.
  • Convert Markdown documents to:
    • HTML.
    • PDF.
    • Or save in the plain Markdown.

Also, if you want to use separate functions, you can just import the package.

Installation

From the repository

You need Python 3.9+.

Run:

git clone "https://github.com/artiomn/markdown_articles_tool"
pip3 install -r markdown_articles_tool/requirements.txt

From the PIP

pip3 install markdown-tool

Usage

Syntax:

markdown_tool [options] <article_file_path_or_url>

options:
  -h, --help            show this help message and exit
  -D {disabled,names_hashing,content_hash}, --deduplication-type {disabled,names_hashing,content_hash}
                        Deduplicate images, using content hash or SHA1(image_name) (default: disabled)
  -d IMAGES_DIRNAME, --images-dirname IMAGES_DIRNAME
                        Folder in which to download images (possible variables: $article_name, $time, $date, $dt, $base_url) (default: images)
  -a, --skip-all-incorrect
                        skip all incorrect images (default: False)
  -E, --download-incorrect-mime
                        download "images" with unrecognized MIME type (default: False)
  -s SKIP_LIST, --skip-list SKIP_LIST
                        skip URL's from the comma-separated list (or file with a leading '@') (default: None)
  -i {md,html,md+html,html+md}, --input-format {md,html,md+html,html+md}
                        input format (default: md)
  -l, --process-local-images
                        [DEPRECATED] Process local images (default: False)
  -n, --replace-image-names
                        Replace image names, using content hash (default: False)
  -o {md,html}, --output-format {md,html}
                        output format (default: md)
  -p IMAGES_PUBLIC_PATH, --images-public-path IMAGES_PUBLIC_PATH
                        Public path to the folder of downloaded images (possible variables: $article_name, $time, $date, $dt, $base_url)
  -P, --prepend-images-with-path
                        Save relative images paths (default: False)
  -R, --remove-source   Remove or replace source file (default: False)
  -t DOWNLOADING_TIMEOUT, --downloading-timeout DOWNLOADING_TIMEOUT
                        how many seconds to wait before downloading will be failed (default: -1)
  -O OUTPUT_PATH, --output-path OUTPUT_PATH
                        article output file name or path
  --verbose, -v         More verbose logging (default: False)
  --version             return version number

Run example 1:

./markdown_tool.py nc-1-zfs/article.md

Run example 2:

./markdown_tool.py not-nas/sov/article.md -o html -s "http://www.ossec.net/_images/ossec-arch.jpg" -a

Run example 3 (run on a folder):

find content/ -name "*.md" | xargs -n1 ./markdown_tool.py

Changes

0.1.3

  • Mostly technical fixes, necessary to work GUI tool.
  • Now the tool has Qt-based GUI.

0.1.2

  • -l, --process-local-images deprecated from the version 0.1.2 and will not work: local images will always be processed.
  • Images with unrecognized MIME type will not be downloaded by default (use -E to disable this behaviour).
  • New option -P, --prepend-images-with-path changes image output path structure. If this option is enabled, "remote" image path will be saved in the local directory structure.
  • Code was significantly refactored.
  • Some auto tests were added.

0.0.8

-D (deduplication) option was changed in the version 0.0.8. Now option is not boolean, it has several values: "disabled", "names_hashing", "content_hash". Long option name was changed too: now it's deduplication-type.

Internals

Tools is a pipeline, which get Markdown form the source and process them, using blocks:

  • Source download article.
  • ImageDownloader download every image. Inside may be used image deduplicator blocks applied to the image.
  • Transform article file, i.e. fix images URLs.
  • Format article to the specific format (Markdown, HTML, PDF, etc.), using selected formatters.

ArticleProcessor class is a strategy, applies blocks, based on the parameters (from the CLI, for example).

markdown_articles_tool's People

Contributors

artiomn avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

markdown_articles_tool's Issues

images

are only the images pertinent to the article downloaded or all images?

Refactoring

  • Use one parameters block in the ArticleProcessor instead of the many separated parameters.

Option to replace image file name with hash

Hello!
Can you add option to replace image file name with hash ?
Needed for using in "media" big folder for markdown notes: "deduplicate" while processing many notes and for an unique file names

Support of misplaced local images

Hi

It would be great if this plugin can also search for missing images in the vault. So far I was not able to find a plugin that can find missing images in the vault. Sometimes we move notes around and the attachments/images are not moved properly, so we end up with missing images even though images would be somewhere in the vault.

thanks

Fix deduplicators

Deduplicators now:

  • Doesn't work correctly.
  • Doesn't fix hash collisions.

Local image processing

Hello!
Can you add processing "local" images ? Copying (relative and absolute paths) and replacing path in file
For example, files in folders:

1\Test.md
1\_d\1.png
_resources\2.png

Content of Test.md:

![](https://pandoc.org/diagram.jpg)

![](_d/1.png)

![](../_resources/2.png)

and command:

markdown_tool.py Test.md -D -d out -p out -O Test2.md

Produces files:

out\diagram.jpg

and file Test2.md

![](out/diagram.jpg)

![](_d/1.png)

![](../_resources/2.png)

Needed:

out\diagram.jpg
out\1.jpg
out\2.jpg

and file:

![](out/diagram.jpg)

![](out/1.png)

![](out/2.png)

Option to specify output filename

Thanks for this nice tool !
Currently the output filename is non-deterministic which makes it hard to use in batch scripts.
Please provide a parameter like --out FILENAME or sth like this.

--replace-image-names Option Not Implemented

I tried using the --replace-image-names option as described in the README, but it seems like this feature is not implemented yet. When I use it, no image names are actually replaced. Could you please confirm if this feature is currently available or not? If it's not implemented yet, it would be a really useful addition to the tool.

Steps to Reproduce:

Run command markdown_articles_tool --replace-image-names ...
Observe that image names are not replaced.
Expected Outcome:
Image names should be replaced as per the documentation.

Actual Outcome:
No image names were replaced.

Thank you for looking into this issue.

Download image with blank inside

Some tools as codimd use blank to set image size as [name](url =300px)

To download image, a simple fix is

diff --git a/markdown_toolset/www_tools.py b/markdown_toolset/www_tools.py
index bc2e58c..ac54304 100644
--- a/markdown_toolset/www_tools.py
+++ b/markdown_toolset/www_tools.py
@@ -34,6 +34,7 @@ def download_from_url(url: str, timeout=None):
     :param url: URL to download.
     :param timeout: timeout before fail.
     """
+    url = url.split()[0]
 
     try:
         response = requests.get(url, allow_redirects=True, timeout=timeout, headers=NECESSARY_HEADERS)

Image download is skipped when using md image size syntax

Hello

Following the 0.1.2 update, I have noticed that some images were not downloaded. It comes from the fact that in Markdown, you can specify an image width or height by adding " =WIDTHxHEIGHT". But when trying to download the image, the tool includes this information in the image URL. For instance, if a markdown file contains

My avatar scaled to 300 pixels width: ![](https://avatars.githubusercontent.com/u/32387838 =300x)

the tool will try to download the image at

https://avatars.githubusercontent.com/u/32387838 =300x

which is an invalid URL. Thus, the error message for unrecognized MIME type will be printed, and the download will be skipped.

Notes:

  • This syntax is not recognized by every md parser, but it works on CodiMD.
  • A link may still be valid if a query is used in the URL, as =300x will be considered a parameter. For example, https://avatars.githubusercontent.com/u/32387838?s=80&v=4 =300x is a valid URL
  • I found ths syntax described in this StackOverflow answer: https://stackoverflow.com/a/21242579

Link is deleted even though no image was found

Hello

My users may introduce a mistake when writing markdown, tagging an important link as an image. As a result, the tool will try to download the html and then delete the link. Losing the link from the article is a problem.

For example in a md file containing

Important link to remember: ![](https://www.google.com/)

the link would be deleted when processing the article, resulting in the following

Important link to remember: ![](.html)

A solution to this particular case would be to raise an exception when replacing the image link if the file name is empty. In www_tools.py after line 73 in function get_filename_from_url, add

if f_name == "": raise ValueError(f'F_name is empty {req.url}')

However, the problem would persist for certain links.

Important link: ![](https://github.com/artiomn/markdown_articles_tool)
would still be replaced by
Important link: ![](markdown_articles_tool.html)

In this case it would be necessary to check the MIME type of the downloaded content before replacing the link.

Image links and folder doesn't match

The script downloaded the images perfectly into an /images folder in the same directory as the files. However, the markdown links only reference the image by name and should be prepended with /images/. Maybe I just need to specify this when running the command, but I'm not sure from the instructions how to do that.

images with unrecognized MIME type work wrong

Hello!
I use this image link format like

![](https://cubox.pro/c/filters:no_upscale()?valid=false&imageUrl=https%3A%2F%2Fpicx.zhimg.com%2F50%2Fv2-53de590b6bb3f42d1a06d28c806c698d_720w.jpg%3Fsource%3D1940ef5c)

so i use the code

python markdown_articles_tool/markdown_tool.py 1.md -E

The program recognized some different image links as identical and replaced the links with

root@pdf:/home/guang# python markdown_articles_tool/markdown_tool.py 1.md -E
Markdown tool version 0.1.3 started...
02.08.2023 05:10:39 File "1.md" will be processed...
02.08.2023 05:10:39 Image public path: 
02.08.2023 05:10:39 Images links count = 17
02.08.2023 05:10:39 Downloading image 1 of 17 from "https://cubox.pro/c/filters:no_upscale()?valid=false&imageUrl=https%3A%2F%2Fpicx.zhimg.com%2F50%2Fv2-53de590b6bb3f42d1a06d28c806c698d_720w.jpg%3Fsource%3D1940ef5c"...
02.08.2023 05:10:40 Image will be written to the file "/home/guang/images/1.png"...
02.08.2023 05:10:40 Downloading image 2 of 17 from "https://cubox.pro/c/filters:no_upscale()?valid=false&imageUrl=https%3A%2F%2Fpica.zhimg.com%2F50%2Fv2-872d10f75dfa52172835fe6fbf22c5fe_720w.jpg%3Fsource%3D1940ef5c"...
02.08.2023 05:10:40 Image will be written to the file "/home/guang/images/1.jpg"...
02.08.2023 05:10:40 Downloading image 3 of 17 from "https://cubox.pro/c/filters:no_upscale()?valid=false&imageUrl=https%3A%2F%2Fpic1.zhimg.com%2F50%2Fv2-c4b89a30d2a3fe1897cfe24388ec935e_720w.jpg%3Fsource%3D1940ef5c"...
02.08.2023 05:10:40 Image "/home/guang/images/1.jpg" already exists and will not be written...
02.08.2023 05:10:40 Downloading image 4 of 17 from "https://cubox.pro/c/filters:no_upscale()?valid=false&imageUrl=https%3A%2F%2Fpic1.zhimg.com%2F50%2Fv2-2a53a9691dd1823bf8e268bccd5ddc33_720w.jpg%3Fsource%3D1940ef5c"...
02.08.2023 05:10:41 Image "/home/guang/images/1.png" already exists and will not be written...
02.08.2023 05:10:41 Downloading image 5 of 17 from "https://cubox.pro/c/filters:no_upscale()?valid=false&imageUrl=https%3A%2F%2Fpic1.zhimg.com%2F50%2Fv2-0efb5de65201ba08c47863d88b61669f_720w.jpg%3Fsource%3D1940ef5c"...

Hope you can help me to solve this problem, thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.