A python script that reads the debian news page and spits out a markdown file that renders the same page.
Clone this repo locally, create a virtual environment, install dependencies and run main.py
.
git clone https://github.com/mungai-njoroge/html2md.git
cd html2md
If you have Poetry installed:
poetry install
# run main.py
poetry run python main.py
Without Poetry:
# create virtual environment
python -m venv venv
# activate it
source venv/bin/activate
# install dependencies
pip install -r requirements.txt
# run script
python main.py
- requests - Downloading webpage
- BeautifulSoup - Parsing Html into tree structure
- markdownify - Generating Markdown from a string
The page is fetched using the requests package and then parsed into a tree structure using BeautifulSoup.
Important information that can be used by a wiki engine is extracted from the page and stored to be used as front matter in the final markdown file.
The relevant section of the webpage is inside the element with id content
. This section is singled out using the BeautifulSoup.find
method. Unneeded elements in the 'content' are identified and removed using the BeautifulSoup.decompose
method. The markdownify
package is then used to generate markdown from the remainder.
Tests are defined in the test_main.py
file. You can run them by running pytest
(which was installed as a dependency).
python -m pytest
With Poetry:
poetry run python -m pytest