html2md's Introduction

html2md

A python script that reads the debian news page and spits out a markdown file that renders the same page.

Read the thought process

Running it

Clone this repo locally, create a virtual environment, install dependencies and run main.py.

git clone https://github.com/mungai-njoroge/html2md.git

cd html2md

If you have Poetry installed:

poetry install

# run main.py
poetry run python main.py

Without Poetry:

# create virtual environment
python -m venv venv

# activate it
source venv/bin/activate

# install dependencies
pip install -r requirements.txt

# run script
python main.py

Libraries used

requests - Downloading webpage
BeautifulSoup - Parsing Html into tree structure
markdownify - Generating Markdown from a string

How it works

The page is fetched using the requests package and then parsed into a tree structure using BeautifulSoup.

Important information that can be used by a wiki engine is extracted from the page and stored to be used as front matter in the final markdown file.

The relevant section of the webpage is inside the element with id content. This section is singled out using the BeautifulSoup.find method. Unneeded elements in the 'content' are identified and removed using the BeautifulSoup.decompose method. The markdownify package is then used to generate markdown from the remainder.

Running tests

Tests are defined in the test_main.py file. You can run them by running pytest (which was installed as a dependency).

python -m pytest

With Poetry:

poetry run python -m pytest

Recommend Projects

cwilvx / html2md Goto Github PK

html2md's Introduction

html2md

Running it

Libraries used

How it works

Running tests

html2md's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs