GithubHelp home page GithubHelp logo

cwilvx / html2md Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 42 KB

A python script that reads the debian wiki news page and spits out a markdown file that renders the same page.

Python 100.00%

html2md's Introduction

html2md

Tests

A python script that reads the debian news page and spits out a markdown file that renders the same page.

Read the thought process

Running it

Clone this repo locally, create a virtual environment, install dependencies and run main.py.

git clone https://github.com/mungai-njoroge/html2md.git

cd html2md

If you have Poetry installed:

poetry install

# run main.py
poetry run python main.py

Without Poetry:

# create virtual environment
python -m venv venv

# activate it
source venv/bin/activate

# install dependencies
pip install -r requirements.txt

# run script
python main.py

Libraries used

How it works

The page is fetched using the requests package and then parsed into a tree structure using BeautifulSoup.

Important information that can be used by a wiki engine is extracted from the page and stored to be used as front matter in the final markdown file.

The relevant section of the webpage is inside the element with id content. This section is singled out using the BeautifulSoup.find method. Unneeded elements in the 'content' are identified and removed using the BeautifulSoup.decompose method. The markdownify package is then used to generate markdown from the remainder.

Running tests

Tests are defined in the test_main.py file. You can run them by running pytest (which was installed as a dependency).

python -m pytest

With Poetry:

poetry run python -m pytest

html2md's People

Contributors

cwilvx avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.