GithubHelp home page GithubHelp logo

thatrainbowbear / scrape-markdown Goto Github PK

View Code? Open in Web Editor NEW

This project forked from evangoer/scrape-markdown

0.0 2.0 0.0 7 KB

A simple utility for scraping web pages and returning the results in Markdown.

License: BSD 3-Clause "New" or "Revised" License

JavaScript 100.00%

scrape-markdown's Introduction

scrape-markdown

A simple utility for scraping web pages or raw HTML data and returning the results in Markdown.

Installation

Install using npm:

$ npm install scrape-markdown

Usage

scrape-markdown -h -s [selector] [url|file|html]...

scrape-markdown accepts one or more URLs, filepaths, or HTML strings. If you supply an HTML string, scrape-markdown converts the data to Markdown directly. If you supply a URL or filepath, scrape-markdown attempts to fetch the contents first.

-h, --help

Displays a usage statement.

-s, --selector

Scrapes HTML from each page using the specified CSS selector. This extracts all nodes on the page using querySelectorAll and concatenates the innerHTML from each matching node. The default selector is 'body'.

A given call of scrape-markdown only accepts a single selector. If you need to apply different selectors to different pages, you should invoke scrape-markdown multiple times.

Depending on your selector, you might need to enclose the value in quotes. For example, a selector of h1 does not require special treatment, but h1 a or #main must be quoted.

Examples

Convert the Yahoo! and Google homepages to Markdown

$ scrape-markdown http://yahoo.com http://google.com

Extract all Express API documentation

$ scrape-markdown --selector "#right" http://expressjs.com/api.html

Convert an arbitrary string to Markdown

$ echo "<h1>Hello</h1>" | scrape-markdown

or

$ scrape-markdown "<h1>Hello</h1>"

Scrape all warning divs out of a local file

$ scrape-markdown --selector .warning path/to/file.html

or

$ cat path/to/file.html | scrape-markdown --selector .warning 

Fetch and scrape a page using curl

$ curl http://example.com | scrape-markdown --selector .content

scrape-markdown can fetch URLs on its own, but it doesn't provide you any fine-grained control. If you need to use retries, set HTTP headers, etc, you can use a more powerful utility such as curl or wget and pipe the output to scrape-markdown.

Get all unordered lists in a collection of local HTML files

find . -name "*.html" | xargs scrape-markdown --selector ul

License

This software is free to use under a 3-clause BSD license. See the LICENSE file for license text and copyright information.

scrape-markdown's People

Contributors

evangoer avatar thatrainbowbear avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.