GithubHelp home page GithubHelp logo

danburzo / percollate Goto Github PK

View Code? Open in Web Editor NEW
4.1K 44.0 164.0 1.16 MB

A command-line tool to turn web pages into readable PDF, EPUB, HTML, or Markdown docs.

Home Page: https://danburzo.ro/projects/percollate/

License: MIT License

JavaScript 88.00% HTML 4.24% CSS 7.68% Shell 0.07%
puppeteer pdf readability cli epub html markdown

percollate's Introduction

percollate

npm version

Percollate is a command-line tool that turns web pages into beautifully formatted PDF, EPUB, HTML or Markdown files.

Sample Output

Sample spread from the generated PDF of a chapter in Dimensions of Colour; rendered here in black & white for a smaller image file size.

Installation

percollate is a Node.js command-line tool which you can install globally from npm:

npm install -g percollate

Percollate and its dependencies require Node.js 14.17.0 or later.

Community-maintained packages

There's a packaged version available on Arch User Repository, which you can install using your local AUR helper (yay, pacaur, or similar):

yay -S nodejs-percollate

Some Docker images are available in this tracking issue.

Usage

Run percollate --help for a list of available commands and options.

Percollate is invoked on one or more operands (usually URLs):

percollate <command> [options] url [url]...

The following commands are available:

  • percollate pdf produces a PDF file;
  • percollate epub produces an EPUB file;
  • percollate html produces a HTML file.
  • percollate md produces a Markdown file.

The operands can be URLs, paths to local files, or the - character which stands for stdin (the standard inputs).

Available options

Unless otherwise stated, these options apply to all three commands.

-o, --output

Specify the path of the resulting bundle relative to the current folder.

percollate pdf https://example.com -o my-example.pdf

-u, --url

Using the - operand you can read the HTML content from stdin, as fetched by a separate command, such as curl. In this sort of setup, percollate does not know the URL from which the content has been fetched, and relative paths on images, anchors, et cetera won't resolve correctly.

Use the --url option to supply the source's original URL.

curl https://example.com | percollate pdf - --url=https://example.com

-w, --wait

By default, percollate processes URLs in parallel. Use the --wait option to process them sequentially instead, with a pause between items. The delay is specified in seconds, and can be zero.

percollate epub --wait=1 url1 url2 url3

--individual

By default, percollate bundles all web pages in a single file. Use the --individual flag to export each source to a separate file.

percollate pdf --individual http://example.com/page1 http://example.com/page2

--template

Path to a custom HTML template. Applies to pdf, html, and md.

--style

Path to a custom CSS stylesheet, relative to the current folder.

--css

Additional CSS styles you can pass from the command-line to override styles specified by the default/custom stylesheet.

--no-amp

Don't prefer the AMP version of the web page.

--debug

Print more detailed information.

-t, --title

Provide a title for the bundle.

percollate epub http://example.com/page-1 http://example.com/page-2 --title="Best Of Example"

-a, --author

Provide an author for the bundle.

percollate pdf --author="Ella Example" http://example.com

--cover

Generate a cover. The option is implicitly enabled when the --title option is provided, or when bundling more than one web page to a single file. Disable this implicit behavior by passing the --no-cover flag.

--toc

Generate a hyperlinked table of contents. The option is implicitly enabled when bundling more than one web page to a single file. Disable this implicit behavior by passing the --no-toc flag.

Applies to pdf, html, and md.

--hyphenate

Hyphenation is enabled by default for pdf, and disabled for epub, html, and md. You can opt into hyphenation with the --hyphenate flag, or disable it with the --no-hyphenate flag.

See also the Hyphenation and justification recipe.

--inline

Embed images inline with the document. Images are fetched and converted to Base64-encoded data URLs.

This option is particularly useful for html to produce self-contained HTML files.

--md.<option>=<value>

Pass options to the underlying Markdown stringifier, mdast-util-to-markdown. These are the default Markdown options:

const DEFAULT_MARKDOWN_OPTIONS = {
	fences: true,
	emphasis: '_',
	strong: '_',
	resourceLink: true,
	rule: '-'
};

Recipes

Basic bundling

To turn a single web page into a PDF:

percollate pdf --output=some.pdf https://example.com

To bundle several web pages into a single PDF, specify them as separate arguments to the command:

percollate pdf --output=some.pdf https://example.com/page1 https://example.com/page2

You can use common Unix commands and keep the list of URLs in a newline-delimited text file:

cat urls.txt | xargs percollate pdf --output=some.pdf

To transform several web pages into individual PDF files at once, use the --individual flag:

percollate pdf --individual https://example.com/page1 https://example.com/page2

If you'd like to fetch the HTML with an external command, you can use - as an operand, which stands for stdin (the standard input):

curl https://example.com/page1 | percollate pdf --url=https://example.com/page1 -

Notice we're using the url option to tell percollate the source of our (now-anonymous) HTML it gets on stdin, so that relative URLs on links and images resolve correctly.

The --css option

The --css option lets you pass a small snippet of CSS to percollate. Here are some common use-cases:

Custom page size / margins

The default page size is A5 (portrait). You can use the --css option to override it using any supported CSS size:

percollate pdf --css "@page { size: A3 landscape }" http://example.com

Similarly, you can define:

  • custom margins, e.g. @page { margin: 0 }
  • the base font size: html { font-size: 10pt }

Changing the font stacks

The default stylesheet includes CSS variables for the fonts used in the PDF:

:root {
	--main-font: Palatino, 'Palatino Linotype', 'Times New Roman',
		'Droid Serif', Times, 'Source Serif Pro', serif, 'Apple Color Emoji',
		'Segoe UI Emoji', 'Segoe UI Symbol';
	--alt-font: 'helvetica neue', ubuntu, roboto, noto, 'segoe ui', arial,
		sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol';
	--code-font: Menlo, Consolas, monospace;
}
CSS variable What it does
--main-font The font stack used for body text
--alt-font Used in headings, captions, et cetera
--code-font Used for code snippets

To override them, use the --css option:

percollate pdf --css ":root { --main-font: 'PT Serif';  --alt-font: Roboto; }" http://example.com

๐Ÿ’ก To work correctly, you must have the fonts installed on your machine. Custom web fonts currently require you to use a custom CSS stylesheet / HTML template.

Remove the appended hrefs from hyperlinks

The idea with percollate is to make PDFs that can be printed without losing where the hyperlinks point to. However, for some link-heavy pages, the appended hrefs can become bothersome. You can remove them using:

percollate pdf --css "a:after { display: none }" http://example.com

Hyphenation and justification

Hyphenation is only enabled by default for PDFs, but you can opt in or out of it for any output format with a flag.

When hyphenation is enabled, paragraphs will be justified:

.article__content p {
	text-align: justify;
}

If you prefer left-aligned text:

percollate pdf --css ".article__content p { text-align: left }" http://example.com

The --style option

The --style option lets you use your own CSS stylesheet instead of the default one. Here are some common use-cases for this option:

โš ๏ธ TODO add examples here

The --template option

The --template option lets you use a custom HTML template for the PDF.

๐Ÿ’ก The HTML template is parsed with nunjucks, which is a close JavaScript relative of Twig for PHP, Jinja2 for Python and L for Ruby.

Here are some common use-cases:

Customizing the page header / footer

Puppeteer can print some basic information about the page in the PDF. The following CSS class names are available for the header / footer, into which the appropriate content will be injected:

  • date โ€” The formatted print date
  • title โ€” The document title
  • url โ€” document location (Note: this will print the path of the temporary html, not the original web page URL)
  • pageNumber โ€” the current page number
  • totalPages โ€” total pages in the document

๐Ÿ‘‰ See the Chromium source code for details.

You place your header / footer template in a template element in your HTML:

<template class="header-template"> My header </template>

<template class="footer-template">
	<div class="text center">
		<span class="pageNumber"></span>
	</div>
</template>

See the default HTML for example usage.

You can add CSS styles to the header / footer with either the --css option or a separate CSS stylesheet (the --style option).

๐Ÿ’ก The header / footer template do not inherit their styles from the rest of the page (i.e. they are not part of the cascade), so you'll have to write the full CSS you want to apply to them.

An example from the default stylesheet:

.footer-template {
	font-size: 10pt;
	font-weight: bold;
}

Updating

To keep the tool up-to-date, you can run:

npm install -g percollate

Occasionally, an ugrade might not go according to plan; in this case, you can uninstall and re-install percollate:

npm uninstall -g percollate && npm install -g percollate

How it works

All export formats follow a common pipeline:

  1. Fetch the page(s) using node-fetch
  2. If an AMP version of the page exists, use that instead (disable with --no-amp flag)
  3. Enhance the DOM using jsdom
  4. Pass the DOM through mozilla/readability to strip unnecessary elements
  5. Apply the HTML template and the stylesheet to the resulting HTML

Different formats then use different tools to produce the final file.

PDFs are rendered with puppeteer.

EPUBs have external images fetched and bundled together with the HTML of each article. When the --inline option is used, images are instead converted to data URLs and embedded into the HTML.

HTMLs are saved without any further changes. When the --inline option is used, images are converted to data URLs and embedded into the HTML. External images are not otherwise fetched.

Markdown files are produced the same way as HTMLs, then processed with a series of utilities from the unified.js umbrella.

Limitations

Percollate inherits the limitations of two of its main components, Readability and Puppeteer (headless Chrome).

The imperative approach Readability takes will not be perfect in each case, especially on HTML pages with atypical markup; you may occasionally notice that it either leaves in superfluous content, or that it strips out parts of the content. You can confirm the problem against Firefox's Reader View. In this case, consider filing an issue on mozilla/readability.

Using a browser to generate the PDF is a double-edged sword. On the one hand, you get excellent support for web platform features. On the other hand, print CSS as defined by W3C specifications is only partially implemented, and it seems unlikely that support will be improved any time soon. However, even with modest print support, I think Chrome is the best (free) tool for the job.

Troubleshooting

On some Linux machines you'll need to install a few more Chrome dependencies before percollate works correctly. (Thanks to @ptica for sorting it out)

The percollate pdf command supports the --no-sandbox Puppeteer flag, but make sure you're aware of the implications before disabling the sandbox.

Using Firefox to render PDFs

This feature is experimental. Please log an issue if you notice anything wrong.

Starting with percollate 3.x, it's possible to use Firefox Nightly as an alternative browser with which to render PDFs. To make Firefox available to Percollate, use the following install command:

PUPPETEER_PRODUCT=firefox npm install percollate

After installation, percollate pdf commands can be run with the --browser=firefox option.

Limitations of Firefox PDF rendering

At the moment, rendering PDFs with Firefox has the following limitations:

  • The pages can't have headers and footers, so there are no page numbers.

Contributing

Contributions of all kinds are welcome! See CONTRIBUTING.md for details.

See also

Here are some other projects to check out if you're interested in building books using the browser:

percollate's People

Contributors

akuukis avatar danburzo avatar emersonlaurentino avatar guybedo avatar juhq avatar mosegontar avatar ncsing avatar opw0011 avatar pascalw avatar pedrolucasp avatar phenax avatar ramadis avatar ssonal avatar tanmayrajani avatar vongrad avatar xiangronglin avatar yashha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

percollate's Issues

anchor is not defined

$percollate pdf --output 1.pdf https://reactjs.org/docs/hello-world.html
Fetching: https://reactjs.org/docs/hello-world.html
Enhancing web page
(node:10750) UnhandledPromiseRejectionWarning: ReferenceError: anchor is not defined
    at Array.from.forEach.img (/usr/lib/node_modules/percollate/src/enhancements.js:12:18)
    at Array.forEach (<anonymous>)
    at imagesAtFullSize (/usr/lib/node_modules/percollate/src/enhancements.js:11:57)
    at cleanup (/usr/lib/node_modules/percollate/index.js:40:2)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:10750) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:10750) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Transform static html folders ?

Let's say we have a static web site which is generated by a static site generator.
Could we transform them into one pdf ?

Browser extension

I tried and found percollate a very useful tool. However, I would love to use it within html pages for on-demand creation of html page to pdf.
How can I run percollate from within browser instead of node or command-line?

Unexpected token function

Tried running percollate --v, returns the "Unexpected token function" error.

percollate/index.js:31
async function cleanup(url) {
^^^^^^^^
SyntaxError: Unexpected token function
at Object.exports.runInThisContext (vm.js:76:16)
at Module._compile (module.js:542:28)
at Object.Module._extensions..js (module.js:579:10)
at Module.load (module.js:487:32)
at tryModuleLoad (module.js:446:12)
at Function.Module._load (module.js:438:3)
at Module.runMain (module.js:604:10)
at run (bootstrap_node.js:394:7)
at startup (bootstrap_node.js:149:9)
at bootstrap_node.js:509:3

Default save location?

Sometimes I might just want to create PDFs from a list of web pages, using the page title as the default file name (using the page title is the default printing behavior of Chrome I believe).

Currently, if I run percollate without specifying --output, it claims to have saved the pdf, but I can't find it in the folder where I executed the command.

Can it just save the web page to the current folder using its title as the filename, when an --output flag is omitted?

noUselessHref: provide option to remove all hrefs

#31 Introduced some href filtering, but I wonder - do we really need to show those hrefs at all? I think they just make it harder to read the text.

Please provide an option to skip all href generation. For now I'm using this hack:

function noUselessHref(doc) {
	Array.from(doc.querySelectorAll(`a`))
		.filter(function(el) {
			return true;
		})
		.forEach(el => el.classList.add('no-href'));
}

PDF: Add a Table of Contents to the metadata

A PDF generated from many web pages would benefit from a Table of Contents, implemented as PDF bookmarks. We'll probably need to post-process the PDF with something like HummusJS to write the TOC. (Also, I'd appreciate if someone with more experience would explain whether its license is compatible with our MIT License)

Related: #25

Skip / replace Readability

Readability's results are not always perfect, so let's make it flexible enough so that we can take out the readability step, or replace it with some other way of parsing the content.

This means standardizing what we get back from the parser, and we can take Readability as the baseline output.

Add --css CLI option

Add a --css CLI option to allow sending short style snippets from the CLI directly, without having to use a custom HTML/CSS file. For example, changing the page size:

percollate --output some.pdf --css "@page { size: A4; }" http://example.com

The CSS will be appended to the stylesheet.

Screenshots to README

Hi,

Good work on this one!
It would be nice to include a few screenshots on the README page just to get a brief idea of what the output would look like.

Regards.

Add command `read`

This will start a local server (via serve) that processes the HTMLs and shows a web reader interface.

TOC with multiple levels

It would be great to have an option to feed not just a plain list of URLs, but a tabbed, spaced, or somehow formatted (see below) file with captions and URLs to form a multilevel TOC in a resulting PDF.

A sample of such an input file:

<h1><a href="http://url1.com">Level 1 caption</a></h1>
	<h2><a href="http://url11.com">Level 1-1 caption</a></h2>
		<h3><a href="http://url111.com">Level 1-1-1 caption</a></h3>
		<h3><a href="http://url112.com">Level 1-1-2 caption</a></h3>
	<h2><a href="http://url12.com">Level 1-2 caption</a></h2>
		<h3><a href="http://url121.com">Level 1-2-1 caption</a></h3>
		<h3><a href="http://url122.com">Level 1-2-2 caption</a></h3>

In-page anchors on github.com pages don't work

For example:

percollate pdf https://github.com/danburzo/percollate

Will result in a PDF where the links in the Table of Contents doesn't work:

danburzopercollate.pdf

(produced on macOS / [email protected])

I may be dense, but I can't tell how the anchors work in browsers in the first place ๐Ÿ˜ฐ (Later edit: the behavior is dependent on JavaScript โ€” of course ๐Ÿ˜„ )

Failed to launch chrome buecause Running as root without --no-sandbox

Saving as PDF
(node:24806) UnhandledPromiseRejectionWarning: Error: Failed to launch chrome!
[1012/081529.390835:ERROR:zygote_host_impl_linux.cc(89)] Running as root without --no-sandbox is not supported. See https://crbug.com/638180.

TROUBLESHOOTING: https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md

at onClose (/usr/lib/node_modules/percollate/node_modules/puppeteer/lib/Launcher.js:339:14)
at Interface.helper.addEventListener (/usr/lib/node_modules/percollate/node_modules/puppeteer/lib/Launcher.js:328:50)
at emitNone (events.js:111:20)
at Interface.emit (events.js:208:7)
at Interface.close (readline.js:368:8)
at Socket.onend (readline.js:147:10)
at emitNone (events.js:111:20)
at Socket.emit (events.js:208:7)
at endReadableNT (_stream_readable.js:1064:12)
at _combinedTickCallback (internal/process/next_tick.js:139:11)

(node:24806) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:24806) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

SyntaxError: Unexpected token ...

When I put percollate pdf --output as32.pdf https://github.com/danburzo/percollate , get this error.

SyntaxError: Unexpected token ...
    at createScript (vm.js:74:10)
    at Object.runInThisContext (vm.js:116:10)
    at Module._compile (module.js:533:28)
    at Object.Module._extensions..js (module.js:580:10)
    at Module.load (module.js:503:32)
    at tryModuleLoad (module.js:466:12)
    at Function.Module._load (module.js:458:3)
    at Module.require (module.js:513:17)
    at require (internal/module.js:11:18)
    at Object.<anonymous> (/usr/local/lib/node_modules/percollate/cli.js:5:40)

Unexpected token function

Installed globally via NPM however when trying to run (any site) the following error is received:

/usr/local/lib/node_modules/percollate/index.js:31
async function cleanup(url) {
^^^^^^^^
SyntaxError: Unexpected token function
at createScript (vm.js:56:10)
at Object.runInThisContext (vm.js:97:10)
at Module._compile (module.js:542:28)
at Object.Module._extensions..js (module.js:579:10)
at Module.load (module.js:487:32)
at tryModuleLoad (module.js:446:12)
at Function.Module._load (module.js:438:3)
at Module.runMain (module.js:604:10)
at run (bootstrap_node.js:390:7)
at startup (bootstrap_node.js:150:9)

Allow proxy parameters

got accepts proxy parameters.

Either allow command line parameters for proxy to be passed over to got, or even better, honor http_proxy, https_proxy, no_proxy env variables.

Usecases for --stylesheet vs. --css handling in the HTML template

Initially, the HTML template received the path for the stylesheet (either the default one, or a custom one provided with the --stylesheet option):

<head>
<meta charset="utf-8">
<title>๐ŸŒ Percollate</title>
<link rel='stylesheet' media='all' href="{{ stylesheet }}"/>
</head>

With the introduction of the --css option, I changed it to:

<style type='text/css'>
{{ style }}
</style>

And deprecated the passing of the stylesheet property to the template.

However, I think I might have missed a valid use-case for an external stylesheet, namely being able to reference outside resources (images, web fonts) in the custom CSS.

This issue outline some use-cases, to make sure the final solution covers all of them elegantly:

  • Override just the page size, margins, font sizes
  • Override the default stylesheet with a custom one
  • Use local web fonts in the custom stylesheet

(Adding to this list as more use-cases arise)

Use largest available size for images in Wikipedia articles

The idea of the imagesAtFullSize enhancement is to get the largest available image from blogs using Blogspot, WordPress, and the like:

function imagesAtFullSize(doc) {
/*
Replace:
<a href='original-size.png'>
<img src='small-size.png'/>
</a>
With:
<img src='original-size.png'/>
*/
Array.from(doc.querySelectorAll('a > img:only-child')).forEach(img => {
let anchor = img.parentNode;
let original = anchor.href;
// only replace if the HREF matches an image file
if (original.match(/\.(png|jpg|jpeg|gif|svg)$/)) {
img.setAttribute('src', original);
anchor.parentNode.replaceChild(img, anchor);
}
});

However, Wikipedia images are an exception:

<a href="/wiki/File:Perkulator.jpg" class="image">
  <img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Perkulator.jpg/250px-Perkulator.jpg" class="thumbimage" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Perkulator.jpg/375px-Perkulator.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Perkulator.jpg/500px-Perkulator.jpg 2x" data-file-width="1944" data-file-height="2592" width="250" height="333">
</a>

They link to what looks like an image file, but is in fact a HTML page for that image. How can we handle this situation gracefully?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.