arachnys / athenapdf Goto Github PK

View Code? Open in Web Editor NEW

2.3K 72.0 188.0 16.44 MB

Drop-in replacement for wkhtmltopdf built on Go, Electron and Docker

License: MIT License

Makefile 2.89% JavaScript 12.78% Shell 0.97% Go 80.95% Dockerfile 2.41%

docker pdf-conversion pdf-converter microservice go golang electron javascript cli html-to-pdf

athenapdf's Introduction

Athena - deprecated

Simple, Docker-powered PDF conversions.

This project has been deprecated and is no longer supported.

Athena is comprised of an Electron command line interface (CLI) tool, and a Go microservice for converting HTML to PDF documents.

Athena transformed Arachne into a spider for challenging her as a weaver and/or weaving a tapestry that insulted the gods.

Examples:

Original: Google isn’t even close as a tool for proper due diligence. Why not? (Converted: PDF | Aggressive)
Original: Panamanian Law Firm Is Gatekeeper To Vast Flow of Murky Offshore Secrets (Converted: PDF | Aggressive)

When aggressive mode is enabled, only the essential contents of a page are kept in the generated PDF document. It is a clutter-free version of the web page, perfect for reading.

Background

Athena is an open source project.

It was designed to do one thing and to do it well - PDF conversions; to work together with other programs; and to be able to handle text streams, because that is a universal interface.

It aims to give users an on-demand capability to convert HTML to PDF without frills.

At the lowest level, its CLI component (athenapdf) was designed to be an alternative / drop-in replacement for wkhtmltopdf, a popular CLI tool for HTML to PDF conversions. Because of Docker the CLI syntax is a bit more complex but it's much more reliable.

(For what it's worth, wkhtmltopdf is great, but it has a horrible habit of crashing unexpectedly - especially when printing documents with invalid HTML, problematic CSS or other issues).

There is also a microservice component (weaver), allowing you to leverage Athena over HTTP.

Getting Started

CLI vs Microservice

Our CLI tool will suffice for most simple, and everyday HTML to PDF conversions.

However, for conversions at scale / PDF conversion as a service, we recommend getting started with our microservice component instead.

The microservice is packaged with athenapdf, and you can run both components independently.

Docker

Both components are packaged, and distributed as Docker images.

The only dependency you will need is Docker, and the rest will be handled for you (even if you are running in an environment without a display server - headless environment).

Quick Start

Before starting, ensure your Docker environment is set up, and ready-to-use.

For OSX / Windows users, ensure your Docker Machine is prepared, and the appropriate environment variables are established.

CLI

docker pull arachnysdocker/athenapdf
docker run --rm -v $(pwd):/converted/ arachnysdocker/athenapdf athenapdf <input_path> [output_path]
See cli for full documentation

The [output_path] can be omitted.

Example: docker run --rm -v $(pwd):/converted/ arachnysdocker/athenapdf athenapdf https://www.arachnys.com/the-long-road-to-achieving-true-perpetual-kyc/

For Windows users, an additional forward slash must precede the volume when using Git Bash / MinGW:

docker run --rm -v /$(pwd):/converted/ arachnysdocker/athenapdf athenapdf https://www.arachnys.com/the-long-road-to-achieving-true-perpetual-kyc/

Alternatively, if using the Windows command prompt:

docker run --rm -v %cd%:/converted/ arachnysdocker/athenapdf athenapdf https://www.arachnys.com/the-long-road-to-achieving-true-perpetual-kyc/

Microservice

docker pull arachnysdocker/athenapdf-service
docker run -p 8080:8080 --rm arachnysdocker/athenapdf-service
Inline conversion: http://<docker-address>:8080/convert?auth=arachnys-weaver&url=https://www.arachnys.com/the-long-road-to-achieving-true-perpetual-kyc/
OR cURL, and redirect output to file: curl http://dockerhost:8080/convert\?auth\=arachnys-weaver\&url\=https://www.arachnys.com/the-long-road-to-achieving-true-perpetual-kyc/ |> out.pdf
See weaver for full documentation

The default authentication key is arachnys-weaver. This can be changed through the WEAVER_AUTH_KEY environment variable.

The microservice can be deployed scalably to ECS if you want to build your own conversion farm.

License

Please note athenapdf is NEITHER affiliated with NOR endorsed by Google Inc. and GitHub Inc.

See LICENSE.

An Arachnys Christmas project.

athenapdf's People

Contributors

Stargazers

Watchers

Forkers

dfc mykelyk welcheb awesome-fork iamapinan devopsbox rcrowe l1kw1d buooy benliscio 0xnbk codeofrobin sufianhassan bitpi c0debrain fengyiyi ac1714 webix mantyr mechanicux chandantiwari protez se77en cybernetics ligadous parkerproject gotenxiao xpagedeveloper amygdala shotishu oficinastic ninedraft coderberry nuvi kenbolton rorz kiliaro chrisemoulton flavianmissi cequencer xhite teh inlineblock sashka rjnichols shcoderalex atom134 ivoputzer fenstrat antonini oberd atrigub nexus85 hivelocity lanceulmer bodiam i2biz z-dev fulus06 cherifsy coderbradlee joechapman nootanghimire forsbergplustwo k13gomez code11 neofreko sbrown345 badou119 atsatsoulas troyharvey en-japan-air evgenr christophgapp arxpoetica padamshrestha teknologist chris-garrett dream-group alikamil ego008 exister jeckep jgoelen emrul themightychris basgys stefanomasini godeep richbayliss muke5hy klawj liyndon loicmahieu ahaikal michelada d4tocchini chairish stepkagr kissmonx

athenapdf's Issues

Fonts not displaying properly

Hi,

Set it all up in a docker environment and working beautifully! However, when the pdf is generated, the letter spacing is really close together... Using Dejavu Sans. Any idea what the issue is?

Unable to execute built executable

I tried to build athenapdf as per https://github.com/arachnys/athenapdf/blob/master/cli/docs/building.md using using NPM.

Everything went fine but when I tried to execute the binary, I am getting the following error:

xxx@ubuntu:~/athenapdf/cli/build/athenapdf-linux-x64$ ./athenapdf
./athenapdf: error while loading shared libraries: libgtk-x11-2.0.so.0: cannot open shared object file: No such file or directory

I found a way to fix this by doing

sudo apt-get install libgtk2.0-0:amd64

Next I got the following error:

xxx@ubuntu:~/athenapdf/cli/build/athenapdf-linux-x64$ ./athenapdf
./athenapdf: error while loading shared libraries: libXtst.so.6: cannot open shared object file: No such file or directory

Now when I tried doing

sudo apt install libxext6

I got this:

Reading package lists... Done
Building dependency tree
Reading state information... Done
libxext6 is already the newest version.
0 upgraded, 0 newly installed, 0 to remove and 4 not upgraded.

How do I fix this? Also, am I doing anything wrong? Also, as a wkhtmltopdf drop-in replacement, I would expect it to run without any dependency issues after building.

Machine Used: VirtualBox VM running Ubuntu Server 14.04 with 4GB RAM

x509: certificate signed by unknown authority

Is it possible to add a runtime option which allows the conversion of sites using untrusted SSL?

We need to convert some internal facing sites to PDF and the certificates are not trusted.

Proxy Support

AthenaPDF looks amazing! Can't wait to give it a try, but need to fetch HTML remotely via proxy. Any planned support for this?

Conversion process hit the URL twice

I have a setup which will generate an URL with one time token access. I was wondering why I always fail in generating PDF. My log says I have two requests from Athena. The first one, yield in HTTP 200. And another one (of which the token no longer valid), yield 401.

Apparently this line https://github.com/arachnys/athenapdf/blob/master/weaver/converter/athenapdf/athenapdf.go#L45
actually hit my URL before the actual conversion task.

log.Printf("[AthenaPDF] converting to PDF: %s\n", s.GetActualURI())

Of course, switching GIN_MODE to release immediately fixed my issue (after hours of frustration). Maybe we can log s.URI instead of calling getActualURI

Allow override of acl for s3 uploads

Currently all pdfs generated by athena and uploaded to s3 are public by default. I have a use case where we want to define our own acls for uploaded files. Is there any plan to fix this issue?

Support for PNG

Would love support for PNG output, or a good integrated workflow for doing PDF->PNG.

Thanks!

Better docs for CLI args?

Athena mentions being a drop in replacement for wkhtmltopdf despite the change from an outdated webkit to blink. I'm aware I can use --help for a list of commands, are they meant to be similar to wkhtmltopdf? They seem different(the supported ones). The CLI help isn't too clear about some of the accepted values.

    -h, --help                   output usage information
    -V, --version                output the version number
    --debug                      show GUI
    -T, --timeout <seconds>      seconds before timing out (default: 120)
    -D, --delay <milliseconds>   milliseconds delay before saving (default: 200)
    -P, --pagesize <size>        page size of the generated PDF (default: A4)
    -M, --margins <marginsType>  margins to use when generating the PDF (default: standard)
    -Z --zoom <factor>           zoom factor for higher scale rendering (default: 1 - represents 100%)
    -S, --stdout                 write conversion to stdout
    -A, --aggressive             aggressive mode / runs dom-distiller
    -B, --bypass                 bypasses paywalls on digital publications (experimental feature)
    --proxy <url>                use proxy to load remote HTML
    --no-portrait                render in landscape
    --no-background              omit CSS backgrounds
    --no-cache                   disables caching

I saw in another issue that zoom doesn't really have an effect or one of the expected effects(browser zoom), presumably a floating point value.
pagesize I haven't looked into as A4 is fine, but is there a way to list available values?
debug, not sure what this does, I tried it and didn't really see anything.
margins, this is the one I'm interested in. I'm not sure how these are meant to be defined, I've seen standard and none as values but there is no docs/examples elsewhere or a way to find out the value config? I want to set differing horizontal/vertical margins.

Option for converting from html string in memory?

This is more of a question, but not sure where else to ask. I have a use case where I'm manipulating some html in software and then generating a pdf. Is there a way to use Athenapdf without having to have either a local file or a public url?

instructables.com ?ALLSTEPS stripped?

Hi,

For this URL: http://www.instructables.com/id/DIY-2k2560x1440-Beam-Projector/?ALLSTEPS
the pdf generated online actually shows the content of http://www.instructables.com/id/DIY-2k2560x1440-Beam-Projector/

Is the url converted? Is it possible to keep ?ALLSTEPS? That allows to download the full page in a single shot.

Thanks.

Stop PDF from writing to file?

Is there anywhere to not write the output of athenapdf to a file? I'd like to be able to serve the stdout up to a browser and not have to save it somewhere on the machine itself.

Webgl support

Hi,

This tool generates by far the best PDFs I've seen out of any tool. Great work!

I'd like to be able to print webgl canvases. Right now I don't think this works: for example if you try converting: http://webglsamples.org/aquarium/aquarium.html.

Do you know if it's possible to add support for this and roughly how it might be done?

"Failed to load: -6" with local file

vagrant@mrpdf:/opt/MrPDF/log$ docker run --rm -v $(pwd):/converted/ arachnysdocker/athenapdf athenapdf /tmp/348e3a26fe6c19b7e80e2338c720792e.html
Xlib:  extension "RANDR" missing on display ":99".
Xlib:  extension "RANDR" missing on display ":99".
Failed to load: -6  (file:///tmp/348e3a26fe6c19b7e80e2338c720792e.html)

I keep getting this and I am not sure what I am doing wrong?

Page number in a footer

Hi guys,

can you give me an example how to display the page number with a footer? #42 seems to go into the right direction but I do not get what "relying on stamping / watermarking to achieve headers, and footers" means @MrSaints :-(. Any code examples?

Cheers Florian

Zoom option does not affect output

Trying to use the --zoom CLI option to scale the PDF output, but no matter what I pass as factor, the output is identical. I've tried 0.1, 0.5, 1.5, 2, 5, 10, 100 and 200 and the output is always the same.

Version being the "latest" from docker as of today.

Upgrade to latest electron

The latest version of electron is 1.6.1. The project currently uses 1.4.0, so it would be nice if electron can be bumped! 😄

(weaver) Renderer Process Crashes when trying to convert some web pages

This happens when I try to convert blog.arachnys.com to a PDF, this has not happened when converting other web pages.

The error might be related to libudev . The log in the console is pasted below


[Worker #3] processing conversion job (pending conversions: 0)
[AthenaPDF] converting to PDF: http://blog.arachnys.com/
captured errors:
Error #01: exit status 1 : Xlib:  extension "RANDR" missing on display ":99".
Xlib:  extension "RANDR" missing on display ":99".
libudev: udev_has_devtmpfs: name_to_handle_at on /dev: Operation not permitted
The renderer process has crashed.

I tried adding --security-opt seccomp=unconfined to the docker run command, but this only removed the libudev: udev ... error. The response is still json encoded and says something along the lines of "Internal Error"

dirty stream with --stdout option

it's not ideal to have logs and timers within stdout-stream when when the --stdout option is set. is there any way to disable this behaviour?

Disable JS

I'd like to use this tool instead of a self built solution based on wkhtmltopdf, however since users can upload custom html, I disabled JS on wkhtmltopdf, I didn't find any option to do this with athenapdf. Is it possible and how could I do it? - Environment variable would be preferred, but I could also set it on every request or extend the docker image.

Webfonts not rendering

I attempted to convert the following into a PDF: https://gist.github.com/jacobwgillespie/de11f4100154042b8b8d615230594acd.

This is the resulting PDF: resume.pdf, and it looks like it's missing the webfonts from the stylesheet.

Use headless Chrome

See https://developers.google.com/web/updates/2017/04/headless-chrome, and https://medium.com/@dschnr/using-headless-chrome-as-an-automated-screenshot-tool-4b07dffba79a

I don't think we can use it as a full replacement, but it can definitely be included as a simple, non-configurable adapter (at least for the initial release).

Brainstorm: new convertion backend for weasyprint

http://weasyprint.readthedocs.io/en/latest/#

A little about the service:

WeasyPrint is a visual rendering engine for HTML and CSS that can export to PDF. It aims to support web standards for printing. WeasyPrint is free software made available under a BSD license.

It is based on various libraries but not on a full rendering engine like WebKit or Gecko. The CSS layout engine is written in Python, designed for pagination, and meant to be easy to hack on.

Background:

I've been trying to convert a not so straight forward html to pdf using the css3 specification for paged media. The specification is great, everything I need is covered there, the problem is that it doesn't work ATM with Athena's engine (webkit if I'm not mistaken?).

Athena has been of great use for our organisation, but this is something we really need support for.

I noticed that on #42 someone is also reporting sort of the same problem.

Additionally, I'm willing to help implement this if there's interest on doing so from your side.

Thank you!

Duplication of thead, and other table related issues

It has come to my attention that the recent upgrade to Electron 1.4.0, and thus, Chrome 53 has resulted in various issues when rendering tables.

Problematic commit: 462778a

Related upstream issues:

A fix is available in Chromium Canary, but that is unlikely to reach Electron any time soon.

If you are experiencing this issue, the recommendation for Athena is to pin to an older tag / version.

That is, use:

athenapdf:2.7.0
athenapdf-service:2.7.0

Future upgrade-related releases will always be pinned / tagged for both the cli, and microservice image.

Apologies for this oversight.

CSS background colors are ignored

I've got a page with some SVG elements as well as regular HTML. I'm impressed to see the SVG stuff seems to render as expected* but the background colors of ordinary divs are being discarded, even without the --no-background commandline argument.

*by "as expected," I mean "same as in the browser."

This is with athenapdf version 2.7.0

weaver: Multiple files web page.

Hi guys, I'm thinking of using weaver for an html to pdf conversion. The html file I'm going to convert is not hosted in a webserver so I was planning on sending it to the server using the convertByFileHandler. And a curl similar to:

 curl -v -F file="@index.html" http://localhost:8080/convert?auth=arachnys-weaver

But of course, I can't include css files nor images doing this...is there a way of sending a whole webpage (css, js and html) that I'm not seeing??

S3 configuration to Weaver?

In the Weaver README, it says it can be configured to upload to S3. How can I do this configuration? Thank you for your help!

Page orientation setup

From CLI I can change orientation with --no-portrait param
How I can change it from inline call
http://[domain-name]/convert?auth=arachnys-weaver&url=[url]& ... ?

SignatureDoesNotMatch with weaver to s3

I am getting the following when I try request weaver upload to the pdf to S3

Error #1: SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your key and signing method.
mfpdf_service_1 | status code: 403, request id: 423990AD92B37E16, host id: DfNwssdCx3PTB2Dv6x2DmYP2xzlGWWAaV/NK5wUJglenz8mGFqb6QnA1L//E8iZA2G078HO0p9w=

the credentials, bucket, and key combination work with other tools such as Cyberduck and minio client.

Support for page with authentication

Hi,

very great project! Is there a support for authentication (or session) on web page that require username/password?

Thanks :)

Cookie support

wkhtmltopdf has a "--cookie" option, which is very usefull to convert pages that are part of a user session.

It would be great to have the same option also for athenapdf.

Pipe HTML directly via stdin instead of local file or remote URL

Hi, I really like how this looks so far, but I've come across a rather major limitation that prevents easy integration for me: Athenapdf cannot load HTML directly from stdin.

Currently Athenapdf only supports loading local files and remote files via http. I want to skip the whole 'file' part and just pipe the HTML source directly into Athenapdf. wkhtmltopdf supports this through the use of - as a replacement for the input URL, but Athenapdf does not. I tried using /dev/stdin instead, but it seems Athenapdf (or perhaps Electron) does not handle this.

/dev/stdin definitely points to the data I want:

$ echo '<html><head></head><body><h1>hi</h1></body></html>' | sudo docker run --rm -i arachnysdocker/athenapdf cat /dev/stdin
<html><head></head><body><h1>hi</h1></body></html>

However, trying to use /dev/stdin with athenapdf doesn't work at all - it renders an empty pdf:

$ echo '<html><head></head><body><h1>hi</h1></body></html>' | sudo docker run --rm -i arachnysdocker/athenapdf athenapdf -S /dev/stdin
Xlib:  extension "RANDR" missing on display ":99".
Xlib:  extension "RANDR" missing on display ":99".
%PDF-1.4
%▒▒▒▒
1 0 obj
<</Creator (Chromium)
/Producer (Skia/PDF)
/CreationDate (D:20160411201956+00'00')
/ModDate (D:20160411201956+00'00')>>
endobj
2 0 obj
<</Type /Catalog
/Pages 3 0 R>>
endobj
3 0 obj
<</Type /Pages
/Count 1
/Kids [4 0 R]>>
endobj
4 0 obj
<</Type /Page
/Resources <</ProcSets [/PDF /Text /ImageB /ImageC /ImageI]>>
/MediaBox [0 0 596 843]
/Contents 5 0 R
/Parent 3 0 R>>
endobj
5 0 obj
<</Length 18>> stream
1 0 0 -1 0 843 cm

endstream
endobj
xref
0 6
0000000000 65535 f
0000000015 00000 n
0000000150 00000 n
0000000197 00000 n
0000000252 00000 n
0000000399 00000 n
trailer
<</Size 6
/Root 2 0 R
/Info 1 0 R>>
startxref
465
%%EOFPDF Conversion: 340.725ms

In the meantime, here's a simple bash script that uses temporary files to achieve the effect I want - read HTML from stdin and output pdf to stdout:
used like this:
echo '<html><head></head><body><h1>Hello, world!</h1></body></html>' | bash ~/athenapdf.sh

athenapdf.sh

#!/bin/bash

tmp=$(mktemp)
trap "rm -f $tmp; exit 1" 0 1 2 3 13 15
cat /dev/stdin > $tmp
docker run --rm -v /tmp:/tmp/ -v $(pwd):/converted/ arachnysdocker/athenapdf athenapdf -S $tmp 2> /dev/null
rm -f $tmp
trap 0
exit 0

Support for headers/footers for each page

Is there a way to output arbitrary header/footers (e.g. page numbers) similar to the Save to PDF functionality of the Chrome browser?

EDIT: I read #16 right after posting this and see this is a general problem with the webkit renderer

How to Post HTML data to athena?

I similar to #54 want to POST some pre-rendered HTML data to athena running as a docker instance, and receive a PDF back that I can save.
I can't send it as a file path because the program running this is inside it's own docker container, and setting up a web server inside a container just seems dumb.

I have some HTML that I'm trying to render

<!DOCTYPE html>
<html>
<head>
....
</head>
<body>
.....
</body>
</html>

nothing special.
Using Node, I've read the text of the file (a template that's already had the data mixed in) into the variable htmlData.
and I'm trying to approximate the method used by #54, and POST it as such:

request({
            url: "http://127.0.0.1:8100/convert?auth=arachnys-weaver",
            method: "POST",
            headers: {
                "content-type": "text/html", 
            },
            body: htmlData
        }, function (error, response, body){
            console.log(error);
            // console.log(response);
            console.log(body);
        });

which is a bog standard post using the request library (i've also used the standard HTTP library).

I've fiddled with changing the content type to multipart/form-data which is mentioned by @MrSaints , or nothing.

all of which athena logs

captured errors:
Error #01: invalid file provided

[GIN] 2017/01/13 - 19:34:51 | 400 |      118.96µs | 172.17.0.1 |   POST    /convert

Is there a doc I'm missing on how to POST/ what am I doing wrong?

Aggressive mode is not working

athenapdf CLI is currently using a version of Electron where executeJavaScript is broken when webSecurity is set to false.

It'll be upgraded with the fix in the next Electron release as it is not an urgent bug.

EDIT: This affects the print media plugin as well.

Allow weaver to change Athena cmd at runtime - Page orientation

Currently the athena command is set in the the config file with WEAVER_ATHENA_CMD, but this means the command is fixed.

I need to be able to change page orientation & likely other options as parameters to weaver, just like you can with the auth & s3 credentials.

Is this already possible & I've just missed it, or likely something that you would be willing to add?

Thanks for your work on this, really great tool!

Windows volume issues with example commands

As it stands, the example commands will fail on most Windows instances, due to an issue with path resolution. When using Git Bash (or rather, MSYS), you will get an error similar to the following:

$ docker run --rm -v $(pwd):/converted/ arachnysdocker/athenapdf athenapdf https://www.github.com test.pdf
C:\Program Files\Docker\Docker\Resources\bin\docker.exe: Error response from daemon: invalid bind mount spec "/D/Repositories/athena;C:\\Program Files\\Git\\converted\\": invalid volume specification: '/D/Repositories/athena;C:\Program Files\Git\converted\': invalid mount config for type "bind": invalid mount path: '\Program Files\Git\converted\' mount path must be absolute.
See 'C:\Program Files\Docker\Docker\Resources\bin\docker.exe run --help'.

According to this comment, you can get around this issue by adding an additional forward slash in front of $(pwd):

$ docker run --rm -v /$(pwd):/converted/ arachnysdocker/athenapdf athenapdf https://www.github.com test.pdf
...
Converted 'https://www.github.com' to PDF: 'test.pdf'
PDF Conversion: 3292.573ms

Alternatively, using the Windows CMD, pwd will not work. For that, you have to use %cd%:

C:\Users\Jake>docker run --rm -v %cd%:/converted/ arachnysdocker/athenapdf athenapdf http://github.com/ test.pdf
...
Converted 'http://github.com/' to PDF: 'test.pdf'
PDF Conversion: 3670.077ms

Naturally, this is with how Docker works, and not necessarily at the fault of Athena, but since we're claiming that this is a "Drop-in replacement for wkhtmltopdf", we might as well provide the command to work correctly for Windows users.

I can author a PR including these alternative invocations, but I was unsure if you wanted a warning for using MSYS, CMD, or both. I think these notices would be particularly helpful and save other people some troubleshooting time, particularly if they have never used Docker and are installing it purely as a wkhtmltopdf replacement.

question: Running it all inside Electron possible ?

In order for weaver to work it has to run in docker right ?
Its essentially running electron in a headless container to get the job done i am guessing.

Its a great design, but I need it to run on Desktops with no Docker or reliance on Server.

Please advise, and i can make the agreed changes if you open to that and its required.

is there any known way to create a table of contents based on anchor links of an html page?

Custom page size

I've tried inline css width: 300mm; height: 300mm; on the html/body/element being rendered, also tried params like --page-width 300mm from wkhtmltopdf - all with no luck.

DOC: url for aggressive mode

Hi,

I couldn't find in your documentation that the url for the aggressive mode would be: convert?auth=arachnys-weaver&aggressive&url=http://blog.arachnys.com/

Thanks.

Line being split across pages

I'm using the Athenapdf docker image to convert html to pdfs. The image is working and things look good, but on one of my test documents a line is being split across two pages.

I had this same issue come up occasionally on wkhtmltopdf and this is one of the main reasons that I'm looking for something else. Have you seen this before? Any thoughts on how to remedy it?

Can I set delay config when using Weaver?

I need to set a timeout for the Weaver (hosted) version. Do I need to modify the WEAVER_ATHENA_CMD command to include the delay?

Thanks for your help!

Would be awesome if we could generate a table of contents based on h1 - h6.

Ebooks. :D

(Currently I managed to achieve this only by compiling wkhtmltopdf or using the great, but EXPENSIVE PrinceXML)

Save output as user not root? + harmless errors?

I've tried to print from a local dev server(provided the host network and port), the delay is to get around a toast notification when the page is loaded(BrowserSync based). This all works good.

$ docker run --net="host" --rm -v $(pwd):/converted/ arachnysdocker/athenapdf athenapdf http://localhost:3000 --delay 5000
Xlib:  extension "RANDR" missing on display ":99".
Xlib:  extension "RANDR" missing on display ":99".
libudev: udev_has_devtmpfs: name_to_handle_at on /dev: Operation not permitted
Converted 'http://localhost:3000' to PDF: '2bf0fa1d7db9ecd60b2c436978513d3c1c5536a8.pdf'
PDF Conversion: 5517.037ms

$ ls -la
-rw-r--r--   1 root root  167730 Nov 11 18:53 2bf0fa1d7db9ecd60b2c436978513d3c1c5536a8.pdf

The errors don't seem to be causing any problems, the PDF is written, it's just done so as the root user(due to running via Docker I guess). The file doesn't show up in my file browser, I have to open an instance of the file browser as root. A docker setting can get around this I guess?(if this isn't the case for you it could be due to my distro/package settings for docker too, I'm in a group called docker).

EDIT: Just saw the libudev suppression note here. Still looking into the file ownership issue, seems using gosu with an entrypoint script should do the trick.

Monitor and restart Xvfb

I've been using athenapdf for about a month now in a production app and roughly once a week the Xvfb instance dies in the weaver container and we have to restart it. We get a general error message: "PDF conversion failed due to an internal server error". It would really be helpful if the Xvfb process was monitored automatically to keep it alive.

CLI tool failing on FreeBSD

After installing athendapdf with docker pull arachnysdocker/athenapdf, I tried to convert an HTML file called review.html to PDF.

This fails as follows:

$ docker run --rm -v $(pwd):/converted/ arachnysdocker/athenapdf athenapdf ./review.html .
[89390:0405/141456.780648:FATAL:zygote_host_impl_linux.cc(140)] Check failed: base::UnixDomainSocket::EnableReceiveProcessId(fds[0]).
#0 0x000001ccd51e <unknown>
#1 0x000001cb27cb <unknown>
#2 0x000002c6126b <unknown>
#3 0x000002bfa64f <unknown>
#4 0x000002ad1f4e <unknown>
#5 0x000002991d2c <unknown>
#6 0x0000029e36ef <unknown>
#7 0x000002989128 <unknown>
#8 0x0000026d99e4 <unknown>
#9 0x0000026d9df0 <unknown>
#10 0x000003b85483 main
#11 0x000811021b45 <unknown>
#12 0x0000005a9009 <unknown>

jail: /athenapdf/entrypoint.sh athenapdf ./review.html .: exited on signal 6

My system details are:

$ docker --version
Docker version 1.7.0-dev, build 582db78
$ uname -a
FreeBSD x220 11.0-RELEASE-p8 FreeBSD 11.0-RELEASE-p8 #0: Wed Feb 22 06:12:04 UTC 2017     [email protected]:/usr/obj/usr/src/sys/GENERIC  amd64

Support MHTML conversions

Chromium / Electron is able to open, and decode MHTML files provided that a "hint" is given through the file extension. i.e. It won't work if you're opening a remote MHTML file (considered an octet-stream) or if you're converting a local file without the .mhtml extension.

The solution would've been simpler had Go's http.DetectContentType supported the detection of the message/rfc822 header, but it doesn't. Plus, ioutil.Tempfile doesn't support file suffixes, only prefixes. So even if we detected that the file is a MHTML file, we will have to rename the temporary file (assuming we are saving the octet-stream it for local conversion since the CLI can't convert it remotely) to include the necessary .mhtml extension.

It's not necessarily a challenge, but it will pollute the source handler with MHTML specific code.

From Wikipedia:

MIME type for MHTML is not well agreed upon. Used MIME types include:

multipart/related
application/x-mimearchive
message/rfc822

Preliminary checker:

func isMHTML(r io.Reader) (bool, error) {
    m, err := mail.ReadMessage(r)
    if err != nil {
        return false, err
    }

    mt, _, err := mime.ParseMediaType(m.Header.Get("Content-Type"))
    if err != nil {
        return false, err
    }

    if strings.HasPrefix(mt, "multipart/") {
        return true, nil
    }

    return false, nil
}

Loading of Angular Style URL's

Thanks for making a start on a great conversion tool. So far seems to work more consistently on complex web pages than wkhtml2pdf.

I've run into an issue with angular style URLS that have a # in them. To see this in action run a conversion on http://triangular.oxygenna.com/#/dashboards/sales - rather than convert this page it just converts the default page http://triangular.oxygenna.com/. I suspect this will turn out to be an issue for quite a few people.

Replace `did-finish-load` with `ready-to-show`

With the introduction of Electron 1.2.3, there is a new ready-to-show event which might work better than did-finish-load when it comes to capturing the contents of a page in a timely manner.

Advantages:

It does not wait for slow resources to finish loading (initial render)
Possibly addresses FOUC without timeout hacks

Disadvantages:

Some legitimate remote resources may not be loaded
Debug mode won't trigger save

Possible compromise:

An option to switch between the two.

Currently, a page with several slow to load remote resources may take up to 2 minutes before timing out (the HTTP requests). As such, Weaver may terminate the "work" before the page can be converted, even if the main frame, and the core assets are successfully loaded. With this event, the page can be converted despite these slow resources.

Athenapdf seems to strip images

I've been trying out Athenapdf with some websites, but it seems like images are not included in the output.

The issue is reproducible with the google frontpage.

Command:

docker run -v (pwd):/converted/ arachnysdocker/athenapdf athenapdf -D 10000 -T 100000 --no-cache --no-portrait https://google.de

Output:

a413860bd1d2c36582bc11bb306cee8620b2f911.pdf