gildas-lormeau / single-file-cli Goto Github PK

View Code? Open in Web Editor NEW

495.0 10.0 50.0 3.64 MB

CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)

License: GNU Affero General Public License v3.0

JavaScript 91.31% Dockerfile 0.46% Shell 7.81% Batchfile 0.21% TypeScript 0.22%

cli deno nodejs single-file web-archiving web-scraper web-scraping

single-file-cli's Introduction

SingleFile CLI (Command Line Interface)

Introduction

SingleFile can be launched from the command line by running it into a (headless) browser.

It runs through Deno as a standalone script injected into the web page via the Chrome DevTools Protocol instead of being embedded into a WebExtension.

Installation

SingleFile can be run without installing it, just download the executable file and save it in the directory of your choice here: https://github.com/gildas-lormeau/single-file-cli/releases

Make sure Chrome or a Chromium-based browser is installed in the default folder. Otherwise you might need to set the --browser-executable-path option to help SingleFile locating the path of the executable file.

Installation with Docker

Installation from Docker Hub

docker pull capsulecode/singlefile

docker tag capsulecode/singlefile singlefile
Manual installation

git clone --depth 1 --recursive https://github.com/gildas-lormeau/single-file-cli.git

cd single-file-cli

docker build --no-cache -t singlefile .
Run

docker run singlefile "https://www.wikipedia.org"
Run and redirect the result into a file

docker run singlefile "https://www.wikipedia.org" > wikipedia.html
Run and mount a volume to get the saved file in the current directory
- Save one page
  
  docker run -v %cd%:/usr/src/app/out singlefile "https://www.wikipedia.org" wikipedia.html (Windows)
  
  docker run -v $(pwd):/usr/src/app/out singlefile "https://www.wikipedia.org" wikipedia.html (Linux/UNIX)
- Save one or multiple pages by using the filename template (see --filename-template option)
  
  docker run -v %cd%:/usr/src/app/out singlefile "https://www.wikipedia.org" --dump-content=false (Windows)
  
  docker run -v $(pwd):/usr/src/app/out singlefile "https://www.wikipedia.org" --dump-content=false (Linux/UNIX)
An alternative docker file can be found here https://github.com/screenbreak/SingleFile-dockerized. It allows you to save pages from the command line interface or through an HTTP server.

Manual installation

Install Deno
There are 3 ways to download the code of SingleFile, choose the one you prefer:
- Download and install with npm (npm is installed with Node.js)
```
npm install "single-file-cli"
cd node_modules/single-file-cli
```
  You can also install SingleFile globally with -g when running npm install.
- Download and unzip manually the master archive provided by Github
```
unzip master.zip .
cd single-file-cli-master
```
- Download with git
```
git clone --depth 1 --recursive https://github.com/gildas-lormeau/single-file-cli.git
cd single-file-cli
```
Make single-file executable (Linux/Unix/BSD etc.).
```
chmod +x single-file
```

Run

Syntax

single-file <url> [output] [options ...]

Display help
```
single-file --help
```

Examples

Dump the HTML content of https://www.wikipedia.org into the console

single-file https://www.wikipedia.org --dump-content

Save https://www.wikipedia.org into wikipedia.html in the current folder

single-file https://www.wikipedia.org wikipedia.html

Save a list of URLs stored into list-urls.txt in the current folder

single-file --urls-file=list-urls.txt

Save https://www.wikipedia.org and crawl its internal links with the query parameters removed from the URL

single-file https://www.wikipedia.org --crawl-links=true --crawl-inner-links-only=true --crawl-max-depth=1 --crawl-rewrite-rule="^(.*)\\?.*$ $1"

Save https://www.wikipedia.org and external links only

single-file https://www.wikipedia.org --crawl-links=true --crawl-inner-links-only=false --crawl-external-links-max-depth=1 --crawl-rewrite-rule="^.*wikipedia.*$"

Compile executables

Compile executables into /dist

./compile.sh

License

SingleFile and SingleFile CLI are licensed under AGPL. Code derived from third-party projects is licensed under MIT. Please contact me at gildas.lormeau <at> gmail.com if you are interested in licensing the SingleFile code for a commercial service or product.

single-file-cli's People

Contributors

Stargazers

Watchers

single-file-cli's Issues

Add an option to import settings from browser when using cli

Describe the solution you'd like
Support import browser exported settings file singlefile-settings.json to cli config.

The option name may :
--browser-settings-file
Description:
Overwrite the default settings of cli by the settings file exported from the browser plugin.

single-file-cli doesn't exit with error code when retriving page fails

Hi. On encountering a network error as shown below, single-file-cli throws the error but does not propagate the error code on exit, which makes it difficult to determine when retrieving a bunch of urls if any of them failed.

net::ERR_NAME_NOT_RESOLVED at <url elided> URL: <url elided>                                                                                                                                                                                      
Stack: Error: net::ERR_NAME_NOT_RESOLVED at <url elided>                                                                 
    at navigate (<path>/single-file-cli/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Frame.js:236:23)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)                        
    at async Frame.goto (<path>/single-file-cli/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Frame.js:206:21)
    at async CDPPage.goto (<path>/single-file-cli/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Page.js:439:16)
    at async pageGoto (<path>/single-file-cli/back-ends/puppeteer.js:194:3)    
    at async getPageData (<path>/single-file-cli/back-ends/puppeteer.js:132:3)
    at async exports.getPageData (<path>/single-file-cli/back-ends/puppeteer.js:56:10)
    at async capturePage (<path>/single-file-cli/single-file-cli-api.js:254:20)
    at async runNextTask (<path>/single-file-cli/single-file-cli-api.js:175:20)                                        
    at async Promise.all (index 0)

Can you ensure single-file-cli exits with appropriate code on an error? It would also be nice if there was a way to determine if the provided url returns a 404 (or any 400 or 500 code), but maybe that's harder to do.

Stacktrace when running SingleFile CLI for a particular website

I installed SingleFile from the Docker image in the usual way:

docker pull capsulecode/singlefile
docker tag capsulecode/singlefile singlefile

I can then save a webpage as usual and everything works as expected:

docker run -v $(pwd):/usr/src/app/out singlefile "https://www.wikipedia.org" --dump-content=false

However, when I try to save the webpage https://cdn.docbook.org/release/xsl-nons/1.79.2/webhelp/docs/ch03s02.html I get a stacktrace:

$ docker run -v $(pwd):/usr/src/app/out singlefile "https://cdn.docbook.org/release/xsl-nons/1.79.2/webhelp/docs/ch03s02.html" --dump-content=false
Evaluation failed: SyntaxError: Invalid regular expression: /^var(--/: Unterminated group
    at String.match (<anonymous>)
    at String.startsWith (https://cdn.docbook.org/release/xsl-nons/1.79.2/webhelp/docs/search/nwSearchFnt.js:871:15)
    at <anonymous>:1:217815
    at Array.find (<anonymous>)
    at <anonymous>:1:217804
    at Array.find (<anonymous>)
    at tp (<anonymous>:1:217793)
    at Object.removeUnusedFonts (<anonymous>:1:300431)
    at jm.removeUnusedFonts (<anonymous>:1:267447)
    at Nm.executeTask (<anonymous>:1:243712) URL: https://cdn.docbook.org/release/xsl-nons/1.79.2/webhelp/docs/ch03s02.html
Stack: Error: Evaluation failed: SyntaxError: Invalid regular expression: /^var(--/: Unterminated group
    at String.match (<anonymous>)
    at String.startsWith (https://cdn.docbook.org/release/xsl-nons/1.79.2/webhelp/docs/search/nwSearchFnt.js:871:15)
    at <anonymous>:1:217815
    at Array.find (<anonymous>)
    at <anonymous>:1:217804
    at Array.find (<anonymous>)
    at tp (<anonymous>:1:217793)
    at Object.removeUnusedFonts (<anonymous>:1:300431)
    at jm.removeUnusedFonts (<anonymous>:1:267447)
    at Nm.executeTask (<anonymous>:1:243712)
    at ExecutionContext._evaluateInternal (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:221:19)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async ExecutionContext.evaluate (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:110:16)
    at async getPageData (/usr/src/app/node_modules/single-file-cli/back-ends/puppeteer.js:139:10)
    at async Object.exports.getPageData (/usr/src/app/node_modules/single-file-cli/back-ends/puppeteer.js:51:10)
    at async capturePage (/usr/src/app/node_modules/single-file-cli/single-file-cli-api.js:254:20)
    at async runNextTask (/usr/src/app/node_modules/single-file-cli/single-file-cli-api.js:175:20)
    at async Promise.all (index 0)
    at async capture (/usr/src/app/node_modules/single-file-cli/single-file-cli-api.js:126:2)
    at async run (/usr/src/app/node_modules/single-file-cli/single-file:54:2)

I do not get an error when I try to save the same page using the SingleFile extension for Firefox. Thanks!

When spawning singlefile from within a docker container, it will leave defunct chrome processes behind.

This is not a problem when doing single-shot executions with a docker container that starts, runs singlefile, and stops again. However, when running a small flask server as a frontend to singlefile which spawns singlefile via subprocess, it will leave a couple of defunct chrome processes behind.

It seems this is a general problem with puppeteer and does not originate from singelfile itself. To overcome this issue, tini can be used as a prefix when calling singlefile which makes sure to terminate all child processes:

tini -s /usr/src/app/node_modules/single-file-cli/single-file ...

Leaving this here for documentation purposes.

Page download failing for Youtube when using browser cookie

single-file-cli is failing when using it on a Youtube Community page with the below error and only failing strictly when the browser height is too large(>~50000 pixels) and browser cookie is used from an account that has membership of the channel.

Command: single-file --browser-cookies-file="I:\archive scripts\batch scripts\singlefile_script\cookie.txt" --browser-executable-path="C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" --browser-script="I:\archive scripts\batch scripts\singlefile_script\replace_image_url_script.js" --browser-load-max-time=120000 --browser-wait-delay=20000 --browser-height=200000 --crawl-replace-urls=true "https://www.youtube.com/channel/UCqm3BQLlJfvkTsX_hvm0UmA/community"

Error:

Evaluation failed: RangeError: Invalid string length
    at <anonymous>:1:239479
    at Array.forEach (<anonymous>)
    at <anonymous>:1:239464
    at cy (<anonymous>:1:239632)
    at <anonymous>:1:239479
    at Array.forEach (<anonymous>)
    at <anonymous>:1:239464
    at cy (<anonymous>:1:239632)
    at <anonymous>:1:239479
    at Array.forEach (<anonymous>)
    at <anonymous>:1:239464
    at cy (<anonymous>:1:239632)
    at <anonymous>:1:239479
    at Array.forEach (<anonymous>)
    at <anonymous>:1:239464
    at cy (<anonymous>:1:239632)
    at <anonymous>:1:239479
    at Array.forEach (<anonymous>)
    at <anonymous>:1:239464
    at cy (<anonymous>:1:239632)
    at <anonymous>:1:239479
    at Array.forEach (<anonymous>)
    at <anonymous>:1:239464
    at cy (<anonymous>:1:239632)
    at <anonymous>:1:239479 URL: https://www.youtube.com/channel/UCqm3BQLlJfvkTsX_hvm0UmA/community
Stack: Error: Evaluation failed: RangeError: Invalid string length
    at <anonymous>:1:239479
    at Array.forEach (<anonymous>)
    at <anonymous>:1:239464
    at cy (<anonymous>:1:239632)
    at <anonymous>:1:239479
    at Array.forEach (<anonymous>)
    at <anonymous>:1:239464
    at cy (<anonymous>:1:239632)
    at <anonymous>:1:239479
    at Array.forEach (<anonymous>)
    at <anonymous>:1:239464
    at cy (<anonymous>:1:239632)
    at <anonymous>:1:239479
    at Array.forEach (<anonymous>)
    at <anonymous>:1:239464
    at cy (<anonymous>:1:239632)
    at <anonymous>:1:239479
    at Array.forEach (<anonymous>)
    at <anonymous>:1:239464
    at cy (<anonymous>:1:239632)
    at <anonymous>:1:239479
    at Array.forEach (<anonymous>)
    at <anonymous>:1:239464
    at cy (<anonymous>:1:239632)
    at <anonymous>:1:239479
    at ExecutionContext._ExecutionContext_evaluate (C:\Users\test\AppData\Roaming\npm\node_modules\single-file-cli\node_modules\puppeteer-core\lib\cjs\puppeteer\common\ExecutionContext.js:229:15)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async ExecutionContext.evaluate (C:\Users\test\AppData\Roaming\npm\node_modules\single-file-cli\node_modules\puppeteer-core\lib\cjs\puppeteer\common\ExecutionContext.js:107:16)
    at async getPageData (C:\Users\test\AppData\Roaming\npm\node_modules\single-file-cli\back-ends\puppeteer.js:150:10)
    at async Object.exports.getPageData (C:\Users\test\AppData\Roaming\npm\node_modules\single-file-cli\back-ends\puppeteer.js:56:10)
    at async capturePage (C:\Users\test\AppData\Roaming\npm\node_modules\single-file-cli\single-file-cli-api.js:254:20)
    at async runNextTask (C:\Users\test\AppData\Roaming\npm\node_modules\single-file-cli\single-file-cli-api.js:175:20)
    at async Promise.all (index 0)
    at async capture (C:\Users\test\AppData\Roaming\npm\node_modules\single-file-cli\single-file-cli-api.js:126:2)

Output extension should be .zip if `--compress-content --self-extracting-archive false` is used

--compress-content --self-extracting-archive false outputs a zip file, but it receives the extension .html rather than .zip

Crawl restriction

Is is possible to restrict a crawl to only URLs with a specific prefix? Or to contain a specific word?

Eg: I only want to save a page if its URL starts with "https://en.wikipedia.org/wiki/" or say only if the URL contains the word "wiki", etc.

single-file-cli hangs even on `domcontentloaded`

Hello! I love single-file and single-file-cli. Truly the best option out there for saving full and complete webpages!

Right now, it's hitting this strange problem where, as far as I can tell, the page hangs indefinitely, even under the lowest load settings.

It might be related to this? puppeteer/puppeteer#9196 Not entirely sure.

To Reproduce
Run the following command in bash:

single-file --browser-executable-path=/opt/homebrew/bin/chromium https://web.archive.org/web/20140325112710/http://www.sfchronicle.com/ --block-images --dump-content --browser-headless false --browser-wait-until domcontentloaded --browser-debug

Environment

OS: MacOS
Browser: Chrome

Error: command not found, after `npm install -g "gildas-lormeau/single-file-cli"`

First of all, thanks for building this amazing tool, love the chrome extension!

A problem I face when trying to use the CLI tool:

After installing it on mac with npm install -g "gildas-lormeau/single-file-cli", when I try to run single-file command, I got error "command not found"

➜  ~  npm list -g single-file-cli
/opt/homebrew/lib
└── single-file-cli@ -> ./../../../Users/gudh/.npm/_cacache/tmp/git-cloneIMlDig

➜  ~ single-file
zsh: command not found: single-file
➜  ~ single-file-cli
zsh: command not found: single-file-cli

anything I am missing?

Feature request: Add support for configuration files and add an argument to read the file from a path.

It would be more convenient to specify the arguments in a config file, perhaps a toml one, rather than specify them over and over again in the terminal. It can be also be more convenient if users have multiple sets of arguments for different pages.

Issue capturing Github urls with browserless

Issue

When the Browserless endpoint is provided as a browserServer option and a GitHub URL eg. https://github.com/pawanpaudel93, single-file does not capture the URL completely, and when opening the captured webpage the webpage seems to load without CSS.

But, It works perfectly when used without browserless though.

Skips First URL in the list of URLs

When using --urls-file option, it misses the First URL in the list.

Workaround: Add a empty line at start of the file.

This issue is present on single-filez-cli too.

Unable to download a particular website using cli

Hi. I get the following error on seemingly any web page from "http://highscalability.com" when using single-file-cli. However the firefox extension works as expected. Can you investigate? Thanks.

$ single-file --browser-executable-path chromium "http://highscalability.com/blog/2022/12/16/what-is-cloud-computing-according-to-chatgpt.html"
Evaluation failed: TypeError: Cannot read properties of undefined (reading 'type')
    at Object.node (<anonymous>:1:55192)
    at Object.node (<anonymous>:1:56252)
    at Object.Hi (<anonymous>:1:97189)
    at Object.node (<anonymous>:1:55258)
    at Object.node (<anonymous>:1:56252)
    at Fr.forEach (<anonymous>:1:44421)
    at Object.ln [as children] (<anonymous>:1:54840)
    at Object.Cc (<anonymous>:1:116289)
    at Object.node (<anonymous>:1:55258)
    at Object.generate (<anonymous>:1:56320) URL: http://highscalability.com/blog/2022/12/16/what-is-cloud-computing-according-to-chatgpt.html
Stack: Error: Evaluation failed: TypeError: Cannot read properties of undefined (reading 'type')
    at Object.node (<anonymous>:1:55192)
    at Object.node (<anonymous>:1:56252)
    at Object.Hi (<anonymous>:1:97189)
    at Object.node (<anonymous>:1:55258)
    at Object.node (<anonymous>:1:56252)
    at Fr.forEach (<anonymous>:1:44421)
    at Object.ln [as children] (<anonymous>:1:54840)
    at Object.Cc (<anonymous>:1:116289)
    at Object.node (<anonymous>:1:55258)
    at Object.generate (<anonymous>:1:56320)
    at ExecutionContext._ExecutionContext_evaluate (<path>/single-file-cli/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:229:15)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async ExecutionContext.evaluate (<path>/single-file-cli/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:107:16)
    at async getPageData (<path>/single-file-cli/back-ends/puppeteer.js:150:10)
    at async exports.getPageData (<path>/single-file-cli/back-ends/puppeteer.js:56:10)
    at async capturePage (<path>/single-file-cli/single-file-cli-api.js:254:20)
    at async runNextTask (<path>/single-file-cli/single-file-cli-api.js:175:20)
    at async Promise.all (index 0)
    at async capture (<path>/single-file-cli/single-file-cli-api.js:126:2)
    at async run (<path>/single-file-cli/single-file:54:2)

working with a given url seems to dump code into the terminal

Running the below command seems to dump code, and then result in the error message below.

single-file https://epa.oszk.hu/00300/00336/00003/bbcikk19.html --back-end=jsdom x.html

Might be related to an IFRAME in the html, which can be ignored with --remove-frames which seems to be supported by the jsdom backend, even tought the cli --help says its not.

Occasional stack traces from the CLI

I am running the following command from a Bash shell (MinGW on Windows 10):

docker run --mount "type=bind,src=$PWD/cookiedir,dst=/cookiedir" --mount "type=bind,src=$PWD/sitedir,dst=/sitedir" singlefile --browser-cookies-file=/cookiedir/cookies.txt --urls-file="/sitedir/urls.txt" --output-directory="/sitedir" --dump-content=false --filename-template="{url-pathname-flat}.html"

Note that I am using the Docker image and the --urls-file option.

Sometimes I get the following error:

Execution context was destroyed, most likely because of a navigation. URL: <redacted>
Stack: Error: Execution context was destroyed, most likely because of a navigation.
    at rewriteError (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:265:23)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async ExecutionContext._evaluateInternal (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:219:60)
    at async ExecutionContext.evaluate (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:110:16)
    at async getPageData (/usr/src/app/node_modules/single-file/cli/back-ends/puppeteer.js:139:10)
    at async Object.exports.getPageData (/usr/src/app/node_modules/single-file/cli/back-ends/puppeteer.js:51:10)
    at async capturePage (/usr/src/app/node_modules/single-file/cli/single-file-cli-api.js:248:20)
    at async runNextTask (/usr/src/app/node_modules/single-file/cli/single-file-cli-api.js:169:20)
    at async Promise.all (index 0)

Sometimes I get the following different error:

Navigation failed because browser has disconnected! URL: <redacted>
Stack: Error: Navigation failed because browser has disconnected!
    at /usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/LifecycleWatcher.js:51:147
    at /usr/src/app/node_modules/puppeteer-core/lib/cjs/vendor/mitt/src/index.js:51:62
    at Array.map (<anonymous>)
    at Object.emit (/usr/src/app/node_modules/puppeteer-core/lib/cjs/vendor/mitt/src/index.js:51:43)
    at CDPSession.emit (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/EventEmitter.js:72:22)
    at CDPSession._onClosed (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:256:14)
    at Connection._onMessage (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:99:25)
    at WebSocket.<anonymous> (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/node/NodeWebSocketTransport.js:13:32)
    at WebSocket.onMessage (/usr/src/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:132:16)
    at WebSocket.emit (events.js:315:20)

I can download the pages at the URLs that failed by trying again. However, I would only usually expect to get a stack trace from an internal error (not a network connection error, or whatever might be the underlying cause here).

One difficulty I have is that there is no option to "resume" downloading pages should some pages fail to download. Utilities such as youtube-dl allow you to run them a second time to continue downloading files that were not downloaded in the previous run.

It would be good if the above errors were more user friendly (or explained what to do to fix the problem).
It would also be good if downloads from a list of URLs could be resumed if interrupted / incomplete (similar to youtube-dl for example).
Finally, is it guaranteed that if there is an error, then no file will be produced (i.e. HTML files are only created after a successful download)?
If partial files or zero-byte files can be left behind after an error, then one has to inspect the log to be sure that all pages have downloaded correctly (where youtube-dl will create .part files that are renamed only once the file is fully downloaded to avoid this problem and allow resuming of downloads.)

Many thanks!

CLI option --crawl-replace-urls does not do anything

When I run this command:

single-file --output-directory=outdir --dump-content=false --filename-template="{url-pathname-flat}.html" --crawl-links --crawl-save-session=session.json --crawl-replace-urls=true https://en.wikipedia.org/wiki/Thomas_Lipton

none of the files in the outdir directory have URLs of saved pages replaced with relative paths of other saved pages in outdir.

When I run this command, _wiki_Thomas_Lipton.html is downloaded to outdir. This is the file of URL from which the crawl started.

The Wikipedia page https://en.wikipedia.org/wiki/Thomas_Lipton has a link to https://en.wikipedia.org/wiki/Self-made_man in the first sentence. This page was also downloaded by SingleFile as _wiki_Self-made_man.html.

I was expecting the href to https://en.wikipedia.org/wiki/Self-made_man in _wiki_Thomas_Lipton.html to be rewritten to _wiki_Self-made_man.html but it was not. Am I using the CLI options incorrectly?

Set Chrome profile for puppeteer when using CLI

Is your feature request related to a problem? Please describe.
When SingleFile CLI saves a page with puppeteer, the page is saved without any of the familiar extensions your personal Chrome profile uses (for example, ad blockers), because puppeteer runs on a clean profile by default.

Describe the solution you'd like
Would it be possible to add an argument that points to a Chrome profile folder for the CLI? It would allow the page to be saved the way you would normally see it when you're using Chrome, with extensions and all. Thanks!

Can we save a page without images?

I do see the following 3 options.

  --remove-alternative-fonts                    Remove alternative fonts to the ones displayed  [boolean] [default: true]
  --remove-alternative-medias                   Remove alternative CSS stylesheets  [boolean] [default: true]
  --remove-alternative-images                   Remove images for alternative sizes of screen  [boolean] [default: true]

But what if I want to save pages without the normal images? Not alternative ones? Can we get these options ALSO?

   --remove-fonts                    Remove all fonts [boolean] [default: false]
   --remove-medias                   Remove all CSS stylesheets  [boolean] [default: false]
   --remove-images                   Remove all images [boolean] [default: false]

unknown error: no chrome binary at "/usr/bin/chromium-browser"

It appears that my npm automatically updated all my packages including single-file-cli, because after this event my args.js got overwritten and I lost all my configurations :( [it would be great if there was a flag option so the user can provide path to our own custom args.js that way we don't have to mess with the original, which can get overwritten after update]

Anyways, the problem now is that for some reason now I'm getting "no chrome binary" Error! I have tried uninstalling and installing both chromedriver and chromium-browser, but no luck. I'm not sure if the issue is with selenium or single-file-cli is not setting the paths correctly. I think selenium 4.11.0 should support Chromium 116, but I'm not completely sure.

Here is what I have:

Note: Both chromedriver and chromium-browser files are present and located in /usr/bin/
chromedriver --version
ChromeDriver 116.0.5845.96
chromium-browser --version
Chromium 116.0.5845.96 snap

my args.js

		"back-end": "webdriver-chromium",
		...
		"browser-executable-path": "/usr/bin/chromium-browser",
		...
		"web-driver-executable-path": "/usr/bin/chromedriver",

LOGS:

Selenium Manager binary found at /home/arc/.nvm/versions/node/v18.14.2/lib/node_modules/single-file-cli/node_modules/selenium-webdriver/bin/linux/selenium-manager
Driver path: /usr/bin/chromedriver
Browser path: "/usr/bin/chromium-browser"
unknown error: no chrome binary at "/usr/bin/chromium-browser" URL: https://mysite.com/
Stack: WebDriverError: unknown error: no chrome binary at "/usr/bin/chromium-browser"
    at Object.throwDecodedError (/home/arc/.nvm/versions/node/v18.14.2/lib/node_modules/single-file-cli/node_modules/selenium-webdriver/lib/error.js:524:15)
    at parseHttpResponse (/home/arc/.nvm/versions/node/v18.14.2/lib/node_modules/single-file-cli/node_modules/selenium-webdriver/lib/http.js:601:13)
    at Executor.execute (/home/arc/.nvm/versions/node/v18.14.2/lib/node_modules/single-file-cli/node_modules/selenium-webdriver/lib/http.js:529:28)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

node:internal/process/promises:288
            triggerUncaughtException(err, true /* fromPromise */);
            ^

WebDriverError: unknown error: no chrome binary at "/usr/bin/chromium-browser"
    at Object.throwDecodedError (/home/arc/.nvm/versions/node/v18.14.2/lib/node_modules/single-file-cli/node_modules/selenium-webdriver/lib/error.js:524:15)
    at parseHttpResponse (/home/arc/.nvm/versions/node/v18.14.2/lib/node_modules/single-file-cli/node_modules/selenium-webdriver/lib/http.js:601:13)
    at Executor.execute (/home/arc/.nvm/versions/node/v18.14.2/lib/node_modules/single-file-cli/node_modules/selenium-webdriver/lib/http.js:529:28)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
  remoteStacktrace: '#0 0x55b439b8fee3 <unknown>\n' +
    '#1 0x55b4398cab77 <unknown>\n' +
    '#2 0x55b4398f33f9 <unknown>\n' +
    '#3 0x55b4398f1f19 <unknown>\n' +
    '#4 0x55b439931da1 <unknown>\n' +
    '#5 0x55b4399313ef <unknown>\n' +
    '#6 0x55b439928ef3 <unknown>\n' +
    '#7 0x55b4398fd132 <unknown>\n' +
    '#8 0x55b4398fdede <unknown>\n' +
    '#9 0x55b439b5478d <unknown>\n' +
    '#10 0x55b439b59017 <unknown>\n' +
    '#11 0x55b439b625e8 <unknown>\n' +
    '#12 0x55b439b59a50 <unknown>\n' +
    '#13 0x55b439b2a92e <unknown>\n' +
    '#14 0x55b439b7a7f8 <unknown>\n' +
    '#15 0x55b439b7a9ea <unknown>\n' +
    '#16 0x55b439b894e8 <unknown>\n' +
    '#17 0x7f2c6eee1b43 <unknown>\n'
}

Node.js v18.14.2

Images don't get downloaded using CLI

For questions, please create or update a thread here:
https://github.com/gildas-lormeau/SingleFile/discussions

Please ensure that you do not find an answer before reporting the issue:

in the FAQ: https://github.com/gildas-lormeau/SingleFile/blob/master/faq.md
in the list of known issues: https://github.com/gildas-lormeau/SingleFile#known-issues
in the help page embedded in the extension

Describe the bug
I'm trying to automate singlefile with my small project for downloading a webpage exactly like I can do manually from the extension bar.

However, when using the cli with nodejs, I found out that the downloaded webpage doesn't have images, specifically the one that is hosted on amazonaws.

I manually managed to download succesfully the page with the extension but when using the CLI, I can't download everything in the page.

To Reproduce
Steps to reproduce the behavior:
You can try with the page for example to see what I'm talking about
https://www.tigergroup.ae/

The images hosted on amazonaws can't be downloaded
but the image of the popup is hosted on their server, so it was downloaded

I tried using the extension and it worked fine.

Expected behavior
a good html file should be downloaded will all resources embeded specially images

Environment

OS: [e.g. Win10 Pro, ]
Browser: [e.g. Chrome]
Version: [e.g. 64]

Runtime.callFunctionOn timed out. Increase the 'protocolTimeout' setting in launch/connect calls for a higher timeout if needed

Stack: ProtocolError: Runtime.callFunctionOn timed out. Increase the 'protocolTimeout' setting in launch/connect calls for a higher timeout if needed.
    at <instance_members_initializer> (C:\Users\Richard\single-file-cli\node_modules\puppeteer-core\lib\cjs\puppeteer\common\Connection.js:49:14)
    at new Callback (C:\Users\Richard\single-file-cli\node_modules\puppeteer-core\lib\cjs\puppeteer\common\Connection.js:53:16)
    at CallbackRegistry.create (C:\Users\Richard\single-file-cli\node_modules\puppeteer-core\lib\cjs\puppeteer\common\Connection.js:93:26)
    at Connection._rawSend (C:\Users\Richard\single-file-cli\node_modules\puppeteer-core\lib\cjs\puppeteer\common\Connection.js:207:26)
    at CDPSessionImpl.send (C:\Users\Richard\single-file-cli\node_modules\puppeteer-core\lib\cjs\puppeteer\common\Connection.js:416:33)
    at #evaluate (C:\Users\Richard\single-file-cli\node_modules\puppeteer-core\lib\cjs\puppeteer\common\ExecutionContext.js:234:50)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async ExecutionContext.evaluate (C:\Users\Richard\single-file-cli\node_modules\puppeteer-core\lib\cjs\puppeteer\common\ExecutionContext.js:149:16)
    at async getPageData (C:\Users\Richard\single-file-cli\back-ends\puppeteer.js:153:10)
    at async exports.getPageData (C:\Users\Richard\single-file-cli\back-ends\puppeteer.js:56:10)

node -v is v20.5.1. Can't share the webpage unfortunately. Used the latest version of the project: v1.0.65.

Notable options used (to fix the use case here: gildas-lormeau/SingleFile#105):

--remove-hidden-elements=false
--remove-unused-styles=false
--block-scripts=false

Used Google Chrome as the browser.
Only tested once in browser with the extension; unfortunately the whole browser crashed or force quit (when it was nearing completion) with the above settings and I haven't gotten the chance to retry.

No option to disable saved date header

gildas-lormeau/SingleFile#1058

Is there a flag to disable the saved date header in the .html from CLI like what was implemented the issue above? It's causing the snapshots to have different hashes despite having the exact same content.

diff f0c76867d8e8e2e998e84f1d21af6fee62004f79dcc810f58e7a4a466061c145.html d8a1d1b260a32f2a4e0e0cdf0c5a73e77b944c11a9d6868bcaa6494fc7ce5a10.html
4c4
<  saved date: Thu Apr 06 2023 20:55:06 GMT-0400 (Eastern Daylight Time)
---
>  saved date: Thu Apr 06 2023 22:57:13 GMT-0400 (Eastern Daylight Time)

Not a bug/issue, Wonder if we can get some insight/progress when running with a big url file?

Describe the bug
Not really a bug / issue, but would like to know if we can gain some insight when running sing-file from command line

To Reproduce
Steps to reproduce the behavior:

nohup single-file --urls-file=lotOfUrl.txt --browser-executable-path /u
sr/bin/google-chrome --filename-template="{page-title}{date-locale}{time-locale}{url-pathname-flat}" > output.txt &
waiting. waiting,

Expected behavior
would like see progress/eror/info when tail the output.txt
would be nice to see the URL that failed to save.

Environment

OS: Debian 10
Browser: Chrome
Version: (not sure how to check it yet, just install fresh today using "npm install -g "gildas-lormeau/SingleFile#master"")
can confirm it is working well with above command where file contains only 5 URLs )

Thank you for sharing this tool.

Unable to download alternative sizes of screen within <picture> tag

If use the --remove-alternative-images option, the <source> tag and <img> tag of the exported html file are the same image.
Not all pictures are saved.

<picture>
    <source media="(min-width: 835px)"
        srcset="https://images.pexels.com/photos/1108099/pexels-photo-1108099.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=1">
    <img src="https://images.pexels.com/photos/45201/kitty-cat-kitten-pet-45201.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=1"
        alt="Cat" style="width:auto;">
</picture>

Please check this, Thanks!

feature suggestion: convert IFRAMEs into embedded content

If you like the idea, and its possible, feasable, then combining the contents of IFRAMEs on a page into inline DIVs would be a welcome addtion. In a way this is also about combining an URL into a single file. Thanks.

Inserting scripts using browser-script does not seem to be working

Hi,

Running single-file --browser-script "./script.js" "<url>" simply exits seemingly without any output or error but removing the --browser-script argument makes it work as expected.

script.js contains the second script from https://github.com/gildas-lormeau/SingleFile/wiki/How-to-execute-a-user-script-before-a-page-is-saved.

How do I get it to work?

Thanks.

How can I use docker cli instead of curl with cookie.json?

There is a website I can download html using curl only with -H 'cookie: ghost-members-ssr=example; ghost-members-ssr.sig=example'.

For single-file-cli, I prepare the json file
echo '{"ghost-members-ssr": "example", "ghost-members-ssr.sig":"example"}' > cookie.json
and mount this file into docker.

However, when I run the cli with option --browser-cookie-file="/usr/src/app/node_modules/single-file-cli/cookie.json" , the result html file is not changed from without the option.

How can I use docker cli instead of curl with cookie.json?

where to set ignoreHTTPSErrors config

Options should not be hardcoded, especially when they have an impact on security. So, the value should come from options and MUST NOT be set to true by default

Originally posted by @gildas-lormeau in #5 (comment)

When I download a website with an expired SSL certificate, it is effective to manually modify the code as above.。

I don't seem to see the entry for setting 'ignoreHTTPSErrors' in the current version. Is it necessary to add a set entry like this?

  if (options.ignoreHTTPSErrors !== undefined) {
    browserOptions.ignoreHTTPSErrors = options.ignoreHTTPSErrors;
  }

Fails to run in drive root on Windows

It fails trying to create a directory called . (which can't succeed for obvious reasons 😅).

EPERM: operation not permitted, mkdir '.' URL: xxx
Stack: Error: EPERM: operation not permitted, mkdir '.'
    at Object.mkdirSync (node:fs:1398:3)
    at capturePage (C:\xxx\amd64\node_modules\single-file-cli\single-file-cli-api.js:271:8)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async runNextTask (C:\xxx\amd64\node_modules\single-file-cli\single-file-cli-api.js:176:20)
    at async Promise.all (index 0)
    at async capture (C:\xxx\amd64\node_modules\single-file-cli\single-file-cli-api.js:127:2)
    at async run (C:\xxx\amd64\node_modules\single-file-cli\single-file:54:2)

Tested with v1.1.4.

Running singlefile-cli with a chrome extension

How do I run a chrome extension (e.g. IDontCareAboutCookies) within the webdriver?
I tried running the single-file cli with

./single-file --browser-executable-path="/opt/google/chrome/google-chrome" --browser-args='["--no-sandbox", "--load-extension=./3.4.6_0.crx"]' https://www.bbc.com --dump-content

but the extension doesn't seem to work, and the returned single-file contains the popups. The path to the extension is correct. Any ideas?

Singlefile CLI fails to clone website on first run

I am running singlefile on a raspberry pi 3B+ running Chromium. (I know that SingleFile is officially supported on a Raspberry Pi)

I am using this command:

./single-file --browser-executable-path=chromium-browser https://yahoo.com .

When I boot my Pi and run this command, it just hangs. However, if I interrupt it and then rerun the command, the webpage downloads.

Expected behavior
Downloaded website on first run of SingleFile.

Environment
Raspberry Pi 3B+
Raspbian 64 bit

Thanks for you help!

EDIT: This can be closed. I figured out how to get this working on a raspberry pi if you want to use SinglFile cli in a script. You can run /usr/bin/chromium-browser --no-sandbox 2>/dev/null first as root. This will suppress the error but launch a browser instance. You can then run your SingleFile command without having to interrupt the first run.

Example:

#!/bin/bash
/usr/bin/chromium-browser --no-sandbox 2>/dev/null
runuser -u pi -- /single-file --browser-executable-path=chromium-browser https://yahoo.com .

Can we use sessionData in CLI?

I'm using Cli with cookies

I need to include sessionData and also open the browser in stealth mode.

I use stealth mode plugin with puppeteer along with adding sessionData and works fine. I just can't include them while using SingleFile Cli?

any idea on how to use it?

include-infobar option causes error

Describe the bug
When I do not use --include-infobar=true on the CLI, single-file works correctly; when I do include it, no matter where I include it, I get an error that starts with Stack: Error: Evaluation failed: ReferenceError: infobar is not defined.

To Reproduce
Steps to reproduce the behavior:

In the terminal, enter the following:
D:\work\single-file-cli-master>single-file.bat https://www.wikipedia.org wikipedia.html --browser-executable-path="C:\Program Files\Google\Chrome\Application\chrome.exe" --include-infobar=true

This correctly downloads the webpage & creates the html file on my Desktop.
However, if I include --include-infobar=true in the command, it gives an error & the webpage is not saved on my Desktop.

'D:\work\single-file-cli-master>single-file.bat https://www.wikipedia.org wikipedia.html --browser-executable-path="C:\Program Files\Google\Chrome\Application\chrome.exe" --include-infobar=true'


Evaluation failed: ReferenceError: infobar is not defined
    at pptr://__puppeteer_evaluation_script__:4:5 URL: https://www.wikipedia.org
Stack: Error: Evaluation failed: ReferenceError: infobar is not defined
    at pptr://__puppeteer_evaluation_script__:4:5
    at ExecutionContext._ExecutionContext_evaluate (D:\work\node_modules\puppeteer-core\lib\cjs\puppeteer\common\ExecutionContext.js:271:15)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async ExecutionContext.evaluate (D:\work\node_modules\puppeteer-core\lib\cjs\puppeteer\common\ExecutionContext.js:118:16)
    at async getPageData (D:\work\single-file-cli-master\back-ends\puppeteer.js:139:10)
    at async exports.getPageData (D:\work\single-file-cli-master\back-ends\puppeteer.js:51:10)
    at async capturePage (D:\work\single-file-cli-master\single-file-cli-api.js:253:20)
    at async runNextTask (D:\work\single-file-cli-master\single-file-cli-api.js:174:20)
    at async Promise.all (index 0)
    at async capture (D:\work\single-file-cli-master\single-file-cli-api.js:125:2)
    at async run (D:\work\single-file-cli-master\single-file:54:2

Expected behavior
I expect the webpage to download without the error.

Environment

OS: windows 11 21H2
Browser: Chrome
Version: 104.0.5112.102 (Official Build) (64-bit)
Additional Context
This error occurs in wsl linux - windows - nodejs ver 13 , 16 ,18
its work well with --include-infobar=false or without --include-infobar.
the --include-infobar=true works well with single-fileZ with same usage

CLI: Automatically cancel a process that would take too much time

Thank you for taking the time to build and maintain this extension, it's been of great use!

Describe the bug
SingleFile hangs indefinitely when saving webpage using Chrome browser, either with CLI or manually with the Chrome extension.

To Reproduce
Steps to reproduce the behavior:

For CLI, run: docker run singlefile https://web.archive.org/web/20120502081049/http://www.cbsnews.com/8301-505245_162-57425417/oscars-home-renamed-dolby-theatre/
Otherwise, navigate to URL with Chrome browser with default settings, save the page using SingleFile extension

Expected behavior
Saves page successfully.

Environment

OS: Ubuntu 18.04.1
Browser: Chrome

feature suggestion: verbose mode (hangs on URL)

single-file fails (hangs) on retrieving an URL: https://epa.oszk.hu/00100/00181/00060/vers_16_ady.htm - and probably because it has links to nonfunctional hostnames, but I see no way to debug this. Could you add a verbose mode, to see where it fails? Adding --crawl-inner-links-only does not help with the above URL either. Thanks!

--save-original-urls does not work on cli

I've tried it with docker and npx. save-original-urls does work in the extension for me.

podman run --rm capsulecode/singlefile https://wikipedia.org --save-original-urls=true

npx single-file-cli https://wikipedia.org --browser-executable-path=/usr/bin/google-chrome-stable --save-original-urls=true

Timeout error on crawl

I started a new crawl using single-file. I fetched 10 pages successfully and then gave me this error.

Timed out after 60000 ms URL: https://xxx
Stack: ScriptTimeoutError: Timed out after 60000 ms
    at Object.throwDecodedError (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\node_modules\selenium-webdriver\lib\error.js:522:15)
    at parseHttpResponse (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\node_modules\selenium-webdriver\lib\http.js:549:13)
    at Executor.execute (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\node_modules\selenium-webdriver\lib\http.js:475:28)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async thenableWebDriverProxy.execute (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\node_modules\selenium-webdriver\lib\webdriver.js:735:17)
    at async getPageData (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\back-ends\webdriver-gecko.js:141:17)
    at async Object.exports.getPageData (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\back-ends\webdriver-gecko.js:37:10)
    at async capturePage (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\single-file-cli-api.js:253:20)
    at async runNextTask (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\single-file-cli-api.js:174:20)

It it from the remote website? Perhaps I am crawling too fast? Is it possible to delay requests by some random time, etc? Also - if I restart the same crawl - can I get single-file to ignore the pages that it has already downloaded?

Not working with sites having css or javascript.

Like the webpabe https://github.com/gildas-lormeau/single-file-cli
Downloaded the page somewhat just like wget does.

To reproduce:
System: Arch Linux

install single-file-cli by command:

unzip master.zip .
cd single-file-cli-master
npm install

cd to the directory, make executable by chmod +x single-file
symlink single-file to ~/.local/bin which is in the PATH

Installed geckodriver by sudo pacman -S geckodriver

Run the below command in terminal, all can download, but the outcome just like wget
single-file https://github.com/gildas-lormeau/single-file-cli --back-end=webdriver-gecko
single-file https://github.com/gildas-lormeau/single-file-cli --back-end=webdriver-chromium

And I have no idea how puppeteer work without the --back-end flag.
just prompt error Failed to launch the browser process! spawn chrome ENOENT everytime.

How does --crawl-rewrite-rule work?

Can you please provide an example of passing the --crawl-rewrite-rule to single-file-cli ?

support for web pages with charset gbk

could you please add support for web pages with charset gbk, currently when saving web pages with charset gbk the code is messed up.

test url: https://www.52pojie.cn/

`Cannot read properties of undefined` when following the docker example

WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
Cannot read properties of undefined (reading 'close') URL: https://www.wikipedia.org
Stack: TypeError: Cannot read properties of undefined (reading 'close')
    at Object.exports.getPageData (/usr/src/app/node_modules/single-file-cli/back-ends/puppeteer.js:54:15)
    at async capturePage (/usr/src/app/node_modules/single-file-cli/single-file-cli-api.js:254:20)
    at async runNextTask (/usr/src/app/node_modules/single-file-cli/single-file-cli-api.js:175:20)
    at async Promise.all (index 0)
    at async capture (/usr/src/app/node_modules/single-file-cli/single-file-cli-api.js:126:2)
    at async run (/usr/src/app/node_modules/single-file-cli/single-file:54:2)

Docker version 20.10.20, build 9fdeb9, macOS

Output file names when using `--urls-file`

Is there a way to specify output file names when using --urls-file? I tried adding the filename after the url and it doesn't appear to work (no error either).

Recent tagged Docker image?

I'm curious when you will release a new Docker tag.

From discussion on archivebox, I see that newer chromium versions are breaking some functionality. So I don't want to rely upon latest tag which might break later.

However, I also don't want to use a very old version of single-file.

Could you tag a new release that appears stable, possibly with a particular chromium version in case local users need to rebuild the docker image?

webdriver-chromium not working anymore

Just tried a fresh NPM install and I can't seem to use webdriver-chromium anymore - I get an error from selenium-webdriver every time:

./single-file --back-end=webdriver-chromium https://github.com/gildas-lormeau/single-file-cli                                                                           
/home/eugene/Scratch/sf/node_modules/selenium-webdriver/http/index.js:51
    throw new Error('Invalid URL: ' + aUrl)
          ^

Error: Invalid URL: http:/
    at getRequestOptions (/home/eugene/Scratch/sf/node_modules/selenium-webdriver/http/index.js:51:11)
    at new HttpClient (/home/eugene/Scratch/sf/node_modules/selenium-webdriver/http/index.js:90:21)
    at getStatus (/home/eugene/Scratch/sf/node_modules/selenium-webdriver/http/util.js:38:18)
    at checkServerStatus (/home/eugene/Scratch/sf/node_modules/selenium-webdriver/http/util.js:76:14)
    at /home/eugene/Scratch/sf/node_modules/selenium-webdriver/http/util.js:74:5
    at new Promise (<anonymous>)
    at Object.waitForServer (/home/eugene/Scratch/sf/node_modules/selenium-webdriver/http/util.js:57:10)
    at /home/eugene/Scratch/sf/node_modules/selenium-webdriver/remote/index.js:251:24
    at new Promise (<anonymous>)
    at /home/eugene/Scratch/sf/node_modules/selenium-webdriver/remote/index.js:246:20

Node.js v18.3.0

Puppeteer still works, this command completed successfully:

./single-file --browser-executable-path=/usr/bin/chromium https://github.com/gildas-lormeau/single-file-cli

Failed to launch the browser process! (Apple Silicon)

Running from the README:

joseph@Josephs-MacBook-Air ~/d/easypin (fastapi2)> docker pull capsulecode/singlefile                                                                                                                                                                  (base)


Using default tag: latest
latest: Pulling from capsulecode/singlefile
8921db27df28: Pull complete
7acbef36938e: Pull complete
d8066cda4048: Pull complete
04686419c3aa: Pull complete
4f4fb700ef54: Pull complete
198871f01b71: Pull complete
7eac83a8531f: Pull complete
Digest: sha256:f307b2b87df68835c5397b7f7e77a55af8e1abfea140cbc15c49d6d89e1b4b52
Status: Downloaded newer image for capsulecode/singlefile:latest
docker.io/capsulecode/singlefile:latest
joseph@Josephs-MacBook-Air ~/d/easypin (fastapi2)> docker tag capsulecode/singlefile singlefile                                                                                                                                                        (base)


joseph@Josephs-MacBook-Air ~/d/easypin (fastapi2)> docker run singlefile "https://www.wikipedia.org"                                                                                                                                                   (base)
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
Failed to launch the browser process!
qemu: uncaught target signal 5 (Trace/breakpoint trap) - core dumped
qemu: uncaught target signal 5 (Trace/breakpoint trap) - core dumped
qemu: uncaught target signal 11 (Segmentation fault) - core dumped


TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

This is on OSX.

Saved page missing images using CLI, while fine with browser extension

Describe the bug
https://blogs.cisco.com/datacenter/the-napkins-dialogues-life-of-a-packet-walk-part-1
When saving the page above with the browser extension, the saved page is accurate.
However, if it is saved with the CLI, the images would be missing

To Reproduce
Steps to reproduce the behavior:

Save the page with browser extension
Saved page is around 2mb. Displays images accurately.
Save the page with CLI
Saved page is around 350kb. Is missing images

Expected behavior
CLI and browser extension results should be the same (at least I hope).

Screenshots
If applicable, add screenshots to help explain your problem.

Environment

OS: Win10 Pro
Browser: Chrome
Version: 88

Not saving images when saving the webpage.

Extension is better because it saves even the images but the cli version does not do it.

CLI option --filename-conflict-action=skip should not attempt to download page if file already exists

The current behavior of --filename-conflict-action=skip is as follows:

download the page as usual
if the file to be created already exists, do overwrite the file.

It would be more efficient to first check if the file already exists, and then only download the page if the file does not already exist.

This would support the following use case:

Suppose we use --urls-file to download a list of URLs. Some of those pages may fail to download (e.g., due to a network failure). In my experience, for a large list of URLs, it is likely that at least one page will fail to download. If there is an error downloading the page then no file will be created (at least this seems to be the behavior).

I was hoping to be able to use the options --filename-template="{url-pathname-flat}.html" and --filename-conflict-action=skip combined with the --urls-file option to be able to resume after an error. I was hoping that SingleFile would only attempt to download the URLs did not already have files.

However, with the current implementation, because SingleFile attempts to download the page again, this is too slow to be practical.

--browser-script after --browser-executable-path results in silent failure

This works:

# ./single-file --browser-script remove-images.js --browser-executable-path chromium "https://www.google.com/"

But this fails, exiting silently with status code 0:

# ./single-file --browser-executable-path chromium --browser-script remove-images.js "https://www.google.com/"

I just saw that this was already reported in #8 (comment) but I wanted to make this more visible, because it can take a while to find the cause.

I first observed this on NixOS / google-chrome-unstable and reproduced it on Debian unstable / chromium:

# uname -a
Linux inspired-ocelot 6.1.24 #1-NixOS SMP PREEMPT_DYNAMIC Thu Apr 13 14:55:40 UTC 2023 x86_64 GNU/Linux

# node --version
v18.13.0

# npm --version
9.2.0

# git clone --depth 1 --recursive https://github.com/gildas-lormeau/single-file-cli.git
Cloning into 'single-file-cli'...
remote: Enumerating objects: 48, done.
remote: Counting objects: 100% (48/48), done.
remote: Compressing objects: 100% (42/42), done.
remote: Total 48 (delta 9), reused 24 (delta 4), pack-reused 0
Receiving objects: 100% (48/48), 179.88 KiB | 1.49 MiB/s, done.
Resolving deltas: 100% (9/9), done.

# npm install
npm WARN deprecated [email protected]: This package has been deprecated and is no longer maintained. Please use @rollup/plugin-terser

added 193 packages, and audited 194 packages in 7s

20 packages are looking for funding
  run `npm fund` for details

found 0 vulnerabilities

# chmod +x single-file

# chromium --version
find: ‘/home/at/.config/chromium/Crash Reports/pending/’: No such file or directory
Chromium 112.0.5615.121 built on Debian 12.0, running on Debian 12.0

# ./single-file --browser-executable-path chromium "https://www.google.com/"

# ls -lrt | tail -n 1
-rw-r----- 1 at at 128,821 2023-04-17 10:47 Google (2023-04-17 10_47_38 AM).html

# cat remove-images.js # copied from https://github.com/gildas-lormeau/SingleFile/wiki/How-to-execute-a-user-script-before-a-page-is-saved
// ==UserScript==
// @name         Remove images
// @namespace    https://github.com/gildas-lormeau/SingleFile
// @version      1.0
// @description  [SingleFile] Remove all the images
// @author       Gildas Lormeau
// @match        *://*/*
// @grant        none
// ==/UserScript==


(() => {

  const elements = new Map();
  const removedElementsSelector = "img";
  dispatchEvent(new CustomEvent("single-file-user-script-init"));

  addEventListener("single-file-on-before-capture-request", () => {
    document.querySelectorAll(removedElementsSelector).forEach(element => {
      const placeHolderElement = document.createElement(element.tagName);
      elements.set(placeHolderElement, element);
      element.parentElement.replaceChild(placeHolderElement, element);
    });
  });

  addEventListener("single-file-on-after-capture-request", () => {
    Array.from(elements).forEach(([placeHolderElement, element]) => {
      placeHolderElement.parentElement.replaceChild(element, placeHolderElement);
    });
    elements.clear();
  });

})();

# Works:

# ./single-file --browser-script remove-images.js --browser-executable-path chromium "https://www.google.com/"

# Fails (exits silently with status code 0):

# ./single-file --browser-executable-path chromium --browser-script remove-images.js "https://www.google.com/"

TypeError: window.singlefile is undefined

Describe the bug
Could not save page and got TypeError.

To Reproduce
Following command works fine:
single-file https://www.baidu.com 1.html --back-end=webdriver-gecko --browser-headless=false

But not this:
single-file https://baidu.com 1.html --back-end=webdriver-gecko --browser-headless=false
Will got error:

TypeError: window.singlefile is undefined URL: https://baidu.com
Stack: undefined

Expected behavior
A clear and concise description of what you expected to happen.

Environment

OS: Win10 Pro
Browser: Firefox 94
geckodriver v0.30.0

gildas-lormeau / single-file-cli Goto Github PK

single-file-cli's Introduction

SingleFile CLI (Command Line Interface)

Introduction

Installation

Installation with Docker

Manual installation

Run

Compile executables

License

single-file-cli's People

Contributors

Stargazers

Watchers

Forkers

single-file-cli's Issues

Issue

Here is what I have:

Recommend Projects

Recommend Topics

Recommend Org

Jobs