gildas-lormeau / single-file-cli Goto Github PK
View Code? Open in Web Editor NEWCLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
License: GNU Affero General Public License v3.0
CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
License: GNU Affero General Public License v3.0
It appears that my npm
automatically updated all my packages including single-file-cli
, because after this event my args.js got overwritten and I lost all my configurations :( [it would be great if there was a flag option so the user can provide path to our own custom args.js that way we don't have to mess with the original, which can get overwritten after update]
Anyways, the problem now is that for some reason now I'm getting "no chrome binary" Error! I have tried uninstalling and installing both chromedriver and chromium-browser, but no luck. I'm not sure if the issue is with selenium or single-file-cli is not setting the paths correctly. I think selenium 4.11.0 should support Chromium 116, but I'm not completely sure.
Note: Both chromedriver
and chromium-browser
files are present and located in /usr/bin/
chromedriver --version
ChromeDriver 116.0.5845.96
chromium-browser --version
Chromium 116.0.5845.96 snap
my args.js
"back-end": "webdriver-chromium",
...
"browser-executable-path": "/usr/bin/chromium-browser",
...
"web-driver-executable-path": "/usr/bin/chromedriver",
LOGS:
Selenium Manager binary found at /home/arc/.nvm/versions/node/v18.14.2/lib/node_modules/single-file-cli/node_modules/selenium-webdriver/bin/linux/selenium-manager
Driver path: /usr/bin/chromedriver
Browser path: "/usr/bin/chromium-browser"
unknown error: no chrome binary at "/usr/bin/chromium-browser" URL: https://mysite.com/
Stack: WebDriverError: unknown error: no chrome binary at "/usr/bin/chromium-browser"
at Object.throwDecodedError (/home/arc/.nvm/versions/node/v18.14.2/lib/node_modules/single-file-cli/node_modules/selenium-webdriver/lib/error.js:524:15)
at parseHttpResponse (/home/arc/.nvm/versions/node/v18.14.2/lib/node_modules/single-file-cli/node_modules/selenium-webdriver/lib/http.js:601:13)
at Executor.execute (/home/arc/.nvm/versions/node/v18.14.2/lib/node_modules/single-file-cli/node_modules/selenium-webdriver/lib/http.js:529:28)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
node:internal/process/promises:288
triggerUncaughtException(err, true /* fromPromise */);
^
WebDriverError: unknown error: no chrome binary at "/usr/bin/chromium-browser"
at Object.throwDecodedError (/home/arc/.nvm/versions/node/v18.14.2/lib/node_modules/single-file-cli/node_modules/selenium-webdriver/lib/error.js:524:15)
at parseHttpResponse (/home/arc/.nvm/versions/node/v18.14.2/lib/node_modules/single-file-cli/node_modules/selenium-webdriver/lib/http.js:601:13)
at Executor.execute (/home/arc/.nvm/versions/node/v18.14.2/lib/node_modules/single-file-cli/node_modules/selenium-webdriver/lib/http.js:529:28)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
remoteStacktrace: '#0 0x55b439b8fee3 <unknown>\n' +
'#1 0x55b4398cab77 <unknown>\n' +
'#2 0x55b4398f33f9 <unknown>\n' +
'#3 0x55b4398f1f19 <unknown>\n' +
'#4 0x55b439931da1 <unknown>\n' +
'#5 0x55b4399313ef <unknown>\n' +
'#6 0x55b439928ef3 <unknown>\n' +
'#7 0x55b4398fd132 <unknown>\n' +
'#8 0x55b4398fdede <unknown>\n' +
'#9 0x55b439b5478d <unknown>\n' +
'#10 0x55b439b59017 <unknown>\n' +
'#11 0x55b439b625e8 <unknown>\n' +
'#12 0x55b439b59a50 <unknown>\n' +
'#13 0x55b439b2a92e <unknown>\n' +
'#14 0x55b439b7a7f8 <unknown>\n' +
'#15 0x55b439b7a9ea <unknown>\n' +
'#16 0x55b439b894e8 <unknown>\n' +
'#17 0x7f2c6eee1b43 <unknown>\n'
}
Node.js v18.14.2
I am running singlefile on a raspberry pi 3B+ running Chromium. (I know that SingleFile is officially supported on a Raspberry Pi)
I am using this command:
./single-file --browser-executable-path=chromium-browser https://yahoo.com .
When I boot my Pi and run this command, it just hangs. However, if I interrupt it and then rerun the command, the webpage downloads.
Expected behavior
Downloaded website on first run of SingleFile.
Environment
Raspberry Pi 3B+
Raspbian 64 bit
Thanks for you help!
EDIT: This can be closed. I figured out how to get this working on a raspberry pi if you want to use SinglFile cli in a script. You can run /usr/bin/chromium-browser --no-sandbox 2>/dev/null first as root. This will suppress the error but launch a browser instance. You can then run your SingleFile command without having to interrupt the first run.
Example:
#!/bin/bash
/usr/bin/chromium-browser --no-sandbox 2>/dev/null
runuser -u pi -- /single-file --browser-executable-path=chromium-browser https://yahoo.com .
Describe the bug
When I do not use --include-infobar=true
on the CLI, single-file works correctly; when I do include it, no matter where I include it, I get an error that starts with Stack: Error: Evaluation failed: ReferenceError: infobar is not defined
.
To Reproduce
Steps to reproduce the behavior:
In the terminal, enter the following:
D:\work\single-file-cli-master>single-file.bat https://www.wikipedia.org wikipedia.html --browser-executable-path="C:\Program Files\Google\Chrome\Application\chrome.exe" --include-infobar=true
This correctly downloads the webpage & creates the html file on my Desktop.
However, if I include --include-infobar=true
in the command, it gives an error & the webpage is not saved on my Desktop.
'D:\work\single-file-cli-master>single-file.bat https://www.wikipedia.org wikipedia.html --browser-executable-path="C:\Program Files\Google\Chrome\Application\chrome.exe" --include-infobar=true'
Evaluation failed: ReferenceError: infobar is not defined
at pptr://__puppeteer_evaluation_script__:4:5 URL: https://www.wikipedia.org
Stack: Error: Evaluation failed: ReferenceError: infobar is not defined
at pptr://__puppeteer_evaluation_script__:4:5
at ExecutionContext._ExecutionContext_evaluate (D:\work\node_modules\puppeteer-core\lib\cjs\puppeteer\common\ExecutionContext.js:271:15)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async ExecutionContext.evaluate (D:\work\node_modules\puppeteer-core\lib\cjs\puppeteer\common\ExecutionContext.js:118:16)
at async getPageData (D:\work\single-file-cli-master\back-ends\puppeteer.js:139:10)
at async exports.getPageData (D:\work\single-file-cli-master\back-ends\puppeteer.js:51:10)
at async capturePage (D:\work\single-file-cli-master\single-file-cli-api.js:253:20)
at async runNextTask (D:\work\single-file-cli-master\single-file-cli-api.js:174:20)
at async Promise.all (index 0)
at async capture (D:\work\single-file-cli-master\single-file-cli-api.js:125:2)
at async run (D:\work\single-file-cli-master\single-file:54:2
Expected behavior
I expect the webpage to download without the error.
Environment
OS: windows 11 21H2
Browser: Chrome
Version: 104.0.5112.102 (Official Build) (64-bit)
Additional Context
This error occurs in wsl linux - windows - nodejs ver 13 , 16 ,18
its work well with --include-infobar=false
or without --include-infobar
.
the --include-infobar=true
works well with single-fileZ with same usage
I do see the following 3 options.
--remove-alternative-fonts Remove alternative fonts to the ones displayed [boolean] [default: true]
--remove-alternative-medias Remove alternative CSS stylesheets [boolean] [default: true]
--remove-alternative-images Remove images for alternative sizes of screen [boolean] [default: true]
But what if I want to save pages without the normal images? Not alternative ones? Can we get these options ALSO?
--remove-fonts Remove all fonts [boolean] [default: false]
--remove-medias Remove all CSS stylesheets [boolean] [default: false]
--remove-images Remove all images [boolean] [default: false]
I started a new crawl using single-file. I fetched 10 pages successfully and then gave me this error.
Timed out after 60000 ms URL: https://xxx
Stack: ScriptTimeoutError: Timed out after 60000 ms
at Object.throwDecodedError (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\node_modules\selenium-webdriver\lib\error.js:522:15)
at parseHttpResponse (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\node_modules\selenium-webdriver\lib\http.js:549:13)
at Executor.execute (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\node_modules\selenium-webdriver\lib\http.js:475:28)
at runMicrotasks (<anonymous>)
at processTicksAndRejections (internal/process/task_queues.js:93:5)
at async thenableWebDriverProxy.execute (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\node_modules\selenium-webdriver\lib\webdriver.js:735:17)
at async getPageData (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\back-ends\webdriver-gecko.js:141:17)
at async Object.exports.getPageData (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\back-ends\webdriver-gecko.js:37:10)
at async capturePage (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\single-file-cli-api.js:253:20)
at async runNextTask (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\single-file-cli-api.js:174:20)
It it from the remote website? Perhaps I am crawling too fast? Is it possible to delay requests by some random time, etc? Also - if I restart the same crawl - can I get single-file to ignore the pages that it has already downloaded?
gildas-lormeau/SingleFile#1058
Is there a flag to disable the saved date header in the .html from CLI like what was implemented the issue above? It's causing the snapshots to have different hashes despite having the exact same content.
diff f0c76867d8e8e2e998e84f1d21af6fee62004f79dcc810f58e7a4a466061c145.html d8a1d1b260a32f2a4e0e0cdf0c5a73e77b944c11a9d6868bcaa6494fc7ce5a10.html
4c4
< saved date: Thu Apr 06 2023 20:55:06 GMT-0400 (Eastern Daylight Time)
---
> saved date: Thu Apr 06 2023 22:57:13 GMT-0400 (Eastern Daylight Time)
Thank you for taking the time to build and maintain this extension, it's been of great use!
Describe the bug
SingleFile hangs indefinitely when saving webpage using Chrome browser, either with CLI or manually with the Chrome extension.
To Reproduce
Steps to reproduce the behavior:
For CLI, run: docker run singlefile https://web.archive.org/web/20120502081049/http://www.cbsnews.com/8301-505245_162-57425417/oscars-home-renamed-dolby-theatre/
Otherwise, navigate to URL with Chrome browser with default settings, save the page using SingleFile extension
Expected behavior
Saves page successfully.
Environment
OS: Ubuntu 18.04.1
Browser: Chrome
There is a website I can download html using curl only with -H 'cookie: ghost-members-ssr=example; ghost-members-ssr.sig=example'
.
For single-file-cli, I prepare the json file
echo '{"ghost-members-ssr": "example", "ghost-members-ssr.sig":"example"}' > cookie.json
and mount this file into docker.
However, when I run the cli with option --browser-cookie-file="/usr/src/app/node_modules/single-file-cli/cookie.json"
, the result html file is not changed from without the option.
How can I use docker cli instead of curl with cookie.json?
This works:
# ./single-file --browser-script remove-images.js --browser-executable-path chromium "https://www.google.com/"
But this fails, exiting silently with status code 0:
# ./single-file --browser-executable-path chromium --browser-script remove-images.js "https://www.google.com/"
I just saw that this was already reported in #8 (comment) but I wanted to make this more visible, because it can take a while to find the cause.
I first observed this on NixOS / google-chrome-unstable and reproduced it on Debian unstable / chromium:
# uname -a
Linux inspired-ocelot 6.1.24 #1-NixOS SMP PREEMPT_DYNAMIC Thu Apr 13 14:55:40 UTC 2023 x86_64 GNU/Linux
# node --version
v18.13.0
# npm --version
9.2.0
# git clone --depth 1 --recursive https://github.com/gildas-lormeau/single-file-cli.git
Cloning into 'single-file-cli'...
remote: Enumerating objects: 48, done.
remote: Counting objects: 100% (48/48), done.
remote: Compressing objects: 100% (42/42), done.
remote: Total 48 (delta 9), reused 24 (delta 4), pack-reused 0
Receiving objects: 100% (48/48), 179.88 KiB | 1.49 MiB/s, done.
Resolving deltas: 100% (9/9), done.
# npm install
npm WARN deprecated [email protected]: This package has been deprecated and is no longer maintained. Please use @rollup/plugin-terser
added 193 packages, and audited 194 packages in 7s
20 packages are looking for funding
run `npm fund` for details
found 0 vulnerabilities
# chmod +x single-file
# chromium --version
find: ‘/home/at/.config/chromium/Crash Reports/pending/’: No such file or directory
Chromium 112.0.5615.121 built on Debian 12.0, running on Debian 12.0
# ./single-file --browser-executable-path chromium "https://www.google.com/"
# ls -lrt | tail -n 1
-rw-r----- 1 at at 128,821 2023-04-17 10:47 Google (2023-04-17 10_47_38 AM).html
# cat remove-images.js # copied from https://github.com/gildas-lormeau/SingleFile/wiki/How-to-execute-a-user-script-before-a-page-is-saved
// ==UserScript==
// @name Remove images
// @namespace https://github.com/gildas-lormeau/SingleFile
// @version 1.0
// @description [SingleFile] Remove all the images
// @author Gildas Lormeau
// @match *://*/*
// @grant none
// ==/UserScript==
(() => {
const elements = new Map();
const removedElementsSelector = "img";
dispatchEvent(new CustomEvent("single-file-user-script-init"));
addEventListener("single-file-on-before-capture-request", () => {
document.querySelectorAll(removedElementsSelector).forEach(element => {
const placeHolderElement = document.createElement(element.tagName);
elements.set(placeHolderElement, element);
element.parentElement.replaceChild(placeHolderElement, element);
});
});
addEventListener("single-file-on-after-capture-request", () => {
Array.from(elements).forEach(([placeHolderElement, element]) => {
placeHolderElement.parentElement.replaceChild(element, placeHolderElement);
});
elements.clear();
});
})();
# Works:
# ./single-file --browser-script remove-images.js --browser-executable-path chromium "https://www.google.com/"
# Fails (exits silently with status code 0):
# ./single-file --browser-executable-path chromium --browser-script remove-images.js "https://www.google.com/"
Hi. On encountering a network error as shown below, single-file-cli
throws the error but does not propagate the error code on exit, which makes it difficult to determine when retrieving a bunch of urls if any of them failed.
net::ERR_NAME_NOT_RESOLVED at <url elided> URL: <url elided>
Stack: Error: net::ERR_NAME_NOT_RESOLVED at <url elided>
at navigate (<path>/single-file-cli/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Frame.js:236:23)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async Frame.goto (<path>/single-file-cli/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Frame.js:206:21)
at async CDPPage.goto (<path>/single-file-cli/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Page.js:439:16)
at async pageGoto (<path>/single-file-cli/back-ends/puppeteer.js:194:3)
at async getPageData (<path>/single-file-cli/back-ends/puppeteer.js:132:3)
at async exports.getPageData (<path>/single-file-cli/back-ends/puppeteer.js:56:10)
at async capturePage (<path>/single-file-cli/single-file-cli-api.js:254:20)
at async runNextTask (<path>/single-file-cli/single-file-cli-api.js:175:20)
at async Promise.all (index 0)
Can you ensure single-file-cli
exits with appropriate code on an error? It would also be nice if there was a way to determine if the provided url returns a 404 (or any 400 or 500 code), but maybe that's harder to do.
Describe the solution you'd like
Support import browser exported settings file singlefile-settings.json
to cli config.
The option name may :
--browser-settings-file
Description:
Overwrite the default settings of cli by the settings file exported from the browser plugin.
I installed SingleFile from the Docker image in the usual way:
docker pull capsulecode/singlefile
docker tag capsulecode/singlefile singlefile
I can then save a webpage as usual and everything works as expected:
docker run -v $(pwd):/usr/src/app/out singlefile "https://www.wikipedia.org" --dump-content=false
However, when I try to save the webpage https://cdn.docbook.org/release/xsl-nons/1.79.2/webhelp/docs/ch03s02.html
I get a stacktrace:
$ docker run -v $(pwd):/usr/src/app/out singlefile "https://cdn.docbook.org/release/xsl-nons/1.79.2/webhelp/docs/ch03s02.html" --dump-content=false
Evaluation failed: SyntaxError: Invalid regular expression: /^var(--/: Unterminated group
at String.match (<anonymous>)
at String.startsWith (https://cdn.docbook.org/release/xsl-nons/1.79.2/webhelp/docs/search/nwSearchFnt.js:871:15)
at <anonymous>:1:217815
at Array.find (<anonymous>)
at <anonymous>:1:217804
at Array.find (<anonymous>)
at tp (<anonymous>:1:217793)
at Object.removeUnusedFonts (<anonymous>:1:300431)
at jm.removeUnusedFonts (<anonymous>:1:267447)
at Nm.executeTask (<anonymous>:1:243712) URL: https://cdn.docbook.org/release/xsl-nons/1.79.2/webhelp/docs/ch03s02.html
Stack: Error: Evaluation failed: SyntaxError: Invalid regular expression: /^var(--/: Unterminated group
at String.match (<anonymous>)
at String.startsWith (https://cdn.docbook.org/release/xsl-nons/1.79.2/webhelp/docs/search/nwSearchFnt.js:871:15)
at <anonymous>:1:217815
at Array.find (<anonymous>)
at <anonymous>:1:217804
at Array.find (<anonymous>)
at tp (<anonymous>:1:217793)
at Object.removeUnusedFonts (<anonymous>:1:300431)
at jm.removeUnusedFonts (<anonymous>:1:267447)
at Nm.executeTask (<anonymous>:1:243712)
at ExecutionContext._evaluateInternal (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:221:19)
at processTicksAndRejections (node:internal/process/task_queues:96:5)
at async ExecutionContext.evaluate (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:110:16)
at async getPageData (/usr/src/app/node_modules/single-file-cli/back-ends/puppeteer.js:139:10)
at async Object.exports.getPageData (/usr/src/app/node_modules/single-file-cli/back-ends/puppeteer.js:51:10)
at async capturePage (/usr/src/app/node_modules/single-file-cli/single-file-cli-api.js:254:20)
at async runNextTask (/usr/src/app/node_modules/single-file-cli/single-file-cli-api.js:175:20)
at async Promise.all (index 0)
at async capture (/usr/src/app/node_modules/single-file-cli/single-file-cli-api.js:126:2)
at async run (/usr/src/app/node_modules/single-file-cli/single-file:54:2)
I do not get an error when I try to save the same page using the SingleFile extension for Firefox. Thanks!
When using --urls-file
option, it misses the First URL in the list.
Workaround: Add a empty line at start of the file.
This issue is present on single-filez-cli too.
How do I run a chrome extension (e.g. IDontCareAboutCookies) within the webdriver?
I tried running the single-file cli with
./single-file --browser-executable-path="/opt/google/chrome/google-chrome" --browser-args='["--no-sandbox", "--load-extension=./3.4.6_0.crx"]' https://www.bbc.com --dump-content
but the extension doesn't seem to work, and the returned single-file contains the popups. The path to the extension is correct. Any ideas?
I'm using Cli with cookies
I need to include sessionData and also open the browser in stealth mode.
I use stealth mode plugin with puppeteer along with adding sessionData and works fine. I just can't include them while using SingleFile Cli?
any idea on how to use it?
This is not a problem when doing single-shot executions with a docker container that starts, runs singlefile, and stops again. However, when running a small flask server as a frontend to singlefile which spawns singlefile via subprocess
, it will leave a couple of defunct chrome processes behind.
It seems this is a general problem with puppeteer and does not originate from singelfile itself. To overcome this issue, tini
can be used as a prefix when calling singlefile which makes sure to terminate all child processes:
tini -s /usr/src/app/node_modules/single-file-cli/single-file ...
Leaving this here for documentation purposes.
If you like the idea, and its possible, feasable, then combining the contents of IFRAMEs on a page into inline DIVs would be a welcome addtion. In a way this is also about combining an URL into a single file. Thanks.
Like the webpabe https://github.com/gildas-lormeau/single-file-cli
Downloaded the page somewhat just like wget
does.
To reproduce:
System: Arch Linux
install single-file-cli by command:
unzip master.zip .
cd single-file-cli-master
npm install
cd to the directory, make executable by chmod +x single-file
symlink single-file to ~/.local/bin
which is in the PATH
Installed geckodriver
by sudo pacman -S geckodriver
Run the below command in terminal, all can download, but the outcome just like wget
single-file https://github.com/gildas-lormeau/single-file-cli --back-end=webdriver-gecko
single-file https://github.com/gildas-lormeau/single-file-cli --back-end=webdriver-chromium
And I have no idea how puppeteer work without the --back-end flag.
just prompt error Failed to launch the browser process! spawn chrome ENOENT
everytime.
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
Cannot read properties of undefined (reading 'close') URL: https://www.wikipedia.org
Stack: TypeError: Cannot read properties of undefined (reading 'close')
at Object.exports.getPageData (/usr/src/app/node_modules/single-file-cli/back-ends/puppeteer.js:54:15)
at async capturePage (/usr/src/app/node_modules/single-file-cli/single-file-cli-api.js:254:20)
at async runNextTask (/usr/src/app/node_modules/single-file-cli/single-file-cli-api.js:175:20)
at async Promise.all (index 0)
at async capture (/usr/src/app/node_modules/single-file-cli/single-file-cli-api.js:126:2)
at async run (/usr/src/app/node_modules/single-file-cli/single-file:54:2)
Docker version 20.10.20, build 9fdeb9, macOS
Describe the bug
Could not save page and got TypeError.
To Reproduce
Following command works fine:
single-file https://www.baidu.com 1.html --back-end=webdriver-gecko --browser-headless=false
But not this:
single-file https://baidu.com 1.html --back-end=webdriver-gecko --browser-headless=false
Will got error:
TypeError: window.singlefile is undefined URL: https://baidu.com
Stack: undefined
Expected behavior
A clear and concise description of what you expected to happen.
Environment
Describe the bug
Not really a bug / issue, but would like to know if we can gain some insight when running sing-file from command line
To Reproduce
Steps to reproduce the behavior:
Expected behavior
would like see progress/eror/info when tail the output.txt
would be nice to see the URL that failed to save.
Environment
Thank you for sharing this tool.
--compress-content --self-extracting-archive false
outputs a zip file, but it receives the extension .html
rather than .zip
I'm curious when you will release a new Docker tag.
From discussion on archivebox, I see that newer chromium versions are breaking some functionality. So I don't want to rely upon latest tag which might break later.
However, I also don't want to use a very old version of single-file.
Could you tag a new release that appears stable, possibly with a particular chromium version in case local users need to rebuild the docker image?
The current behavior of --filename-conflict-action=skip
is as follows:
It would be more efficient to first check if the file already exists, and then only download the page if the file does not already exist.
This would support the following use case:
Suppose we use --urls-file
to download a list of URLs. Some of those pages may fail to download (e.g., due to a network failure). In my experience, for a large list of URLs, it is likely that at least one page will fail to download. If there is an error downloading the page then no file will be created (at least this seems to be the behavior).
I was hoping to be able to use the options --filename-template="{url-pathname-flat}.html"
and --filename-conflict-action=skip
combined with the --urls-file
option to be able to resume after an error. I was hoping that SingleFile would only attempt to download the URLs did not already have files.
However, with the current implementation, because SingleFile attempts to download the page again, this is too slow to be practical.
Just tried a fresh NPM install and I can't seem to use webdriver-chromium anymore - I get an error from selenium-webdriver every time:
./single-file --back-end=webdriver-chromium https://github.com/gildas-lormeau/single-file-cli
/home/eugene/Scratch/sf/node_modules/selenium-webdriver/http/index.js:51
throw new Error('Invalid URL: ' + aUrl)
^
Error: Invalid URL: http:/
at getRequestOptions (/home/eugene/Scratch/sf/node_modules/selenium-webdriver/http/index.js:51:11)
at new HttpClient (/home/eugene/Scratch/sf/node_modules/selenium-webdriver/http/index.js:90:21)
at getStatus (/home/eugene/Scratch/sf/node_modules/selenium-webdriver/http/util.js:38:18)
at checkServerStatus (/home/eugene/Scratch/sf/node_modules/selenium-webdriver/http/util.js:76:14)
at /home/eugene/Scratch/sf/node_modules/selenium-webdriver/http/util.js:74:5
at new Promise (<anonymous>)
at Object.waitForServer (/home/eugene/Scratch/sf/node_modules/selenium-webdriver/http/util.js:57:10)
at /home/eugene/Scratch/sf/node_modules/selenium-webdriver/remote/index.js:251:24
at new Promise (<anonymous>)
at /home/eugene/Scratch/sf/node_modules/selenium-webdriver/remote/index.js:246:20
Node.js v18.3.0
Puppeteer still works, this command completed successfully:
./single-file --browser-executable-path=/usr/bin/chromium https://github.com/gildas-lormeau/single-file-cli
I've tried it with docker and npx. save-original-urls does work in the extension for me.
podman run --rm capsulecode/singlefile https://wikipedia.org --save-original-urls=true
npx single-file-cli https://wikipedia.org --browser-executable-path=/usr/bin/google-chrome-stable --save-original-urls=true
Originally posted by @gildas-lormeau in #5 (comment)
When I download a website with an expired SSL certificate, it is effective to manually modify the code as above.。
I don't seem to see the entry for setting 'ignoreHTTPSErrors' in the current version. Is it necessary to add a set entry like this?
if (options.ignoreHTTPSErrors !== undefined) {
browserOptions.ignoreHTTPSErrors = options.ignoreHTTPSErrors;
}
single-file fails (hangs) on retrieving an URL: https://epa.oszk.hu/00100/00181/00060/vers_16_ady.htm - and probably because it has links to nonfunctional hostnames, but I see no way to debug this. Could you add a verbose mode, to see where it fails? Adding --crawl-inner-links-only does
not help with the above URL either. Thanks!
Is your feature request related to a problem? Please describe.
When SingleFile CLI saves a page with puppeteer, the page is saved without any of the familiar extensions your personal Chrome profile uses (for example, ad blockers), because puppeteer runs on a clean profile by default.
Describe the solution you'd like
Would it be possible to add an argument that points to a Chrome profile folder for the CLI? It would allow the page to be saved the way you would normally see it when you're using Chrome, with extensions and all. Thanks!
Can you please provide an example of passing the --crawl-rewrite-rule to single-file-cli ?
Is is possible to restrict a crawl to only URLs with a specific prefix? Or to contain a specific word?
Eg: I only want to save a page if its URL starts with "https://en.wikipedia.org/wiki/" or say only if the URL contains the word "wiki", etc.
Hi. I get the following error on seemingly any web page from "http://highscalability.com" when using single-file-cli
. However the firefox extension works as expected. Can you investigate? Thanks.
$ single-file --browser-executable-path chromium "http://highscalability.com/blog/2022/12/16/what-is-cloud-computing-according-to-chatgpt.html"
Evaluation failed: TypeError: Cannot read properties of undefined (reading 'type')
at Object.node (<anonymous>:1:55192)
at Object.node (<anonymous>:1:56252)
at Object.Hi (<anonymous>:1:97189)
at Object.node (<anonymous>:1:55258)
at Object.node (<anonymous>:1:56252)
at Fr.forEach (<anonymous>:1:44421)
at Object.ln [as children] (<anonymous>:1:54840)
at Object.Cc (<anonymous>:1:116289)
at Object.node (<anonymous>:1:55258)
at Object.generate (<anonymous>:1:56320) URL: http://highscalability.com/blog/2022/12/16/what-is-cloud-computing-according-to-chatgpt.html
Stack: Error: Evaluation failed: TypeError: Cannot read properties of undefined (reading 'type')
at Object.node (<anonymous>:1:55192)
at Object.node (<anonymous>:1:56252)
at Object.Hi (<anonymous>:1:97189)
at Object.node (<anonymous>:1:55258)
at Object.node (<anonymous>:1:56252)
at Fr.forEach (<anonymous>:1:44421)
at Object.ln [as children] (<anonymous>:1:54840)
at Object.Cc (<anonymous>:1:116289)
at Object.node (<anonymous>:1:55258)
at Object.generate (<anonymous>:1:56320)
at ExecutionContext._ExecutionContext_evaluate (<path>/single-file-cli/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:229:15)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async ExecutionContext.evaluate (<path>/single-file-cli/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:107:16)
at async getPageData (<path>/single-file-cli/back-ends/puppeteer.js:150:10)
at async exports.getPageData (<path>/single-file-cli/back-ends/puppeteer.js:56:10)
at async capturePage (<path>/single-file-cli/single-file-cli-api.js:254:20)
at async runNextTask (<path>/single-file-cli/single-file-cli-api.js:175:20)
at async Promise.all (index 0)
at async capture (<path>/single-file-cli/single-file-cli-api.js:126:2)
at async run (<path>/single-file-cli/single-file:54:2)
I am running the following command from a Bash shell (MinGW on Windows 10):
docker run --mount "type=bind,src=$PWD/cookiedir,dst=/cookiedir" --mount "type=bind,src=$PWD/sitedir,dst=/sitedir" singlefile --browser-cookies-file=/cookiedir/cookies.txt --urls-file="/sitedir/urls.txt" --output-directory="/sitedir" --dump-content=false --filename-template="{url-pathname-flat}.html"
Note that I am using the Docker image and the --urls-file
option.
Sometimes I get the following error:
Execution context was destroyed, most likely because of a navigation. URL: <redacted>
Stack: Error: Execution context was destroyed, most likely because of a navigation.
at rewriteError (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:265:23)
at runMicrotasks (<anonymous>)
at processTicksAndRejections (internal/process/task_queues.js:93:5)
at async ExecutionContext._evaluateInternal (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:219:60)
at async ExecutionContext.evaluate (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:110:16)
at async getPageData (/usr/src/app/node_modules/single-file/cli/back-ends/puppeteer.js:139:10)
at async Object.exports.getPageData (/usr/src/app/node_modules/single-file/cli/back-ends/puppeteer.js:51:10)
at async capturePage (/usr/src/app/node_modules/single-file/cli/single-file-cli-api.js:248:20)
at async runNextTask (/usr/src/app/node_modules/single-file/cli/single-file-cli-api.js:169:20)
at async Promise.all (index 0)
Sometimes I get the following different error:
Navigation failed because browser has disconnected! URL: <redacted>
Stack: Error: Navigation failed because browser has disconnected!
at /usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/LifecycleWatcher.js:51:147
at /usr/src/app/node_modules/puppeteer-core/lib/cjs/vendor/mitt/src/index.js:51:62
at Array.map (<anonymous>)
at Object.emit (/usr/src/app/node_modules/puppeteer-core/lib/cjs/vendor/mitt/src/index.js:51:43)
at CDPSession.emit (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/EventEmitter.js:72:22)
at CDPSession._onClosed (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:256:14)
at Connection._onMessage (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:99:25)
at WebSocket.<anonymous> (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/node/NodeWebSocketTransport.js:13:32)
at WebSocket.onMessage (/usr/src/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:132:16)
at WebSocket.emit (events.js:315:20)
I can download the pages at the URLs that failed by trying again. However, I would only usually expect to get a stack trace from an internal error (not a network connection error, or whatever might be the underlying cause here).
One difficulty I have is that there is no option to "resume" downloading pages should some pages fail to download. Utilities such as youtube-dl
allow you to run them a second time to continue downloading files that were not downloaded in the previous run.
youtube-dl
for example).youtube-dl
will create .part
files that are renamed only once the file is fully downloaded to avoid this problem and allow resuming of downloads.)Many thanks!
It would be more convenient to specify the arguments in a config file, perhaps a toml one, rather than specify them over and over again in the terminal. It can be also be more convenient if users have multiple sets of arguments for different pages.
When the Browserless endpoint is provided as a browserServer option and a GitHub URL eg. https://github.com/pawanpaudel93, single-file does not capture the URL completely, and when opening the captured webpage the webpage seems to load without CSS.
But, It works perfectly when used without browserless though.
Hi,
Running single-file --browser-script "./script.js" "<url>"
simply exits seemingly without any output or error but removing the --browser-script
argument makes it work as expected.
script.js
contains the second script from https://github.com/gildas-lormeau/SingleFile/wiki/How-to-execute-a-user-script-before-a-page-is-saved
.
How do I get it to work?
Thanks.
Hello! I love single-file and single-file-cli. Truly the best option out there for saving full and complete webpages!
Right now, it's hitting this strange problem where, as far as I can tell, the page hangs indefinitely, even under the lowest load settings.
It might be related to this? puppeteer/puppeteer#9196 Not entirely sure.
To Reproduce
Run the following command in bash:
single-file --browser-executable-path=/opt/homebrew/bin/chromium https://web.archive.org/web/20140325112710/http://www.sfchronicle.com/ --block-images --dump-content --browser-headless false --browser-wait-until domcontentloaded --browser-debug
Environment
could you please add support for web pages with charset gbk, currently when saving web pages with charset gbk the code is messed up.
test url: https://www.52pojie.cn/
Extension is better because it saves even the images but the cli version does not do it.
Running from the README:
joseph@Josephs-MacBook-Air ~/d/easypin (fastapi2)> docker pull capsulecode/singlefile (base)
Using default tag: latest
latest: Pulling from capsulecode/singlefile
8921db27df28: Pull complete
7acbef36938e: Pull complete
d8066cda4048: Pull complete
04686419c3aa: Pull complete
4f4fb700ef54: Pull complete
198871f01b71: Pull complete
7eac83a8531f: Pull complete
Digest: sha256:f307b2b87df68835c5397b7f7e77a55af8e1abfea140cbc15c49d6d89e1b4b52
Status: Downloaded newer image for capsulecode/singlefile:latest
docker.io/capsulecode/singlefile:latest
joseph@Josephs-MacBook-Air ~/d/easypin (fastapi2)> docker tag capsulecode/singlefile singlefile (base)
joseph@Josephs-MacBook-Air ~/d/easypin (fastapi2)> docker run singlefile "https://www.wikipedia.org" (base)
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
Failed to launch the browser process!
qemu: uncaught target signal 5 (Trace/breakpoint trap) - core dumped
qemu: uncaught target signal 5 (Trace/breakpoint trap) - core dumped
qemu: uncaught target signal 11 (Segmentation fault) - core dumped
TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md
This is on OSX.
When I run this command:
single-file --output-directory=outdir --dump-content=false --filename-template="{url-pathname-flat}.html" --crawl-links --crawl-save-session=session.json --crawl-replace-urls=true https://en.wikipedia.org/wiki/Thomas_Lipton
none of the files in the outdir
directory have URLs of saved pages replaced with relative paths of other saved pages in outdir
.
When I run this command, _wiki_Thomas_Lipton.html
is downloaded to outdir
. This is the file of URL from which the crawl started.
The Wikipedia page https://en.wikipedia.org/wiki/Thomas_Lipton has a link to https://en.wikipedia.org/wiki/Self-made_man in the first sentence. This page was also downloaded by SingleFile as _wiki_Self-made_man.html
.
I was expecting the href
to https://en.wikipedia.org/wiki/Self-made_man
in _wiki_Thomas_Lipton.html
to be rewritten to _wiki_Self-made_man.html
but it was not. Am I using the CLI options incorrectly?
First of all, thanks for building this amazing tool, love the chrome extension!
A problem I face when trying to use the CLI tool:
After installing it on mac with npm install -g "gildas-lormeau/single-file-cli"
, when I try to run single-file
command, I got error "command not found"
➜ ~ npm list -g single-file-cli
/opt/homebrew/lib
└── single-file-cli@ -> ./../../../Users/gudh/.npm/_cacache/tmp/git-cloneIMlDig
➜ ~ single-file
zsh: command not found: single-file
➜ ~ single-file-cli
zsh: command not found: single-file-cli
anything I am missing?
Stack: ProtocolError: Runtime.callFunctionOn timed out. Increase the 'protocolTimeout' setting in launch/connect calls for a higher timeout if needed.
at <instance_members_initializer> (C:\Users\Richard\single-file-cli\node_modules\puppeteer-core\lib\cjs\puppeteer\common\Connection.js:49:14)
at new Callback (C:\Users\Richard\single-file-cli\node_modules\puppeteer-core\lib\cjs\puppeteer\common\Connection.js:53:16)
at CallbackRegistry.create (C:\Users\Richard\single-file-cli\node_modules\puppeteer-core\lib\cjs\puppeteer\common\Connection.js:93:26)
at Connection._rawSend (C:\Users\Richard\single-file-cli\node_modules\puppeteer-core\lib\cjs\puppeteer\common\Connection.js:207:26)
at CDPSessionImpl.send (C:\Users\Richard\single-file-cli\node_modules\puppeteer-core\lib\cjs\puppeteer\common\Connection.js:416:33)
at #evaluate (C:\Users\Richard\single-file-cli\node_modules\puppeteer-core\lib\cjs\puppeteer\common\ExecutionContext.js:234:50)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async ExecutionContext.evaluate (C:\Users\Richard\single-file-cli\node_modules\puppeteer-core\lib\cjs\puppeteer\common\ExecutionContext.js:149:16)
at async getPageData (C:\Users\Richard\single-file-cli\back-ends\puppeteer.js:153:10)
at async exports.getPageData (C:\Users\Richard\single-file-cli\back-ends\puppeteer.js:56:10)
node -v
is v20.5.1
. Can't share the webpage unfortunately. Used the latest version of the project: v1.0.65.
Notable options used (to fix the use case here: gildas-lormeau/SingleFile#105):
--remove-hidden-elements=false
--remove-unused-styles=false
--block-scripts=false
Used Google Chrome as the browser.
Only tested once in browser with the extension; unfortunately the whole browser crashed or force quit (when it was nearing completion) with the above settings and I haven't gotten the chance to retry.
Running the below command seems to dump code, and then result in the error message below.
single-file https://epa.oszk.hu/00300/00336/00003/bbcikk19.html --back-end=jsdom x.html
Might be related to an IFRAME in the html, which can be ignored with --remove-frames
which seems to be supported by the jsdom backend, even tought the cli --help
says its not.
single-file-cli is failing when using it on a Youtube Community page with the below error and only failing strictly when the browser height is too large(>~50000 pixels) and browser cookie is used from an account that has membership of the channel.
Command: single-file --browser-cookies-file="I:\archive scripts\batch scripts\singlefile_script\cookie.txt" --browser-executable-path="C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" --browser-script="I:\archive scripts\batch scripts\singlefile_script\replace_image_url_script.js" --browser-load-max-time=120000 --browser-wait-delay=20000 --browser-height=200000 --crawl-replace-urls=true "https://www.youtube.com/channel/UCqm3BQLlJfvkTsX_hvm0UmA/community"
Error:
Evaluation failed: RangeError: Invalid string length
at <anonymous>:1:239479
at Array.forEach (<anonymous>)
at <anonymous>:1:239464
at cy (<anonymous>:1:239632)
at <anonymous>:1:239479
at Array.forEach (<anonymous>)
at <anonymous>:1:239464
at cy (<anonymous>:1:239632)
at <anonymous>:1:239479
at Array.forEach (<anonymous>)
at <anonymous>:1:239464
at cy (<anonymous>:1:239632)
at <anonymous>:1:239479
at Array.forEach (<anonymous>)
at <anonymous>:1:239464
at cy (<anonymous>:1:239632)
at <anonymous>:1:239479
at Array.forEach (<anonymous>)
at <anonymous>:1:239464
at cy (<anonymous>:1:239632)
at <anonymous>:1:239479
at Array.forEach (<anonymous>)
at <anonymous>:1:239464
at cy (<anonymous>:1:239632)
at <anonymous>:1:239479 URL: https://www.youtube.com/channel/UCqm3BQLlJfvkTsX_hvm0UmA/community
Stack: Error: Evaluation failed: RangeError: Invalid string length
at <anonymous>:1:239479
at Array.forEach (<anonymous>)
at <anonymous>:1:239464
at cy (<anonymous>:1:239632)
at <anonymous>:1:239479
at Array.forEach (<anonymous>)
at <anonymous>:1:239464
at cy (<anonymous>:1:239632)
at <anonymous>:1:239479
at Array.forEach (<anonymous>)
at <anonymous>:1:239464
at cy (<anonymous>:1:239632)
at <anonymous>:1:239479
at Array.forEach (<anonymous>)
at <anonymous>:1:239464
at cy (<anonymous>:1:239632)
at <anonymous>:1:239479
at Array.forEach (<anonymous>)
at <anonymous>:1:239464
at cy (<anonymous>:1:239632)
at <anonymous>:1:239479
at Array.forEach (<anonymous>)
at <anonymous>:1:239464
at cy (<anonymous>:1:239632)
at <anonymous>:1:239479
at ExecutionContext._ExecutionContext_evaluate (C:\Users\test\AppData\Roaming\npm\node_modules\single-file-cli\node_modules\puppeteer-core\lib\cjs\puppeteer\common\ExecutionContext.js:229:15)
at runMicrotasks (<anonymous>)
at processTicksAndRejections (node:internal/process/task_queues:96:5)
at async ExecutionContext.evaluate (C:\Users\test\AppData\Roaming\npm\node_modules\single-file-cli\node_modules\puppeteer-core\lib\cjs\puppeteer\common\ExecutionContext.js:107:16)
at async getPageData (C:\Users\test\AppData\Roaming\npm\node_modules\single-file-cli\back-ends\puppeteer.js:150:10)
at async Object.exports.getPageData (C:\Users\test\AppData\Roaming\npm\node_modules\single-file-cli\back-ends\puppeteer.js:56:10)
at async capturePage (C:\Users\test\AppData\Roaming\npm\node_modules\single-file-cli\single-file-cli-api.js:254:20)
at async runNextTask (C:\Users\test\AppData\Roaming\npm\node_modules\single-file-cli\single-file-cli-api.js:175:20)
at async Promise.all (index 0)
at async capture (C:\Users\test\AppData\Roaming\npm\node_modules\single-file-cli\single-file-cli-api.js:126:2)
Is there a way to specify output file names when using --urls-file
? I tried adding the filename after the url and it doesn't appear to work (no error either).
It fails trying to create a directory called .
(which can't succeed for obvious reasons 😅).
EPERM: operation not permitted, mkdir '.' URL: xxx
Stack: Error: EPERM: operation not permitted, mkdir '.'
at Object.mkdirSync (node:fs:1398:3)
at capturePage (C:\xxx\amd64\node_modules\single-file-cli\single-file-cli-api.js:271:8)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async runNextTask (C:\xxx\amd64\node_modules\single-file-cli\single-file-cli-api.js:176:20)
at async Promise.all (index 0)
at async capture (C:\xxx\amd64\node_modules\single-file-cli\single-file-cli-api.js:127:2)
at async run (C:\xxx\amd64\node_modules\single-file-cli\single-file:54:2)
Tested with v1.1.4.
If use the --remove-alternative-images
option, the <source>
tag and <img>
tag of the exported html file are the same image.
Not all pictures are saved.
<picture>
<source media="(min-width: 835px)"
srcset="https://images.pexels.com/photos/1108099/pexels-photo-1108099.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=1">
<img src="https://images.pexels.com/photos/45201/kitty-cat-kitten-pet-45201.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=1"
alt="Cat" style="width:auto;">
</picture>
Please check this, Thanks!
For questions, please create or update a thread here:
https://github.com/gildas-lormeau/SingleFile/discussions
Please ensure that you do not find an answer before reporting the issue:
Describe the bug
I'm trying to automate singlefile with my small project for downloading a webpage exactly like I can do manually from the extension bar.
However, when using the cli with nodejs, I found out that the downloaded webpage doesn't have images, specifically the one that is hosted on amazonaws.
I manually managed to download succesfully the page with the extension but when using the CLI, I can't download everything in the page.
To Reproduce
Steps to reproduce the behavior:
You can try with the page for example to see what I'm talking about
https://www.tigergroup.ae/
The images hosted on amazonaws can't be downloaded
but the image of the popup is hosted on their server, so it was downloaded
I tried using the extension and it worked fine.
Expected behavior
a good html file should be downloaded will all resources embeded specially images
Environment
Describe the bug
https://blogs.cisco.com/datacenter/the-napkins-dialogues-life-of-a-packet-walk-part-1
When saving the page above with the browser extension, the saved page is accurate.
However, if it is saved with the CLI, the images would be missing
To Reproduce
Steps to reproduce the behavior:
Expected behavior
CLI and browser extension results should be the same (at least I hope).
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.