microlinkhq / html-get Goto Github PK
View Code? Open in Web Editor NEWGet the HTML from any website, using prerendering when necessary.
License: MIT License
Get the HTML from any website, using prerendering when necessary.
License: MIT License
4.10.1
to 4.10.2
.π¨ View failing branch.
This version is covered by your current version range and after updating it in your project the build failed.
@metascraper/helpers is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.
There is a collection of frequently asked questions. If those donβt help, you can always ask the humans behind Greenkeeper.
Your Greenkeeper Bot π΄
related: microlinkhq/open#35
Bloomberg website returns "To continue, please click the box below to let us know you're not a robot.", I used to have my own fetch and scraper and I used puppeteer with page.setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9")
to avoid the robot error. Wondering how to set that using html-get
. Appreciate your help.
The function defined getBrowserless
takes zero arguments but when called, it is passed an argument browser
Can you explain/correct it?
const getContent = async url => {
// create a browser context inside Chromium process
const browserContext = browserlessFactory.createContext()
const getBrowserless = () => browserContext
const result = await getHTML(url, { getBrowserless })
// close the browser context after it's used
await getBrowserless((browser) => browser.destroyContext())
return result
}
I was just testing against the demo. Sometimes it works, sometimes it throws this error:
/Users/songkeys/GitHub/Crossbell-Box/faas/node_modules/p-cancelable/index.js:48
throw new Error('The `onCancel` handler was attached after the promise settled.');
^
Error: The `onCancel` handler was attached after the promise settled.
at onCancel (/Users/songkeys/GitHub/Crossbell-Box/faas/node_modules/p-cancelable/index.js:48:12)
at makeRequest (/Users/songkeys/GitHub/Crossbell-Box/faas/node_modules/got/dist/source/as-promise/index.js:38:13)
at Request.<anonymous> (/Users/songkeys/GitHub/Crossbell-Box/faas/node_modules/got/dist/source/as-promise/index.js:143:17)
at Object.onceWrapper (node:events:628:26)
at Request.emit (node:events:513:28)
at Timeout.retry (/Users/songkeys/GitHub/Crossbell-Box/faas/node_modules/got/dist/source/core/index.js:1278:30)
Environment: macOS m1, nodejs v18.
[email protected] /Users/songkeys/GitHub/Crossbell-Box/faas
βββ¬ [email protected]
βββ¬ [email protected]
β βββ [email protected] deduped
βββ [email protected]
βββ¬ [email protected]
βββ¬ [email protected]
βββ [email protected] deduped
βββ¬ [email protected]
βββ [email protected] deduped
I believe this is an issue related to the got
package. See sindresorhus/got#1489.
I would also highly recommend deprecating the use of got
as it's somehow buggy and slow. Instead, use the undici
package, which would be much more performant. In the future (less than a month later), when nodejs v18 comes out, there is going to be built-in standard fetch
which is also provided by undici
.
edit: After taking another deep look, I think I found the fix...
+ if (req._isPending) {
onCancel(() => {
debug('fetch:cancel', { url, reflect })
req.cancel()
})
+ }
let me know what you think.
Hello,
I am currently using that package for one of my work and facing an issue that relative URLs in HTML are getting resolved to absolute URLs.
URL: https://spiritualtactics.com/
After some trial and error found out that if the following function in index.js (Line : 140)
rewriteHtmlUrls({ $, url, headers })
is comment out then the downloaded HTML contains relative URLs.
Request your assistance in solving this issue. As I want to have the downloaded HTML unmodified i.e. if the URLs in the original HTML is absolute it should be absolute else if it is relative, then the downloaded should have relative.
Thanks
Branch | Build failing π¨ |
---|---|
Dependency | browserless |
Current Version | 3.6.1 |
Type | dependency |
This version is covered by your current version range and after updating it in your project the build failed.
browserless is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.
There is a collection of frequently asked questions. If those donβt help, you can always ask the humans behind Greenkeeper.
Your Greenkeeper Bot π΄
Can't get it to work because of this error. How to get around this error? The example in the README file is pretty vague about this.
1.0.0
to 1.0.1
.π¨ View failing branch.
This version is covered by your current version range and after updating it in your project the build failed.
require-one-of is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.
There is a collection of frequently asked questions. If those donβt help, you can always ask the humans behind Greenkeeper.
Your Greenkeeper Bot π΄
As I suggest in the code comment https://github.com/Kikobeats/html-get/blob/master/src/index.js#L13
This will change in the next puppeteer major release
The 'Chromium' process stay alive even after getHTML is done(shows up on task manager). It's ok if it does nothing and just stay alive , but it takes so much CPU resource.
Is there any way to automatically close chromium after getHTML function has done?
I need to investigate if could be possible move from got
to curl
curl
has better accuracy for getting HTML from securized websites.
That's the main library used by Insomnia Client:
https://github.com/getinsomnia/node-libcurl
Example
var Curl = require('insomnia-node-libcurl').Curl
var curl = new Curl()
curl.setOpt('URL', 'www.google.com')
curl.setOpt('FOLLOWLOCATION', true)
curl.on('end', function (statusCode, body, headers) {
console.info(statusCode)
console.info('---')
console.info(body.length)
console.info('---')
console.info(this.getInfo('TOTAL_TIME'))
console.log(body)
this.close()
})
curl.on('error', curl.close.bind(curl))
curl.perform()
My concern is that the API is a bit hard, need to create a good wrapper similar to got
interface.
Testing URLs
Branch | Build failing π¨ |
---|---|
Dependency | got |
Current Version | 8.3.1 |
Type | dependency |
This version is covered by your current version range and after updating it in your project the build failed.
got is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.
Fix Got throwing an error in some cases when trying to pipe one got.stream
into another one. 7ac705f
There is a collection of frequently asked questions. If those donβt help, you can always ask the humans behind Greenkeeper.
Your Greenkeeper Bot π΄
Hi,
are you currently using mem dependency in master?
Can't find it in repository.
With usage here there is a problem, because in 1-day cache maxAge an attacker could put many url-lookups in memory.
Branch | Build failing π¨ |
---|---|
Dependency | browserless |
Current Version | 4.1.0 |
Type | dependency |
This version is covered by your current version range and after updating it in your project the build failed.
browserless is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.
There is a collection of frequently asked questions. If those donβt help, you can always ask the humans behind Greenkeeper.
Your Greenkeeper Bot π΄
If I try to get html structure by:
(async () => {
const { url, html, stats } = await getHTML('https://www.thewashingtondailynews.com/2019/09/04/early-voting-ends-early-in-advance-of-dorian/');
console.log(html)
})()
Returned html structure is not complete, for example it doesn't contain og:description
tag.
4.0.5
to 4.0.6
.π¨ View failing branch.
This version is covered by your current version range and after updating it in your project the build failed.
tldts is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.
The new version differs by 17 commits.
f17ce70
Release v4.0.6 (#123)
90e4277
Bump eslint-plugin-import from 2.16.0 to 2.17.1 (#122)
4d96491
Bump @types/node from 11.13.0 to 11.13.4 (#120)
3f8f548
Bump rollup from 1.9.1 to 1.10.0 (#121)
5c161f5
Bump typescript from 3.4.2 to 3.4.3 (#117)
21a83ea
Bump rollup from 1.9.0 to 1.9.1 (#118)
cfd2551
Bump rollup from 1.8.0 to 1.9.0 (#113)
b446372
Bump ts-jest from 24.0.1 to 24.0.2 (#114)
18c9b2f
Bump typescript from 3.4.1 to 3.4.2 (#115)
b69182f
Bump jest from 24.7.0 to 24.7.1 (#112)
ddffd1d
Bump jest from 24.5.0 to 24.7.0 (#111)
8e38c6e
Bump rollup from 1.7.4 to 1.8.0 (#110)
a9988b1
Bump tslint from 5.14.0 to 5.15.0 (#109)
007a2b1
Bump @types/node from 11.12.1 to 11.13.0 (#108)
a72ceab
Bump eslint from 5.15.3 to 5.16.0 (#105)
There are 17 commits in total.
See the full diff
There is a collection of frequently asked questions. If those donβt help, you can always ask the humans behind Greenkeeper.
Your Greenkeeper Bot π΄
i am trying fetch html content for url : ΰ€ΰ€ΰ€¦ΰ₯ΰ€°ΰ€―ΰ€Ύΰ€¨-3
the url contains character in language other than english
which looks like this when copied : https://hi.wikipedia.org/wiki/%E0%A4%9A%E0%A4%82%E0%A4%A6%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A4%BE%E0%A4%A8-3
here is my code :
import createBrowserless from 'browserless'
import getHTML from 'html-get'
export var browserlessFactory
// Kill the process when Node.js exit
process.on('exit', () => {
console.log('Closing browser!')
browserlessFactory && browserlessFactory.close()
})
function initializeBrowserless() {
console.log('Creating browserless...')
browserlessFactory = createBrowserless()
}
const getContent = async url => {
// Spawn Chromium process once
if (!browserlessFactory) {
initializeBrowserless()
}
// create a browser context inside Chromium process
const browserContext = browserlessFactory.createContext()
const getBrowserless = () => browserContext
const result = await getHTML(url, { getBrowserless, rewriteUrls: true })
// close the browser context after it's used
await getBrowserless((browser) => browser.destroyContext())
if (!result) {
throw new Error('Failed to get HTML content')
}
if (result.statusCode !== 200) {
throw new Error(`Failed to get HTML content. Status code: ${result.status}`)
}
// browserlessFactory.close()
return result.html
}
export { getContent }
this is throwing error for the url :
<path>/node_modules/browserless/src/index.js:62
getBrowser().then(browser => browser.createIncognitoBrowserContext(contextOpts))
^
TypeError: browser.createIncognitoBrowserContext is not a function
at <path>/node_modules/browserless/src/index.js:62:42
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
can somebody help me with this, if i am doing smething wrong?
Branch | Build failing π¨ |
---|---|
Dependency | browserless |
Current Version | 4.1.1 |
Type | dependency |
This version is covered by your current version range and after updating it in your project the build failed.
browserless is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.
There is a collection of frequently asked questions. If those donβt help, you can always ask the humans behind Greenkeeper.
Your Greenkeeper Bot π΄
I'm trying to figure out why I keep getting this error:
(node:22272) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 exit listeners added. Use emitter.setMaxListeners() to increase limit
(node:22272) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 SIGINT listeners added. Use emitter.setMaxListeners() to increase limit
(node:22272) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 SIGTERM listeners added. Use emitter.setMaxListeners() to increase limit
(node:22272) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 SIGHUP listeners added. Use emitter.setMaxListeners() to increase limit
To be honest I'm quite new in NodeJS coding but I found online that this error is caused by too many listeners being attached to a EventEmitter. I don't know how to solve this, but I'm quite sure it's caused by this module because removing it everything works fine.
This is my code:
getHTML(url)
.then(({url, html, stats, headers, statusCode})=>{
if(statusCode==200){
console.log(statusCode);
}
})
.catch((err)=>{
onCheckError(err);
});
It is executed every 5 seconds.
5.2.0
to 5.2.4
.π¨ View failing branch.
This version is covered by your current version range and after updating it in your project the build failed.
@metascraper/helpers is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.
There is a collection of frequently asked questions. If those donβt help, you can always ask the humans behind Greenkeeper.
Your Greenkeeper Bot π΄
1.0.8
to 1.0.9
.π¨ View failing branch.
This version is covered by your current version range and after updating it in your project the build failed.
require-one-of is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.
There is a collection of frequently asked questions. If those donβt help, you can always ask the humans behind Greenkeeper.
Your Greenkeeper Bot π΄
9.2.1
to 9.2.2
.π¨ View failing branch.
This version is covered by your current version range and after updating it in your project the build failed.
got is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.
The new version differs by 4 commits.
248d68c
9.2.2
3ad3950
Don't override hooks when merging arguments
292f78a
Merge hooks on got.extend() (#608)
7ae6939
Gracefully handle invalid Location
redirect URLs (#605)
See the full diff
There is a collection of frequently asked questions. If those donβt help, you can always ask the humans behind Greenkeeper.
Your Greenkeeper Bot π΄
One my machine HTML-GET is prerendering JavaScript to HTML. Which is what I want.
But on a colleagues machine it is not prerendering the JavaScript.
On the command line how do I force the prerendering? Or do I need to make a script adjustment some where? If so then where specifically. I am only a NPM command line user not a JavaScripter, so need very specific directions.
2.1.4
to 2.1.5
.π¨ View failing branch.
This version is covered by your current version range and after updating it in your project the build failed.
parse-domain is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.
The new version differs by 2 commits.
8f572b2
chore(release): 2.1.5
d9d782b
fix: Compatibility problems with older JavaScript engines (#51)
See the full diff
There is a collection of frequently asked questions. If those donβt help, you can always ask the humans behind Greenkeeper.
Your Greenkeeper Bot π΄
1.0.7
to 1.0.8
.π¨ View failing branch.
This version is covered by your current version range and after updating it in your project the build failed.
require-one-of is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.
There is a collection of frequently asked questions. If those donβt help, you can always ask the humans behind Greenkeeper.
Your Greenkeeper Bot π΄
Branch | Build failing π¨ |
---|---|
Dependency | got |
Current Version | 9.2.0 |
Type | dependency |
This version is covered by your current version range and after updating it in your project the build failed.
got is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.
The new version differs by 7 commits.
0ddf3ac
9.2.1
b8480f3
Fix cache reusing bodies of HTTP errors
a3a6a94
Dump cacheable-request
to 5.0.0
e55e52c
Mention Ky in the readme
5f191b9
Fix merging default & custom handlers
9c50d26
Update dev dependencies
cf56247
Meta tweaks
See the full diff
There is a collection of frequently asked questions. If those donβt help, you can always ask the humans behind Greenkeeper.
Your Greenkeeper Bot π΄
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.