GithubHelp home page GithubHelp logo

html-get's People

Contributors

dependabot-preview[bot] avatar dependabot[bot] avatar greenkeeper[bot] avatar kikobeats avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

html-get's Issues

An in-range update of @metascraper/helpers is breaking the build 🚨

The dependency @metascraper/helpers was updated from 4.10.1 to 4.10.2.

🚨 View failing branch.

This version is covered by your current version range and after updating it in your project the build failed.

@metascraper/helpers is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details
  • ❌ continuous-integration/travis-ci/push: The Travis CI build failed (Details).

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

[Question] How to setUserAgent for puppeteer.

Bloomberg website returns "To continue, please click the box below to let us know you're not a robot.", I used to have my own fetch and scraper and I used puppeteer with page.setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9") to avoid the robot error. Wondering how to set that using html-get. Appreciate your help.

Incorrect function parameter, in readme example

The function defined getBrowserless takes zero arguments but when called, it is passed an argument browser

Can you explain/correct it?

const getContent = async url => {
  // create a browser context inside Chromium process
  const browserContext = browserlessFactory.createContext()
  const getBrowserless = () => browserContext
  const result = await getHTML(url, { getBrowserless })
  // close the browser context after it's used
  await getBrowserless((browser) => browser.destroyContext())
  return result
}

Error: The `onCancel` handler was attached after the promise settled.

I was just testing against the demo. Sometimes it works, sometimes it throws this error:

/Users/songkeys/GitHub/Crossbell-Box/faas/node_modules/p-cancelable/index.js:48
                                        throw new Error('The `onCancel` handler was attached after the promise settled.');
           ^
Error: The `onCancel` handler was attached after the promise settled.
    at onCancel (/Users/songkeys/GitHub/Crossbell-Box/faas/node_modules/p-cancelable/index.js:48:12)
    at makeRequest (/Users/songkeys/GitHub/Crossbell-Box/faas/node_modules/got/dist/source/as-promise/index.js:38:13)
    at Request.<anonymous> (/Users/songkeys/GitHub/Crossbell-Box/faas/node_modules/got/dist/source/as-promise/index.js:143:17)
    at Object.onceWrapper (node:events:628:26)
    at Request.emit (node:events:513:28)
    at Timeout.retry (/Users/songkeys/GitHub/Crossbell-Box/faas/node_modules/got/dist/source/core/index.js:1278:30)

Environment: macOS m1, nodejs v18.

[email protected] /Users/songkeys/GitHub/Crossbell-Box/faas
└─┬ [email protected]
  β”œβ”€β”¬ [email protected]
  β”‚ └── [email protected] deduped
  β”œβ”€β”€ [email protected]
  └─┬ [email protected]
    └─┬ [email protected]
      β”œβ”€β”€ [email protected] deduped
      └─┬ [email protected]
        └── [email protected] deduped

I believe this is an issue related to the got package. See sindresorhus/got#1489.

I would also highly recommend deprecating the use of got as it's somehow buggy and slow. Instead, use the undici package, which would be much more performant. In the future (less than a month later), when nodejs v18 comes out, there is going to be built-in standard fetch which is also provided by undici.


edit: After taking another deep look, I think I found the fix...

+  if (req._isPending) {
      onCancel(() => {
        debug('fetch:cancel', { url, reflect })
        req.cancel()
      })
+   }

let me know what you think.

Resolve relative URLS getting changed to absolute URLS

Hello,

I am currently using that package for one of my work and facing an issue that relative URLs in HTML are getting resolved to absolute URLs.

URL: https://spiritualtactics.com/

After some trial and error found out that if the following function in index.js (Line : 140)
rewriteHtmlUrls({ $, url, headers })

is comment out then the downloaded HTML contains relative URLs.

Request your assistance in solving this issue. As I want to have the downloaded HTML unmodified i.e. if the URLs in the original HTML is absolute it should be absolute else if it is relative, then the downloaded should have relative.

Thanks

An in-range update of browserless is breaking the build 🚨

Version 3.6.2 of browserless was just published.

Branch Build failing 🚨
Dependency browserless
Current Version 3.6.1
Type dependency

This version is covered by your current version range and after updating it in your project the build failed.

browserless is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details
  • ❌ continuous-integration/travis-ci/push The Travis CI build could not complete due to an error Details

Commits

The new version differs by 3 commits.

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

An in-range update of require-one-of is breaking the build 🚨

The dependency require-one-of was updated from 1.0.0 to 1.0.1.

🚨 View failing branch.

This version is covered by your current version range and after updating it in your project the build failed.

require-one-of is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details
  • ❌ continuous-integration/travis-ci/push: The Travis CI build failed (Details).

Commits

The new version differs by 2 commits.

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

The 'Chromium' process stay alive even after getHTML is done

The 'Chromium' process stay alive even after getHTML is done(shows up on task manager). It's ok if it does nothing and just stay alive , but it takes so much CPU resource.

Is there any way to automatically close chromium after getHTML function has done?

Move from `got` to `curl`

I need to investigate if could be possible move from got to curl

curl has better accuracy for getting HTML from securized websites.

That's the main library used by Insomnia Client:

https://github.com/getinsomnia/node-libcurl

Example

var Curl = require('insomnia-node-libcurl').Curl

var curl = new Curl()

curl.setOpt('URL', 'www.google.com')
curl.setOpt('FOLLOWLOCATION', true)

curl.on('end', function (statusCode, body, headers) {
  console.info(statusCode)
  console.info('---')
  console.info(body.length)
  console.info('---')
  console.info(this.getInfo('TOTAL_TIME'))
  console.log(body)
  this.close()
})

curl.on('error', curl.close.bind(curl))
curl.perform()

My concern is that the API is a bit hard, need to create a good wrapper similar to got interface.

Testing URLs

An in-range update of got is breaking the build 🚨

Version 8.3.2 of got was just published.

Branch Build failing 🚨
Dependency got
Current Version 8.3.1
Type dependency

This version is covered by your current version range and after updating it in your project the build failed.

got is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details
  • ❌ continuous-integration/travis-ci/push The Travis CI build could not complete due to an error Details

Release Notes v8.3.2

Fix Got throwing an error in some cases when trying to pipe one got.stream into another one. 7ac705f

Commits

The new version differs by 2 commits.

  • ad7b361 8.3.2
  • 7ac705f fix Buffer.byteLength(req._header) throwing error (#490)

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

Still using mem dependency?

Hi,

are you currently using mem dependency in master?
Can't find it in repository.

With usage here there is a problem, because in 1-day cache maxAge an attacker could put many url-lookups in memory.

An in-range update of browserless is breaking the build 🚨

Version 4.1.1 of browserless was just published.

Branch Build failing 🚨
Dependency browserless
Current Version 4.1.0
Type dependency

This version is covered by your current version range and after updating it in your project the build failed.

browserless is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details
  • ❌ continuous-integration/travis-ci/push: The Travis CI build could not complete due to an error (Details).

Commits

The new version differs by 3 commits.

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

returned html structure not complete

If I try to get html structure by:

(async () => {
  const { url, html, stats } = await getHTML('https://www.thewashingtondailynews.com/2019/09/04/early-voting-ends-early-in-advance-of-dorian/');
  console.log(html)
})()

Returned html structure is not complete, for example it doesn't contain og:description tag.

Result

An in-range update of tldts is breaking the build 🚨

The dependency tldts was updated from 4.0.5 to 4.0.6.

🚨 View failing branch.

This version is covered by your current version range and after updating it in your project the build failed.

tldts is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details
  • ❌ continuous-integration/travis-ci/push: The Travis CI build failed (Details).

Release Notes for v4.0.6
  • #123 Update Public Suffix Lists to 033221af7f600bcfce38dcbfafe03b9a2269c4cc
Commits

The new version differs by 17 commits.

  • f17ce70 Release v4.0.6 (#123)
  • 90e4277 Bump eslint-plugin-import from 2.16.0 to 2.17.1 (#122)
  • 4d96491 Bump @types/node from 11.13.0 to 11.13.4 (#120)
  • 3f8f548 Bump rollup from 1.9.1 to 1.10.0 (#121)
  • 5c161f5 Bump typescript from 3.4.2 to 3.4.3 (#117)
  • 21a83ea Bump rollup from 1.9.0 to 1.9.1 (#118)
  • cfd2551 Bump rollup from 1.8.0 to 1.9.0 (#113)
  • b446372 Bump ts-jest from 24.0.1 to 24.0.2 (#114)
  • 18c9b2f Bump typescript from 3.4.1 to 3.4.2 (#115)
  • b69182f Bump jest from 24.7.0 to 24.7.1 (#112)
  • ddffd1d Bump jest from 24.5.0 to 24.7.0 (#111)
  • 8e38c6e Bump rollup from 1.7.4 to 1.8.0 (#110)
  • a9988b1 Bump tslint from 5.14.0 to 5.15.0 (#109)
  • 007a2b1 Bump @types/node from 11.12.1 to 11.13.0 (#108)
  • a72ceab Bump eslint from 5.15.3 to 5.16.0 (#105)

There are 17 commits in total.

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

Thorwing TypeError: browser.createIncognitoBrowserContext is not a function; for some URLs

i am trying fetch html content for url : ΰ€šΰ€‚ΰ€¦ΰ₯ΰ€°ΰ€―ΰ€Ύΰ€¨-3

the url contains character in language other than english
image

which looks like this when copied : https://hi.wikipedia.org/wiki/%E0%A4%9A%E0%A4%82%E0%A4%A6%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A4%BE%E0%A4%A8-3

here is my code :

import createBrowserless from 'browserless'
import getHTML from 'html-get'

export var browserlessFactory

// Kill the process when Node.js exit
process.on('exit', () => {
    console.log('Closing browser!')
    browserlessFactory && browserlessFactory.close()
})

function initializeBrowserless() {
    console.log('Creating browserless...')
    browserlessFactory = createBrowserless()
}

const getContent = async url => {
    // Spawn Chromium process once
    if (!browserlessFactory) {
        initializeBrowserless()
    }

    // create a browser context inside Chromium process
    const browserContext = browserlessFactory.createContext()
    const getBrowserless = () => browserContext
    const result = await getHTML(url, { getBrowserless, rewriteUrls: true })
    // close the browser context after it's used
    await getBrowserless((browser) => browser.destroyContext())

    if (!result) {
        throw new Error('Failed to get HTML content')
    }
    if (result.statusCode !== 200) {
        throw new Error(`Failed to get HTML content. Status code: ${result.status}`)
    }

    // browserlessFactory.close()
    return result.html
}

export { getContent }

this is throwing error for the url :

<path>/node_modules/browserless/src/index.js:62
    getBrowser().then(browser => browser.createIncognitoBrowserContext(contextOpts))
                                         ^

TypeError: browser.createIncognitoBrowserContext is not a function
    at <path>/node_modules/browserless/src/index.js:62:42
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

can somebody help me with this, if i am doing smething wrong?

An in-range update of browserless is breaking the build 🚨

Version 4.1.2 of browserless was just published.

Branch Build failing 🚨
Dependency browserless
Current Version 4.1.1
Type dependency

This version is covered by your current version range and after updating it in your project the build failed.

browserless is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details
  • ❌ continuous-integration/travis-ci/push: The Travis CI build failed (Details).

Commits

The new version differs by 3 commits.

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

MaxListenersExceededWarning

I'm trying to figure out why I keep getting this error:

(node:22272) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 exit listeners added. Use emitter.setMaxListeners() to increase limit
(node:22272) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 SIGINT listeners added. Use emitter.setMaxListeners() to increase limit
(node:22272) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 SIGTERM listeners added. Use emitter.setMaxListeners() to increase limit
(node:22272) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 SIGHUP listeners added. Use emitter.setMaxListeners() to increase limit

To be honest I'm quite new in NodeJS coding but I found online that this error is caused by too many listeners being attached to a EventEmitter. I don't know how to solve this, but I'm quite sure it's caused by this module because removing it everything works fine.
This is my code:

getHTML(url)
        .then(({url, html, stats, headers, statusCode})=>{
            if(statusCode==200){
                console.log(statusCode);
            }
        })
        .catch((err)=>{
            onCheckError(err);
        });

It is executed every 5 seconds.

An in-range update of @metascraper/helpers is breaking the build 🚨

The dependency @metascraper/helpers was updated from 5.2.0 to 5.2.4.

🚨 View failing branch.

This version is covered by your current version range and after updating it in your project the build failed.

@metascraper/helpers is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details
  • ❌ continuous-integration/travis-ci/push: The Travis CI build failed (Details).

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

An in-range update of require-one-of is breaking the build 🚨

The dependency require-one-of was updated from 1.0.8 to 1.0.9.

🚨 View failing branch.

This version is covered by your current version range and after updating it in your project the build failed.

require-one-of is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details
  • ❌ continuous-integration/travis-ci/push: The Travis CI build failed (Details).

Commits

The new version differs by 2 commits.

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

An in-range update of got is breaking the build 🚨

The dependency got was updated from 9.2.1 to 9.2.2.

🚨 View failing branch.

This version is covered by your current version range and after updating it in your project the build failed.

got is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details
  • ❌ continuous-integration/travis-ci/push: The Travis CI build failed (Details).

Release Notes for v9.2.2
  • Gracefully handle invalid Location redirect URLs. (#605) 7ae6939
  • Don't override hooks when merging arguments. 3ad3950
  • Merge hooks on got.extend(). (#608) 292f78a

v9.2.1...v9.2.2

Commits

The new version differs by 4 commits.

  • 248d68c 9.2.2
  • 3ad3950 Don't override hooks when merging arguments
  • 292f78a Merge hooks on got.extend() (#608)
  • 7ae6939 Gracefully handle invalid Location redirect URLs (#605)

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

how to force prerendering from the command line?

One my machine HTML-GET is prerendering JavaScript to HTML. Which is what I want.
But on a colleagues machine it is not prerendering the JavaScript.

On the command line how do I force the prerendering? Or do I need to make a script adjustment some where? If so then where specifically. I am only a NPM command line user not a JavaScripter, so need very specific directions.

An in-range update of parse-domain is breaking the build 🚨

The dependency parse-domain was updated from 2.1.4 to 2.1.5.

🚨 View failing branch.

This version is covered by your current version range and after updating it in your project the build failed.

parse-domain is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details
  • ❌ continuous-integration/travis-ci/push: The Travis CI build failed (Details).

Release Notes for v2.1.5

Bug Fixes

  • Compatibility problems with older JavaScript engines (#51) (d9d782b)
Commits

The new version differs by 2 commits.

  • 8f572b2 chore(release): 2.1.5
  • d9d782b fix: Compatibility problems with older JavaScript engines (#51)

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

An in-range update of require-one-of is breaking the build 🚨

The dependency require-one-of was updated from 1.0.7 to 1.0.8.

🚨 View failing branch.

This version is covered by your current version range and after updating it in your project the build failed.

require-one-of is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details
  • ❌ continuous-integration/travis-ci/push: The Travis CI build failed (Details).

Commits

The new version differs by 2 commits.

  • e9b7790 chore(release): 1.0.8
  • fd2988c docs: fix typo on variable name

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

An in-range update of got is breaking the build 🚨

Version 9.2.1 of got was just published.

Branch Build failing 🚨
Dependency got
Current Version 9.2.0
Type dependency

This version is covered by your current version range and after updating it in your project the build failed.

got is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details
  • ❌ continuous-integration/travis-ci/push: The Travis CI build failed (Details).

Release Notes v9.2.1
  • Don't cache response when HTTP error was received. #597 b8480f3
  • Fix merging default & custom handlers. 5f191b9

v9.2.0...v9.2.1

Commits

The new version differs by 7 commits.

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.