GithubHelp home page GithubHelp logo

emadehsan / thal Goto Github PK

View Code? Open in Web Editor NEW
2.4K 53.0 206.0 649 KB

Getting started with Puppeteer and Chrome Headless for Web Scraping

Home Page: https://emadehsan.com

License: MIT License

JavaScript 100.00%
puppeteer chrome-headless nodejs scraping mongoose mongodb

thal's Issues

TimeoutError: Navigation timeout, stuck after login

Description

The page's stuck after login on github, here's the error message after 30 seconds of waiting for nothing happens:

(node:24805) UnhandledPromiseRejectionWarning: TimeoutError: Navigation timeout of 30000 ms exceeded
    at Promise.then (/home/loia5tqd001/Desktop/thal/node_modules/puppeteer/lib/LifecycleWatcher.js:142:21)
  -- ASYNC --
    at Frame.<anonymous> (/home/loia5tqd001/Desktop/thal/node_modules/puppeteer/lib/helper.js:111:15)
    at Page.waitForNavigation (/home/loia5tqd001/Desktop/thal/node_modules/puppeteer/lib/Page.js:690:49)
    at Page.<anonymous> (/home/loia5tqd001/Desktop/thal/node_modules/puppeteer/lib/helper.js:112:23)
    at run (/home/loia5tqd001/Desktop/thal/index.js:30:14)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:24805) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:24805) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

How Has This Been Tested?

I cloned the repository, ran npm install, and then ran index.js with Code Runner (it's similar to run node index.js)

Screenshots

ezgif com-video-to-gif

Unhandled promise rejections are deprecated

when run index.js๏ผŒi get this error:

node index.js
(node:2199) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 1): Error: Navigation Timeout Exceeded: 30000ms exceeded
(node:2199) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

env:

  • node: v8.3.0
  • os: 10.12.6

Feature Request: eCommerce scraping request

Hello,
We use e-commerce web scraping and found the library a perfect start point. Could you tell us how we can integrating amazon.com best.com and ebay.com with this scraper.

Would love to use it at production.

Thanks in advance,
Rahul

How do you access `document` on Node?

For example

let listLength = await page.evaluate((sel) => {
    return document.getElementsByClassName(sel).length;
  }, LENGTH_SELECTOR_CLASS);

Where does document comes from?
I don't want to import implicitly but only explicitly...

Thanks!

username selector

At this point in your tut, Extract Emails, when you selected the username selector in devtools, this is what I am getting:

#user_search_results > div.user-list > div:nth-child(1) > div.d-flex > div > a > em

Note the em at then end. If you use this, the loop doesnt work. You have to change it to #user_search_results > div.user-list > div:nth-child(1) > div.d-flex > div > a for it to run.

Do you know why this might be happening?

Great tutorial. Thank you.

let -> const in README

In ES6, it's idiomatic to use const when a variable binding doesn't change. Therefore, most let bindings in the README should be const, right?

How to test the crawl module

Hi,

I'm a new user for your module, I icorrectly installed Node and MongoDB

I did this :

git clone https://github.com/emadehsan/thal.git

cd thal

npm install

=> modules are correctly installed

I don't know how can I run it? can you tell me plz

when I did npm test, I have this error

> [email protected] test /root/puppeteer/thal
> echo "Error: no test specified" && exit 1

Error: no test specified
npm ERR! Test failed.  See above for more details.

I tried node index.js, I have this error

module.js:491
    throw err;
    ^

Error: Cannot find module './creds'
    at Function.Module._resolveFilename (module.js:489:15)
    at Function.Module._load (module.js:439:25)
    at Module.require (module.js:517:17)
    at require (internal/module.js:11:18)
    at Object.<anonymous> (/root/puppeteer/thal/index.js:2:15)
    at Module._compile (module.js:573:30)
    at Object.Module._extensions..js (module.js:584:10)
    at Module.load (module.js:507:32)
    at tryModuleLoad (module.js:470:12)
    at Function.Module._load (module.js:462:3)

Thanks

Error executing node.js

nodejs index.js

(node:13914) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 1): Error: Evaluation failed: TypeError: Cannot read property 'innerHTML' of null
at :2:43
(node:13914) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

LENGHT_SELECTOR_CLASS misspelled?

Firstly, great work on the tutorial!

I've noticed you've possibly misspelled your LENGHT_SELECTOR_CLASS variable. It should be LENGTH_SELECTOR_CLASS. ๐Ÿ‘

Downloading in puppeteer

I have a list of pdf links which I need to download after web scraping using puppeteer! page.pdf() doesnt seem to work!
Any suggestions?

You don't need to use JSDOM

I believe two of my colleagues already left a comment on the Medium post with this information..

But you don't need to use JSDOM for text extraction. You can use the $ method instead. It should make this a lot more simpler.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.