emadehsan / thal Goto Github PK

View Code? Open in Web Editor NEW

2.4K 2.4K 206.0 649 KB

Getting started with Puppeteer and Chrome Headless for Web Scraping

Home Page: https://emadehsan.com

License: MIT License

JavaScript 100.00%

chrome-headless mongodb mongoose nodejs puppeteer scraping

thal's Introduction

Getting started with Puppeteer and Chrome Headless for Web Scraping

Here is a link to Medium Article

Here is the Chinese Version thanks to @csbun

Puppeteer is official tool for Chrome Headless by Google Chrome team. Since the official announcement of Chrome Headless, many of the industry standard libraries for automated testing have been discontinued by their maintainers. Including PhantomJS. Selenium IDE for Firefox has been discontinued due to lack of maintainers.

For sure, Chrome being the market leader in web browsing, Chrome Headless is going to industry leader in Automated Testing of web applications. So, I have put together this starter guide on how to get started with Web Scraping in Chrome Headless.

TL;DR

In this guide we will scrape GitHub, login to it and extract and save emails of users using Chrome Headless, Puppeteer, Node and MongoDB. Don't worry GitHub have rate limiting mechanism in place to keep you under control but this post will give you good idea on Scrapping with Chrome Headless and Node. Also, alway stay updated with the documentation because Puppeteer is under development and APIs are prone to changes.

Getting Started

Before we start, we need following tools installed. Head over to their websites and install them.

Project setup

Start off by making the project directory

$ mkdir thal
$ cd thal

Initiate NPM. And put in the necessary details.

$ npm init

Install Puppeteer. Its not stable and repository is updated daily. If you want to avail the latest functionality you can install it directly from its GitHub repository.

$ npm i --save puppeteer

Puppeteer includes its own chrome / chromium, that is guaranteed to work headless. So each time you install / update puppeteer, it will download its specific chrome version.

Coding

We will start by taking a screenshot of the page. This is code from their documentation.

Screenshot

const puppeteer = require('puppeteer');

async function run() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto('https://github.com');
  await page.screenshot({ path: 'screenshots/github.png' });

  browser.close();
}

run();

If its your first time using Node 7 or 8, you might be unfamiliar with async and await keywords. To put async/await in really simple words, an async function returns a Promise. The promise when resolves might return the result that you asked for. But to do this in a single line, you tie the call to async function with await. Save this in index.js inside project directory.

Also create the screenshots dir.

$ mkdir screenshots

Run the code with

$ node index.js

The screenshot is now saved inside screenshots/ dir.

Login to GitHub

If you go to GitHub and search for john, then click the users tab. You will see list of all users with names.

Some of them have made their emails publicly visible and some have chosen not to. But the thing is you can't see these emails without logging in. So, lets login. We will make heavy use of Puppeteer documentation.

Add a file creds.js in project root. I highly recommend signing up for new account with a new dummy email because you might end up getting your account blocked.

module.exports = {
    username: '<GITHUB_USERNAME>',
    password: '<GITHUB_PASSWORD>'
}

Add another file .gitignore and put following content inside it:

node_modules/
creds.js

Launch in non headless

For visual debugging, make chrome launch with GUI by passing an object with headless: false to launch method.

const browser = await puppeteer.launch({
  headless: false
});

Lets navigate to login

await page.goto('https://github.com/login');

Open https://github.com/login in your browser. Right click on input box below Username or email address and select Inspect. From developers tool, right click on the highlighted code and select Copy then Copy selector.

Paste that value to following constant

const USERNAME_SELECTOR = '#login_field'; // "#login_field" is the copied value

Repeat the process for Password input box and Sign in button. You would have following

// dom element selectors
const USERNAME_SELECTOR = '#login_field';
const PASSWORD_SELECTOR = '#password';
const BUTTON_SELECTOR = '#login > form > div.auth-form-body.mt-3 > input.btn.btn-primary.btn-block';

Logging in

Puppeteer provides methods click to click a DOM element and type to type text in some input box. Let's fill in the credentials then click login and wait for redirect.

Up on top, require creds.js file.

const CREDS = require('./creds');

And then

await page.click(USERNAME_SELECTOR);
await page.keyboard.type(CREDS.username);

await page.click(PASSWORD_SELECTOR);
await page.keyboard.type(CREDS.password);

await Promise.all([
  page.click(BUTTON_SELECTOR),
  page.waitForNavigation()
])

Search GitHub

Now, we have logged in. We can programmatically click on search box, fill it and on the results page, click users tab. But there's an easy way. Search requests are usually GET requests. So, every thing is sent via url. So, manually type john inside search box and then click users tab and copy the url. It would be

const searchUrl = 'https://github.com/search?q=john&type=Users&utf8=%E2%9C%93';

Rearranging a bit

const userToSearch = 'john';
const searchUrl = `https://github.com/search?q=${userToSearch}&type=Users&utf8=%E2%9C%93`;

Lets navigate to this page and wait to see if it actually searched?

await page.goto(searchUrl);
await page.waitFor(2*1000);

Extract Emails

We are interested in extracting username and email of users. Lets copy the DOM element selectors like we did above.

const LIST_USERNAME_SELECTOR = '#user_search_results > div.user-list > div:nth-child(1) div.d-flex > div > a';
const LIST_EMAIL_SELECTOR = '#user_search_results > div.user-list > div:nth-child(1) div.d-flex > div > ul > li:nth-child(2) > a';

const LENGTH_SELECTOR_CLASS = 'user-list-item';

You can see that I also added LENGTH_SELECTOR_CLASS above. If you look at the github page's code inside developers tool, you will observe that divs with class user-list-item are actually housing information about a single user each.

Currently one way to extract text from an element is by using evaluate method of Page or ElementHandle. When we navigate to page with search results, we will use page.evaluate method to get the length of users list on the page. The evaluate method evaluates the code inside browser context.

let listLength = await page.evaluate((sel) => {
    return document.getElementsByClassName(sel).length;
  }, LENGTH_SELECTOR_CLASS);

Let's loop through all the listed users and extract emails. As we loop through the DOM, we have to change index inside the selectors to point to the next DOM element. So, I put the INDEX string at the place where we want to place the index as we loop through.

  // const LIST_USERNAME_SELECTOR = '#user_search_results > div.user-list > div:nth-child(1) div.d-flex > div > a';
const LIST_USERNAME_SELECTOR = '#user_search_results > div.user-list > div:nth-child(INDEX) div.d-flex > div > a';
  // const LIST_EMAIL_SELECTOR = '#user_search_results > div.user-list > div:nth-child(1) div.d-flex > div > ul > li:nth-child(2) > a';
const LIST_EMAIL_SELECTOR = '#user_search_results > div.user-list > div:nth-child(INDEX) div.d-flex > div > ul > li:nth-child(2) > a';
const LENGTH_SELECTOR_CLASS = 'user-list-item';

The loop and extraction

for (let i = 1; i <= listLength; i++) {
    // change the index to the next child
    let usernameSelector = LIST_USERNAME_SELECTOR.replace("INDEX", i);
    let emailSelector = LIST_EMAIL_SELECTOR.replace("INDEX", i);

    let username = await page.evaluate((sel) => {
        return document.querySelector(sel).getAttribute('href').replace('/', '');
      }, usernameSelector);

    let email = await page.evaluate((sel) => {
        let element = document.querySelector(sel);
        return element? element.innerHTML: null;
      }, emailSelector);

    // not all users have emails visible
    if (!email)
      continue;

    console.log(username, ' -> ', email);

    // TODO save this user
  }

Now if you run the script with node index.js you would see usernames and there corresponding emails printed.

Go over all the pages

First we would estimate the last page number with search results. At search results page, on top, you can see 69,769 users at the time of this writing.

Fun Fact: If you compare with the previous screenshot of the page, you will notice that 6 more john s have joined GitHub in the matter of a few hours.

Copy its selector from developer tools. We would write a new function below the run function to return the number of pages we can go through.

async function getNumPages(page) {
  const NUM_USER_SELECTOR = '#js-pjax-container > div.container > div > div.column.three-fourths.codesearch-results.pr-6 > div.d-flex.flex-justify-between.border-bottom.pb-3 > h3';

  let inner = await page.evaluate((sel) => {
    let html = document.querySelector(sel).innerHTML;
    
    // format is: "69,803 users"
    return html.replace(',', '').replace('users', '').trim();
  }, NUM_USER_SELECTOR);

  let numUsers = parseInt(inner);

  console.log('numUsers: ', numUsers);

  /*
  * GitHub shows 10 resuls per page, so
  */
  let numPages = Math.ceil(numUsers / 10);
  return numPages;
}

At the bottom of the search results page, if you hover the mouse over buttons with page numbers, you can see they link to the next pages. The link to 2nd page with results is https://github.com/search?p=2&q=john&type=Users&utf8=%E2%9C%93. Notice the p=2 query parameter in the URL. This will help us navigate to the next page.

After adding an outer loop to go through all the pages around our previous loop, the code looks like

let numPages = await getNumPages(page);

console.log('Numpages: ', numPages);

for (let h = 1; h <= numPages; h++) {

	let pageUrl = searchUrl + '&p=' + h;
	
	await page.goto(pageUrl);
	
	let listLength = await page.evaluate((sel) => {
		return document.getElementsByClassName(sel).length;
	}, LENGTH_SELECTOR_CLASS);
	
	for (let i = 1; i <= listLength; i++) {
		// change the index to the next child
		let usernameSelector = LIST_USERNAME_SELECTOR.replace("INDEX", i);
		let emailSelector = LIST_EMAIL_SELECTOR.replace("INDEX", i);
		
		let username = await page.evaluate((sel) => {
			return document.querySelector(sel).getAttribute('href').replace('/', '');
		}, usernameSelector);
		
		let email = await page.evaluate((sel) => {
			let element = document.querySelector(sel);
			return element? element.innerHTML: null;
		}, emailSelector);
		
		// not all users have emails visible
		if (!email)
			continue;
			
		console.log(username, ' -> ', email);
		
		// TODO save this users
	}
}

Save to MongoDB

The part with puppeteer is over now. We will use mongoose to store the information in to MongoDB. Its an ORM, actually just a library to facilitate information storage and retrieval from the database.

$ npm i --save mongoose

MongoDB is a Schema-less NoSQL database. But we can make it follow some rules using Mongoose. First we would have to create a Model which is just representation of MongoDB Collection in code. Create a directory models. Create a file user.js inside and put the following code in it, the structure of our collection. Next whenever we insert something into users collection with mongoose, it would have to follow this structure.

const mongoose = require('mongoose');

let userSchema = new mongoose.Schema({
    username: String,
    email: String,
    dateCrawled: Date
});

let User = mongoose.model('User', userSchema);

module.exports = User;

Let's now actually insert. We don't want duplicate emails in our database. So, we only insert a user's information if the email is not already present. Otherwise we would just update the information. For this we would use mongoose's Model.findOneAndUpdate method.

At the top of index.js add the imports

const mongoose = require('mongoose');
const User = require('./models/user');

Add the following function at bottom of index.js to upsert (update or insert) the User model

function upsertUser(userObj) {

	const DB_URL = 'mongodb://localhost/thal';
	
	if (mongoose.connection.readyState == 0) {
		mongoose.connect(DB_URL);
	}
	
	// if this email exists, update the entry, don't insert
	const conditions = { email: userObj.email };
	const options = { upsert: true, new: true, setDefaultsOnInsert: true };
	
	User.findOneAndUpdate(conditions, userObj, options, (err, result) => {
		if (err) throw err;
	});
}

Start MongoDB server. Put following code inside the for loops at the place of comment // TODO save this user in order to save the user

upsertUser({
  username: username,
  email: email,
  dateCrawled: new Date()
});

To check if you are actually getting users saved, get inside mongo shell

$ mongo
> use thal
> db.users.find().pretty()

You would see multiple users added there. This marks the crux of this guide.

Conclusion

Chrome Headless and Puppeteer is the start of a new era in Web Scraping and Automated Testing. Chrome Headless also supports WebGL. You can deploy your scraper in cloud and sit back and let it do the heavy load. Remember to remove the headless: false option when you deploy on server.

While scraping, you might be halted by GitHub's rate limiting

Another thing I noticed, you cannot go beyond 100 pages on GitHub.

End note

Deserts symbolize vastness and are witness of the struggles and sacrifices of people who traversed through these giant mountains of sand. Thal is a desert in Pakistan spanning across multiple districts including my home district Bhakkar. Somewhat similar is the case with Internet that we traversed today in quest of data. That's why I named the repository Thal. If you like this effort, please like and share this with others. If you have any suggestions, comment here or approach me directly @e_mad_ehsan. I would love to hear from you.

thal's People

Contributors

Stargazers

Watchers

Forkers

aadityadev yelluw chosen1 nranjan54 bussiere rfxlab alfathsyahrian oandremendes bajunk jabranr moazkh60 n2a5me bmsdave csbun benjamesbabala hero7 connect-wechat-app folkevil nilportugues shelltips techmexdev viqar israrulhaq07 basdog22 nabilashraf agdolla haquezameer shoaibshakeel381 landsurveyorsunited ahsenkh paulehoffman saturation linkwisdom awesome-javascript ajerez owenb132 opt9 xygahs0801 liming518110 mohxin kevinwucodes a382695908 jqw-chang yatishbalaji ngvanduy practicalcode dice-x ardani huiyuanjian celoseben mehrdad-shokri davidalphafox kjfcpua xprudhomme wandergink cemheren ft395 mockingloris galothus schollz alanqian raja3c jimmy9065 techrec13 seangeleno croso-df caroso1222 willnguyen1312 lang315 liuxianhua1996 tanvee38 elizabethhannan king-one kaludjer ambroseus xenotime-india rwibawa mhuckaby casualuser whmay jl26123 rahulmisra2000 ariezncahyo tongxunlu mrerichoffman wooodhead mahmoudelsayad alangunning evelaguti tomwang1013 atltl xuesj jonathantsang khsr shahidv3 ismailmechbal darwing1210 jay-deshmukh beaupedraza kurdin

thal's Issues

TimeoutError: Navigation timeout, stuck after login

Description

The page's stuck after login on github, here's the error message after 30 seconds of waiting for nothing happens:

(node:24805) UnhandledPromiseRejectionWarning: TimeoutError: Navigation timeout of 30000 ms exceeded
    at Promise.then (/home/loia5tqd001/Desktop/thal/node_modules/puppeteer/lib/LifecycleWatcher.js:142:21)
  -- ASYNC --
    at Frame.<anonymous> (/home/loia5tqd001/Desktop/thal/node_modules/puppeteer/lib/helper.js:111:15)
    at Page.waitForNavigation (/home/loia5tqd001/Desktop/thal/node_modules/puppeteer/lib/Page.js:690:49)
    at Page.<anonymous> (/home/loia5tqd001/Desktop/thal/node_modules/puppeteer/lib/helper.js:112:23)
    at run (/home/loia5tqd001/Desktop/thal/index.js:30:14)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:24805) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:24805) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

How Has This Been Tested?

I cloned the repository, ran npm install, and then ran index.js with Code Runner (it's similar to run node index.js)

Screenshots

Chinese version

Here comes a Chinese version

How do you access `document` on Node?

For example

let listLength = await page.evaluate((sel) => {
    return document.getElementsByClassName(sel).length;
  }, LENGTH_SELECTOR_CLASS);

Where does document comes from?
I don't want to import implicitly but only explicitly...

Thanks!

LENGHT_SELECTOR_CLASS misspelled?

Firstly, great work on the tutorial!

I've noticed you've possibly misspelled your LENGHT_SELECTOR_CLASS variable. It should be LENGTH_SELECTOR_CLASS. 👍

let -> const in README

In ES6, it's idiomatic to use const when a variable binding doesn't change. Therefore, most let bindings in the README should be const, right?

Feature Request: eCommerce scraping request

Hello,
We use e-commerce web scraping and found the library a perfect start point. Could you tell us how we can integrating amazon.com best.com and ebay.com with this scraper.

Would love to use it at production.

Thanks in advance,
Rahul

You don't need to use JSDOM

I believe two of my colleagues already left a comment on the Medium post with this information..

But you don't need to use JSDOM for text extraction. You can use the $ method instead. It should make this a lot more simpler.

Unhandled promise rejections are deprecated

when run index.js，i get this error:

node index.js
(node:2199) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 1): Error: Navigation Timeout Exceeded: 30000ms exceeded
(node:2199) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

env:

node: v8.3.0
os: 10.12.6

How do i clear cache and cookies in headless mode?

Error executing node.js

nodejs index.js

(node:13914) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 1): Error: Evaluation failed: TypeError: Cannot read property 'innerHTML' of null
at :2:43
(node:13914) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

How to test the crawl module

Hi,

I'm a new user for your module, I icorrectly installed Node and MongoDB

I did this :

git clone https://github.com/emadehsan/thal.git

cd thal

npm install

=> modules are correctly installed

I don't know how can I run it? can you tell me plz

when I did npm test, I have this error

> [email protected] test /root/puppeteer/thal
> echo "Error: no test specified" && exit 1

Error: no test specified
npm ERR! Test failed.  See above for more details.

I tried node index.js, I have this error

module.js:491
    throw err;
    ^

Error: Cannot find module './creds'
    at Function.Module._resolveFilename (module.js:489:15)
    at Function.Module._load (module.js:439:25)
    at Module.require (module.js:517:17)
    at require (internal/module.js:11:18)
    at Object.<anonymous> (/root/puppeteer/thal/index.js:2:15)
    at Module._compile (module.js:573:30)
    at Object.Module._extensions..js (module.js:584:10)
    at Module.load (module.js:507:32)
    at tryModuleLoad (module.js:470:12)
    at Function.Module._load (module.js:462:3)

Thanks

#user_search_results > div.user-list > div:nth-child(1) > div.d-flex > div > a > em

Note the em at then end. If you use this, the loop doesnt work. You have to change it to #user_search_results > div.user-list > div:nth-child(1) > div.d-flex > div > a for it to run.

Do you know why this might be happening?

Great tutorial. Thank you.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble