GithubHelp home page GithubHelp logo

kevwjin / contacts-web-scraper Goto Github PK

View Code? Open in Web Editor NEW
8.0 1.0 1.0 669 KB

Scrape contact info (e.g. emails, Facebooks, Twitters, etc.) into Google Sheets by entering search phrases or directly entering the website URL

Python 100.00%
gspread webscraper google-sheets beautifulsoup requests

contacts-web-scraper's Introduction

Description

While manual input of keywords into a search engine is efficient for varied searches, the method is less suitable when the searches are repetitive; instead, web scraping, or the automation of data extraction from a website, is the most convenient method.

Contacts Web Scraper is built to scrape websites with an About, Contact, or Our Team page that lists email and SNS contact information. Blogs, for instance, are a category of websites that commonly have the aforementioned types of pages.

Contacts Web Scraper allows the you to:

  1. scrape contacts from websites from websites retrieved from a Google search page
    • e.g. Let the search phrase of the Google search page be top 100 fitness blogs. If the search page contains websites with a list of 100 URLs to fitness blogs, then the scraper will scrape each of the websites from the search page for the URLs and scrape each of the websites from the URLs for contacts. The scraper removes duplicate websites automatically.
  2. scrape contacts from websites of a Google search page
  3. scrape contacts from websites

All the contact information is filed into your manually created Google Sheet file, which must contain at least 4 sheets and be shared with a service account with the following email: email [email protected] for the service account email and the credentials.json file. The first sheet is to display the scraped contacts, while the next sheets are for input of operations 1, 2, and 3 respectively (listed above).

Getting Started

In order to run the program, you must have ‘python3’ installed. After downloading all the program files, navigate to the WebContactSheetManager.py file through terminal:

Introductory Terminal Guide

‘cd’ by itself is used to navigate to the root directory. For Unix (MacOS, Linux, etc.), ‘ls’ displays the current directory’s contents; use ‘ls’ as a way to view which directories can be navigated to from the current directory. In Windows, the terminal uses ‘dir’ instead. ‘cd Documents’ navigates to the Documents directory. When arriving in a directory with the WebContactSheetManager.py file, run the script with the command ‘python3 WebContactSheetManager.py’. In order to stop and exit the program, hit ‘Ctrl + C’ in the terminal.

When running the program, the first notice instructs to share the Google Sheets file to the service account email. In your Google Drive, create a new sheet, and add sheets to the file so that there are a total of at least 4 sheets. Once the Google Sheets file is shared with the service account email, enter the name of the Google Sheets file. The name needs to be unique from the names of other Google Sheets; otherwise the program will not create a new sheet. If the sheet you want to access has already been created, confirm that you would like to proceed. Depending on whether you are creating a new sheet, the program will either prompt you to enter the password or prompt you to create a new password.

The program will then prompt a warning that “the file must contain 4 sheets for the program to work.” If your file indeed has 4 sheets, hit ‘Enter’ to proceed. The wait to connect to and set up the Google Sheets file may take a few seconds.

A menu with the scraping operations will soon appear, and you may proceed to your desired scrape. At the same time, open your Google Sheets file. The program inserted headers in the first row of every sheet during the set up.

Operations

Note that for operations 1 and 2, Google does not permit scraping their search results in their Terms of Service, but if you scrape at a rate of less than 6 unique keyphrases per hour, you will not be detected according to this stackoverflow.

For the first operation, insert the keyphrase in the first column of the second sheet of your file. You must insert the keyphrases in order, such that there is no empty cell between keyphrases in the column. Once you confirm the operation, the program will begin scraping, and update the status of your inputs to “to scrape” in the spreadsheet.

Occasionally, you may spot a message stating that the “HTML from [website url] was not extracted.” The program will simply ignore the current website and move onto the next website to scrape.

The same process applies to operation 2 and 3. Both operations will append the extracted contact information to the first sheet. Here is an example output.

contacts-web-scraper's People

Contributors

kevwjin avatar

Stargazers

dingobongo avatar  avatar  avatar Tommi Saltiola avatar  avatar Abdalrahman Saqr avatar  avatar AYO ELUTILO. avatar

Watchers

 avatar

Forkers

powellamaranth

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.