GithubHelp home page GithubHelp logo

frankfanslc / webscraping_email_phone Goto Github PK

View Code? Open in Web Editor NEW

This project forked from data-ethusiast/webscraping_email_phone

0.0 1.0 0.0 17 KB

Web scraping of Emails and Phone numbers from various websites

Python 100.00%

webscraping_email_phone's Introduction

Email-Phone scraping

This project allows you to easily crawl through the websites' script to collect bulk of emails and phone numbers which are then dumped into a .csv file in an organized way.

The main concern of this 'Advanced' Email and phone scraping using python3 is to provide a platform where we can garner the data (emails and phone no:) in a neat and swift manner.

Applications:

  1. Generally used by marketers to stockpile the data of several organizations.
  2. Used in Business/ eCommerce: Market Analysis

Getting Started

These instructions will help you to deploy this project in your local systems for development and testing purposes. Given below are the steps to be followed systematically to build this project.

Pre-requisites

What are the things which are to be installed in your system?

  • This project is built using python version 3.7

Libraries to be installed ?

  • pip install regex (2020.7.14)
  • pip install google-search (1.0.2)
  • pip install requests (2.24.0)
  • pip install beautifulsoup4 (4.9.1)
  • pip install tld (0.12.2)

Deployment

Now you are good to go :)

  1. Clone and download the zip file.
  2. Extract the file into your required directory.
  3. Erase the content in the .csv file and keep the header undisturbed.
  4. Run the script

Execution

  1. Enter the organization name along with the location if necessary. Ex: Deloitte Hyderabad
  2. The link associated with it will be stored in the 'web_urls.txt'
  3. Enjoy Harvesting Emails and Phone numbers :)

How does it Work?

  1. Firstly, It generates a link for the input which is being provided. It does this using 'search' from the google-search library and stores the present and all the successive urls in the 'web_urls.txt'
  2. Secondly, We now process each and every URL by requesting a HTTP response to the website.
  3. We convert the entire page of that respective url into a html scripted text using bs4.
  4. Now that we have extracted the entire content from the web page, we have to scrap all the emails and phone numbers present in the home page.
  5. The scraping of the data is all done by regular expressions.
  6. The regex code employed in this project is the one which is generalized, which detects and throws back mails along with phone no's from most of the websites. Nevertheless, for some it might not go well.
  7. If the data is not detected in the home page of the website, It traces the contact page and starts collecting the data if present, as most of the websites' contact details reside in the contact-us webpage
  8. Now we merge the home page data and contact page data into a single data structure.
  9. Finally, We dump the entire stuff into a .csv file, so that the data is not in a dishevelled manner and is used for inspection.

Built with

Python 3.x - A Programming Language

Contributing

Open to contributions from the public.

Author

  • K Sai Chaitanya

webscraping_email_phone's People

Contributors

data-ethusiast avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.