GithubHelp home page GithubHelp logo

akshatjainbafna / profile-scraper Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 2.0 34 KB

A profile scraper which can be given a username and a phone number to scrape details from various apps like WhatsApp, Facebook, TrueCaller etc.

Python 100.00%
facebook-scraper profile-scraper selenium web-scraper whatsapp-scraper

profile-scraper's Introduction

Profile Scraper Web App

This is a simple profile scraper which can be used to scrap data from variety of web app like WhatsApp, Facebook, TrueCaller, GooglePay. It is build using:

  • Flask: Flask is a micro web framework written in Python. It is classified as a microframework because it does not require particular tools or libraries. It has no database abstraction layer, form validation, or any other components where pre-existing third-party libraries provide common functions.
  • Selenium: Selenium is an open source umbrella project for a range of tools and libraries aimed at supporting browser automation. It provides a playback tool for authoring functional tests across most modern web browsers, without the need to learn a test scripting language.
  • BeautifulSoup - Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

Note:

  • App is running on port 5002 so you can send the request using any browser or Postman to the endpoint /scrap-data to scrap the data for a particular number & username for example: http://localhost:5002/scrap-data?username=virat&phone_number=7021292055

  • A lot of things needs to be done manually, like Authenticating in whatsapp and clicking checkbox in confirm box and confirm box in whatsapp to allow whatsapp open a chat on the browser.

  • Driver will run twice: Once on initially running the app and once after the flask server will be started. So close the initially runned driver and authenticate the whatsaoo runned after the server has been started coz all the requests will be send to the instance started after running the server. I have tried multple ways to run the driver only once after the server has started like Multi-processing, Multi-threading, Using Flask-Socket.io to run the driver only once after the server has started and not on every refresh, using " app_context " provided by flask itself but nothing worked well, the first three ways are crashing the browser whereas app_context is running the driver twice, not solving the problem in hand.

  • You can adjust time.sleep( X ) based on how fast your browser is loading and your data speed. If it is super fast you can decrease the amount of time kept the code for waiting to laod a window.

  • Add facebook cookies in JSON format to a file in the base folder with the name " cookies.txt ". You can use Chrome extensions like "Export cookie JSON file for Puppeteer" to export all the cookies in JSON format. If the cookies won't be send with the request facebook restricts the request.

  • Learning: The script will be executed twice no matter where you place the script. Whether you keep it:

  1. Above Flask instance

    `< Some Script/code >

    app = Flask(name) CORS(app)`

  2. Below Flask instance

    ` app = Flask(name)

    CORS(app)

    < Some script > `

  3. In "main" section: ` < Some Script >

    app.run() `

And when placed after app.run(), No Script will be executed.

How to Run the Flask App?

  1. Clone the repository.

git clone https://github.com/Akshatjainbafna/Dinosys-Web-Scrapping-Assignment.git

  1. Install all the dependencies

pip install -r requirements.txt

  1. Run the Flask Server

python module1.py

  1. You can request the API now.

http://localhost:5002/scrap-data?username=virat&phone_number=7021292055

profile-scraper's People

Contributors

akshatjainbafna avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.