GithubHelp home page GithubHelp logo

scraper's Introduction

Dental Stall Scraper

This project is a web scraping application built with FastAPI. It scrapes product data from the Dental Stall website and stores it in a JSON file. The application uses caching to avoid redundant data and supports proxy usage.

Project Structure

The project has the following structure:

scraper/
├── main.py # Entry point for the FastAPI application
├── cache.py # Cache management
├── models.py # Data models
├── scraper.py # Web scraper logic
├── utils.py # Utility functions, including authentication
├── images/ # Directory to store downloaded images
├── scraped_data.json # Output file for scraped data
└── cache.json # Cache file for storing product prices

Requirements

  • Python 3.8+
  • FastAPI
  • Requests
  • BeautifulSoup4
  • PIL (Pillow)
  • Uvicorn

Installation

  1. Clone the repository:

    git clone https://github.com/codeffe/scraper.git
    cd scraper
    
  2. Create and activate a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
  3. Install the required packages:

    pip install -r requirements.txt
  4. Create the necessary directories and files if they don't exist:

    mkdir -p images
    touch scraped_data.json cache.json
    Running the Application
  5. Start the FastAPI server:

    uvicorn main:app --host 0.0.0.0 --port 8000
  6. Access the API:

The API will be available at http://localhost:8000.

API Endpoints

The FastAPI application has the following endpoint:

  Endpoint: /scrape
  Method: POST
  Headers: token: xyz

  Request Body: 
     {
       "pages": 1,
       "proxy": null
     }

  Response:
     {
       "scraped_products": [ ... ],
       "message": "Scraped X products"
     }

Authentication

The application uses a static token for authentication. Ensure to include the header token: xyz in your requests.

How it Works

Initialization: The Scraper class initializes with the number of pages to scrape, an optional proxy, and a cache instance.

Scraping Process: The scrape method fetches the specified pages from the Dental Stall website. It parses product information including title, price, and image URL. Images are downloaded and saved to the images/ directory.

Image Validation: Downloaded images are validated to ensure they are not corrupted.

Caching: The cache is used to store product prices to avoid redundant data scraping.

Saving Data: Scraped data is saved to scraped_data.json. Cache is updated and saved to cache.json.

Notification: A message is logged indicating the number of products scraped and saved.

Example Request

To scrape data from the Dental Stall website, you can use the following curl command:

curl -X POST "http://localhost:8000/scrape" -H "Content-Type: application/json" -H "token: xyz" -d '{"pages": 1, "proxy": null}'

This will initiate the scraping process for 1 page without using a proxy.

License

This project is licensed under the MIT License.

scraper's People

Contributors

codeffe avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.