This project is a web scraping application built with FastAPI. It scrapes product data from the Dental Stall website and stores it in a JSON file. The application uses caching to avoid redundant data and supports proxy usage.
The project has the following structure:
scraper/
├── main.py # Entry point for the FastAPI application
├── cache.py # Cache management
├── models.py # Data models
├── scraper.py # Web scraper logic
├── utils.py # Utility functions, including authentication
├── images/ # Directory to store downloaded images
├── scraped_data.json # Output file for scraped data
└── cache.json # Cache file for storing product prices
- Python 3.8+
- FastAPI
- Requests
- BeautifulSoup4
- PIL (Pillow)
- Uvicorn
-
Clone the repository:
git clone https://github.com/codeffe/scraper.git cd scraper
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install the required packages:
pip install -r requirements.txt
-
Create the necessary directories and files if they don't exist:
mkdir -p images touch scraped_data.json cache.json Running the Application
-
Start the FastAPI server:
uvicorn main:app --host 0.0.0.0 --port 8000
-
Access the API:
The API will be available at http://localhost:8000.
The FastAPI application has the following endpoint:
Endpoint: /scrape
Method: POST
Headers: token: xyz
Request Body:
{
"pages": 1,
"proxy": null
}
Response:
{
"scraped_products": [ ... ],
"message": "Scraped X products"
}
The application uses a static token for authentication. Ensure to include the header token: xyz in your requests.
Initialization: The Scraper class initializes with the number of pages to scrape, an optional proxy, and a cache instance.
Scraping Process: The scrape method fetches the specified pages from the Dental Stall website. It parses product information including title, price, and image URL. Images are downloaded and saved to the images/ directory.
Image Validation: Downloaded images are validated to ensure they are not corrupted.
Caching: The cache is used to store product prices to avoid redundant data scraping.
Saving Data: Scraped data is saved to scraped_data.json. Cache is updated and saved to cache.json.
Notification: A message is logged indicating the number of products scraped and saved.
To scrape data from the Dental Stall website, you can use the following curl command:
curl -X POST "http://localhost:8000/scrape" -H "Content-Type: application/json" -H "token: xyz" -d '{"pages": 1, "proxy": null}'
This will initiate the scraping process for 1 page without using a proxy.
This project is licensed under the MIT License.