GithubHelp home page GithubHelp logo

fehmitahsindemirkan / mindsite-case Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 52.28 MB

Mindsite Interview Task : Powerful web scraping tool for e-commerce data with email notifications and flexible data export. Supports N11 and Trendyol.

Home Page: https://www.themindsite.com/

Python 94.50% Cython 2.30% C 2.50% XSLT 0.34% C++ 0.19% Meson 0.01% Fortran 0.06% Forth 0.01% Smarty 0.02% JavaScript 0.01% PowerShell 0.06% Batchfile 0.01%
parser-library python scanner-web selenium-python webcrawler

mindsite-case's Introduction

MINDSITE INTERVIEW TASK

The system will take a list of product URLs as input. Subsequently, it will expedite the data types for early parsing as specified. Following this, upon initiation of the scanning process, it will export the collected data in various file formats. Lastly, it can notify users by sending emails, which may also include the gathered data, attached to the email or embedded in the body.

Used Libraries

  • Beautiful Soup: A library used for scraping data from web pages.
  • Requests: A library for making HTTP requests.
  • asyncio: The core library for asynchronous programming.
  • pandas: A library for data analysis and manipulation.
  • tabulate: A library for creating tables.
  • smtplib: A built-in library for sending emails using the Simple Mail Transfer Protocol.
  • re: A built-in library for regular expressions.
  • logging: A built-in library for flexible event logging.
  • openpyxl: A library for reading and writing Excel files.

Project Structure

The project includes the following main components:

  • src/crawler.py: Contains the Crawler class for asynchronously fetching web pages.
  • src/parser.py: Includes URLParser classes for extracting product details from web pages.
  • src/storage.py: Contains the StorageExporter class for storing obtained product information.
  • src/email_sender.py: Includes the EmailSender class for handling email sending operations.
  • product.py: Contains the Product class representing a product with its details.
  • logs/crawler.log:Log file containing information about the crawling process.

Usage

How to run this project :

Example Usage:

# Clone the repository
git clone https://github.com/your-username/your-web-scraper.git

# Navigate to the project directory
cd your-web-scraper

# Create a virtual environment
python -m venv venv

# Activate the virtual environment (Windows)
venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run the scraper
python.exe main.py

Project Overview

Base Requirements:

1. Asynchronous Handling for Reduced Response Wait Time

The project implements a concurrent approach by creating a product list, storing URLs in an array, and performing multiple data retrieval operations simultaneously.

2. Failure Handling Mechanism for Unresponsive or Blocked Requests

Crawler.py effectively addresses erroneous URLs and various errors, providing a robust mechanism to overcome challenges.

3. Logging for Consistency

The project maintains a log file (crawler.log) to record regular and consistent log information throughout the code.

4. Comprehensive Error Handling for Uninterrupted Runtime

Errors are handled gracefully, displaying error messages while allowing the project to continue running.

5. Data Export in Specified Formats

Data can be exported in JSON, XLSX, and CSV formats to ensure flexibility in data sharing and analysis.

6. Support for Two Specified Retailers

Two distinct classes for N11 and Trendyol have been created, enabling data parsing according to the respective retailer's structure.

7. Support for Two Specified File Formats

Export operations support JSON, XLSX, and CSV, providing versatility in exporting data files.

Bonus Requirements:

1. Email Notification Module upon Completion

The email_sender.py module sends a notification email to the embedded email address once the data retrieval process is completed.

2. Attach Exported Files to Notification Email

Exported files are attached to the notification email using the email_sender.py module.

3. Beautifully Formatted Table in Email Body

The send_email method in email_sender.py has been updated, and a new method _create_html_table has been added to present collected data in a well-formatted HTML table within the email body.

4. Support for More Than Two Specified Retailers

Support for N11 and Trendyol has been extended to accommodate more than two retailers by creating classes for each retailer and parsing data accordingly.

5. Support for More Than Two Specified File Formats

Data export operations support JSON, CSV, and XLSX, providing a wider range of choices for file formats.

6. Validating input URLs are in the correct formats:

Prior to the web crawling process, the program includes a validation step to confirm that input URLs adhere to the standard HTTP or HTTPS format, preventing potential errors during execution.


Evaluation

  1. Architectural Decisions:

    • The project is organized into modules, including Crawler, URLParser, StorageExporter, and EmailSender. Each module has distinct responsibilities, promoting code readability and maintainability.
  2. Pythonic Usages:

    • Leveraging Python's features and standard libraries, the code follows Pythonic conventions. Asynchronous programming is implemented using the asyncio module.
  3. Performance:

    • Asynchronous programming enhances performance by allowing the scraper to download and process web pages concurrently. The asyncio module is employed for efficient asynchronous operations.
  4. Manageability and Configurability:

    • The code is designed to be extendable and configurable. Different URLParser classes can be added to extract data from various e-commerce websites. The Crawler class handles page downloading and processing, and the StorageExporter classes manage data export operations.
  5. Asynchronous Programming Best Practices:

    • The asyncio module is used for asynchronous programming, and await expressions are appropriately utilized within asynchronous functions.
  6. Alignment with Requirements:

    • The project meets the specified requirements, extracting product information from different e-commerce websites asynchronously. Asynchronous programming using asyncio aligns with the project's requirements.
  7. Technical Documentation:

    • The code includes comments and docstrings providing technical documentation. Each module and class has explanations for methods and functionalities, enhancing code understandability.

References

Python Documentation:

OpenAI GPT-3.5:

Web Crawling and Scraping:

Selenium with Python:


mindsite-case's People

Contributors

fehmitahsindemirkan avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.