GithubHelp home page GithubHelp logo

jordan9675 / mydramascraper Goto Github PK

View Code? Open in Web Editor NEW
7.0 2.0 1.0 39 KB

Scrapy project and spiders designed to scrape information on MyDramaList

License: MIT License

Python 100.00%
webscraping scrapy spiders drama mydramalist

mydramascraper's Introduction

MyDramaScraper

This repository provides a classic Scrapy project with a spider named dramalist which allows one to scrap information about the dramas that are listed on MyDramaList.

It also allows to scrap the list of dramas a user completed and their associated ratings

Quick start

To run the spider, first install the dependencies running :

pip install -r requirements.txt

Then, navigate through the the scrapy project called dramascraper which is located in the root directory and from there

You can either scrape MyDramaList's list of drama by running :

scrapy crawl dramalist

To scrape data about user's list, you can run :

scrapy crawl userdramalist -a "<user1>,<user2>...<userN>"

Note : List of users must be comma separated and enclosed in double quotes. A single user can also be passed.

Data types

Drama information

Key Type Description
name String Name of the drama
synopsis String Synopsis of the drama
duration_in_minutes Integer Duration of each episodes (in minutes)
nb_episodes Integer Number of episodes
country_origin String Drama's country of origin
ratings Float Ratings from MyDramaList's users
ranking Integer Ranking based on the ratings
popularity_rank Integer Ranking based on the drama's popularity
nb_watchers Integer Number of MyDramaList's users currently watching the drama
nb_ratings Integer Number of users that have rated the drama
nb_reviews Integer Number of reviews written by MyDramaList's users
streamed_on List List of platforms on which the drama is broadcaster
genres List List of genres associated to the drama
tags List List of tags associated to the drama
mydramalist_url String URL of the drama on MyDramaList
director List List of the drama's director(s)
screenwriter List List of the drama's screenwriter(s)
main_roles List List of the actors having a main role in the drama
support_roles List List of the actors having a support role in the drama
guest_roles List List of the actors having a guest role in the drama

User's drama list

Key Type Description
title str Title of the drama
user str User's username
score int Rating given by the user to the drama

Insert in MySQL

This feature is only available when using the spider designed to scrape information about dramas.

This scrapy project comes with a pipeline allowing to insert the results in a MySQL database.

How to enable the pipeline ?

The pipeline is enabled by default but won't insert anything unless it's asked to.

It does require a small extra loading time when initializing the spider so I do recommend disabling it by emptying the ITEM_PIPELINES in the settings.py file located in the Scrapy project. It should then look like this :

ITEM_PIPELINES = {}

Where to provide the credentials ?

The credentials of the database are currently provided to the pipeline as environmental variables through a hidden .env file which is loaded using the dotenv python package.

Here is an example of how the file looks like :

HOST=xxxxxxxxxxxxx
DB=xxxxxxxxxxxxx
USERNAME=xxxxxxxxxxxxx
PASSWORD=xxxxxxxxxxxxx

The name of the environmental can be changed within this file but don't forget also to change them in the db_connection method from the InsertItem pipeline.

For simplicity, you could also hardcode your credentials in the db_connection method but this is not recommended.

How to ask the pipeline to insert the records ?

Even though the pipeline is enabled by default, it won't insert your records unless you ask him for it by specifying an argument when launching the spider.

To do so, you could run something like :

scrapy crawl dramalist -a sql=True

Note: The value of the sql argument is case insensitive

What columns should I create in my DB ?

Currently, the insertion is based on hardcoded column names that are arbitrarily chosen.

They can be found in the query variable (which refers to the SQL query) within the process_item method in the InsertItem pipeline.

Feel free to name them as you wish and to process the data the way it suits to your needs. However, don't forget to assign the VARCHAR or JSON type to the columns that should store data of type list.

If changing the predetermined names of the columns, please ensure that the order is still corresponding to the order of the keys from the item returned by the scrapy spider.

The insertion is here based on values wrapped into a tuple. The order of the values provided in the tuple should be corresponding to the column name they are associated to.

For example, name is provided as the first column to be filled which implies that its value will be equal to the first element of the tuple that will be provided.

When changing the name of the columns and their order, keep in mind that the order of the values provided as a tuple should be changed correspondingly too.

Motivation

This project is the brick of another upcoming project. Indeed, we are motivated in scraping information on MyDramaList so that we can later create a drama recommandation system based on the user's taste in terms of drama.

We are mainly interested in retrieving information that could be relevant when creating such a system which explains why we decided not to retrieve some types of information from MyDramaList.

To do

  • Add a requirements.txt file
  • Add docstrings
  • Implement some user arguments :
    • Maximum pages to scrape
    • Minimum rating
    • Date range
  • Scrape information located in the statistics tab of the drama's page
  • Creating a spider to retrieve information about the actors
  • Implement a pipeline to allow insertion into MySQL databases

Contact

If there is any missing information that you would be interested in retrieving through the spiders, please send me an e-mail at [email protected].

mydramascraper's People

Contributors

jordan9675 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

gauravkumeriya1

mydramascraper's Issues

Is this project open source?

Hi there, as there is currently no license on this repo, the project is not open source? Thus I wanted to ask if the project is open source, and if so if you could add a license to it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.