GithubHelp home page GithubHelp logo

cleansing-data's Introduction

Amazon Prime Titles Data Cleaning

This repository contains the code and steps for cleaning and preparing the Amazon Prime Titles dataset for analysis. The dataset provides information about movies and TV shows available on Amazon Prime, including details such as title, director, cast, release year, and more. The dataset is loaded from a CSV file, and various data cleaning and preprocessing steps are performed.

1. Python Libraries Import

The first step involves importing essential Python libraries, which facilitate the data cleaning and analysis processes.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

2. Load Data Set

The dataset is loaded by creating a pandas DataFrame from the CSV file 'amazon_prime_titles.csv.'

df = pd.read_csv('amazon_prime_titles.csv')

Displays the data set that has been loaded image

3. Check Data Condition

An overview of the dataset is provided, including details about its dimensions and data types.

df.shape

image From the output, the dataset consists of 9,668 rows and 12 columns. It provides valuable information about the Amazon Prime Titles dataset.

df.info()

Provide detailed information about the dataset image

4. Check Percentage Missing Values

Calculate and display the percentage of missing values for each column in the dataset. Develop strategies for handling missing values based on the data types and their relevance to the analysis.

missing_values = df.isnull().sum()

missing_percentages = (missing_values / len(df)) * 100

missing_info = pd.DataFrame({
    'Column': missing_values.index,
    'Missing Values': missing_values.values,
    'Missing Percentage': missing_percentages.values
})

print(missing_info)

image Insights and Considerations

  • Data Completeness: Columns like 'director,' 'cast,' 'country,' 'date_added,' 'rating,' and others have missing values that need attention.
  • Data Types: The 'release_year' column is correctly identified as an integer ('int64'). Other columns, such as 'date_added' and 'duration,' might need conversion to datetime or numeric types if they contain date or duration information.

5. Check and Remove Duplicate

Duplicate rows are identified and removed from the dataset if any are found.

df.duplicated().sum()

image As the output is 0, no further elimination is required. This confirms that there are no duplicates in the dataset.

6. Drop Columns

Columns 'date_added' and 'country' are dropped due to their high percentage of missing data (more than 50%) and limited relevance to the analysis.

df.drop(['date_added','country'], axis = 1, inplace=True)
df.columns

image Show dataset info to see if the date and country columns are missing. Check the others columns missing values.

df.info()

image The output shows the continued presence of null values within columns such as 'director', 'cast', 'release_year' and 'rating'.

7. Enhance Data

In this section, the dataset is enhanced to improve its quality and usability. This includes renaming the 'listed_in' column, converting 'duration' values into a numerical format, and splitting the 'duration' column into 'duration_movie' and 'duration_season'. Additionally, the 'categories' column is divided into multiple columns to better represent the genres.

df.rename(columns={'listed_in' : 'categories'},inplace=True)
df.head(2)

image To solve the data type problems at duration and date added.

  • 'date_added' is removed from the dataset.
  • Attention is focused on the 'duration' columns. A new column named 'duration_num' is created in the DataFrame to store numerical values extracted from the 'duration' column.
df['duration_num'] = df['duration'].str.extract('(\d+)').astype(float)

The 'duration' column is then split into 'duration_movie' and 'duration_season'

df['duration_movie'] = df['duration'].str.extract('(\d+) min', expand=False)

df['duration_season'] = df['duration'].str.extract('(\d+) Seasons', expand=False)

df.drop('duration', axis=1, inplace=True)

df.head()

image To enhance the dataset further, the 'categories' column is divided into three separate columns to represent the different genres

df[['Genre1', 'Genre2', 'Genre3']] = df['categories'].str.split(',', expand=True, n=2)
df.head()

image After renaming the 'listed_in' column and extracting numerical values from 'duration', the dataset is more structured and suitable for analysis. It allows for better categorization of genres and simplifies the representation of content duration.

8. Handling Missing Values

This section is dedicated to addressing missing values in the dataset, focusing on non-categorical and non-numerical columns. The primary objective is to properly impute the missing data. The 'rating' column will be handled differently by assigning it the mode value. Missing values in other columns will be filled with 'unavailable' for non-categorical and non-numerical columns.

df.isnull().sum()

image The inspection reveals that there are missing values in the columns 'director', 'cast', 'rating', 'duration_movie', 'duration_season', 'Genre2', and 'Genre3'.

df['director'].fillna('Unavailable', inplace=True)
df['cast'].fillna('Unavailable', inplace=True)
df['duration_season'].fillna('Unavailable', inplace=True)
df['duration_movie'].fillna('Unavailable', inplace=True)
df['Genre1'].fillna('Unavailable', inplace=True)
df['Genre2'].fillna('Unavailable', inplace=True)
df['Genre3'].fillna('Unavailable', inplace=True)

df['rating'] = df['rating'].fillna(df['rating'].mode()[0])
df

image Following the filling of missing values, the dataset is inspected again to confirm the absence of any remaining missing values

df.isnull().sum() 

image The output confirms that there are no missing values in the dataset, meaning that all columns have been appropriately handled. The dataset is now more complete and ready for further analysis and processing.

9. Check Outliers

Analyze the distribution of film duration for 'Movie' type films. The calculated lower and upper bounds can then be used to identify and manage outliers in the film's duration.

movie_df = df[df['type'] == 'Movie']

q1_movie = movie_df['duration_num'].quantile(0.25)
q3_movie = movie_df['duration_num'].quantile(0.75)

print("Quartil 1 (Q1) untuk Movie:", q1_movie)
print("Quartil 3 (Q3) untuk Movie:", q3_movie)

iqr=q3_movie-q1_movie
print('IQR : ',iqr)

lwr_bound = q1_movie - (1.5 * iqr) #lower bound
upr_bound = q3_movie + (1.5 * iqr) #upper bound

print('Lower Bound : ',lwr_bound)
print('Upper Bound: ',upr_bound)

image Arrange the rows of the DataFrame in descending order based on the values in the 'duration_num' column. In the context of movies, this operation would arrange the movies from the longest to the shortest duration.

movie_df.sort_values(by='duration_num', ascending=False)

image Filter and select rows from the sorted DataFrame movie_df_sorted where the 'duration_num' values fall outside a specific range.

movie_df_sorted = movie_df.sort_values(by='duration_num', ascending=False)
movie_df_sorted[(movie_df_sorted['duration_num'] > 152) | (movie_df_sorted['duration_num'] < 28)]

image

10. Save Clean Data in CSV

The cleaned dataset is saved as a CSV file, named "amazon_clean.csv."

df_clean = pd.DataFrame(df)
df_clean.to_csv("amazon_clean.csv")
df_clean

image

cleansing-data's People

Contributors

pritaaa avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.