🛒 Big Mart Sales Data Preprocessing & EDA 🛒

using SAS Studio 🖥

📃 Table of Contents:

About Project
Objectives
Data Set Description
Initial Data Exploration
Data Pre Processing
EDA
Hypothesis
Feature Engineering
New Metadata

🖋 About Project

👉 Dataset is taken from Kaggle.
👉 In this project, will perform:

Initial data exploration
Data Preprocessing
EDA
Hypothesis testing (statistical and non-statistical)
Feature engineering (one-hot encoding, label encoding, and binning)

📌 Objectives

Perform initial data exploration
Perform data preprocessing
Perform EDA
Perform hypothesis testing (statistical and non-statistical)
Perform feature engineering (one-hot encoding, label encoding, and binning)

🧾 Data Set Description

👉 There are 12 variables in this data sets:

5 categorical variables,
5 continuous variables,
1 variable to accommodate the Item identifier (ID), and
1 variable to accommodate the Outlet identifier (ID).

👉 The variables available in this dataset are:

Column Name	Data Type	Type	Description
Item_Identifier	char	Nominal	Product ID
Item_Weight	num	Ratio	Weight of product
Item_Fat_Content	char	Nominal	Content of product (low fat or regular)
Item_Visibility	num	Nominal	The percentage of all products in the store that are assigned to a specific product in the total display area
Item_Type	char	Nominal	Category of product
Item_MRP	num	Ratio	Maximum retail price of a product
Outlet_Identifier	char	Nominal	Store ID
Outlet_Establishment_Year	num	Nominal	Year the store established
Outlet_Size	char	Ordinal	Size of the store
Outlet_Location_Type	char	Ordinal	The type of city where the store is located
Outlet_Type	char	Ordinal	Type of the store
Item_Outlet_Sales	num	Nominal	Sales of product

📊 Initial Data Exploration

🏛 Data Sets Summary:

🔠 Character Column Exploration:

Item_Identifier
- It can be seen that there are 1559 different item IDs in the “Item_Identifier” column.
- Besides, there are “Item_Identifier” values that start with the same specific character, such as “DR”, “FD”, and “NC”, followed by numbers at the end of a value.
Item_Fat_Content
- It can be seen that most of the products are categorized as “Low Fat” products with a percentage of 59.71%.
- It can be seen that there are inconsistent values have the same meaning but in different values, such as “LF”, “low fat”, and “reg”.
- These variables will be preprocessed in the next section.
Item_Type
- There are 16 item types in the dataset, with “Fruits and vegetables” became the item type with the highest number, with a percentage of 14.46%.
- However, the lowest number item type is “Seafood”, which only has a percentage of 0.75%.
Outlet_Identifier
- It can be seen there are ten outlets in the dataset, with all the outlets percentage have almost the same number of outlets (about 10%).
- However, the percentage of outlets “OUT10” and “OUT19” have the least number of outlets, only 6%.
Outlet_Size
- There are 3 types of outlet size, and “Medium” size becomes the size of outlet with the most number in the dataset with a percentage of 45.69%.
- The smallest number outlet size is the “High” size which only has the percentage of 15.25%.
Outlet_Location_Type
- There are 3 levels of outlet, and “Tier 3” becomes the tier of outlet with the most number in the dataset with a percentage of 39.31%.
- The smallest tier of outlet is the “Tier 1”, which only have a percentage of 28.02%.
Outlet_Type
- It can be seen there are four outlet types in this dataset, and “Supermarket Type 1” become the type of outlet with the most number in the dataset with a percentage of 65.43%.
- The smallest number of outlet types are “Supermarket Type 2” with only a percentage of 10.89% and “Supermarket Type 3” with only 10.97%.

🔢 Continuous Column Exploration:

Item_Weight
- It can be seen that the distribution is normal, which means no outliers in this column.
Item_Visibility
- This figure shows that the distribution is right-skewed distributions, which has a long right tail, and the mean position is on the right side of the data.
- It can be seen that there are outliers in this column and the minimum value in this column is 0.
- These outliers will be pre processed in the next section.
Item_MRP
- It can be seen that the distribution is normal (no outliers detected).
Outlet_Establishment_Year
- It can be seen that the distribution is normal, which means no outliers in this column.
Item_Outlet_Sales
- It can be seen that the distribution is right-skewed distributions, which has a long right tail, and the mean is usually to the right of the median of the data.
- It can be seen that there are outliers value in this column.
- These outliers will be pre processed in the next section.

❓ Missing Values:

👉 It can be seen that there are missing values in the “Item_Weight” column (1463 missing values) and the “Outlet_Size” column (2410 missing values).
👉 These missing values will be pre processed in the next section.

⚙ Data Pre-processing

🧹 Handling Dirty Data:

👉 The previous section already mentioned inconsistent values in the “Item_Fat_Content” column, which have the same meaning but in different forms.

👉 It can be seen now that the inconsistent values have been replaced into the same values, such as “LF” to “Low Fat”, “reg” to “Regular”, etc.

❓ Handling Missing Values:

Outlet_Size

👉 It can be seen that the “Small” outlet size belongs to “Grocer Store” and “Supermarket Type1” outlet type.
👉 However, if the total “Outlet_Size” outlets are added up, the “Medium” size outlet is much bigger than the “Small” size outlets.
👉 In this case, it was decided to fill in the missing data by following the previous data (last value replacement method) to reduce the biased of analysis.

👉 Based on "Outlet_Identifier" column, it can be seen that “OUT010”, “OUT017”, and “OUT045” outlets have missing values in the “Outlet_Size” column.

👉 The table shows the comparison between before and after imputing missing values in the “Outlet_Size” column.
👉 It can be seen that the missing values have disappeared, and outlet size for “OUT010”, “OUT017”, and “OUT045” already imputed.

Item_Weight

👉 It can be seen that there are some previous “Item_Weight” values in the dataset based on the “Item_Identifier” column.
👉 In this case, it was decided to fill these missing values by following the previous data (last value replacement method) to reduce the biased of analysis that will be performed in the next section.

👉 There are still missing values in the “Item_Weight” column.
👉 This is because the item ID does not have the value in the previous record.
👉 The values will be filled with the “Item_Weight” column's mean value.

👉 The table shows the comparison between before and after imputing missing values in the “Item_Weight” column.
👉 It can be seen that the missing values have disappeared.

🔧 Handling Outliers:

👉 From previous section, it can be seen that "Item_Visibility" and "Item_Outlet_Sales" have outliers.
👉 This section will transform the column values using log and square root transformation, then compare which one is the best transformation for this case.

👉 It can be seen that the square root transformation is better than the log transformation for these two columns since the frequency distribution of the square root transformation is closer to the normal distribution even though there are still outliers in both columns.

📈 EDA

🧮 Pearson Correlation:

👉 It can be seen there is a high correlation between the “Item_MRP” and the square root transformation of “Item_Outlet_Sales”, with a coefficient of 0.56343.
👉 there is also a weak correlation between “Item_MRP” with “Item_Weight”, with a coefficient of 0.02597.

🥧 Item Fat Content based on Item Type:

👉 As can be seen, the “Household” item becomes the item type with the most number in the “Low Fat” category with a percentage of 16.49%. 👉 However, the item type with the smallest number in the “Low Fat” category is the “Baking Good” items with a percentage of 5.96%.
👉 In addition, “Fruits and Vegetables” is the most number in the “Regular” category with a percentage of 20.03%.
👉 However, “Meat” is the item with the smallest number in the “Regular” category with a percentage of 8.48%.

📉 Outlet Location Type based on Outlet Type:

👉 In general, “Supermarket Type1” is an outlet type owned only by the “Tier 2” outlet locations.
👉 In addition, the “Supermarket Type1” outlet type has the most “Tier 1” outlet locations, with a percentage of 21.8% compared to the “Grocery Store” outlet type with a percentage of 6.2%.
👉 The total number of “Tier 3” outlet locations is almost the same for each outlet type, approximately 11%.

💹 Outlet Identifier based on Outlet Size:

👉 Most outlets have more “Small” size-types than the other size types, and the number each outlet has is approximately the same.
👉 Other than that, outlets “OUT046”, “OUT045”, “OUT035”, “OUT019”, “OUT017”, and “OUT010” only have the “Small” size type.
👉 However, the “OUT019” and “OUT010” outlets have fewer outlets than the other outlets with a similar type size, around 500 outlets.
👉 “OUT049”, “OUT027”, and “OUT018” only have the size type “Medium,” and “OUT013” only have the size type “High”.

📊 Outlet Identifier based on Outlet Type:

👉 In general, the data for “Supermarket Type1” has symmetrical data (the median is in the middle of the box plot), which means that the data is normally distributed.
👉 The larger the outlet type, the more items are purchased at the outlet.

🧪 Hypothesis

1️⃣ Hypothesis 1:

H0: Outlet_Establishment_Year is normally distributed.
H1: Outlet_Establishment_Year is not normally distributed.

▶ Conclusion: H0 rejected
👉 From Kolmogorov-Smirnov test, it can be concluded that H0 rejected since the D value is < 1 and the p-value < 0.05.

2️⃣ Hypothesis 2:

H0: There is no heteroscedasticity between “Item_MRP” and “Item_Weight”.
H1: There is heteroscedasticity between “Item_MRP” and “Item_Weight”.

▶ Conclusion: H0 accepted
👉 It can be seen that there is no specific pattern for irregular points spread above and below the 0 axes on the Y-axis.
👉 It can be concluded that there is no heteroscedasticity between “Item_MRP” and “Item_Weight”.

3️⃣ Hypothesis 3:

H0: Tier 3 has the most numbers in “High” size outlets.
H1: Tier 3 does not have the most numbers in “High” size outlets.

▶ Conclusion: H0 rejected
👉 As can be seen, “Tier 3” has the most numbers in “Medium” size outlets with 1863 outlets, followed by “High” size outlets with 932 outlets.

4️⃣ Hypothesis 4:

H0: There was a decrease in the number of outlets from 1985 to 1998.
H1: There was an increase in the number of outlets from 1985 to 1998.

▶ Conclusion: H0 accepted
👉 It can be seen that there was a decrease in the number of outlets from 1463 outlets in the year 1985 to 555 outlets in the year 1998.

5️⃣ Hypothesis 5:

H0: The number of “Supermarket Type1” reached its peak in 1987.
H1: The number of “Supermarket Type1” not reached its peak in 1987.

▶ Conclusion: H0 accepted
👉 It can be seen that the number of “Supermarket Type1” reached its peak in 1987 with 932 outlets.
👉 However, in 2007, the number of “Supermarket Type1” only 926 outlets.

⚒ Feature Engineering

🧺 Binning

Item_Category
- It can be seen that for item ID in the "Item_Identifier" column that starts with character "DR" will be classified as "Drink, "FD" will be classified as "Food", and "NC" will be classified as "Non-Consumable".
Age_Outlet
- It can be seen that the age of an outlet is derived from the reduction in the year the outlet was established by 2021.
Outlet_Category
- It can be seen that the age of outlets is classified into three categories: "New Outlet" for outlets aged 1-10 years, "Moderate Outlet" for outlets aged 11-20 years, and "Old Outlet" for those over 20 years old

🔖 Label Encoding

👉 This section will perform label encoding for "Item_Fat_Content" column.
👉 The label encoding will be described as follows:

Column Name	Level	Encoded Label
Item_Fat_Content	Low Fat	0
Item_Fat_Content	Regular	1

🔥 One-Hot Encoding

Item_Category
- For "Drink" category will be included in group 1, "Food" will be included in group 2, and "Non-Consumable" will be included in group 3.
Outlet_Category
- For "New Outlet" category will be included in group 1, "Moderate Outlet" will be included in group 2, and "Old Outlet" will be included in group 3.
Outlet_Size
- For "Small" category will be included in group 1, "Medium" will be included in group 2, and "High" will be included in group 3.
Outlet_Location_Type
- For "Tier 1" category will be included in group 1, "Tier 2" will be included in group 2, and "Tier 3" will be included in group 3.
Outlet_Type
- For "Grocery Store" category will be included in group 1, "Supermarket Type1" will be included in group 2, "Supermarket Type2" will be included in group 3, and "Supermarket Type3" will be included in group 4.

📜 New Metadata

Column Name	Data Type	Description	Instances
Item_Identifier	Char	ProductID	{FDA15, DRC01, FDN15, …}
Item_Category	Char	Category of a product (Drink, Food, Non-Consumable)	{Food, Drink, Food, …}
Item_Category_GROUP_1	Num	Encoded Item_Category 0≠ Drink 1= Drink	{0, 1, 0, …}
Item_Category_GROUP_2	Num	Encoded Item_Category 0≠ Food 1= Food	{1, 0, 1, …}
Item_Category_GROUP_3	Num	Encoded Item_Category 0≠ Non-Consumable 1= Non-Consumable	{0, 0, 0, …}
Item_Weight	Num	The product's weight	{9.3, 5.92, 17.5, …}
Item_Fat_Content	Char	Content of product (Low Fat or Regular)	{Low Fat, Regular, Low Fat, …}
Item_Fat_Content_Encoding	Num	Types of Product Content 0= Low Fat 1= Regular	{0,1, 0, …}
Item_Visibility	Num	The percentage of a store's overall display area that is dedicated to a single product.	{0.016047301,0.019278216, 0.016760075, …}
Item_Type	Char	Category of product (Soft Drinks, Dairy, Hard Drinks, Canned, Frozen Foods, Fruits and Vegetables, Snack Foods, Baking Goods, Starchy Foods, Meat, Seafood, Breakfast, Breads, Health and Hygiene, Household, Others)	{Dairy, Soft Drinks, Meat, …}
Item_MRP	Num	Maximum Retail Price of a product	{249.8092,48.2692, 141.618, …}
Outlet_Identifier	Char	Outlet ID	{OUT049,OUT018, OUT049, …}
Outlet_Establishment_Year	Num	Year the outlet established	{1999,2009, 1999, …}
Age_Outlet	Num	Age of an outlet based on "Outlet_Establishment_Year" and 2021	{22,12, 22, …}
Outlet_Category	Char	Category of an outlet based on age 0-10= New Outlet 11-20= Moderate Outlet >20= Old Outlet	{OldOutlet, Moderate Outlet, Old Outlet, …}
Outlet_Category_GROUP_1	Num	Encoded Outlet_Category 0≠ New Outlet 1= New Outlet	{0, 0, 0, …}
Outlet_Category_GROUP_2	Num	Encoded Outlet_Category 0≠ Moderate Outlet 1= Moderate Outlet	{0, 1, 0, …}
Outlet_Category_GROUP_3	Num	Encoded Outlet_Category 0≠ Old Outlet 1= Old Outlet	{1, 0, 1, …}
Outlet_Size	Char	Size of the store (Small, Medium, High)	{Medium, Medium, Medium, …}
Outlet_Size_GROUP_1	Num	EncodedOutlet_Size 0≠ Small 1= Small	{0, 0, 0, …}
Outlet_Size_GROUP_2	Num	EncodedOutlet_Size 0≠ Medium 1= Medium	{1, 1, 1, …}
Outlet_Size_GROUP_3	Num	EncodedOutlet_Size 0≠ High 1= High	{0, 0, 0, …}
Outlet_Location_Type	Char	Thetype of city where the store is located(Tier1, Tier 2, Tier 3)	{Tier1, Tier 3, Tier 1, …}
Outlet_Location_Type_GROUP_1	Num	Encoded Outlet_Location_Type 0≠ Tier 1 1= Tier 1	{1, 0, 1, …}
Outlet_Location_Type_GROUP_2	Num	Encoded Outlet_Location_Type 0≠ Tier 2 1= Tier 2	{0, 0, 0, …}
Outlet_Location_Type_GROUP_3	Num	Encoded Outlet_Location_Type 0≠ Tier 3 1= Tier 3	{0, 1, 0, …}
Outlet_Type	Char	Type of the store (Grocery Store, Supermarket Type1, Supermarket Type2, Supermarket Type3)	{Supermarket Type1, Supermarket Type2, Supermarket Type1, …}
Outlet_Type_GROUP_1	Num	EncodedOutlet_Type 0≠ Grocery Store 1= Grocery Store	{0, 0, 0, …}
Outlet_Type_GROUP_2	Num	Encoded Outlet_Type 0≠ Supermarket Type1 1= Supermarket Type1	{1, 0, 1, …}
Outlet_Type_GROUP_3	Num	EncodedOutlet_Type 0≠ Supermarket Type2 1= Supermarket Type2	{0, 1, 0, …}
Outlet_Type_GROUP_4	Num	Encoded Outlet_Type 0≠ Supermarket Type3 1= Supermarket Type3	{0, 0, 0, …}
Item_Outlet_Sales	Num	Sales of product	{3735.138, 443.4228, 2097.27, …}

🙌 Support me!

👉 If you find this project useful, please ⭐ this repository 😆!

🎈 Check out my data pre-processing and feature engineering using Python on Kaggle here!

👉 More about myself: here

caesarmario / big-mart-sales-preprocessing-sas-studio Goto Github PK

big-mart-sales-preprocessing-sas-studio's Introduction