You are given bhp.csv which contains property prices in the city of banglore, India. You need to examine price_per_sqft column and do following,
(1) Remove outliers using IQR
(2) After removing outliers in step 1, you get a new dataframe.
(3) use zscore of 3 to remove outliers. This is quite similar to IQR and you will get exact same result
(4) for the data set height_weight.csv find the following
(i) Using IQR detect weight outliers and print them
(ii) Using IQR, detect height outliers and print them
TO detect and remove the outliers in the given data set and save the final data.
Step 1 Import the required packages(pandas,numpy,scipy)
Step 2 Read the given csv file
Step 3 Convert the file into a dataframe and get information of the data.
Step 4 Remove the non numerical data columns using drop() method.
Step 5 Detect the outliers in the data set using z scores method.
Step 6 Remove the outliers by z scores and list manupilation or by using Interquartile Range(IQR)
Step 7 Check if the outliersare removed from data set using graphical methods.
Step 8 Save the final data set into the file.
FOR BHP.CSV FILE
import pandas as pd
df=pd.read_csv("/bhp.csv")
df
df.info()
df.shape
import seaborn as sns
sns.boxplot(x="price_per_sqft",data=df)
Q1=df['price_per_sqft'].quantile(0.25)
Q3=df['price_per_sqft'].quantile(0.75)
IQR=Q3-Q1
lower=Q1-1.5*IQR
upper=Q3+1.5*IQR
newdata=df[(df['price_per_sqft']>=lower) & (df['price_per_sqft']<=upper)]
print(newdata)
newdata=df[(df['price_per_sqft']>=lower) | (df['price_per_sqft']<=upper)]
print(newdata)
newdata.shape
sns.boxplot(x="price_per_sqft",data=newdata)
z_score=np.abs(stats.zscore(df['price_per_sqft']))
newdata2=df[(z_score<3)]
print(newdata2)
outlier2=df[(z_score>=3)]
print(outlier2)
newdata2.shape
sns.boxplot(x="price_per_sqft",data=newdata2)
FOR HEIGHT_WEIGHT.CSV FILE
import pandas as pd
df=pd.read_csv("/height_weight.csv")
df
df.info()
df.shape
df.describe()
import seaborn as sns
sns.boxplot(x="height",data=df)
Q1=df['height'].quantile(0.25)
Q3=df['height'].quantile(0.75)
IQR=Q3-Q1
lower=Q1-1.5*IQR
upper=Q3+1.5*IQR
newdata1=df[(df['height']>=lower) | (df['height']<=upper)]
print(newdata1)
newdata=df[(df['height']>=lower) & (df['height']<=upper)]
print(newdata)
sns.boxplot(x='height',data=newdata1)
Thus the outliers are detected and removed in the given file and the final data set is saved into the file.