GithubHelp home page GithubHelp logo

huzaifarjkhan / nashvillehousing Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 6.96 MB

Data Cleaning and ETL for Nashville Housing Real Estate

TSQL 100.00%
database datacleaning datamunging datawrangling etl

nashvillehousing's Introduction

NashVilleHousing_

This project required an intensive data cleaning process. The client expected the data to be such cleaned that in future any analysis could be performed on it easily and effectively. For this purpose I used my extensive data cleaning checklist. The changes made includes:

The data/information are being provided by taking explicit approval from the client

  1. Standardizing Data: Sale Date was converted from Date Time format to Date format

                  UPDATE PortfolioProject..NashvilleHousing 
                  SET    SaleDate = CAST(SaleDate AS Date) 
    
                  SELECT
                    *
                  FROM
                    PortfolioProject..NashvilleHousing
    
                  -- must use 'ALTER' statement to modify the table; 'UPDATE' is used for only updating data into the table
    
                  ALTER TABLE PortfolioProject..NashvilleHousing
                  ALTER COLUMN SaleDate Date
    

  1. NULLS in property Address were populated with the correct input using advance Self-Joins UPDATE a SET a.PropertyAddress = COALESCE(a.PropertyAddress, b.PropertyAddress) -- can use ISNULL() as well FROM PortfolioProject..NashvilleHousing a JOIN PortfolioProject..NashvilleHousing b ON a.ParcelID = b.ParcelID AND a.[UniqueID ] <> b.[UniqueID ]

  1. Breaking Property and Owner Addresses fields into Address, City and State columns for easier analysis

                  -- Accidentally modifeied PropertyAddress table without extracting city from it
                  UPDATE
                    PortfolioProject..NashvilleHousing
                  SET
                    PropertyAddress = SUBSTRING(PropertyAddress,1, CHARINDEX(',',PropertyAddress)-1) 
    
    
                  --Reuploaded the Original dataset to extract PropertyCity from there
                  -- *Nulls of PropertyAddress is removed from this dataset using above method
    
                  --SEPERATING COLUMNS FOR Address and City
                  ALTER TABLE 
                      PortfolioProject..NashvilleHousing
                  ADD PropertyCity varchar(255)
    
    
                  -- Populating PropertyCity 
                  UPDATE 
                    a
                  SET 
                    a.PropertyCity = SUBSTRING(b.PropertyAddress, CHARINDEX(',',b.PropertyAddress)+1, LEN(b.PropertyAddress))
                  FROM
                    PortfolioProject..NashvilleHousing a
                    JOIN 
                    [PortfolioProject].[dbo].[OriginalNashvilleHousing] b
                    ON 
                      a.[UniqueID ] = b.[UniqueID ]
    
                  SELECT
                    *
                  FROM
                    PortfolioProject..NashvilleHousing
    
                  -- PropertyAddress Breaking
                  -- PARSENAME looks for '.' 'period' and breaks string into column 
    
                  -- It could have been done using SUBSTRING but would have been lengthy
                  ALTER TABLE
                    PortfolioProject..NashvilleHousing
                  ADD OwnerSplitAddress NVARCHAR(255), 
                    OwnerCity NVARCHAR(255), 
                    OwnerState NVARCHAR(255)
    
    
                  UPDATE
                    PortfolioProject..NashvilleHousing
                  SET
                    OwnerSplitAddress = PARSENAME(REPLACE(OwnerAddress,',','.'),3),
                    OwnerCity		  = PARSENAME(REPLACE(OwnerAddress,',','.'),2),
                    OwnerState		  = PARSENAME(REPLACE(OwnerAddress,',','.'),1)
    
                  ALTER TABLE 
                    PortfolioProject..NashvilleHousing
                  DROP COLUMN OwnerAddress
    

  1. Standardized 'Yes, No, Y and N' inputs in 'Sold as Vacant' column to only 'Yes' and 'No'

                  UPDATE
                    PortfolioProject..NashvilleHousing
                  SET
                    SoldAsVacant =
                    (CASE
                      WHEN 
                        SoldAsVacant = 'Y' THEN 'Yes' 
                      WHEN
                        SoldAsVacant = 'N' THEN 'No'
                      ELSE SoldAsVacant
                    END )
    

  1. Removed Duplicates

                      WITH RowNumCTE AS(
                      Select *, 
                        ROW_NUMBER() OVER (Partition By ParcelID,     --Assuming that UniqueID is not available
                                        PropertyAddress,
                                        SaleDate,
                                        LegalReference,
                                        OwnerName
                                  Order By
                                        ParcelID) AS RowNumb -- This expression starts giving row number to each row in the table 
                                        --but restarts the number if given data is not same
    
    
                      FROM
                        PortfolioProject..NashvilleHousing
                      --WHERE
                      --	RowNumb > 1 -- Cant use Where in query where Window function is used such as OVER. Hence, using CTE
                      )
                      --SELECT *
                      --FROM
                      --	RowNumCTE
                      --Where RowNumb > 1
    
                      DELETE 
                      FROM
                        RowNumCTE
                      Where RowNumb > 1
    

**At the end of this ETL and Data wrangling process the client got extremely neat and 'ready to be analyzed' or 'stored for future use' data **

nashvillehousing's People

Contributors

huzaifarjkhan avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.