Indian-Census-Data-Analysis-Using-SQL

Formulated questions to explore population trends and characteristics, shedding light on Indian Census

Author

@saadharoon27

Project Overview
About The Dataset
Queries, Reasons, and Code

Project Overview

The dataset of the Indian census of 2011, structured into two distinct tables. The first dataset comprises columns encompassing geographical and demographic aspects of the population, including information such as district, state, sex ratio, population growth rate, and literacy rate. Meanwhile, the second dataset contains columns like district, state, area in square kilometres, and population count.

In my analysis, I formulated various questions to explore the dataset comprehensively, aiming to uncover insights into the population trends and characteristics. Each question was selected with a specific purpose in mind, based on its relevance to demographic patterns and regional dynamics. The conclusions drawn from these questions are presented, shedding light on notable findings and contributing to a deeper understanding of the data's implications.

About The Dataset

Indian Census 2011 Dataset

Dataset 1: Demographic Insights

Column	Description
District	The name of the district within India.
State	The state to which the district belongs.
Growth	The population growth rate of the district.
Sex_Ratio	The ratio of males to females in the population.
Literacy	The literacy rate of the district's population.

Dataset 2: Geographical Information

Column	Description
District	The name of the district within India.
State	The state to which the district belongs.
Area_km2	The geographical area of the district in square kilometers.
Population	The population count of the district.

Queries, Reasons, and Code

Calculation of total number of rows in both the dataset.
- Reason: Verifying the number of rows in an SQL dataset is crucial as it offers an essential measure of data volume and completeness. This information helps ensure data integrity, aids in identifying potential data discrepancies, and provides a fundamental understanding of the dataset's scale and scope.
- Code:
```
SELECT COUNT(*) FROM dataset1
SELECT COUNT(*) FROM dataset2
```
- Finding: Both the dataset has exactly 640 rows of data.
Finding the population of India.
- Reason: Before we proceed with the in-depth analysis of the Indian Census 2011 data, it's essential to establish the total population figure that our analysis encompasses. The analysis utilized Dataset2 because it contained the necessary information to determine the population of each district.
- Code:
```
SELECT SUM(population) population FROM dataset2;
```
- Finding: The sum of the population of every district is, 1210854977.
Average growth percentage of India.
- Reason: Calculating the average growth percentage of India aids in comprehending the pace of the country's overall population expansion. This information assists in estimating future population sizes after a specific number of years, enabling informed projections and planning.
- Code:
```
SELECT AVG(growth) AverageGrowth FROM dataset1;
```
- Finding: The average rate of growth of India’s population is 19.24%.
Average growth percentage state-wise and also display the top 3.
- Reason: In dataset1, the growth percentage is presented on a district level, offering a detailed perspective of the data. However, for a broader understanding and to formulate more effective strategies, a more comprehensive overview might be more beneficial. Zooming out to view the data on a larger scale could provide a clearer insight into the trends and help in identifying actionable steps.
- Code:
```
SELECT state, AVG(growth) AS AvgStatesGrowth FROM dataset1 
GROUP BY state
ORDER BY AvgStatesGrowth DESC;

SELECT state, AVG(growth) AS AvgStatesGrowth FROM dataset1 
GROUP BY state
ORDER BY AvgStatesGrowth DESC
LIMIT 3;
```
- Finding: Highest growth% state is Nagaland, followed by Dadra, and Daman with 82.28%, 55.88%, and 42.74% respectively.
Average sex ratio of different states and find the worst 3 performers.
- Reason: Determining the average gender distribution across various states can aid in tailoring product offerings to suit specific regional demographics. This approach ensures that products are aligned with the preferences of different states' populations, enhancing the potential for successful market penetration.
- Code:
```
SELECT state, ROUND(AVG(sex_ratio)) AS sex_ratio FROM dataset1 
GROUP BY state
ORDER BY sex_ratio DESC;

SELECT state, ROUND(AVG(sex_ratio)) AS sex_ratio FROM dataset1 
GROUP BY state
ORDER BY sex_ratio ASC
LIMIT 3;
```
- Finding: The highest ratio is of Kerala**’s** with 1080 Females per 1000 Males. The worst performers are Dadra, Daman and Chandigarh.
Literacy rate of different states and also states with greater than 90%.
- Reason: The literacy rate serves as a significant parameter for determining the most effective marketing approach. This factor ensures that marketing materials resonate better with the audience by considering their level of understanding and engagement.
- Code:
```
SELECT state, ROUND(AVG(literacy)) AS literacy_rate FROM dataset1 
GROUP BY state
ORDER BY literacy_rate DESC;

SELECT state, ROUND(AVG(literacy)) AS literacy_rate 
FROM dataset1 
GROUP BY state
HAVING ROUND(AVG(literacy)) > 90
ORDER BY literacy_rate DESC
```
- Finding: Kerala again comes on the top with the highest literacy rate in India, with 94%, followed by Lakshadweep with 92%.

Top and bottom 3 states in literacy rates.

Reason: Finding the extreme edges helps us in understanding the spread of the data that we are dealing with.

Code:

/* Method 1 */
(SELECT state, ROUND(AVG(literacy)) AS literacy_rate 
FROM dataset1 
GROUP BY state
ORDER BY literacy_rate ASC
LIMIT 3)
UNION
(SELECT state, ROUND(AVG(literacy)) AS literacy_rate 
FROM dataset1 
GROUP BY state
ORDER BY literacy_rate DESC
LIMIT 3)
ORDER BY literacy_rate DESC

/*Method 2*/
WITH literacy_cte AS (
    SELECT state, ROUND(AVG(literacy)) AS literacy_rate
    FROM dataset1
    GROUP BY state
)
SELECT state, literacy_rate
FROM (
    SELECT state, literacy_rate
    FROM literacy_cte
    ORDER BY literacy_rate ASC
    LIMIT 3
    ) AS lower_literacy
UNION ALL
SELECT state, literacy_rate
FROM (
    SELECT state, literacy_rate
    FROM literacy_cte
    ORDER BY literacy_rate DESC
    LIMIT 3
    ) AS higher_literacy
ORDER BY literacy_rate DESC;

Finding: The top 3 are, Kerala, Lakshadweep and Mizoram with 94%, 92%, 89%, respectively, and the bottom 3 are Rajasthan, Arunachal Pradesh, and Bihar with 65%, 64% and 62% respectively.

States starting with a letter ‘A’ or ‘B’.
- Reason: This question helps to display the power of LIKE function.
- Code:
```
SELECT DISTINCT state FROM dataset1 
WHERE LOWER(state) LIKE 'a%' OR LOWER(state) LIKE 'b%'
```
- Finding: States that starts with the letter ‘A’ are, Andaman and Nicobar Islands, Andhra Pradesh, Arunachal Pradesh, Assam. For letter ‘B’ is only Bihar.
Calculate the number of males and females.
- Reason: In our earlier analysis, we focused solely on calculating the average sex ratio, which provided a percentage-based perspective. However, this approach didn't offer a detailed understanding of the actual male and female populations in different states. To address this limitation, I've now incorporated the real male and female population figures for each state, allowing for a more comprehensive and accurate assessment.
- Code:
```
/* Males = population/(sex_ratio+1)
   Females = population*(sex_ratio)/(sex_ratio+1) */
SELECT c.state, SUM(ROUND(c.population/(c.sex_ratio+1))) AS male, SUM(ROUND(c.population*(c.sex_ratio)/(c.sex_ratio+1))) AS female
FROM
(SELECT d1.district, d1.state, d1.sex_ratio/1000 as sex_ratio,  d2.population
FROM dataset1 AS d1
INNER JOIN dataset2 AS d2
ON d1.district=d2.district) AS c
GROUP BY state
```
- Finding: State Wise Gender Distribution

Actual population in previous census and in current census.

Reason: The difference in values will help us understand at which pace the population is growing at. To calculate the previous census, I have subtracted the growth percentage from the current census data.

Code:

SELECT	i.state, ROUND(((i.current_population))/(1+(i.states_growth/100))) AS previous_population, i.current_population
FROM
	(SELECT d1.state,
       (SUM(d1.growth)) / (COUNT(d1.growth)) AS states_growth,
        SUM(d2.population) AS current_population
		FROM dataset1 AS d1
		INNER JOIN dataset2 AS d2 ON d1.state = d2.state
		GROUP BY d1.state 
		ORDER BY d1.state) AS i
ORDER BY i.state ASC;

Finding: State Wise Population Change

How the change in population influenced the area km2 of the population.

Reason: As the country's population grows, the available land area per person is likely to decrease. This could lead to a more condensed living space, potentially resulting in the construction of skyscrapers and tall buildings to accommodate the increasing population within limited land resources.

Code:

SELECT 
    (g.total_area / g.previous_census_population) AS previous_census_population_vs_area, 
    (g.total_area / g.current_census_population) AS current_census_population_vs_area 
FROM (
    SELECT q.*, r.total_area 
    FROM (
        SELECT '1' AS keyy, n.* 
        FROM (
            SELECT 
                SUM(m.previous_census_population) AS previous_census_population, 
                SUM(m.current_census_population) AS current_census_population 
            FROM (
                SELECT e.state,
                    SUM(e.previous_census_population) AS previous_census_population,
                    SUM(e.current_census_population) AS current_census_population 
                FROM (
                    SELECT d.district, d.state, ROUND(d.population / (1 + d.growth)) AS previous_census_population, d.population AS current_census_population 
                    FROM (
                        SELECT a.district, a.state, a.growth, b.population 
                        FROM dataset1 a 
                        INNER JOIN dataset2 b ON a.district = b.district
                    ) d
                ) e
                GROUP BY e.state
            ) m
        ) n
    ) q 
    INNER JOIN (
        SELECT '1' AS keyy, z.* 
        FROM (
            SELECT SUM(area_km2) AS total_area 
            FROM dataset2
        ) z
    ) r ON q.keyy = r.keyy
) g;

Finding:

Area km2 (Previous Census) Area km2 (Current Census)

0.04806182205366204 0.0026745920896968024

Calculate the top 3 districts with highest literacy rates from each district.
- Reason: The primary objective of this calculation is to showcase the effectiveness of window functions in SQL. These functions simplify complex coding tasks by allowing us to achieve significant results through straightforward steps.
- Code:
```
SELECT a.* FROM
	(SELECT district, state, literacy, RANK() OVER(PARTITION BY state 
	 ORDER BY literacy DESC) AS rnk FROM dataset1) AS a
WHERE a.rnk in (1,2,3) ORDER BY state
```

saadharoon27 / indian-census-data-analysis-using-sql Goto Github PK