The rent-houses--germany from felipedmnq

This project has the coal to undertand and practice tools to load data into Postgres database and to extract data from Postgrees using python.

Postgres Database.

This solution was build to extract the needed information from a website (https://www.immonet.de/) and store it localy in a postgres database.

The first step was building the extration part of the data pipeline. Three python scripts were written to perform the task. It could be done in a simplier way, but the real intention here was to practice and try some new packeges using python to connect with a postgres database, load data into the database eng get data from the database.

The first script (get_offers_by_city.py) gets the quantity of rent offers for 11 german cities and load the data into a database in a cumulative way (append) to build a "historical table".

The table below ilustrates the extracted data:

The second script (get_offer_ids.py) get the previous table from the database and use it as base to run each offer for each city and extract the "offer id". All ids from each offer are stored in a new table and load into the database.

The following table shows a sample of the returned data.

The third script (get_offers_infos.py) uses the previous one (use offer ids) to extract all main information from each offer from each city. All the information is extracted in a raw format, all the main information comes in as a list to be tranformed and cleaned later.

Data Analasysis

Dataframe Description

offer_id - unique id for each offer.
extraction_date - extraction datetime.
lat - latitude.
lng - longitude.
city - city.
area_m2 - area in squared meters.
furnished - whether is furnished or unfurnished.
zip_code - zip code.
main_category - main category of the property.
rooms - ho many rooms the property have.
build_year - year of construction of the property.
state - state.
sub_category - sub-category of the property.
balcony - whether is has a balcony or not.
heat_type - which time of head system.
offer_title - the title of the property ad.
kitchen - whether it has a kitchen installed or not.
rent_price - rent price in Euros.
garden - whether it has a garden or not.

Descriptive Analysis

	attributes	min	max	range	avg	median	std	skewness	kurtosis
1	area_m2	15	400	385	68.9725	63.01	33.8502	2.04937	8.77325
2	rooms	1	11	10	2.28184	2	0.976887	0.961887	2.65717
3	build_year	1740	2021	281	1978.87	nan	40.9994	-0.78613	-0.162868
4	balcony	0	1	1	0.419285	0	0.493442	0.32723	-1.89355
6	kitchen	0	1	1	0.506451	1	0.499958	-0.0258107	-2
7	rent_price	200	8600	8400	1023.42	850	719.659	3.08872	17.7604
18	garden	0	1	1	0.142077	0	0.34913	2.05088	2.20684

From this statistic description we can see that our two main columns (area_m2 and rent_price) are highly skewed and have high kurtosis values. The positive skewness tells that for rent prices and areas we have a higher frequency of values below the average than above the average. The positive and high kurtosis tell us that our data are heavy-tailed and suggest the presence of outliers.

Exploratory Data Analysis

The first step was to check the characteristics and distributions of rent price.

Here we can see the presence of some outliers, the positive skewness with a tail directed to higher prices and the scatter plot shows us the exactly distribution of the rent price.

The next chart shows us The rent price distribution for each city.

Some Hypothesis

H1. Offers with larger areas are more expensive.

As expected, this hypothesis is FALSE.

As we can see in the chart below, as the area size increases, the rent average price increses as well and this is natural, places with bigger areas should have a bigger cost.

But when we analyze the average price per m2 for each range of area, the data shows us that not necessarily the bigger places are more expensive, they have a bigger cost, but the most expensive places are the smallest ones, they have a much bigger cost by m2 than the medium to bigger places.

H2. The more rooms, the more expensive.

The situation here is similar than the previous and the hypotesys is aldo FALSE.

When we analyze the rent cost and the number of rooms, is natural that as bigger the number of room, the bigger the rent cost and the next chart shows us this.

Again, when we analyze the m2 price by number of rooms, we can clearly see that the price per m2 is much bigger for places with less number of rooms, in other words, the relative cost is much bigger in smaller places or places with less number of rooms.

In order to have an overview about the offer_title column and check what are the main words that appear there, was build a "word cloud" image, displayed below:

To display all the offers in a map was used folium library, where we can see the distribution though Germany in a aggregated format. Each point displays the offer id, type, rent price and area.

Conclusion

This project had the main goal to practice the usage of tools to perform a full ETL process. It was used Python and BeautfulSoup library to performa web scraper and extrated the needed data from a website, pysicopg library to load the extracted data into a Postgres database and to extract data from it, configparser library to have access to the database hiding password and critical information, logging library to save the logs, Pandas and regular expressions to transform and cleam the data and Seaborn, Matplotlib, WordCloud and Folium to analyse and display the data into charts and maps.

Python	BeautifulSoup	Pandas	Psycopg2
RegEx	Seaborn	Folium	WordCloud

felipedmnq / rent-houses--germany Goto Github PK

rent-houses--germany's Introduction

Postgres Database.

Data Analasysis

Dataframe Description

Descriptive Analysis

Exploratory Data Analysis

Some Hypothesis

Conclusion

rent-houses--germany's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs