- To develop API that capitalizes on real-estate data to render the following functionalities :
- modeling a house in 3D from lidar satellite images (geoTIFFs file) by only entering a home address. This part is an extension of a previous project
- locating the house on a map by entering its address
- making price forecast on the buildings (i.e. houses or apartment) according to multiple features (postal code, number of rooms, living space, surface area, etc.)
- Te deploy the API on azure (using a.o. Docker and Travis)
- Consolidate the knowledge in Python, specifically in : NumPy, Pandas, Sklearn, Matplotlib,...
- To be able to search and implement new librairies
- Consolidate knowledge of data science and machine learning algorithm for developping an accurate regression prediction model
- To be able to construct the project with object-oriented programming (OOP)
- To be able to implement the whole project - and make it functioning - through an API (using Flask)
- To be able to deploy the API on a web based environment (in this case Azure)
- The API must be functional
- Your model must be functional
- The API to be deployed on a web based environment (e.g. Heroku, Azure, etc.)
- Optimize your solution to have the result as fast as possible.
- The API searches for as much information as possible on its own. (For example, area => cadastre) Better visualization
- You provide a 3d representation of the house
- All the work achieved was done during the BeCode's AI/data science bootcamp 2020-2021
- Research and understand the term, concept and requirement of the project.
- Discover new libraries that can serve the project purposes
- Developing, using and testing machine learning algorithm (i.a. sklearn with linear, SVG, decision trees regression, XGBoost,...)
- for 3D house reconstruction
- for real-estate data
- Data collection was done in the context of a previous project whose aim was to develop a Scrapping Bot written in Python, to scrape data (50.000+) from real estate website "Zimmo.be", for a challenge given by Becode.
-
Data cleaning : including, a.o., removing outliers and features with to many missing values (>15%) and conducting multivariate feature imputation for the feature with less missing values (using sklearn.impute.IterativeImputer)
-
Features engineering : as location (postal code) are not readily amenable to be integrate in quantitative model - but has nonetheless a huge impact on real-estate price - a ranking index was compute based on the average house price for each entities in Belgium. As shown, this index demonstrates a good association with house prices and it seemed that its 3rd polynomials best explained the target (more than 25% of the 'house price' variance explained for this sole feature - based on r_square coefficient).
- Features :
- type of building: house/apartment
- living area: square meters
- field's surface: square meters
- number of facades
- number of bedrooms
- garden: yes/no
- terrace: yes/no
- terrace area: square meters
- equipped kitchen: yes/no
- fireplace: yes/no
- swimming pool: yes/no
- state of the building: as new, just renovated, good, to refresh, to renovate, to restore (one hot encoding)
- Target:
- House price: euros
- Machine learning model:
- Multiple models using increasing number of features and based on various algorithm (i.a. linear, SVM, decision tree, XGBoost) were trained and evaluated.
- The best model was based on the XGBoost algorithm (n_estimators=700, max_depth= 4, learning_rate= 0.3) and provided an r_square coefficient of .82 on the train set and of .76 on the test set
- The best fitted model was save as a pickel file which was integrated in the API for price estimation
- Examples of python code for data manipulation and algorithms development are stored in the notebook folder of the current repository
- Estimate: in
- Estimate: out
- Map: in
- Map: out
- 3D reconstruction: in
- 3D reconstruction: out