Real Estate Analysis in GTA
Project Overview
Our chosen project examines house prices in the Greater Toronto Area in the years 2020 and 2021 We are going to be using data from the 2020 -2021 Toronto Real Estate sold house listings. We want to answer questions whether the sold listings could predict future house prices based on the house's features (number of washrooms, bedrooms, etc), location. Here are the steps that will allow us to achieve our goal:
- Create a database using SQLite.
- Run various machine learning models to predict review scoring and compare which machine learning model is most accurate.
- Create a fully functioning and interactive dashboard using tableau.
- Create and host a Web application on Github to showcase results.
Data Source Description
We were able to gather our data from a licensed realtor who has access to the most recent listing information about house sales in Brampton for last six months. The data are pdf files which show 26 rows. The pdf files are converted into Excel files, cleaned up then saved as csv. Multiple Linear Regression model will be used to predict the future price of the house based on the house features listed above. We are using SQLite for the Database.
Data content
The CSV file contains 26 fields. The description of each field is as below:
- "#" - The row number.
- LSC - The listing displays its contract status at the Last Status Change (LSC) field
- EC - Describes whether the property has received the Encumbrance Certificate: certificate of assurance that the property is free from any legal or monetary liability
- St# - The Street number
- Street name - The name of the street.
- Abbr - Abbreviation of Street type.
- Dir - Direction (North, South, East and West).
- Municipality - The name of the city in which the property resides in.
- Community - The name of the locality on which the property resides within a given municipality.
- List Price - The price of the property at the time of listing.
- Sold Price - The price that the property was sold for.
- Type - Describes the type of the property (semi-detached, detached or attached).
- Style - Number of storeys in the listed property contains.
- Br - The number of bedrooms in the listed property contains.
- Additional- Any other additional rooms in the listed property.
- Wr - The number of washrooms in the listed property.
- Fam - Indicates if there is a family room in the listed property.
- Kit - Number of kitchens in the listed property.
- Garage type - The type of the garage.
- A/C - The type of A/C (Centralized or non-centralized).
- Heat - The type of Furnace.
- Contract Date - The date the contract was signed.
- Sold Date - The date the property was sold on.
- List Brokerage - Name of brokerage that the listed property was under.
- Co op Brokerage - Buyer's brokerage.
- MLS # - A unique number assigned to a real estate listing.
Questions to Answer:
- Do unique features of the house (Washroom, bedroom, area, semidetached, attached) play an integral role with determining the sold price?
- Does the location, type, style, listing date and listing price play a factor of the sold price?
- When was it listed and how fast was it sold?
- Does location dictate how long a house is listed in the market and if there are any patterns?
- Does the location showcase a pattern in the over asking prices?
- What will be the Sold Price of a house based on different features mentioned in the dataset?
Machine Learning
We have created machine learning model to predict "Sold Price" of a house based on style, type, bedrooms, washrooms, and list price of a house. This can help a prospective buyer to decide how much to bid for the house. We have used Multiple Linear Regression model for this purpose.
Since we are trying to predict a continuous numerical output (i.e. “Sold Price” of homes) based on a number of input variables, we have selected Multiple Linear Regression as a machine learning model. It will take an input of a set of factors (or test dataset), learn patterns and find relationships between datapoints to predict the value of dependent variable. The 10 columns mentioned on ML Input page are taken as feature or input variables.
We also tested XGBRegressor from the XGBoost library and Support Vector Regression (SVM) from sklearn as alternative machine learning models. However, the perfromance was dropped significantly. Therefore, Linear Regression regression was chosen as best option out of the tested models.
Team:
- Sanket Kumar : https://github.com/sanketkumaronline
- Agnieszka Blanchard : https://github.com/agnieszka-web
- Andrew Tymkiv : https://github.com/AndrewTymkiv
- Yashitha Bhuvanagiri : https://github.com/yashithab