This Repo is about EDA & Data Visualization.
Values Skills: Data pre-processing, descriptive statistics, Python
Skills: Regression methods, Prediction methods.
You work for the city of Seattle. To achieve its goal of a carbon-neutral city in 2050, your team is taking a close interest in emissions from non-residential buildings. For this, careful records were made by your agents in 2015 and 2016.However, these surveys are expensive to obtain, and from those already done, you want to try to predict the emissions of buildings whoseemissionshave not yet been measured.Two measures interest you: CO2 emissions and total energy consumption. You also want to evaluate the interest in the emission prediction of the ENERGYSTAR Score(which is complicated to calculate)with the approach currently used by your team.
# In Python, 3 environment comes with many helpful analytics libraries installed
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Pandas uses the plot() method to create diagrams.
# Pythons uses Pyplot, a submodule of the Matplotlib library to visualize the diagram on the screen.
# Matplotlib is a low level graph plotting library in python that serves as a visualization utility
import matplotlib.pyplot as plt
# Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
import seaborn as sns
%matplotlib inline
df_2015.drop(['PropertyName', 'TaxParcelIdentificationNumber', 'CouncilDistrictCode', 'Neighborhood', 'DataYear', 'ListOfAllPropertyUseTypes', 'LargestPropertyUseType','LargestPropertyUseTypeGFA', 'SecondLargestPropertyUseType', 'SecondLargestPropertyUseTypeGFA', 'ThirdLargestPropertyUseType', 'ThirdLargestPropertyUseTypeGFA', 'YearsENERGYSTARCertified', 'Comment', 'Outlier', '2010 Census Tracts', 'City Council Districts', 'DefaultData', 'ComplianceStatus', 'Seattle Police Department Micro Community Policing Plan Areas', 'SPD Beats', 'Zip Codes'], axis=1, inplace=True)
# The head() method returns the headers and a specified number of rows, starting from the top.
df_2015.head()
df_2016.drop(['OSEBuildingID', 'DataYear', 'BuildingType', 'PrimaryPropertyType', 'PropertyName', 'Address', 'City', 'State', 'ZipCode', 'TaxParcelIdentificationNumber', 'ListOfAllPropertyUseTypes', 'LargestPropertyUseType', 'LargestPropertyUseTypeGFA', 'ENERGYSTARScore', 'Neighborhood', 'CouncilDistrictCode', 'SecondLargestPropertyUseType', 'SecondLargestPropertyUseTypeGFA', 'ThirdLargestPropertyUseType', 'ThirdLargestPropertyUseTypeGFA', 'YearsENERGYSTARCertified', 'Comments', 'Outlier', 'DefaultData', 'ComplianceStatus'], axis=1, inplace=True)
# The head() method returns the headers and a specified number of rows, starting from the top.
df_2016.head()