top of page
Search

Nashville Housing Data Cleaning

Updated: Nov 17, 2023

In this project, I wrote SQL queries in Microsoft SQL Server Management Studio to clean and make the Nashville, Tennessee Housing Data more usable.


METHODS


Standardize Date Format


For the column 'SalesDate', I wanted to change the date format by removing the timestamp and leaving only the year, month, and day (yyyy-mm-dd). In the initial query I wrote to convert the date format, I used the CONVERT function to create a new value with the date in the new format. I then attempted to UPDATE the database to SET the 'SalesDate' column equal to the converted date value, however this did not change the date format.


When this didn't work, I wrote a new query where I created an ALTER TABLE statement to ADD a new empty column named 'SaleDateConverted'. I then successfully used UPDATE to SET the 'SalesDateConverted' column equal to the converted date value.



Populate Property Address Data


In the column for Property Address, many rows were found to have NULL entries where the address should be. I wanted to update all rows containing a NULL entry in this column with the correct addresses using an appropriate reference. I sorted the table by Parcel ID and found that the address should be the same for rows with the same Parcel ID.


I chose to use a self join for the table, joining table 'a' to table 'b' on Parcel ID, and selected for the Parcel ID and Property address in both tables. I added the condition that the Parcel ID be NULL in table 'a', where the Unique ID in table 'a' was not equal to the Unique ID in the table 'b', in order to match with a property address that was not NULL. Once I confirmed that there was a correct address for every NULL entry in the Property Address column, I used UPDATE to SET every NULL entry to be equal to the output of the ISNULL function, where every NULL property address in table 'a' would be changed to the property address in table 'b' based on matching Parcel ID.



Breaking out Address into Individual Columns (Street, City, State)


After filling in every NULL property address, I wanted to create new columns where I break the property address and owner address into its individual parts for street, city, and state. We did this using two different methods:


1) Using the SUBSTRING function, I broke the property address into the street address and city. Within the SUBSTRING function, I used the CHARINDEX function to isolate and separate the street from the city with comma in the property address as the delimiter. I then used ALTER TABLE to add columns for the newly split property street address and city, then used UPDATE to SET each column equal to the corresponding substring values.


2) Using the PASENAME function, I broke the owner address into the street address, city, and state. Since PARSENAME automatically uses periods as the delimiter, I used REPLACE to change each comma in the owner address to a period so that PARSENAME would separate the string into street, city, and state based on the location of the period. I then used ALTER TABLE to add columns for the newly split owner street address, city, and state, then used UPDATE to SET each column equal to the corresponding substring values.



Change Y and N to Yes and No in "Sold as Vacant" field


After noticing inconsistencies in the naming of entries in the "Sold as Vacant" field, I selected for each DISTINCT value in the column along with the total COUNT of that value. The four distinct entries were 'Yes', 'No', 'Y', and 'N', with 'Yes' and 'No' being the two most common entries. I then used a CASE statement that changed the values 'Y' to 'Yes' and 'N' to 'No' respectively in the "Sold as Vacant" column. After confirming the CASE statement was successful, I used UPDATE to SET the "Sold as Vacant" column equal to the output of the CASE statement.



Remove Duplicates


In order to determine if there were rows with duplicate information, I ran a CTE named "RowNumCTE" with PARTITION BY in order to partition the temp table by unique information like Parcel ID property address, sale price, sale date, legal reference. The ROW_NUMBER function gave the row number for each line in each partition, so that any row that contained duplicated information would have a row number of 2 or higher. After confirming that rows with a number of 2 or higher were in fact duplicates, I used DELETE to remove the duplicate rows with a number greater than 1.



Delete Unused Columns


Lastly, in order to streamline the database, ALTER TABLE and DROP COLUMN were used to remove unused and unnecessary columns like tax district, sale date, and the original owner address and property address columns.



SUMMARY


In this project, I demonstrated how to use multiple SQL functions in order to clean and better organize a database for housing information in Nashville, Tennessee. I plan to use these skills and many more in future projects.



Click on the GitHub icon below to view the SQL script and access the raw data.



68 views

1 Comment


Aiswaryarani Dhal
Aiswaryarani Dhal
Oct 28, 2023

By any chance, can i hav an access to the data and get a clear picture ..jus for my knowledge

Like
bottom of page