Victor Ezechukwu
6 min readDec 28, 2023

--

My Data Cleaning Journey In Excel

A raw dataset teeming with inconsistencies and missing values.
Raw dataset compiled by Kenji Explains

This week, I cleaned up this fictional raw dataset compiled by Kenji Explains. I transformed the raw dataset in the image above to the clean dataset below in ten steps.

The dataset purged of inconsistencies and errors
The final look of the dataset after cleaning

1. Autofit Rows and Columns

The first thing that stands out in the raw dataset is the text/numbers that take up more space vertically, and the text/numbers that are too wide for the default column width. To fix this, I pressed Ctrl-A to select the entire worksheet, and in the Home tab, I selected the Format command in the Cells group to access the autofit feature. I selected the autofit Row Height and Column Width, and Excel automatically adjusted the height of the rows and the width of the columns to fit the contents within them.

Autofit rows and columns
The look after using the autofit feature

2. Find and Replace
With a clear view of the dataset, I noticed that column C which contains the names of companies has an abbreviation of each company in a bracket. The abbreviation would not be useful in any future analysis, so I used the find-and-replace feature to remove them. I selected the first cell in column C and with Ctrl-Shift- selected the entire contents in the column. I clicked on the Home tab, and in the Editing group, I selected the Find and Select command, then the Replace option, which opened the Find and Replace window.

Find and replace

In the first option, I wrapped an asterisk in a bracket (*), in the second option, I left it blank, which meant finding everything in the bracket and replacing it with nothing. I clicked on Replace All and Excel did its magic.

The find and replace window

3. Lower & Upper
The contents in column C were all in uppercase, which was not consistent with the other contents in the dataset. I converted them to lowercase using the lower function. I selected column D and with Ctrl-Shift-+ created a new column next to column C. In the new column, I entered this formula =LOWER(C3). The formula tells Excel to convert the letters in cell C3 to lowercase.

Lower and Upper functions

After that, I populated the empty cells in the column using autofill. Now I had two columns with the same contents. I selected and copied the contents in column D and pasted them as a value. This is to ensure that the new column stops referencing the previous one as a formula. With that done, I deleted the column with uppercase.

Pasting the result as a value

4. Trim & Proper
The letters in column D were all over the place, with some letters poorly aligned, while some had a mix of uppercase and lowercase. This looked messy, so I fixed this using the Trim and Proper function nested together.
I created a new column next to column D and entered the formula =TRIM(PROPER(D3)). The TRIM function removed extra spaces in the text, leaving only spaces between words while the PROPER function capitalized the first letter in each word, and converted other letters to lowercase.

Trim and proper function

Afterward, I used autofill to populate the remaining empty cells, copied and pasted the new column as a value, and deleted the previous column just like I did earlier.

5. Text to Columns
Looking at column E, I noticed that it has departments and regions combined. I had to separate them into two different columns using the awesome Text to Columns feature.

Text to columns

I created a column next to column E and named it Regions. With cell E3 selected, I pressed Ctrl-Shift- to select everything in the column, clicked on the Data Tab, and in the Data Tools group, I selected Text to Columns.

Text to Columns has three windows. In the first window, I checked the box that asked if my data had a delimiter and clicked next.

Text to column’s first argument window

In the next window, I checked the "Other" box and entered the type of delimiter in my data in the box provided, which was an underscore (_).

Text to column’s second argument window

The last window helps determine the data format and destination of the data. I left everything here by default and clicked the Finish button, and the departments and regions were perfectly separated.

Text to column’s third argument window
The look after using the text-to-columns feature

6. Removing Duplicates
To remove duplicates, I selected all the data in the dataset using Ctrl-A. I clicked the Data Tab and in the Data Tools group, I selected Remove Duplicates. With all the boxes in the remove duplicates window checked, I clicked OK. Easy-peasy!

Removing duplicate values

7. Filling Empty Cells
With duplicates removed, having empty cells in columns G and H doesn’t look good, I set out to fill all the empty cells with a value to indicate missing data. To find the blank cells, I selected the entire dataset, and in the Home Tab, clicked on the Find and Select command in the Editing group, then click on the Go-To Special option, checked the box named “Blanks”, and pressed OK.

With the blank cells identified, I typed "NA" (Not Available) in the formula bar, and pressed the Ctrl and Enter keys simultaneously to fill the blank cells.

Filling the blank cells with a value

8. IFERROR
I finally got the chance to use a conditional statement I learned earlier. I used the IFERROR conditional statement to change the error cells in column J to "NA". To achieve this, I double-clicked on J3 to show the calculation in the cell, in the formula bar, I entered the formula =IFERROR(I3/H3, "NA"). The formula tells Excel to return the value “NA” if there is any error in the calculation.

Typing the IFERROR formula into the formula bar

Then I again used autofill to populate the cells in column J, giving me the desired result.

The look after using the IFFERROR formula

9. Formatting
With the dataset looking cleaner, I applied a simple formatting. I highlighted all the headers, made them bold, changed the background color of the cells to Blue, and changed the text to white to make the headers stand out.

Formatting the headers

10 Gridlines
Finally, I moved to the View Tab and unchecked the gridlines box to remove them. Leaving me with this clean dataset ready to be used for analysis.

The dataset purged of inconsistencies and errors
Final look after removing the gridlines

As always, thank you for going on this journey with me.

--

--

Victor Ezechukwu

Passionate Educator|| DataNewB|| Data Enthusiast|| Book Lover|| Book Reviews||