🔍 Why Data Cleaning Matters
Raw data is often messy, inconsistent, and full of errors. Without cleaning, data visualizations can be misleading or incorrect.
For example:
- A company analyzing customer orders might find that names appear as “ada lovelace” and “Ada Lovelace”, making it hard to track orders correctly.
- Date inconsistencies like “01/01/24” vs. “2024-01-01” might cause incorrect sorting.
- Numeric values stored as text can prevent accurate calculations.
Cleaning data ensures accuracy, reliability, and usability for analysis and visualization.
Try-It | Messy Dataset Alert!
Okay, we know we just started this lesson, but we think you, our data detective, has what it takes to step in and clean up the mess.
- Look at the dataset and identify at least three inconsistencies.
- How might these errors affect analysis?
- What would you do to clean up these errors?
Solutions
- Inconsistent name formatting (e.g., lowercase names, missing last names)
- Inconsistent age formats (e.g., numbers as text, missing values)
- Inconsistent date formats (e.g., different separators, full month names)
- Inconsistent grade levels (e.g., mix of numbers and words)
- Inconsistent score values (e.g., numbers as text, letter grades, missing data)
Step 1: Load the Data
- Open Tableau Public in the browser.
- Click "Upload from computer" and select the Messy_Dataset.csv
- Browse and open the messy dataset.
- You may have to select ‘Update Now’ or ‘Create extract’
Step 2: Fix Name Formatting
🔴 Issues:
- Some names are in lowercase.
- Some entries are missing last names.
- It might be better to have a column for first names and a column for last names.
✅ Solution:
- In a sheet, select the
Namecolumn.
- Create a calculated field to capitalize names correctly:
- Right-click on the Name column
- Select Create > Calculate Field
- Name your new field
Name_Proper - Enter the following formula:
- Click Apply and then OK
What’s a calculated field? Calculated fields are a powerful way to level up your Tableau skills. A calculated field is a custom field you create using a formula to manipulate your data. When you create one, Tableau opens a formula editor, it kind of looks like SQL… but it’s not quite! Instead of standard SQL syntax, Tableau uses its own formula language.
Still, if you’ve used
IF, SUM, or LEFT in SQL, some of this will feel familiar. Tableau even helps you out by showing a list of all the possible functions you can use, like IF, ZNULL, LEFT, and more, so you don’t have to memorize everything.
PROPER([Name])
✨ This will create a new column with properly capitalized names
- Split the name into First Name and Last Name:
- Right-click on your Name_Proper field
- Select Create > Calculate Field
- Name your new field First_Name
- Enter the following formula:
- Click Apply and then OK
- This extracts the first word before any space as the first name
SPLIT([Name_Proper], " ", 1)
- Create a calculated field for Last Name:
- Right-click on your Name_Proper field
- Select Create > Calculate Field
- Name your new field
Last_Name - Enter the following formula:
- Click Apply and then OK
- This extracts everything after the first space as the last name
// For names with multiple parts, get everything after the first space IFNULL(SPLIT([Name_Proper], " ", 2), "Unknown")
- Handle missing last names:
- Use filters to find entries with empty
Last_Name: - Click on the
Last_Namecolumn - Click the filter icon (funnel symbol) in the toolbar
- Filter for blank values
- For records with missing last names, you could:
- Add a note in your documentation to manually review these entries
- Use external reference data if available
- ✨ Use "Unknown" as a placeholder if appropriate for your analysis. We’re going to use this method today! ✨
Step 3: Standardize Age Values
🔴 Issues:
- Some ages are stored as text (
"seventeen","Sixteen").
- Some values are missing.
✅ Solution:
- Select the
Agecolumn.
- Create a calculated field to convert text to numbers:
- Right-click on the
Agecolumn - Select Create > Calculate Field
- Name your new field
Age_Standardized - Enter the following formula:
CASE [Age] WHEN "seventeen" THEN 17 WHEN "Sixteen" THEN 16 WHEN "Fifteen" THEN 15 ELSE INT([Age]) // Converts numbers stored as text END
- Replace missing values using an estimated average or interpolation.
Step 4: Standardize Join Dates
🔴 Issues:
- Mixed formats (
YYYY-MM-DD,MM/DD/YYYY,Month DD, YYYY).
- Inconsistent separators (
/, ,.).
✅ Solution:
- Create a calculated field to standardize date formats:
- Right-click on the Join Date column
- Select Create > Calculate Field
- Name your new field
Join_Date_Standardized - Enter the following formula:
- Click Apply and then OK
- This will create a new column with standardized date values
DATE([Join Date])
If Tableau doesn't detect formats automatically, create a more complex calculated field:
- Right-click on the Join Date column
- Select Create > Calculate Field
- Name your new field
Join_Date_Standardized
- Enter a formula that handles multiple formats:
// Try to parse the date using different formats CASE IF DATEPARSE("yyyy-MM-dd", [Join Date]) IS NOT NULL THEN DATEPARSE("yyyy-MM-dd", [Join Date]) ELSEIF DATEPARSE("MM/dd/yyyy", [Join Date]) IS NOT NULL THEN DATEPARSE("MM/dd/yyyy", [Join Date]) ELSEIF DATEPARSE("MMMM dd, yyyy", [Join Date]) IS NOT NULL THEN DATEPARSE("MMMM dd, yyyy", [Join Date]) ELSE NULL END
- Click Apply and then OK
- After creating your calculated field, right-click on the new field
- Select Format
- In the Format pane, select Date
- Choose your preferred date format (e.g., YYYY-MM-DD)
- Click OK
Step 5: Normalize Grade Levels
🔴 Issues:
- Grades appear in mixed formats (
"10","Tenth","11th").
✅ Solution:
- Select the Grade column.
- Create a calculated field to standardize grade levels to a consistent text format:
- Right-click on the Grade column
- Select Create > Calculate Field
- Name your new field
Grade_Norm - Enter the following formula:
- Click Apply and then OK
- This will create a new column with standardized grade values
CASE [Grade] WHEN "Ninth" THEN "9th" WHEN "Tenth" THEN "10th" ELSE [Grade] // Keep correct values END
- Remove text suffixes to create a numeric-only field:
- Right-click on your Grade_Norm field
- Select Create > Calculate Field
- Name your new field
Grade_Number - Enter the following formula:
- Click Apply and then OK
// Extract numeric part from grade values by removing suffixes REPLACE(REPLACE(REPLACE([Grade_Norm], "th", ""), "nd", ""), "rd", "")
Why are we creating a numeric-only grade field? This allows us to do calculations with them, like calculating the average or median grade level. This is especially helpful for spotting trends or summarizing data by grade.
Step 6: Fix Programming Language Formatting
🔴 Issues:
- Extra spaces (
"Python "vs."Python").
- Inconsistent capitalization (
"java"vs."Java").
- Multiple values combined (
"HTML & CSS").
✅ Solution:
- Select the
Favorite Languagecolumn.
- Create a calculated field to remove extra spaces:
- Right-click on the Favorite Language column
- Select Create > Calculate Field
- Name your new field
Language_Trimmed - Enter the following formula:
- Click Apply and then OK
- This removes leading and trailing spaces from the values
TRIM([Favorite Language])
- Create another calculated field to fix capitalization:
- Right-click on your
Language_Trimmedfield - Select Create > Calculate Field
- Name your new field
Language_Proper - Enter the following formula:
- Click Apply and then OK
- This capitalizes the first letter of each word
PROPER([Language_Trimmed])
- Create a calculated field to separate multiple languages:
- Right-click on your Language_Proper field
- Select Create > Calculate Field
- Name your new field Primary_Language
- Enter a formula to split at common separators:
- Click Apply and then OK
- This extracts just the first language before any "&" symbol
// Extract first language when multiple are listed SPLIT([Language_Proper], "&", 1)
Practice | Club Memberships Dataset Cleaning
Toggle to see how to export the file as .csv

The university's student office needs your help! They have a dataset of student club memberships, but the data is a bit inconsistent and needs cleaning before they can generate accurate reports on student engagement.
Toggle for an explanation of the columns
- StudentID: A unique identifier assigned to each student (e.g., S001, S002). This helps in distinguishing individual students.
- FullName: The full name of the student. You'll notice some inconsistencies here, like varied capitalization, which will need cleaning.
- EnrollDate: The date on which the student enrolled in a particular club. These dates are in various formats (e.g., MM/DD/YY, Month DD, YYYY, YYYY-MM-DD) and will require standardization.
- ClubName: The name of the student club the student is a member of (e.g., Photography Club, Debate Club).
- FeePaidStatus: Indicates whether the student has paid their club membership fee. This column contains inconsistent entries (e.g., "Paid", "YES", "N", "Unpaid", "pending") that need to be cleaned into categories.
- ActivityPoints: A numerical score representing the student's engagement level or participation in club activities.
Specifically, you’ll want to:
- standardize the names
- standardize the enrollment dates
- clean up the fee paid status
🌶️🌶️ Medium Challenge: Standardize Activity Points
The student office wants to understand overall student engagement better. They'd like you to create a new category for each member based on their
ActivityPoints.- Goal: Create a new calculated field named
ActivityLevelthat categorizes members as "Low", "Medium", or "High" based on theirActivityPoints. You can define the thresholds as follows: - Low:
ActivityPoints< 60 - Medium:
ActivityPointsbetween 60 and 80 (inclusive) - High:
ActivityPoints> 80
For a summary of this lesson, check out the 10. Data Cleaning with Tableau One-Pager!
