Data_Wrangling_Capstone_SUSS

🛍️ Data Wrangling Project: Customer Satisfaction Analysis for a New Shopping Mall

This repository contains the code, SQL queries, and visualisation assets developed and submitted in April 2025 for the End-of-Course Assessment Data Wrangling at the Singapore University of Social Sciences (SUSS).
The project focuses on processing, cleaning, transforming, and visualising survey data collected from mall visitors in 2024.

🎯 Project Objective

To clean, transform, and summarise mall visitor survey data, and provide a monthly summary of customer satisfaction indicators. The project tasks include:

Importing and storing raw survey data into MySQL
Cleaning inconsistent date formats
Optimising data types for efficient storage
Creating a summary table with monthly averages and derived KPIs
Visualising the data for business interpretation

🛠️ Technologies Used

Language: Python, SQL (MySQL), R
Packages (R): DBI, RMySQL, ggplot2, dplyr, scales
Tools: RStudio, MySQL Workbench, Jupyter Notebook

Codes Explained

Python script to access the dataset and store it as a MySQL table in MySQL database
MySQL statement to clean up inconsistent date format

In this code, the doi column values have been updated to consistently use YYYY-MM-DD format.

Alter Table statement to optimise data type for efficient storage

For satis, confirm, ideal and handle, these four columns correspond to respondent’s rating on a scale of 1-10. Therefore, using tinyint is space efficient (1 byte) and UNSIGNED restrict it to a non-negative value with a range of 0 to 255.

For comp, it has an input value of either 0 or 1, hence, using tinyint is also justified for saving storage space.

For nocomp, it has an input value from 1 to 4, hence it is optimal to use tinyint as data type to achieve space efficiency.

MySQL statement to create summary table

A new table was created containing 12 rows, one for each month of the year, and 7 columns. The mth column contains an integer representing the month of the year.
The satis, confirm, ideal, comp, handle columns contain the corresponding month average values for those variables.

R code to read in summary table and store it as an R data frame

This R code read in the summary table and store it as an R data frame.

R code for visualization

A single visual was created to make it easy for the mall manager to understand the information in the table.

Visualisation

To visualize multiple variables of different scales over time (months) from the ECA_summary dataset, I evaluated options like a dual-axis line chart and a facet grid of line charts, as suggested by ChatGPT. The advantages of a dual-axis line chart are it could plot the six variables in ECA_summary in one single chart as separate lines in different colors and the x-axis representing the months. The dual y-axes will take care of the different scale of the variables, TNCR in percentage and the other variables in scale of 0-10. The disadvantages of a dual-axis line chart are that it could become cluttered to have six different variables in a single chart, especially if the lines overlapped. Furthermore, the dual y-axes will add to the clutter and confuse viewer due to the different scale. On the other hand, the advantages of a facet grid of line chart are clutter-free individual chart for each variable and reduce confusion as it is easier to read and interpret as there is only one y-axis. The disadvantages of a dual-axis line chart are that it is harder to compare the different variables since they are in sperate panels and less likely for user to spot correlation between the variables. In conclusion, I have chosen a facet grid of line charts as it offers clutter-free and easy to understand visualization, especially without the dual axes to add to the chart clutter. In addition, the customer experience manager of the mall is likely to want to view the individual variable on its own and very rarely will require to compare those variables against each other.

Back to Project main page