This is the second and last part of the project developed and submitted in August 2025 for the End-of-Course Assignment in Analysing Data For Business Success course at Republic Polytechnic. We will build a churn prediction model using KNIME and evaluate its results. You can find the first part of the project here.
In KNIME, create this node sequence:
CSV Reader → Column Filter → Partitioning → Decision Tree Learner → Decision Tree Predictor → Scorer (JavaScript)
In the CSV Reader node, load the customer summary data created in the first part of the project.
In Column Filter node, Customer ID is excluded. Keep Customer Age, Total_Purchases, Transaction_Count, Return_Rate, Churn.
In Partitioning node, we keep 70% training and 30% testing.
In Decision Tree Learner node, make sure Target column = Churn.
In the Decision Tree Predictor node, adjust the “Maximum number of stored patterns” to 55,000 to fit in the dataset.
The confusion matrix showed the model was strong in predicting class 0 (81%) but performed very poorly in predicting class 1 (19%).
This is not a good classification performance and can even be dangerous if predicting Class 1 is important (e.g. fraud detection, medical diagnosis).
This is a result of class imbalance in the dataset as there are many more active customers (0) than churned customers (1) in the dataset.
To rebalance the dataset, we will deploy a SMOTE node and change Decision Tree mdoel to Random Forest.
In the SMOTE node, target column is Churn and keep the default settings.
Before balancing, the model was missing most of the Churners (Class 1). After applying SMOTE, the model showed much improved performance from pre-SMOTE. For Class 1, there are 3806 True Positives (correctly predicted 1s) and 5106 False Negatives (model failed to predict churn when the actual was churned). With recall at 42%, the model still misses more than half of customers who are actually at risk of churning.
The model is now better at identifying potential churners, but not perfect. We can still make actionable business recommendations to reduce churn, focusing on those correctly identified churners (TP = 3,806) and considering the missed churners (FN = 5,106).
Out of the input variables considered by the Random Forest, Transaction_Count has the highest split count at level 2 among all features. This shows how many times a customer transact is the strongest predictor in the model. It is followed closely by Return_Rate and Avg_Order_Value as high influential features.
In summary, customer transaction frequency is the key driver of the model. Return behaviour and spending amount per order add significant predictive power.
Back to Projects portfolio