This lab uses simple linear regression to explore the relationship between two continuous variables. The project will perform a complete simple linear regression analysis, which includes:
This activity will develop your knowledge of linear regression and your skills evaluating regression results which will help prepare you for modeling to provide business recommendations in the future.
This lab used the marketing and sales dataset which includes information about marketing campaigns across TV, radio, and social media, as well as how much revenue in sales was generated from these campaigns. The features in the data are:
Each row corresponds to an independent marketing promotion where the business invests in TV, Social_Media, and Radio promotions to increase Sales.
The business would like to determine which feature most strongly predicts Sales so they have a better understanding of what promotions they should invest in in the future. To accomplish this, youโll construct a simple linear regression model that predicts sales using a single independent variable.
Some reasons for conducting EDA before conducting a simple linear regression model:
Before fitting the model, ensure the Sales for each promotion (i.e., row) is present. If the Sales in a row is missing, that row isnโt of much value to the simple linear regression model.
Create a pairplot to visualize the relationships between pairs of variables in the data. You will use this to visually determine which variable has the strongest linear relationship with Sales. This will help you select the X variable for the simple linear regression.
From the pairplot viz, TV clearly shows the strongest linear relationship with Sales. Hence TV was slected as the x variable for the simple linear regression.
The steps are:
To justify using simple linear regression, check that the four linear regression assumptions are not violated. These assumptions are:
The linearity assumption requires a linear relationship between the independent and dependent variables. Check this assumption by creating a scatterplot comparing the independent variable with the dependent variable.
There is a clear linear relationship between TV & Sales, hence the linearity assumption is met.
The normality assumption states that the errors are normally distributed.
Create two plots to check this assumption:
The histogram of the residuals are approximately normally distributed, which supports that the normality assumption is met for this model.The residuals in the Q-Q plot form a straight line, further supporting that the normality assumption is met.
The homoscedasticity (constant variance) assumption is that the residuals have a constant variance for all values of X.
Check that this assumption is not violated by creating a scatterplot with the fitted values and residuals. Add a line at y = 0 to visualize the variance of residuals above and below y = 0.
The variance of the residuals is consistant across all ๐. Thus, the assumption of homoscedasticity is met.
When TV is used as the independent variable, the coefficient of the intercept is -0.1263 and the slope is 3.5614.
๐ = Intercept + Slope โ ๐
Sales (in millions)=โ0.1263+3.5614โTV (in millions)
According to the model, when TV is used as the independent variable X, an increase of one million dollars for the TV promotional budget results in an estimated 3.5614 million dollars more in sales.
Using TV as X results in a simple linear regression model with R-squared = 0.999. In other words, TV explains 99.9% of the variation in Sales.
The linear regression model estimates that 99.9% of the variation in sales is explained by the TV promotional budget. In other words, nearly all of the variation in sales can be explained by the TV promotional budget alone, making TV an excellent predictor of sales. TV spending is a strong predictor of Sales
The R-squared value will depend on the variable selected for X.
when TV is used as the independent variable, it has a p-value of 0.000 and 95% confidence interval between 3.558 and 3.565
This means there is a 95% chance the interval [3.558,3.565] contains the true parameter value of the slope. These results indicate little uncertainty in the estimation of the slope of X. Therefore, the business can be confident in the impact TV has on Sales.
Back to Projects portfolio