simple-linear-regression

This lab uses simple linear regression to explore the relationship between two continuous variables. The project will perform a complete simple linear regression analysis, which includes:

  1. creating and fitting a model,
  2. checking model assumptions,
  3. analyzing model performance,
  4. interpreting model coefficients,
  5. communicating results to stakeholders.

This activity will develop your knowledge of linear regression and your skills evaluating regression results which will help prepare you for modeling to provide business recommendations in the future.

This lab used the marketing and sales dataset which includes information about marketing campaigns across TV, radio, and social media, as well as how much revenue in sales was generated from these campaigns. The features in the data are:

Each row corresponds to an independent marketing promotion where the business invests in TV, Social_Media, and Radio promotions to increase Sales.

The business would like to determine which feature most strongly predicts Sales so they have a better understanding of what promotions they should invest in in the future. To accomplish this, youโ€™ll construct a simple linear regression model that predicts sales using a single independent variable.

EDA & data cleaning

Some reasons for conducting EDA before conducting a simple linear regression model:

Before fitting the model, ensure the Sales for each promotion (i.e., row) is present. If the Sales in a row is missing, that row isnโ€™t of much value to the simple linear regression model.
Image

Image

Model building

Create a pairplot to visualize the relationships between pairs of variables in the data. You will use this to visually determine which variable has the strongest linear relationship with Sales. This will help you select the X variable for the simple linear regression.

Image
Image
From the pairplot viz, TV clearly shows the strongest linear relationship with Sales. Hence TV was slected as the x variable for the simple linear regression.

Build and fit the model

The steps are:

  1. Define the ols formula
  2. Create an OLS model using ols() function
  3. Fit the model
  4. save the result summary

Check model assumptions

To justify using simple linear regression, check that the four linear regression assumptions are not violated. These assumptions are:

Linearity

The linearity assumption requires a linear relationship between the independent and dependent variables. Check this assumption by creating a scatterplot comparing the independent variable with the dependent variable.

Image
There is a clear linear relationship between TV & Sales, hence the linearity assumption is met.

Normality

The normality assumption states that the errors are normally distributed.

Create two plots to check this assumption:

The histogram of the residuals are approximately normally distributed, which supports that the normality assumption is met for this model.The residuals in the Q-Q plot form a straight line, further supporting that the normality assumption is met.

Homoscedasticity

The homoscedasticity (constant variance) assumption is that the residuals have a constant variance for all values of X.

Check that this assumption is not violated by creating a scatterplot with the fitted values and residuals. Add a line at y = 0 to visualize the variance of residuals above and below y = 0.

Image
The variance of the residuals is consistant across all ๐‘‹. Thus, the assumption of homoscedasticity is met.

Results and evaluation

Image

Interpret model results

When TV is used as the independent variable, the coefficient of the intercept is -0.1263 and the slope is 3.5614.
๐‘Œ = Intercept + Slope โˆ— ๐‘‹

Sales (in millions)=โˆ’0.1263+3.5614โˆ—TV (in millions)

According to the model, when TV is used as the independent variable X, an increase of one million dollars for the TV promotional budget results in an estimated 3.5614 million dollars more in sales.

R-squared interpretation:

Using TV as X results in a simple linear regression model with R-squared = 0.999. In other words, TV explains 99.9% of the variation in Sales.
The linear regression model estimates that 99.9% of the variation in sales is explained by the TV promotional budget. In other words, nearly all of the variation in sales can be explained by the TV promotional budget alone, making TV an excellent predictor of sales. TV spending is a strong predictor of Sales The R-squared value will depend on the variable selected for X.

Interpretation of interpretation of the p-value and confidence interval for the coefficient estimate of X

Image

when TV is used as the independent variable, it has a p-value of 0.000 and 95% confidence interval between 3.558 and 3.565
This means there is a 95% chance the interval [3.558,3.565] contains the true parameter value of the slope. These results indicate little uncertainty in the estimation of the slope of X. Therefore, the business can be confident in the impact TV has on Sales.

Back to Projects portfolio