Linear Regression

Author

Mahesh Divakaran & Dr. Jayadevan Sreedharan

1. Loading the Necessary Libraries

# Load necessary libraries
library(readxl)      # For reading Excel files
library(ggplot2)     # For data visualization
library(car)         # For assumptions testing
Loading required package: carData
library(broom)       # For tidying model outputs
library(dplyr)       # For data manipulation

Attaching package: 'dplyr'
The following object is masked from 'package:car':

    recode
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(psych)       # For descriptive statistics

Attaching package: 'psych'
The following object is masked from 'package:car':

    logit
The following objects are masked from 'package:ggplot2':

    %+%, alpha

2. Reading the Data

# Read the Excel file
file_path <- "data/LInear_data.xlsx"
data <- read_excel(file_path)

# Display the first few rows of the data
head(data)
# A tibble: 6 × 9
  SL_No `Mother age` Mother_BMI Gestational_age Smoking Prenatal_visit DM   
  <dbl>        <dbl>      <dbl>           <dbl> <chr>            <dbl> <chr>
1     1           33       29.1              39 No                  12 No   
2     2           19       30.9              31 Yes                  8 Yes  
3     3           37       26.7              38 No                   3 No   
4     4           33       24.1              36 No                   9 No   
5     5           23       29.5              39 No                  11 No   
6     6           29       18.5              38 Yes                  5 Yes  
# ℹ 2 more variables: HT <chr>, Birth_weight <dbl>

The dataset was successfully imported from an Excel file into R for analysis. It comprises variables such as Mother_age, Mother_BMI, Gestational_age, Smoking, Prenatal_visit, DM, HT, and Birth_weight. These variables represent various maternal and prenatal factors, with Birth_weight being the outcome variable of interest. The data contains [X] observations and [Y] variables, all of which will be used in the subsequent regression analysis.

The dataset contains the following columns:

  1. SL_No: Serial number (not used in analysis).

  2. Mother age: Age of the mother.

  3. Mother_BMI: Body Mass Index (BMI) of the mother.

  4. Gestational_age: Gestational age in weeks.

  5. Smoking: Whether the mother smoked during pregnancy (Yes/No).

  6. Prenatal_visit: Number of prenatal visits.

  7. DM: Presence of diabetes mellitus (Yes/No).

  8. HT: Presence of hypertension (Yes/No).

  9. Birth_weight: Birth weight of the baby in grams (dependent variable).

3. Descriptive Statistics

Before proceeding with the regression analysis, basic descriptive statistics were calculated to summarize the data. The mean, median, standard deviation, minimum, and maximum values provide an overview of the central tendency, spread, and range of each variable.

# Get a summary of the data
summary(data)
     SL_No         Mother age      Mother_BMI    Gestational_age
 Min.   : 1.00   Min.   :19.00   Min.   :18.29   Min.   :29.00  
 1st Qu.:23.75   1st Qu.:26.00   1st Qu.:21.39   1st Qu.:33.00  
 Median :46.50   Median :30.00   Median :25.34   Median :36.00  
 Mean   :46.50   Mean   :30.71   Mean   :25.65   Mean   :35.65  
 3rd Qu.:69.25   3rd Qu.:36.00   3rd Qu.:29.54   3rd Qu.:38.00  
 Max.   :92.00   Max.   :43.00   Max.   :33.90   Max.   :41.00  
   Smoking          Prenatal_visit        DM                 HT           
 Length:92          Min.   : 0.000   Length:92          Length:92         
 Class :character   1st Qu.: 4.000   Class :character   Class :character  
 Mode  :character   Median : 6.000   Mode  :character   Mode  :character  
                    Mean   : 6.326                                        
                    3rd Qu.: 9.000                                        
                    Max.   :13.000                                        
  Birth_weight 
 Min.   :2802  
 1st Qu.:3319  
 Median :3489  
 Mean   :3490  
 3rd Qu.:3656  
 Max.   :3963  
# Get detailed descriptive statistics
describe(data)
                vars  n    mean     sd  median trimmed    mad     min     max
SL_No              1 92   46.50  26.70   46.50   46.50  34.10    1.00   92.00
Mother age         2 92   30.71   6.74   30.00   30.64   7.41   19.00   43.00
Mother_BMI         3 92   25.65   4.70   25.34   25.56   6.12   18.29   33.90
Gestational_age    4 92   35.65   3.30   36.00   35.72   4.45   29.00   41.00
Smoking*           5 92    1.54   0.50    2.00    1.55   0.00    1.00    2.00
Prenatal_visit     6 92    6.33   3.02    6.00    6.27   2.97    0.00   13.00
DM*                7 92    1.32   0.47    1.00    1.27   0.00    1.00    2.00
HT*                8 92    1.86   0.99    1.00    1.82   0.00    1.00    3.00
Birth_weight       9 92 3490.45 243.73 3488.76 3488.83 259.39 2801.71 3963.44
                  range  skew kurtosis    se
SL_No             91.00  0.00    -1.24  2.78
Mother age        24.00  0.06    -0.97  0.70
Mother_BMI        15.61  0.09    -1.28  0.49
Gestational_age   12.00 -0.12    -1.07  0.34
Smoking*           1.00 -0.17    -1.99  0.05
Prenatal_visit    13.00  0.22    -0.85  0.31
DM*                1.00  0.78    -1.40  0.05
HT*                2.00  0.28    -1.93  0.10
Birth_weight    1161.73 -0.01    -0.51 25.41

For instance, the mean Birth_weight was found to be [X] grams, with a standard deviation of [Y] grams, indicating considerable variability in the birth weights within the dataset. Such variability is essential to consider when assessing the relationship between the predictors and the outcome variable.

4. Visualizing the Data

Scatter plots were created to visualize the relationships between Birth_weight and each predictor variable. These plots allow for a preliminary examination of potential linear relationships.

# Pairwise scatter plots for initial visualization
pairs(~ Birth_weight + `Mother age` + Mother_BMI + Gestational_age + Prenatal_visit, data = data)

For example, the scatter plot between Gestational_age and Birth_weight suggests a positive linear relationship, meaning that as the gestational age increases, birth weight tends to increase as well. Conversely, Smoking appears to have a negative relationship with Birth_weight, suggesting that maternal smoking may lead to lower birth weights. Such visualizations are crucial for identifying patterns and potential outliers that might influence the model’s accuracy.

# Scatter plot of the data
ggplot(data, aes(x = Gestational_age, y = Birth_weight)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  labs(title = "Scatter Plot with Linear Regression Line",
       x = 'Mother age',
       y = "Birth Weight") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

5. Checking Assumptions

5.1. Linearity

Scatterplots of the dependent variable against each independent variable are plotted to assess the linearity assumption. The linearity assumption was checked using scatter plots, which revealed that the relationship between most predictors, such as Gestational_age, and the outcome variable Birth_weight is approximately linear. This supports the use of linear regression for this analysis. Non-linear relationships would require a different modeling approach, but in this case, the linearity assumption appears to be reasonably satisfied.

# Scatter plot to check linearity
ggplot(data, aes(x = `Mother age`, y = Birth_weight)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Checking Linearity for Mother age",
       x = "Mother age",
       y = "Birth weight")
`geom_smooth()` using formula = 'y ~ x'

# Repeat for other variables...

The scatterplots show a roughly linear relationship between the dependent variable and each of the independent variables, supporting the linearity assumption.

5.2. Normality of Residuals

The normality of the residuals was assessed using a Q-Q plot.

# Fitting the model
model <- lm(Birth_weight ~ `Mother age` + Mother_BMI + Gestational_age + Smoking + Prenatal_visit + DM + HT, data = data)

# Plotting residuals to check normality
ggplot(model, aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line(col = "red") +
  labs(title = "Q-Q Plot of Residuals")

The residuals mostly follow the diagonal line, suggesting that they are approximately normally distributed. This is important because normally distributed residuals validate the use of the linear regression model and ensure that the statistical tests for significance are reliable.

5.3. Homoscedasticity

The assumption of homoscedasticity, which means that the variance of the residuals is constant across all levels of the independent variables, was checked using a Residuals vs Fitted plot.

# Plot to check homoscedasticity
ggplot(model, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, col = "red") +
  labs(title = "Residuals vs Fitted",
       x = "Fitted Values",
       y = "Residuals")
`geom_smooth()` using formula = 'y ~ x'

The plot does not display any discernible pattern, indicating that the assumption is met. This is crucial for ensuring that the model’s predictions are equally reliable across different levels of the predictors.

5.4. No Multicollinearity

To detect any multicollinearity among the predictors, the Variance Inflation Factor (VIF) was calculated.

# Check VIF (Variance Inflation Factor)
vif(model)
                    GVIF Df GVIF^(1/(2*Df))
`Mother age`    1.107833  1        1.052536
Mother_BMI      1.065599  1        1.032278
Gestational_age 1.050315  1        1.024849
Smoking         1.057255  1        1.028229
Prenatal_visit  1.094781  1        1.046318
DM              1.091795  1        1.044890
HT              1.145505  2        1.034545

All VIF values were found to be below 5, indicating that multicollinearity is not a concern in this dataset. This suggests that the predictor variables are not excessively correlated, allowing for a more stable and interpretable model.

6. Fitting the Linear Regression Model

The linear regression model was fitted with Birth_weight as the dependent variable and Mother_age, Mother_BMI, Gestational_age, Smoking, Prenatal_visit, DM, and HT as the independent variables. The summary of the model provides the estimated coefficients, R-squared value, and significance levels.

# Fit the linear model
model <- lm(Birth_weight ~ `Mother age` + Mother_BMI + Gestational_age + Smoking + Prenatal_visit + DM + HT, data = data)

# Summary of the model
summary(model)

Call:
lm(formula = Birth_weight ~ `Mother age` + Mother_BMI + Gestational_age + 
    Smoking + Prenatal_visit + DM + HT, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-427.64  -43.44   13.25   57.09  225.78 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1016.260    147.821   6.875 1.07e-09 ***
`Mother age`       2.179      1.647   1.323   0.1895    
Mother_BMI        12.576      2.318   5.425 5.61e-07 ***
Gestational_age   59.503      3.280  18.140  < 2e-16 ***
SmokingYes      -125.890     21.662  -5.812 1.11e-07 ***
Prenatal_visit     7.159      3.655   1.959   0.0535 .  
DMYes             46.831     23.600   1.984   0.0505 .  
HTNo|              3.081    104.413   0.030   0.9765    
HTYes            -67.036     22.152  -3.026   0.0033 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 100.7 on 83 degrees of freedom
Multiple R-squared:  0.8445,    Adjusted R-squared:  0.8295 
F-statistic: 56.33 on 8 and 83 DF,  p-value: < 2.2e-16

The R-squared value indicates that 84.45% of the variability in Birth_weight is explained by the predictors in the model. Significant predictors, such as Gestational_age and Smoking, provide insights into how these factors influence birth weight. For instance, a positive coefficient for Gestational_age suggests that longer gestation periods are associated with higher birth weights, while a negative coefficient for Smoking indicates that maternal smoking is associated with lower birth weights.

7. Interpreting the Results

# Tidy output
tidy(model)
# A tibble: 9 × 5
  term            estimate std.error statistic  p.value
  <chr>              <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      1016.      148.      6.87   1.07e- 9
2 `Mother age`        2.18      1.65    1.32   1.89e- 1
3 Mother_BMI         12.6       2.32    5.42   5.61e- 7
4 Gestational_age    59.5       3.28   18.1    1.29e-30
5 SmokingYes       -126.       21.7    -5.81   1.11e- 7
6 Prenatal_visit      7.16      3.65    1.96   5.35e- 2
7 DMYes              46.8      23.6     1.98   5.05e- 2
8 HTNo|               3.08    104.      0.0295 9.77e- 1
9 HTYes             -67.0      22.2    -3.03   3.30e- 3
# Confidence intervals
confint(model)
                       2.5 %      97.5 %
(Intercept)      722.2495347 1310.271008
`Mother age`      -1.0967909    5.455035
Mother_BMI         7.9649441   17.186303
Gestational_age   52.9790986   66.027646
SmokingYes      -168.9742053  -82.806167
Prenatal_visit    -0.1109154   14.428255
DMYes             -0.1085250   93.770216
HTNo|           -204.5921769  210.755050
HTYes           -111.0950356  -22.976930
# ANOVA table
anova(model)
Analysis of Variance Table

Response: Birth_weight
                Df  Sum Sq Mean Sq  F value    Pr(>F)    
`Mother age`     1   19509   19509   1.9258 0.1689375    
Mother_BMI       1  119124  119124  11.7589 0.0009453 ***
Gestational_age  1 3860915 3860915 381.1170 < 2.2e-16 ***
Smoking          1  364923  364923  36.0221 4.929e-08 ***
Prenatal_visit   1   64403   64403   6.3573 0.0136023 *  
DM               1   43031   43031   4.2476 0.0424355 *  
HT               2   93173   46587   4.5986 0.0127627 *  
Residuals       83  840834   10131                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8. Visualization of Results

# Plot the regression line with the data
ggplot(data, aes(x = `Mother age`, y = Birth_weight)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  labs(title = "Linear Regression Line for Mother age",
       x = "Mother age",
       y = "Birth weight") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

# Repeat for other variables...

Residual vs Fitted Plot

This plot shows the residuals on the y-axis and leverage on the x-axis. Leverage indicates how far away the independent variable values of an observation are from those of the other observations. High leverage points have more influence on the model’s coefficients.

plot(model, which=1)

Normal Q-Q Plot

The Q-Q (Quantile-Quantile) plot compares the distribution of the residuals to a theoretical normal distribution. It plots the quantiles of the residuals against the quantiles of a standard normal distribution.

plot(model, which=2)
Warning: not plotting observations with leverage one:
  72

Model Diagnostics

Cook’s Distance

Cook’s distance is a measure used to identify influential data points. It combines information on both leverage and residuals to indicate observations that are affecting the model fit.

plot(model, which=4)

Leverage Plot

This plot identifies points with high leverage.

plot(model, which=5)
Warning: not plotting observations with leverage one:
  72

Prediction

# Create a sample data
new_data <- data[sample(nrow(data), 15, replace = TRUE), ]


predict(model, newdata = new_data, interval = "confidence")
        fit      lwr      upr
1  3194.963 3133.327 3256.599
2  3536.656 3469.583 3603.730
3  3204.934 3142.700 3267.168
4  3424.808 3377.324 3472.293
5  3136.332 3062.407 3210.257
6  3809.255 3755.977 3862.533
7  3386.338 3332.731 3439.945
8  3482.464 3437.689 3527.238
9  3257.200 3193.901 3320.500
10 3535.538 3483.099 3587.978
11 3808.048 3757.205 3858.891
12 3053.999 2986.479 3121.519
13 3694.964 3626.323 3763.605
14 3764.027 3704.565 3823.489
15 3557.803 3510.200 3605.405

9. Assumptions Review

# Summary of assumptions checks
assumptions <- list(
  "Linearity" = "Checked with scatter plots",
  "Normality" = "Checked with Q-Q plot of residuals",
  "Homoscedasticity" = "Checked with Residuals vs Fitted plot",
  "No Multicollinearity" = "Checked with VIF"
)

assumptions
$Linearity
[1] "Checked with scatter plots"

$Normality
[1] "Checked with Q-Q plot of residuals"

$Homoscedasticity
[1] "Checked with Residuals vs Fitted plot"

$`No Multicollinearity`
[1] "Checked with VIF"

10. Save the Results (Optional)

# Save the model summary to a text file
capture.output(summary(model), file = "model_summary.txt")