Profile Image
  • Methods: Python
  • Data Analysis: Pandas, Numpy
  • Data Visualization: Matplotlib
  • Data Modeling: Scikit-Learn, OLS
  • Web Development: HTML, CSS

OLS Linear Regression Model Based on the NCSU.EDU Diabetes Dataset

  • This Exercise explores the NCSU.EDU Diabetes Dataset with an OLS Linear Regression Model. The Linear Regression Model is Used to Predict the Progression of Diabetes One Year after the Baseline. The coding exercise is provided in the python jupyter notebook below.
  • The NCSU Diabetes Data website notes: From Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499, we have "Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline."
  • The OLS Linear Regression model summary using all ten baseline variables showed there were several variables that were statistically insignificant. The variables were identified as statistically insignificant for this exercise when the P>|t| is greater than 0.05. Additional information for the statsomodel OLS can be found here: Statsmodels.org Regression.linear_model.OLS
  

Results

-The NCSU diabetes data consists of 442 diabetes patients. There are ten baseline variables and a quantitative measure of the disease progression one year after the baseline.

-The OLS Linear Regression model summary using all ten baseline variables showed there were several variables that were statistically insignificant. The variables were identified as statistically insignificant for this exercise when the P>|t| is greater than 0.05.

-The second OLS Linear Regression model with only the baseline variables sex, bmi, s3, and s5 yielded the best results with a 48.5% out of sample r-squared value.

-A different type of model such as a decision tree, random forest, or xgboost may yield a greater out of sample r-squared value. However, the scope of this exercise is limited to OLS. A similar diabetes dataset will is examined with XGBoost.


Contact

Please feel free to reach out through the following platforms: