Kenneth Griffin, PhD, PE - Analyzing Diabetes with XGBoost

Methods: Python

Data Analysis: Pandas, Numpy

Data Visualization: Matplotlib

Data Modeling: sklearn, XGBoost

Web Development: HTML, CSS

Analyzing Diabetes with XGBoost

The objective of this project is to analyze a diabetes dataset utilizing XGBoost in a python jupyter notebook.
This diabetes dataset contains the variables pregnancies, glucose, blood pressure, skin thickness, insulin, BMI, diabetes pedigree function, age, and outcome (binary- positive or negative).
XGBoost is an optimized distributed gradient boosting library that can be used to implement machine learning algorithms under the gradient boosting framework. More information can be found here: XGboost.ReadtheDocs.io
Tables and Graphs can be found here:

XGBoost Diabetes Python Jupyter Notebook (7/30/2024 Update)

Results

The dataset analyzed in this project included 768 entries, which include the variables pregnancies, glucose, blood pressure, skin thickness, insulin, BMI, diabetes pedigree function, age, and outcome (binary- positive or negative).

-The pandas describe function shows the minimum age is 21 and the maximum age is 81 years old.

-The dataset is made up of 34.9% (268) positive and 65.1% (500) negative participants.

-The data was split into 70% (537) training and 30% (231) testing data using sklearn train_test_split.

-A model was built using XGBoost with a learning rate of 0.05, maximum depth of tree 20, minimum loss reduction gamma 10, and number of estimators 500.

-The XGBoost model's accuracy was calculated to be 75.7%, while the Receiver Operating Characteristic Area Under the Curve (ROC-AUC) was 72.8%.

-The feature importance plot shows glucose (41), BMI (30), age (20), diabetes pedigree function (5), and insulin (1).

Analyzing Diabetes with XGBoost

Results

Contact