Python Data Analysis | Early stage diabetes risk prediction

10 min readJan 3, 2021

Introduction

I’ve always been interested in data and how it applies to our lives. For some time now I’ve been downloading datasets and analyzing them in Python and training models for fun to continue to learn in the field. I’m just now getting around to start writing about it. I’d like to preface with the acknowledgement that Medium is full of articles such as these, but for my own purposes I wanted to embark on a written history of some of the little projects I’ve undertaken. If anyone stumbles upon this and takes away anything useful, great! Likewise if anyone identifies areas for improvement, I’m always excited to learn new methods and absorb more expertise. Now that we have that out of the way…

Initial data exploration

For this exploration, we’ll be using the Early stage diabetes risk prediction dataset from the UCI Machine Learning Repository.

Starting with the basics, we’ll take a quick look at information about our dataset. df.info() gives us an overview of the rows and columns, as well as the data types of our columns. We have one int64 column, ‘Age’, and we can use df.describe() to take a glance at some stats. df.describe() would provide an overview of any numerical columns, but we’ve just got the one in this dataset. Using value_counts() we also view a couple columns of interest, ‘Gender’ and ‘class’. The column ‘class’ is whether the patient has a positive or negative diagnosis.

I’m a visual person so next we’ll create a bar plot of the count of patients by gender in our dataset as well as the distribution of age for each. In our dataset there are 328 males and 192 females, or 63.08% and 36.92% respectively.

We’re also interested in the age distribution by gender so we’ll create some kde plots.

This reveals some interesting information. Despite males comprising almost two-thirds of our dataset, they don’t account for as many positive cases. If we isolate the positive class in our dataset and look at the value counts, we find that males actually account for fewer cases than females. Of the 320 positive cases, 45.94% are male, and 54.06% are female. Down the line we’ll explore how this could impact our logistic regression model.

In the meantime, let’s take a look at our reported symptoms to see if anything stands out. We’ll start by looking at the total number of reported symptoms for all diagnoses and all genders.

Now let’s look at the reported symptoms by diagnosis to see if any are more prominent in positive or negative cases.

Looking at the same visualization as a percentage in the plot below, we find that the majority of patients with positive cases reported symptoms of polyuria, polydipsia, weakness, partial paresis, polyphagia, sudden weight loss, and visual blurring.

Let’s look even further and at each diagnosis by gender. We want to check for the possibility that different genders may exhibit different symptoms.

This reveals an interesting tidbit. While most of the top reported symptoms are present in both genders, we do see that partial paresis is much more common in females than males. Let’s look at the same graph as a percentage to see the majority reported symptoms for each gender.

The majority of male positive cases reported polyuria (77.55%), weakness (68.71%), polydipsia (68.03%), polyphagia (51.02%), and sudden weight loss (50.34%).

Meanwhile, the majority of female positive cases reported polyuria (74.57%), polydipsia (72.25%), partial paresis (70.52%), weakness (67.63%), sudden weight loss (65.9%), polyphagia (65.9%), visual blurring (60.12%), delayed healing (50.87%), and itching (50.29%).

This is worth noting for when we reach the model training stage as we have a difference in the number of reported symptoms between genders.

It’s also worth looking at the top symptoms reported by gender for the negative diagnosis.

Reaching the end of our data visualization stage, we’ll take a look at the symptoms using a heatmap to get an idea of how they correlate to each other.
Initially we can see in the heatmap on the left that there are stronger correlations between symptoms such as polyuria, polydipsia, sudden weight loss, partial paresis, and class. However once we isolate positive cases in the heatmap to the right, the correlation does not appear as strong.

Let’s look into this further by separating our heatmaps by gender. In the top row we immediately see that there are several stronger correlations for symptoms that were reported by females. In the second row of heatmaps if we again isolate to only look at positive cases, the correlation of symptoms for males becomes even less apparent. Meanwhile, we can still see stronger correlations for females even when we’re only looking at positive cases. This could be interesting when we train the model to see if it performs worse when predicting cases for males.

Model training

Now that we’ve explored our data a bit, let’s move on to training a model. In this example, we’ll be using Logistic Regression from the sklearn library in Python.

Preparing the data

As we saw in the beginning of the data exploration, most of our columns are objects, with the exception of ‘Age’ which is type integer. We will need to convert these object columns into categorical columns. Each category will be assigned a numerical mapping which will be passed into our model.
(e.g. 0: Female, 1: Male, or 0: Negative, 1: Positive)

As a brief note, in the screenshot examples I’ve created a class called DataHolder that contains some methods that abstract the code to make the logic easier to read in our main function. I’ve you’d like to dig into these, my code can be found at: https://github.com/automattrix/diabetes

Below, we select the object columns and then call ‘convert_data’ which is a small helper function I’ve written. This ends up calling my ‘convert_to_categories’ function to return the categorical codes for each object column that is passed.

Next, we will group the ‘Age’ column into bins as we don’t have many data points where a patient is any specific age. We’ll group as 0–9, 10–19, 20–29, 30–39…and so on (e.g. 10s, 20s, 30s…). This new column will also be converted into a categorical column after we have converted to bins. Again I’ve coded some helper functions to make the code a bit more readable. Feel free to check out the github link to explore the functions.

Now that we have the new age bin column, we’ll drop the original column as I won’t be using it for training the model.

We’re getting close to actually creating our model now. We’ll start by selecting our target and features. In this dataset we want to predict whether a diagnosis will be positive or negative based on the reported symptoms. Therefore our target will be ‘class’ and our features will be our remaining columns (gender, age bin, and all the reported symptoms). You’ll notice in the image below that we’ve also removed the target from our features so we don’t contaminate the model with the answers. Otherwise when we test the model, it would already know the diagnosis that we’ve been attempting to predict. That would be like handing a magician a card face up and then asking them to guess which number you drew.

Another note: For the model training I’ve once again abstracted the code for readability. I’ve structured the code in this manner to make it easier to expand with additional models in the future, as well as creating a template of sorts for future projects.

Now onto the model!
Let’s begin with the imports. I’m using the train_test_split to separate the data into training and test data sets, and LogisticRegression for the actual model. I’ve also imported some of my helper functions (build_evaluation_df and graph_log_performance) that will create a Pandas DataFrame that we’ll use for visualizing our predictions.

First, we’ll select our data to pass into train_test_split. ‘X’ will be our feature column and ‘y’ will be the target column. Then we create our training and test datasets (X_train, X_test, y_train, y_test). We’ll reserve 20% of our data to use to test our model. I’ve set the random state to 42 so we can run the code multiple times and return consistent results. You can set the random state to whatever you desire. I chose a small nod to Hitchhiker’s Guide to the Galaxy.

Now we’ll create our model (again setting the random state).
Then we fit the model to our test data with model.fit(X_train, y_train). In simplistic terms, we’re providing examples of which inputs resulted in a positive or negative diagnosis.

With the fitted model we can now make predictions on data that wasn’t seen during training (X_test).

Calling model.predict_proba(X_test) we’re returned with a 2-dimensional array of the probability of each data point being a positive or negative diagnosis.

Calling model.predict(X_test) will return the actual positive/negative (0 or 1) prediction for each data point.

The adventure doesn’t stop there though. Now that we have predictions, we’ll want to know if we can trust our model. If you were actually going to use this model in real life to predict whether or not a person may have diabetes, you’d want to be sure that you’re making a confident prediction.

Luckily, we can review information about our model to determine its performance. We made predictions on X_test, but we also reserved the actual outcomes in y_test. Using those values, we can review how many correct predictions were made.

As a first initial look, we can call model.score(X_test, y_test). This will return our R-squared score.
In the current state of our model, we have an R-squared score of 0.9134615384615384.
With 1 being the best outcome, our current score tells us that our data is fit quite well to our regression line.

Using our helper function, we can also graph the predictions that our model has made.

As we can see, our model has mostly made the correct prediction!

We can use a few more functions from sklearn metrics to evaluate the model.

Using a confusion matrix we can see how many true positives, false positives, false negatives, and true negatives the model predicted. In a perfect world we’d only have true positives and negatives.

In our case, we see:
[[28 5]
[ 4 67]]

28 true positive predictions, 5 false positive predictions, 4 false negatives, and 67 true negatives. Using those numbers we can also calculate precision, accuracy, and recall. We could manually do the math, but we can also save some time by using the sklearn functions.

Precision score: 0.9305555555555556
Accuracy score: 0.9134615384615384
Recall score: 0.9436619718309859

As a final check, let’s take a look at our results and determine which gender had more incorrect predictions. Earlier on during the data exploration phase we noticed that males did not appear to have a very strong symptom correlation in our dataset.

It seems that when our model is making incorrect predictions, it is more likely to occur when predicting for males. In a follow up post, we’ll take a look at ways we might be able to refine the model to attempt to reduce the number of incorrect predictions, although I am quite happy with the current performance.

If you’ve made it this far, thanks for toughing it out! While there are many posts like this, hopefully you found something interesting here. Again, feel free to take a look through my code. I’ll be continuing to update the repository as I work on part II for refining the model.

Until next time!

Acknowledgements

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Islam, MM Faniqul, et al. ‘Likelihood prediction of diabetes at early stage using data mining techniques.’ Computer Vision and Machine Intelligence in Medical Image Analysis. Springer, Singapore, 2020. 113–125.

https://archive.ics.uci.edu/ml/datasets/Early+stage+diabetes+risk+prediction+dataset.