How Accurately Can Cause of Death be Predicted With Minimal Data?
I created two classification models — logistic regression for classification, as well as a decision tree classifier, in an attempt to see how accurately the cause of death “type” (cardiac, immune issues, etc.) can be predicted with only the following features known:
- Individual’s Age Group
- Individual’s Race
- Cause of Death
To begin with, there was some serious leakage within one of the cause of death codes that I had to get rid of during the preprocessing of the data I used (from the CDC Wonder tool) — the entire P ICD-10 code is for perinatal issues. Thus, there is leakage, as one of our age groups is <1 Year. Because the CDC Wonder tool only allows for 70,000 rows, I had to limit my search to one state, for which I chose the state of Texas.
As I did expect, I wasn’t able to get a very high accuracy score — at all. This is because there are so many factors in each individual’s life that may cause tolls on health concerns — living in an impoverished area may lead to higher stress, which may lead to cardiac issues later on in life. While the classification of race may lead to a higher likelihood of specific medical issues (for example, Sarcoidosis affects younger African-American women disproportionately to other classifications of individuals), it still does not take into account genetic and environmental factors.
Despite not being able to beat my baseline accuracy, the following are the test scores for these models.:
About the Model
Upon looking into the feature importances for the decision tree classifier, it caught my attention that the three most important features are whether an individual is African-American, Native American, and if they’re 20–24 years of age. This is something I plan on looking further into to identify how these trends are determined, and what specific health concerns are related to these classifications.
Another reason I was unable to accurately predict the cause of death is because we went with just the classifying cause of death type, rather than the specific cause of death. For example, remember how I mentioned Sarcoidosis earlier? Sarcoidosis is an inflammatory disease affecting the immune system. That means that this would have been classified just as any other disease related to the immune system would.
What I Would Do Differently Next Time
If I were to repeat this, I would like to have at minimum economic data, as well as classify the individual’s specific cause of death. However, this is computationally expensive, considering how many different causes of death there are. Another helpful feature may have been the individual’s specific location (for example, county of residence — an individual who resides in a county where they are of a disproportionate racial population may increase issues on a social level, leading to higher levels of stress and thus, health concerns). Regardless of this even, there are factors that simply cannot be predicted with this data; genetic, social, environmental, etc.