Cogtix - Healthcare Analytics

Project Overview

This case study is about a large Fortune 500 healthcare company. Our healthcare client needed to optimize the patients’ care as well as reduce accrued costs from later stages high-cost medical treatments of potential high-risk patients. For this project, the client aimed to develop various predictive systems that could generate a risk score to stratify the population in order to take relevant action for high-risk patients. Our client had historical demographics, medical claims, pharmacy claims, and clinical data such as doctor’s notes and lab data of their patients. We first developed various analytical solutions to understand the data and obtain insights from it. We ran various experiments to obtain effective features and test various ML models on it. Ultimately, we developed an end-to-end data pipeline to clean the data, select and create effective features for the model, and built cost-effective classification and forecasting machine learning models using deep learning artificial neural networks. We developed models to predict 30 days patient readmission risk score, Congestive heart failure rate, Mortality rate, and Facility utilization rate.

The Challange

Our client is in the healthcare industry so it is a highly regulated industry and patient data is protected by HIPPA US federal law. In addition, the claims data of patients were not clean and easy to use as well as it was a big learning curve for us to understand. In addition, we had to use different strategies to prepare training and validation datasets as well as features for each model. There were also many data standardization and cleanliness issues that we had efficiently and accurately fixed before delivering the final product.

Our Solution

We first performed a detailed data analysis using PySpark and Python Pandas and various plotting libraries such as Matplotlib to understand the data. We then created and selected features and understood the relevance of each feature in each model. We decreased the number of features by up to 75% to make data processing efficiently in production while still maintaining high accuracy using various feature selection processes in each model. We developed the final version of each model after testing various techniques such as boosting, bagging, and recurrent neural networks. We also periodically presented our analysis reports, the importance of each feature, as well as performance of models using various evaluation metrics to our clients. The end product included data pipelines and ML pipelines that extract, transform, and load data from various sources, prepare features, and perform scoring on a weekly basis. In addition, retrain and tune the model on a monthly basis.