In the eight-step approach to population health analytics (PHA) that we at the SAS Center for Health Analytics and Insights (CHAI) recommend, the first step is to “Integrate and Prepare Data.” But before we jump directly into a discussion about this first step, let’s take a moment to consider the following question: How much time did you spend in a doctor’s office or hospital (as a patient) last year? And now, compare that number to the amount of time you spent at work, at home, at an airport, in a car, on a bus, on the phone, out with friends, online, shopping, dancing, cooking, exercising, and whatever else you do while awake. I sincerely hope your ratio of minutes spent in a health care setting was similar to mine: approximately 92:350,308. Keep that ratio in your mind as you consider how much data was generated by you – or about you – in a health care setting. Got a number in mind? Now, how does that number compare to the amount of data generated by you – or about you – in all those other settings combined?
This thought experiment helps us recognize the need to include non-traditional ‘big data’ such as social, consumer, survey, and environmental along with our more traditional clinical, pharmaceutical, biometric, and lab data. Additionally, we need to plan for new data types and sources such as streaming data from wearable fitness devices, self-reported data from smartphone apps, blogs and forums, genomic data, and the digital output (audio and video) from telemedicine encounters. Successful integration of this data will require the use of fuzzy-logic to match, merge and de-duplicate with a high degree of accuracy. Perhaps most important, is the ability to mine unstructured (text) data to apply machine learning and natural language processing to extract value from clinician notes and patient verbatims.
Matching and attributing data from disparate sources to the correct person – be they patient or provider – is no mean feat. It’s not uncommon to find 30 percent or more duplication in patient’s EMR records. The problem is a bit like trying to bail water out of a boat before plugging the holes in the hull. Thus, an ounce of prevention in the form of data quality at the point of entry, is worth a pound of cure. However, most health care settings don’t yet enforce rigorous data entry protocols, so it becomes necessary to profile the data and set up repeatable processes to remediate data quality issues. Whatever technology you choose, be sure it routes data quality decisions to the appropriate data steward for quick and secure resolution.
Once you have a cleanly integrated dataset, you can begin the process to prepare that data for analysis. It’s been noted that 80 percent of the work required to find an analytically driven solution is directly related to data preparation. To best prepare data for meaningful use requires a combination of subject-matter knowledge (to identify what a signal might look like) and data science (to know how to tease that signal out of the noise.) Volumes have been written on the subject of data preparation. For population health analytics, the first and most formidable problem you’ll likely face is missing data. As such, it will be necessary to impute data and use advanced statistical methods for gauging reliability.
While this first step seems monumental, it’s important to anticipate and plan for the subsequent phases of the strategy. In Part 3 of this series, we’ll explore what it takes to assess performance across the continuum and report it in impactful ways.