How big is 'big data' in healthcare?

In my last post I spoke to the problem of medical data overload as it relates to our innate human inability to store and process more than 7 variables (plus or minus 2) at any one time.  This week, I’ll explore what’s driving the data overload and, in particular, how big is "big data" in Healthcare.

So, why all the recent hype about Big Data in Healthcare? Have hospitals suddenly started to treat significantly more patients or double the number of lab tests and EKGs? Have Health Plans recently dramatically increased the number of patients that they provide insurance coverage to? Has there been a sudden surge in the number of Pharmaceutical trials that are underway.   Well, the simple answer is no, no and no.  Then why do we keep hearing that Healthcare is emerging as the poster child for Big Data?   The answer lies in the fact that Big Data isn’t simply about the volume, velocity and variety of the data in storage, it is also about the potential value of those data that already exist but are poorly coordinated and stored in widely disparate formats across industries that haven't typically shared data openly.

I’ll write more about the opportunities and challenges of Big Data next time.

Let’s start by addressing the first part of the question. How big is Big Data in Healthcare? This is deceptively simple question but not one that I could easily find any existing answers to.  McKinsey's Global Institute group recently published a Big Data report in which they estimate that, in ten years or so, the total value of Big Data in Healthcare could be as much as $300B.  However, I was still unable to find any estimates of the actual global size of healthcare data.

So, not to be deterred, I decided to take a crack at it myself and set out to gather some data that would provide me with a foundation for an estimate of the total size of all digital healthcare data stored in the world today. An ex-colleague of mine from a leading PACS vendor provided me with some very helpful input as did a CIO from a highly prestigious integrated delivery system in the Boston area.

With the disclaimer that all the following data should be taken with a liberal pinch of salt, here are my estimates. In these examples, data sizes are expressed as Exabytes, where one Exabyte = 1x1018 Bytes. If my assumptions sound way off base, my math is wrong and/or any of you have seen any other published estimates with associated sources and assumptions I would be interested to hear from you.

Based on data provided by the PACS vendor, I estimate the total amount of PACS data to be about 78 Exabytes (one Exabyte = 1x1018 Bytes) by using the following assumptions – the vendor apparently has copies of all its (global) customers PACS images in a central archive. That archive contains approximately 550 Billion images. At roughly 20MB per image, and with a claimed 15% global share of the PACS market, that would be a total of 550 Billion * 20MB / .15 = 73 Exabytes. I then added a near random 5 Exabytes to take account of redundant storage in local PACS stores and some regional and national image exchanges. My assumption is also that these PACS numbers contain not only traditional radiological data (Digital X-Ray, CT, MR and USS) but also other diagnostic image and some stored video based data and, if not, the estimate could probably be increased by an additional 30%. The estimate does not include data that are of a transient nature, such as full video from interventional procedures or real time bedside medical device data.

Now, this is where things get really tricky. My gut feel (which may be way off but also sounded reasonable to the two colleagues I mentioned above) is that PACS images currently account for at around 50% of the total medical data stored per patient, the rest includes all other stored data such as demographic data, laboratory results, nurse charted data, other diagnostic reports, physicians notes, scanned documents, claims data, trials data, current genomic stores etc. So my total estimate for the currently stored global healthcare data would be something in the region of 150 Exabytes.

Next, my growth rate estimates are based on the Boston CIO’s data. He sees a growth rate of approximately 100MB per patient per year for hospital generated data. Assuming that Boston is reasonably representative of the average medical data stored per citizen in the US, that would lead to an estimated growth rate for hospital data across the US of 30 Petabytes (100MB * 300 Million people).  As this only covers hospital based data, I then multiplied by 4, to estimate the total data growth rate - taking into account of all US healthcare data stored outside of the hospital setting (physician offices, freestanding imaging centers, nursing home, claims data & pharmaceutical trials data) which I felt might be four times the amount of data held within a hospital setting.... and then multiplied that by a factor of 10, which I took to represent the broader world of new healthcare data being generated each year outside of the US. My sense is that the multiple of 10 is probably too small, and it might be quite reasonable to multiply by closer to 20. That leaves us with a growth rate in global healthcare data of between 1.2 and 2.4 Exabytes per year.

In summary, my estimates are that the global size of “Big Data” in Healthcare stands at roughly 150 Exabytes in 2011, increasing at a rate between 1.2 and 2.4 Exabytes per year.

What do you think? Can you come up with a more accurate number?

tags: big data, CHAI, Cool Technology, health analytics, healthcare IT, information overload


  1. Greg
    Posted December 7, 2011 at 2:13 pm | Permalink

    What is a "PACS"?

    • Graham Hughes, MD Graham Hughes, MD
      Posted December 7, 2011 at 3:15 pm | Permalink

      PACS is an acronym for Picture Archive & Communications System for storage and dissemination of digital medical images

  2. Tushar
    Posted April 5, 2012 at 2:48 pm | Permalink

    Do you have any updated stats on the amount of healthcare data? also have you done this kinda of study over time?

    • Graham Hughes, MD Graham Hughes, MD
      Posted May 5, 2012 at 3:02 am | Permalink

      The estimates I made around size and growth rates were really meant primarily to stimulate discussion and, as stated in the post, are based on a bunch of assumptions and not on any specific external research. They may be way off target. I'd love to see if anyone else has conducted a more rigorous analysis and to see see how their results compare to my estimates.

3 Trackbacks

  1. [...] a prior post I made some estimates of the global size and growth rate of medical data stored in databases across [...]

  2. By URL on May 14, 2012 at 4:06 am

    ... [Trackback]...

    [...] Read More here: [...]...

  3. […] data will explode. Healthcare data is rapidly growing, and has been estimated to be even greater than 150 Exabytes. As we collect more data from wearables, DNA, environmental factors and other health factors, […]

Post a Comment

Your email is never published nor shared. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

  • About this blog

    Welcome to the SAS Health and Life Sciences blog. We explore how the health care ecosystem – providers, payers, pharmaceutical firms, regulators and consumers – can collaboratively use information and analytics to transform health quality, cost and outcomes.

  • Health Analytics Executive Virtual Forum
  • Subscribe to this blog

    Enter your email address:

    Other subscription options

  • Archives