At every healthcare security conference I’ve been to in the last few years, at least one speaker has a slide in their presentation deck with a few data breach figures designed to elicit a collective audience gasp. Being a security geek, a decidedly antisocial vocation characterized by skepticism, suspicion, and conspiracy delusions, I was compelled by my particular brand of insanity to cozy up to the source data first hand. Also to get the most up-to-date figures available.

Fortunately, the Office for Civil Rights (OCR), the arm of the U.S. Department of Health and Human Services (HHS), tracks healthcare data breaches of electronic protected health information (ePHI) greater than or equal to 500 patient records, as prescribed by HIPAA/HITECH. OCR provides an interactive online tool for examining ePHI exposure incidents, and the source data is available for download.

Massaging the Data

As with most data sets, there’s a bit of inconsistency and ambiguity in the systems, so instead of jumping right to the denouement, you need some background on how I teased the data into line. Perhaps it will help you if you decide to take a run at the data yourself, or at least it will explain why not everyone comes up with the same results given the same source data.

  • I’ve ignored the covered entity’s, well…entity. No need to insult the injured. But I did remove a few obvious duplicates: same covered entity (CE) name, same reporting date, same number of records. There were less than a half dozen of these.
  • The OCR tracks Types of Breaches (e.g., theft, unauthorized access/disclosure, hacking/IT incident), as well as Location of Breached Information (e.g., laptop, paper, other portable electronic device), which I refer to as Media Containing ePHI for clarity. The OCR provide these as two separate fields, but containing one or more entries separated by commas, which isn’t that useful for statistical analysis. I created individual fields for each entry and separated them. First, though, I had to resolve a number of misspellings and inconsistencies in entry names (e.g., Loss, Improper Disposal vs Loss/Improper Disposal). I also renamed some of the properties and aggregated others that made sense, for example, I consolidated “Other (Backup Tapes)” and “Other (Backup Disks)” into simply Backup Media.
  • Some of the records, only 14% (ish), have associated comments, or a narrative that provides details about the incident. In some cases I changed the Type of Breach and/or Location of Breached Information based on the narrative.
  • If you add up all the Types of Breaches or Media Containing ePHI, they both exceed the total incidents. This is because many of the breaches have multiple classifications attached to them. Some seem to be catch-alls. For example, “Computer” appears to be a generic category for lost or stolen medium, but may be coupled with “Laptop” or “Network Server”. It’s unclear from those incidents without an accompanying narrative whether there were multiple systems involved—a laptop stolen as well as a network server hacked—or it’s multiple classifications for a single system.
  • The comments tell the story of many incidents involving theft of computers left in cars or brought home or to a remote office, like a business associate’s lab. These are not called out in the original classification so I created properties to capture these incidents.
  • The majority of incidents have only one date associated with them; however, later in the data set, there is a date range. To normalize the chronology, I used the later date, which is presumably the date of discovery, for grouping incidents by year.
  • The latest sane date in the data I used is October 12, 2013. Interestingly, the last record, for “Multiple Health Plans”, which I interpret as a generic identifier for multiple health plans (although there may be a healthcare payer with that name…), is dated in the future: December 7, 2013. I left this last record in the data set even though it’s not particularly significant, with only 1,368 breached records, categorized as both theft and loss of paper media.
  • The current year is not yet come to a close. Consequently, I projected the total number of incidents and records based on the current run rate. However, for the detailed results—number of incidents of improper disposal, for example—results were not extrapolated.

The Results

  • 24 million (plus a bit): the number of ePHI records have been/will be compromised between 2009 and the projected end of 2013
  • 730 incidents were reported In the same period (also projected for 2013)
  • The number of incidents dropped significantly from 2012 to 2011, and has been going down slowly since, except for between 2011 and 2012, where it stayed relatively flat. However, the average number of records per incident has fluctuated significantly, with about 40K records / incident in 2013 and 2011, 17K in 2012, and 25K in 2010.

  • Theft is by far the greatest type of breach, including hospital and office burglaries, and laptops stolen from offices and cars. Unauthorized access or disclosure comes in a distant second, with less than half than theft.
  • Hacking doesn’t figure prominently in breach incidents, with less than 20 incidents per year, and only 10 in 2013 so far.
  • At least 14 incidents were related to employees and contractors leaving media containing ePHI in vehicles which were broken into. That figure is likely higher as it’s not a property tracked by OCR.
  • Similarly, postal mail was a prominent medium for inadvertently disclosing ePHI before 2012, but no incidents appear since. The incidents provide lessons in paying attention to detail. Some include sending the wrong patient information to recipients or inserting a list of patients and associated private information into a group mailing, and in once case ePHI was printed on the external mailing label. Additionally, backup media was mailed in a few instances and never reached its destination or was addressed to the wrong recipient.
  • For positive trends, loss of ePHI on portable devices has declined steadily since 2010, and is currently at 8 for 2013; what’s being classified simply as ‘computer’ for media has also steadily declined, and is currently at 0 for 2013; however, that may be due to better classification, as computer seems to be a catch all.


After spending a few hours becoming intimate with the data, I’m left with the feeling that the healthcare industry is making progress in some areas, but is overall struggling to clot the wound bleeding patient data. If you’ve done your own analysis and came up with different results, I’d love to hear what it is and how you arrived at your conclusions.

Stay safe my friends.

More from Data Protection

Overheard at RSA Conference 2024: Top trends cybersecurity experts are talking about

4 min read - At a brunch roundtable, one of the many informal events held during the RSA Conference 2024 (RSAC), the conversation turned to the most popular trends and themes at this year’s events. There was no disagreement in what people presenting sessions or companies on the Expo show floor were talking about: RSAC 2024 is all about artificial intelligence (or as one CISO said, “It’s not RSAC; it’s RSAI”). The chatter around AI shouldn’t have been a surprise to anyone who attended…

3 Strategies to overcome data security challenges in 2024

3 min read - There are over 17 billion internet-connected devices in the world — and experts expect that number will surge to almost 30 billion by 2030.This rapidly growing digital ecosystem makes it increasingly challenging to protect people’s privacy. Attackers only need to be right once to seize databases of personally identifiable information (PII), including payment card information, addresses, phone numbers and Social Security numbers.In addition to the ever-present cybersecurity threats, data security teams must consider the growing list of data compliance laws…

How data residency impacts security and compliance

3 min read - Every piece of your organization’s data is stored in a physical location. Even data stored in a cloud environment lives in a physical location on the virtual server. However, the data may not be in the location you expect, especially if your company uses multiple cloud providers. The data you are trying to protect may be stored literally across the world from where you sit right now or even in multiple locations at the same time. And if you don’t…

Topic updates

Get email updates and stay ahead of the latest threats to the security landscape, thought leadership and research.
Subscribe today