At every healthcare security conference I’ve been to in the last few years, at least one speaker has a slide in their presentation deck with a few data breach figures designed to elicit a collective audience gasp. Being a security geek, a decidedly antisocial vocation characterized by skepticism, suspicion, and conspiracy delusions, I was compelled by my particular brand of insanity to cozy up to the source data first hand. Also to get the most up-to-date figures available.
Fortunately, the Office for Civil Rights (OCR), the arm of the U.S. Department of Health and Human Services (HHS), tracks healthcare data breaches of electronic protected health information (ePHI) greater than or equal to 500 patient records, as prescribed by HIPAA/HITECH. OCR provides an interactive online tool for examining ePHI exposure incidents, and the source data is available for download.
Massaging the Data
As with most data sets, there’s a bit of inconsistency and ambiguity in the systems, so instead of jumping right to the denouement, you need some background on how I teased the data into line. Perhaps it will help you if you decide to take a run at the data yourself, or at least it will explain why not everyone comes up with the same results given the same source data.
- I’ve ignored the covered entity’s, well…entity. No need to insult the injured. But I did remove a few obvious duplicates: same covered entity (CE) name, same reporting date, same number of records. There were less than a half dozen of these.
- The OCR tracks Types of Breaches (e.g., theft, unauthorized access/disclosure, hacking/IT incident), as well as Location of Breached Information (e.g., laptop, paper, other portable electronic device), which I refer to as Media Containing ePHI for clarity. The OCR provide these as two separate fields, but containing one or more entries separated by commas, which isn’t that useful for statistical analysis. I created individual fields for each entry and separated them. First, though, I had to resolve a number of misspellings and inconsistencies in entry names (e.g., Loss, Improper Disposal vs Loss/Improper Disposal). I also renamed some of the properties and aggregated others that made sense, for example, I consolidated “Other (Backup Tapes)” and “Other (Backup Disks)” into simply Backup Media.
- Some of the records, only 14% (ish), have associated comments, or a narrative that provides details about the incident. In some cases I changed the Type of Breach and/or Location of Breached Information based on the narrative.
- If you add up all the Types of Breaches or Media Containing ePHI, they both exceed the total incidents. This is because many of the breaches have multiple classifications attached to them. Some seem to be catch-alls. For example, “Computer” appears to be a generic category for lost or stolen medium, but may be coupled with “Laptop” or “Network Server”. It’s unclear from those incidents without an accompanying narrative whether there were multiple systems involved—a laptop stolen as well as a network server hacked—or it’s multiple classifications for a single system.
- The comments tell the story of many incidents involving theft of computers left in cars or brought home or to a remote office, like a business associate’s lab. These are not called out in the original classification so I created properties to capture these incidents.
- The majority of incidents have only one date associated with them; however, later in the data set, there is a date range. To normalize the chronology, I used the later date, which is presumably the date of discovery, for grouping incidents by year.
- The latest sane date in the data I used is October 12, 2013. Interestingly, the last record, for “Multiple Health Plans”, which I interpret as a generic identifier for multiple health plans (although there may be a healthcare payer with that name…), is dated in the future: December 7, 2013. I left this last record in the data set even though it’s not particularly significant, with only 1,368 breached records, categorized as both theft and loss of paper media.
- The current year is not yet come to a close. Consequently, I projected the total number of incidents and records based on the current run rate. However, for the detailed results—number of incidents of improper disposal, for example—results were not extrapolated.
- 24 million (plus a bit): the number of ePHI records have been/will be compromised between 2009 and the projected end of 2013
- 730 incidents were reported In the same period (also projected for 2013)
- The number of incidents dropped significantly from 2012 to 2011, and has been going down slowly since, except for between 2011 and 2012, where it stayed relatively flat. However, the average number of records per incident has fluctuated significantly, with about 40K records / incident in 2013 and 2011, 17K in 2012, and 25K in 2010.
- Theft is by far the greatest type of breach, including hospital and office burglaries, and laptops stolen from offices and cars. Unauthorized access or disclosure comes in a distant second, with less than half than theft.
- Hacking doesn’t figure prominently in breach incidents, with less than 20 incidents per year, and only 10 in 2013 so far.
- At least 14 incidents were related to employees and contractors leaving media containing ePHI in vehicles which were broken into. That figure is likely higher as it’s not a property tracked by OCR.
- Similarly, postal mail was a prominent medium for inadvertently disclosing ePHI before 2012, but no incidents appear since. The incidents provide lessons in paying attention to detail. Some include sending the wrong patient information to recipients or inserting a list of patients and associated private information into a group mailing, and in once case ePHI was printed on the external mailing label. Additionally, backup media was mailed in a few instances and never reached its destination or was addressed to the wrong recipient.
- For positive trends, loss of ePHI on portable devices has declined steadily since 2010, and is currently at 8 for 2013; what’s being classified simply as ‘computer’ for media has also steadily declined, and is currently at 0 for 2013; however, that may be due to better classification, as computer seems to be a catch all.
After spending a few hours becoming intimate with the data, I’m left with the feeling that the healthcare industry is making progress in some areas, but is overall struggling to clot the wound bleeding patient data. If you’ve done your own analysis and came up with different results, I’d love to hear what it is and how you arrived at your conclusions.
Stay safe my friends.