Co-authored by David A. Valovcin and Enrique Gutierrez-Alvarez.
Data classification is a trending topic within the security world these days — and for a good reason. As a critical step in any comprehensive data protection program, classification takes on even greater importance today in the context of regulatory compliance mandates and debates over data privacy.
So, what are the classification basics, challenges and best practices?
Data Classification 101
Before we delve into the nuances of data classification, let’s lay out a standard definition. Classification, as it pertains to data security, involves three primary parts:
- Parsing structured and unstructured data;
- Identifying which of that data matches a predefined set of patterns or keywords; and
- Assigning labels to data based on levels of sensitivity or business value.
Classification functions as a second step in analyzing your data resources after discovery, which is the process of identifying where data resides. But why do we care about classification? In short: Organizations cannot protect their data if they don’t know what type of data exists, its value to the organization and who can use it.
Classification enables you to assign an identity to your data, which helps you understand how to treat it. This understanding is especially important for sensitive or regulated data that requires specific layers of protection. It can also help hone a more mature security program by helping you identify specific subsets of data that should be watched for activity monitoring — it can also let you know what you don’t need to focus on protecting.
However, there are different ways to think about classification. This process can be leveraged for purposes related to security, compliance, data integrity, business usage, technology purposes and more. But we are most concerned with classification for the compliance and security use cases. Why? Because you can’t protect what you don’t know you have.
Classification is a key piece of the data-protection puzzle because the way a particular data set is classified in a security context can dictate how it’s protected or controlled via various policies or rules an organization might create to function as the backbone of their security program.
What Are the Common Approaches to Data Classification?
Today, there are several different approaches to classification on the market. Some organizations opt to undertake manual classification efforts — but this can become incredibly time-consuming and operationally complex. There are catalog-based search methods, which essentially scan for metadata or column names and assign labels to data within those columns based on predefined policies.
The challenge with this approach is that if the column name is not accurate, the classification result will not be accurate. For example, if you have telephone numbers in a column titled “Column A” and social security numbers in a column titled “Column B,” the engine would not pick this up, leaving your sensitive data unclassified and potentially exposed.
Building upon this approach, there are solutions available that leverage catalog-based search and data sampling search with regular expressions simultaneously. This approach results in a richer set of rules and expressions considered, which contributes to higher accuracy. Both of the previously stated methods leverage automation and are highly preferable to manual classification efforts.
Despite the methods available, however, many organizations today still struggle with accuracy in data classification and integrating classification into a holistic and informed security program.
Why might this be?
What Are the Challenges With Classification?
Classification can be vastly improved by leveraging technologies that automate the process, but this also introduces risk due to data-pattern complexities that machines may miss.
For example, you’d think finding a U.S. telephone number would be easy — just look for a one preceding a 10-digit string of random numbers, right? Wrong. What if dashes or spaces separate the numbers? What if instead of 1-800-356-9377, the number is stored as 1-800-FLOWERS?
Classification clearly needs to be streamlined with technology — but it also needs context provided by humans. To remove as much risk as possible from the overall equation, organizations should look for the most advanced classification technology available that enables parsing through potentially sensitive datasets in a scalable way while still being able to identify complex patterns.
What Is Next-Generation Data Classification?
Ideally, technology that supports this advanced classification would be able to:
- Search beyond metadata or column names;
- Match against multiple patterns for the same classification — not just one. For example, it could know that a string of numbers that looks like “1-800-356-9377” and a combination of numbers and letters that looks like “1-800-FLOWERS” could both be classified as “phone numbers”;
- Be taught more rules and patterns over time to build out a more robust classification library for particular functions, such as government- or industry-mandated compliance regulations; and
- Scan in a way that is quick, scalable and nondisruptive to performance.
Some technologies enable this next-generation data classification, such as the IBM Security Guardium Analyzer that leverages elements of System T, a technology developed by IBM Research. How? By allowing the overall technology to extract all of the data from a table, crawl it, apply taxonomy and conduct a dictionary lookup and find patterns that have been identified as personal or sensitive data.
Rules for this kind of classification can be more expressive, which improves accuracy. This classification allows for a more granular look at data matched against a more granular and specific set of patterns that all occur rapidly, without negatively impacting database performance.
While having technology like this at your fingertips provides immense opportunity, you still need humans to make the rules that this technology uses, because the classification results are only as good as the rules they are written on. Humans provide the context and understanding that enable the development of classification patterns in the first place.
Today, IBM Security is working internally to develop an extensive library of classification patterns for the IBM Security Guardium Analyzer that can help it identify a broad array of sensitive data — from U.S. Social Security numbers and telephone numbers from numerous countries to German pension insurance numbers and more. Because System T can incorporate new rules over time, it allows people from all over the world — people who are experts in identifying data patterns — to work together in improving the classification engine. This means that classification can proceed as teamwork.