Co-authored by David A. Valovcin and Enrique Gutierrez-Alvarez.

Data classification is a trending topic within the security world these days — and for a good reason. As a critical step in any comprehensive data protection program, classification takes on even greater importance today in the context of regulatory compliance mandates and debates over data privacy.

So, what are the classification basics, challenges and best practices?

Data Classification 101

Before we delve into the nuances of data classification, let’s lay out a standard definition. Classification, as it pertains to data security, involves three primary parts:

  1. Parsing structured and unstructured data;
  2. Identifying which of that data matches a predefined set of patterns or keywords; and
  3. Assigning labels to data based on levels of sensitivity or business value.

Classification functions as a second step in analyzing your data resources after discovery, which is the process of identifying where data resides. But why do we care about classification? In short: Organizations cannot protect their data if they don’t know what type of data exists, its value to the organization and who can use it.

Classification enables you to assign an identity to your data, which helps you understand how to treat it. This understanding is especially important for sensitive or regulated data that requires specific layers of protection. It can also help hone a more mature security program by helping you identify specific subsets of data that should be watched for activity monitoring — it can also let you know what you don’t need to focus on protecting.

However, there are different ways to think about classification. This process can be leveraged for purposes related to security, compliance, data integrity, business usage, technology purposes and more. But we are most concerned with classification for the compliance and security use cases. Why? Because you can’t protect what you don’t know you have.

Classification is a key piece of the data-protection puzzle because the way a particular data set is classified in a security context can dictate how it’s protected or controlled via various policies or rules an organization might create to function as the backbone of their security program.

What Are the Common Approaches to Data Classification?

Today, there are several different approaches to classification on the market. Some organizations opt to undertake manual classification efforts — but this can become incredibly time-consuming and operationally complex. There are catalog-based search methods, which essentially scan for metadata or column names and assign labels to data within those columns based on predefined policies.

The challenge with this approach is that if the column name is not accurate, the classification result will not be accurate. For example, if you have telephone numbers in a column titled “Column A” and social security numbers in a column titled “Column B,” the engine would not pick this up, leaving your sensitive data unclassified and potentially exposed.

Building upon this approach, there are solutions available that leverage catalog-based search and data sampling search with regular expressions simultaneously. This approach results in a richer set of rules and expressions considered, which contributes to higher accuracy. Both of the previously stated methods leverage automation and are highly preferable to manual classification efforts.

Despite the methods available, however, many organizations today still struggle with accuracy in data classification and integrating classification into a holistic and informed security program.

Why might this be?

What Are the Challenges With Classification?

Classification can be vastly improved by leveraging technologies that automate the process, but this also introduces risk due to data-pattern complexities that machines may miss.

For example, you’d think finding a U.S. telephone number would be easy — just look for a one preceding a 10-digit string of random numbers, right? Wrong. What if dashes or spaces separate the numbers? What if instead of 1-800-356-9377, the number is stored as 1-800-FLOWERS?

Classification clearly needs to be streamlined with technology — but it also needs context provided by humans. To remove as much risk as possible from the overall equation, organizations should look for the most advanced classification technology available that enables parsing through potentially sensitive datasets in a scalable way while still being able to identify complex patterns.

What Is Next-Generation Data Classification?

Ideally, technology that supports this advanced classification would be able to:

  • Search beyond metadata or column names;
  • Match against multiple patterns for the same classification — not just one. For example, it could know that a string of numbers that looks like “1-800-356-9377” and a combination of numbers and letters that looks like “1-800-FLOWERS” could both be classified as “phone numbers”;
  • Be taught more rules and patterns over time to build out a more robust classification library for particular functions, such as government- or industry-mandated compliance regulations; and
  • Scan in a way that is quick, scalable and nondisruptive to performance.

Some technologies enable this next-generation data classification, such as the IBM Security Guardium Analyzer that leverages elements of System T, a technology developed by IBM Research. How? By allowing the overall technology to extract all of the data from a table, crawl it, apply taxonomy and conduct a dictionary lookup and find patterns that have been identified as personal or sensitive data.

Rules for this kind of classification can be more expressive, which improves accuracy. This classification allows for a more granular look at data matched against a more granular and specific set of patterns that all occur rapidly, without negatively impacting database performance.

While having technology like this at your fingertips provides immense opportunity, you still need humans to make the rules that this technology uses, because the classification results are only as good as the rules they are written on. Humans provide the context and understanding that enable the development of classification patterns in the first place.

Today, IBM Security is working internally to develop an extensive library of classification patterns for the IBM Security Guardium Analyzer that can help it identify a broad array of sensitive data — from U.S. Social Security numbers and telephone numbers from numerous countries to German pension insurance numbers and more. Because System T can incorporate new rules over time, it allows people from all over the world — people who are experts in identifying data patterns — to work together in improving the classification engine. This means that classification can proceed as teamwork.

Watch this video to see how IBM Security Guardium Analyzer works

More from Data Protection

Vulnerability resolution enhanced by integrations

2 min read - Why speed is of the essence in today's cybersecurity landscape? How are you quickly achieving vulnerability resolution?Identifying vulnerabilities should be part of the daily process within an organization. It's an important piece of maintaining an organization’s security posture. However, the complicated nature of modern technologies — and the pace of change — often make vulnerability management a challenging task.In the past, many organizations had to support manual integration work to get different security systems to ‘talk’ to each other. As…

Cost of a data breach 2023: Geographical breakdowns

4 min read - Data breaches can occur anywhere in the world, but they are historically more common in specific countries. Typically, countries with high internet usage and digital services are more prone to data breaches. To that end, IBM’s Cost of a Data Breach Report 2023 looked at 553 organizations of various sizes across 16 countries and geographic regions, and 17 industries. In the report, the top five costs of a data breach by country or region (measured in USD millions) for 2023…

Cost of a data breach 2023: Pharmaceutical industry impacts

3 min read - Data breaches are both commonplace and costly in the medical industry.  Two industry verticals that fall under the medical umbrella — healthcare and pharmaceuticals — sit at the top of the list of the highest average cost of a data breach, according to IBM’s Cost of a Data Breach Report 2023. The health industry’s place at the top spot of most costly data breaches is probably not a surprise. With its sensitive and valuable data assets, it is one of…

Cost of a data breach 2023: Financial industry impacts

3 min read - According to the IBM Cost of a Data Breach Report 2023, the global average cost of a data breach in 2023 was $4.45 million, 15% more than in 2020. In response, 51% of organizations plan to increase cybersecurity spending this year. For the financial industry, however, global statistics don’t tell the whole story. Finance firms lose approximately $5.9 million per data breach, 28% higher than the global average. In addition, evolving regulatory concerns play a role in how financial companies…