Co-authored by David A. Valovcin and Enrique Gutierrez-Alvarez.

Data classification is a trending topic within the security world these days — and for a good reason. As a critical step in any comprehensive data protection program, classification takes on even greater importance today in the context of regulatory compliance mandates and debates over data privacy.

So, what are the classification basics, challenges and best practices?

Data Classification 101

Before we delve into the nuances of data classification, let’s lay out a standard definition. Classification, as it pertains to data security, involves three primary parts:

  1. Parsing structured and unstructured data;
  2. Identifying which of that data matches a predefined set of patterns or keywords; and
  3. Assigning labels to data based on levels of sensitivity or business value.

Classification functions as a second step in analyzing your data resources after discovery, which is the process of identifying where data resides. But why do we care about classification? In short: Organizations cannot protect their data if they don’t know what type of data exists, its value to the organization and who can use it.

Classification enables you to assign an identity to your data, which helps you understand how to treat it. This understanding is especially important for sensitive or regulated data that requires specific layers of protection. It can also help hone a more mature security program by helping you identify specific subsets of data that should be watched for activity monitoring — it can also let you know what you don’t need to focus on protecting.

However, there are different ways to think about classification. This process can be leveraged for purposes related to security, compliance, data integrity, business usage, technology purposes and more. But we are most concerned with classification for the compliance and security use cases. Why? Because you can’t protect what you don’t know you have.

Classification is a key piece of the data-protection puzzle because the way a particular data set is classified in a security context can dictate how it’s protected or controlled via various policies or rules an organization might create to function as the backbone of their security program.

What Are the Common Approaches to Data Classification?

Today, there are several different approaches to classification on the market. Some organizations opt to undertake manual classification efforts — but this can become incredibly time-consuming and operationally complex. There are catalog-based search methods, which essentially scan for metadata or column names and assign labels to data within those columns based on predefined policies.

The challenge with this approach is that if the column name is not accurate, the classification result will not be accurate. For example, if you have telephone numbers in a column titled “Column A” and social security numbers in a column titled “Column B,” the engine would not pick this up, leaving your sensitive data unclassified and potentially exposed.

Building upon this approach, there are solutions available that leverage catalog-based search and data sampling search with regular expressions simultaneously. This approach results in a richer set of rules and expressions considered, which contributes to higher accuracy. Both of the previously stated methods leverage automation and are highly preferable to manual classification efforts.

Despite the methods available, however, many organizations today still struggle with accuracy in data classification and integrating classification into a holistic and informed security program.

Why might this be?

What Are the Challenges With Classification?

Classification can be vastly improved by leveraging technologies that automate the process, but this also introduces risk due to data-pattern complexities that machines may miss.

For example, you’d think finding a U.S. telephone number would be easy — just look for a one preceding a 10-digit string of random numbers, right? Wrong. What if dashes or spaces separate the numbers? What if instead of 1-800-356-9377, the number is stored as 1-800-FLOWERS?

Classification clearly needs to be streamlined with technology — but it also needs context provided by humans. To remove as much risk as possible from the overall equation, organizations should look for the most advanced classification technology available that enables parsing through potentially sensitive datasets in a scalable way while still being able to identify complex patterns.

What Is Next-Generation Data Classification?

Ideally, technology that supports this advanced classification would be able to:

  • Search beyond metadata or column names;
  • Match against multiple patterns for the same classification — not just one. For example, it could know that a string of numbers that looks like “1-800-356-9377” and a combination of numbers and letters that looks like “1-800-FLOWERS” could both be classified as “phone numbers”;
  • Be taught more rules and patterns over time to build out a more robust classification library for particular functions, such as government- or industry-mandated compliance regulations; and
  • Scan in a way that is quick, scalable and nondisruptive to performance.

Some technologies enable this next-generation data classification, such as the IBM Security Guardium Analyzer that leverages elements of System T, a technology developed by IBM Research. How? By allowing the overall technology to extract all of the data from a table, crawl it, apply taxonomy and conduct a dictionary lookup and find patterns that have been identified as personal or sensitive data.

Rules for this kind of classification can be more expressive, which improves accuracy. This classification allows for a more granular look at data matched against a more granular and specific set of patterns that all occur rapidly, without negatively impacting database performance.

While having technology like this at your fingertips provides immense opportunity, you still need humans to make the rules that this technology uses, because the classification results are only as good as the rules they are written on. Humans provide the context and understanding that enable the development of classification patterns in the first place.

Today, IBM Security is working internally to develop an extensive library of classification patterns for the IBM Security Guardium Analyzer that can help it identify a broad array of sensitive data — from U.S. Social Security numbers and telephone numbers from numerous countries to German pension insurance numbers and more. Because System T can incorporate new rules over time, it allows people from all over the world — people who are experts in identifying data patterns — to work together in improving the classification engine. This means that classification can proceed as teamwork.

Watch this video to see how IBM Security Guardium Analyzer works

More from Data Protection

Transitioning to Quantum-Safe Encryption

With their vast increase in computing power, quantum computers promise to revolutionize many fields. Artificial intelligence, medicine and space exploration all benefit from this technological leap — but that power is also a double-edged sword. The risk is that threat actors could abuse quantum computers to break the key cryptographic algorithms we depend upon for the safety of our digital world. This poses a threat to a wide range of critical areas. Fortunately, alternate cryptographic algorithms that are safe against…

How Do You Plan to Celebrate National Computer Security Day?

In October 2022, the world marked the 19th Cybersecurity Awareness Month. October might be over, but employers can still talk about awareness of digital threats. We all have another chance before then: National Computer Security Day. The History of National Computer Security Day The origins of National Computer Security Day trace back to 1988 and the Washington, D.C. chapter of the Association for Computing Machinery’s Special Interest Group on Security, Audit and Control. As noted by National Today, those in…

Resilient Companies Have a Disaster Recovery Plan

Historically, disaster recovery (DR) planning focused on protection against unlikely events such as fires, floods and natural disasters. Some companies mistakenly view DR as an insurance policy for which the likelihood of a claim is low. With the current financial and economic pressures, cutting or underfunding DR planning is a tempting prospect for many organizations. That impulse could be costly. Unfortunately, many companies have adopted newer technology delivery models without DR in mind, such as Cloud Infrastructure-as-a-Service (IaaS), Software-as-a-Service (SaaS)…

Millions Lost in Minutes — Mitigating Public-Facing Attacks

In recent years, many high-profile companies have suffered destructive cybersecurity breaches. These public-facing assaults cost organizations millions of dollars in minutes, from stock prices to media partnerships. Fast Company, Rockstar, Uber, Apple and more have all been victims of these costly and embarrassing attacks. The total average cost of a data breach has increased by 2.6% since 2021 and is now $4.35 million. Organizations that don't deploy zero trust security models also incur an average of $1 million more in…