What Exactly Does the Rise of Hadoop and NoSQL Systems Mean for Data Protection and Security

A Cambrian Explosion of Data and Data Stores

In the world of databases, we’ve been pretty happy for the last 30+ years or so with our relational database systems (RDBMSs) to handle our core data processing needs such as ERP, payment processing, and other transactional processes – these systems are often referred to as Systems of Record. And those of us involved in information security have a pretty good idea of what needs to be done to help secure and protect that data.

But our world is changing. As more and more new development focuses on the social web and on analytics of high volumes and wide varieties of data, many new data systems, such as the class of data stores known as “NoSQL” and Hadoop-based systems, are being created and are evolving to meet the challenges of this new era of computing, including a strong trend toward moving more and more storage and processing on the cloud.

I submit that those who are responsible for data security and privacy must learn about these new systems and their security implications and work in partnership with application teams who will be using them. By working together and understanding the security posture of these systems, appropriate measures can be built in from the beginning to mitigate security risks

In this and subsequent blog entries, I will describe the the reasons for the rise of this new class of data systems, go into detail on a few of them, and describe some security mitigation measures such as data activity monitoring and encryption to help protect data stored in and retrieved from these systems.

The rise of new data systems

I’m married to a biologist, and when I think about what’s been happening in the world of databases and data processing platforms, it just seemed like an apt analogy, if perhaps a bit exaggerated. Part of the reason the RDBMSs have survived so long and continue to thrive is the fact that RDBMS vendors (and the open source community) are adept at evolving their technology and their standards to support the changing requirements around them. The technology and language have evolved over time to support complex schemas for analysis on large amounts of data, and to support new data types, such as XML and user-defined types that map to programming objects. They’ve evolved their infrastructures to handle larger and larger amounts of data in an efficient manner. In an evolutionary sense, the RDBMS is a strong and adaptable technology and will be around for the foreseeable future because of that.

Be that as it may, there are evolutionary niches that are rapidly being filled by new “life forms”. These new life forms, by which I mean new data systems, in case you didn’t get my clever analogy, are competing with or complementing RDBMSs.

And there are a lot of them. The Cambrian Explosion refers to a period of time around 530 or so million years ago (give or take a day) when life when life “exploded” into an incredible diversity of life forms. Things were bopping along pretty quietly, then some pretty different-looking life forms came onto the scene. As an example, take a look at this interesting creature from the deep, known as Hallucigenia, because, well, it looks like something you would see in a very bad dream.

picture of hallucigenia

Hallucigenia is an example of one of the strange life forms from the Cambrian

NoSQL and Hadoop

Back to our own somewhat less hallucinogenic but changing data processing world…. In the world of data systems, most of these new systems are commonly categorized under the name NoSQL systems. The nomenclature of “NoSQL” is somewhat controversial since it is defining what it is not, rather than what it is. It gets even more complicated because some of these systems do support a SQL-like interface. In fact, perhaps one of the most significant recent activities in the NoSQL world is SQL (structured query language).

With this is mind, the NoSQL community has tried to evolve the meaning of NoSQL to mean “not only SQL,” which refers to a wide variety of databases and data stores that have moved away from the relational data model. These systems are not only used for Big Data – they support many different use cases that are not necessarily analytical use cases or rely on huge volumes. These NoSQL systems are generally categorized by their data models:

  • Document store – Example is MongoDB
  • Graph store – Example is Neo4j
  • Column store – Example is HBase
  • Key- value store – Example is Riak

These stores can be further classified by whether they can run in memory, such as Memcached, which is a key-value, in-memory data store.

One can also put Hadoop in this category or consider it separately. Hadoop is more of a framework for processing huge volumes of a wide variety of data using a virtual file system called the Hadoop File System (HDFS) and a processing framework called MapReduce, which basically refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples.

It really can get complicated because there are several relational data warehousing products that are now allowing MapReduce processing as well – this runtime is often referred to as in-database MapReduce.

Why is the explosion happening?

What change has led to this diversity of new data systems? Why now? One reason is that everything is happening so fast. The cyber world is changing and evolving as quickly as developers can push out a new app or new site – we call this the era of social-mobile-cloud. Facebook and Twitter have changed our world permanently with the vast amounts and speed required for processing data. Developers have to work with speed and agility and all of these new use cases don’t necessarily require the rigid data integrity protections offered by relational databases such as fixed schemas. While the best at protecting data integrity, the unfortunate side effect of fixed schemas is that changing said schemas can take a long time in Web-development time, requiring the help of a DBA and a sometimes lengthy change control process.

Developers who work on these new types of data applications may have no patience for such processes. They also may be more productive working with data models that are more aligned with web programming.  This may mean sacrificing data integrity for the ability to insert a wide variety of data structures without requiring database schema changes. It may mean sacrificing transactional guarantees (two-phase commit for example) for speed. These new systems are referred to as Systems of Engagement, and a a modern era information management platform is composed of Systems of Record integrated with Systems of Engagement.

Will these new systems go extinct?

Many of the life forms that evolved during the Cambrian Explosion did go extinct. That’s why we don’t have Halluciginia around any more to give us bad dreams. And I can’t predict which data systems will survive our fickle tech world. But there is evidence that indicates that overall, there is a place for some of these systems to survive longer term. Jeff Kelley from Wikibon, who tracks market share for NoSQL systems, writes that the  demand for Hadoop and NoSQL software and related products will grow along with the volume of data, although there could be a bit of a lag until more people with relevant skills evolve to fill the gap to manage these new systems.

This report also states that enterprises that were reluctant to invest in Hadoop and NoSQL technology one or two years ago are beginning to do so, including companies in industries with significant compliance, regulatory, and security requirements such as financial services and healthcare. I know this from my own experience – the development of data activity monitoring software for MongoDB was driven by a financial services client who was interested in rolling MongoDB technology out to more areas of the enterprise that are driven by regulation and require district security controls.

Another data point is the growing attendance at the NoSQL conference in San Jose, California in 2013, which crossed over 1000 attendees for the first time. Although it was my first year there, as we were co-presenting with the MongoDB team on MongoDB data protection, the conference had many attendees from very large companies: mainly architects and the research organizations. And there were a number of topics around security and governance as well.

Finally, our enterprise clients from typical Fortune 500 companies are starting to use Hadoop and NoSQL. For example, I have heard personally of clients using MongoDB, Cassandra, Riak, Neo4j, and Hadoop (various distributions).

I also want to point out that although there are some systems that are bubbling to the top in terms of market share and venture capital funding, there are no clear 1 or 2 or 3 market leaders. This can complicate the world for enterprises looking to make strategic decisions in this space. It also complicates the world of managing and securing these systems as there are no standards as of yet to help vendors build interoperable tools and security systems.

What about data security and protection?

From a data security and protection perspective, RDBMSs are a highly attractive target for cybercrooks because of the value of the data they hold and because they have been around so long that hackers have had plenty of time to find ways to break into them. The passage of time has also allowed RDBMS systems to develop protections against this natural enemy, including a rich systems of privilege management, and authentication and access controls, and even in-database capabilities for encryption and masking.  And outside the database, a whole ecosystem of database activity monitoring (DAM), data encryption, vulnerability assessment and data privacy software has evolved to help organizations manage risk in cross-vendor environments while maintaining separation of duties.

NoSQL and Hadoop-based systems are quite young and, as with the early days of relational, have evolved to solve a business or technical problems independent of security concerns. Many of these systems are designed to appeal to developers and startups who were interested in rapid development and in low cost to purchase. Many of the vendors who brought these systems to market are now more interested in infiltrating the enterprise and are well aware that adoption is dependent on building in more security features, such as Kerberos authentication and granular roles and privileges, to help alleviate the fear, uncertainty and doubt that will hinder their adoption.  Some of these vendors are partnering with other vendors, such as IBM, to help them be able to present a more complete data security story to their potential enterprise clients.

As an example, Tony Baer, a principal analyst from Ovum, recently blogged on ZDNet about Hadoop security and how vendors such as IBM are working to fill the void left by the open source projects. He calls it The Odd Couple: Hadoop and Data Security.

Firewalling isn’t the (only) answer

The advice offered by some NoSQL database vendors up until recently, and still even now, is that the database should be completely firewalled off from the rest of the world or be very tightly bound to a particular application. As you probably know better than I, that is really not an answer. There are just too many ways to get to the data. For example, you might think that since there is no SQL there is no reason to worry about SQL injection attacks. But some NoSQL systems allow JavaScript to be run on the server and there is definitely a potential for injection that way. And I’ve already heard of an internal breach caused by a MapReduce job that exposed personal identifiable information.

With NoSQL systems, there is also the potential for “schema injection”. Because there is no enforced schema, it is possible to dynamically insert new fields that include any arbitrary information, which could definitely violate the integrity of your data. For example, imagine setting the value of the ‘Gender’ field to “Canada”. For this reason, any organizations that use some of these newer systems may have to build more of their own data security and integrity. And as always, we must be on guard against intentional or accidental privileged user activity that can cause embarrassing or dangerous data breaches.

Additional compensation required – sensitive data discovery and connection pooling

Those of you who work closely with data security probably have about a million ideas swirling through your head now about other capabilities that are not currently provided by the NoSQL vendors or even by data security vendors. For example, how do you track applications that access the NoSQL database?  In the RDBMS world, we still have issues with pooled database connections and ensuring that we get the correct identifying information to track back to a particular user. It’s likely this will be an issue here as well unless some way of compensating for this is built into the application.

The ability to discover sensitive data that already exists so that you know what to protect is provided by vendors such as IBM. There is a known gap for this with NoSQL and Hadoop. It will take time for this to evolve, but corporations need to start a process to compensate for this until commercially available tools come to market.

What do you think?

The trend I see with our clients is that more and more organizations will be using Hadoop-based systems and NoSQL systems to store sensitive data; whether sensitive data is stored there accidentally or on purpose is almost beside the point, but you can almost bet that there is a lot of accidentally sensitive information being sucked into the Hadoop cluster. What is your organization doing about it? I’m really interested in hearing what’s happening out there in the ‘real world’.

  • Are the various silos of your organization talking to each other about these new systems and what usage of them entails from a strategic business growth and information security perspective?
  • How does ‘cloud’ affect your IT organization and people that run the DBMSs? One can go to Amazon today and stand up an instance without even talking to the IT or security organizations. Do you have systems in place to facilitate communication and alignment?
  • Are your IT and Security organizations more closely aligning with the business in this changing world, or are they in danger of being bypassed in this equation because resources have “evolved” to the point of tremendous automation for some of the more complicated tasks?

Let me know. I would love to ‘evolve’ my perspective by hearing from you.

Share this Article:
Kathryn Zeidenstein

Technology Evangelist and Community Advocate, IBM Security Guardium

Kathryn Zeidenstein is a technology evangelist and community advocate for IBM Security Guardium data protection solutions, based out of the Silicon Valley Lab in San Jose, California. Responsible for producing content to build skills and raise awareness for Guardium technologies, she has published several articles and presented at many conferences. She also runs the Guardium Virtual User Group and is responsible for community building for Guardium.