The Best of Both from SQL and LDAP
- Michel Prompt
- March 7, 2013
- Founder's Corner
- 8 MIN READ
Last week, we discussed Ian Glazer’s video on killing identity in order to save it. And this week I want to follow up on some of the questions I raised at the end of my post:
- What makes hierarchies, along with graphs, so important (think contextual relationships).
- How SQL databases can capture information without redundancy, but still need a materialization layer (think cache) to deliver contextual views at the speeds required for security and context-aware applications (think mobile).
- Why virtualizing both identity and context is key to breathing new life into IdM, while respecting your current identity investments (think Radiant).
These can be regrouped into two topics: The first is identity representation and how we should record it, which is a problem of structure and storage. What is the ideal directory structure and storage for identity and how could we better integrate this life cycle with the larger field of data management? Or, as Ian stresses, how could IdM be a part of IT, instead of apart from it? The second is architecture and real life deployment or how in practice we could and should manage the identity lifecycle in a world of silos. I’d argue that the current fragmented identity infrastructure landscape requires a more rational solution, one that would involve what we at Radiant call a federated identity service, based around a virtualization layer for identity and context.
Today, we’ll be considering the structure and storage side of this equation: how to build the most modern, flexible, and future-oriented way of representing, storing, and sharing essential identity information.
The Rise (and Fall) of the Comma: On SQL, LDAP, and Graphs
As I mentioned in my previous post, although Ian wants to “kill the comma”—and don’t get me wrong, I agree with most of his critiques of the system—I believe there’s more life left in LDAP. To make my case, let’s take a trip back in time, because those who do not know their history are doomed to repeat yesterday’s mistakes—and miss yesterday’s lessons.
Now, we all know that in terms of programming languages, there’s been much evolution since the sixties/seventies, including C, C++, logic programming, functional programming, Java, aspects programming, Erlang, Scala, and many more, with each language offering its own strengths, specializations, and limitations. But we have not seen the same exuberant dynamism in the world of databases. In the earliest days, the hierarchical systems emerged, then some forms of the network database, but from the seventies on, SQL has dominated the landscape. Sure, there were some attempts at object-oriented databases in the late eighties and early nineties, but these were specialized niche products that never cracked the mainstream.
Given this history, we know that around the year 2000 or so, SQL remained the workhorse, the predominant database system in enterprises everywhere. So why did we see the emergence of LDAP, itself a simplification of x500, a kind of “object-oriented” hierarchical database? Why did a bunch of very smart people sitting on a standards committee commissioned by many organizations—including the largest telcos in the world—bypass SQL standards to implement a look-up mechanism based on a hard-coded hierarchy? Why did they create a distributed hierarchical database for email addresses and other information about people (including pictures and preferences—hello, Facebook!), which later evolved into a repository for PKI and source of authentication for internal networks?
Could it be that the hierarchy had some relevance that they weren’t getting from the relational model alone? I believe the much-maligned comma, or rather the DNs (and don’t you love the weird jargon from the X500 world?) still has something to tell us, even now, as identity environments grow increasingly fragmented and demand is going mobile and reaching into the cloud. So let’s look at why, where, and how LDAP is still very relevant for today’s identity infrastructures. But first we need a detour through SQL.
A Thought Experiment: Trees Inside of Tables
Imagine that we could implement a directory model—a complete LDAP schema and directory tree—in a set of SQL tables following best data management practices. Such an implementation would offer several major benefits:
- The information kept in the system would be extremely flexible. You could modify and generate as many directory trees as you need from your LDAP schema. You could also generate all the graphs you need from these schemas, including graphs containing loops, which are no longer strictly trees.
- The information in your system would be non-redundant and easy to maintain, following the normalization rules of the relational model. This is a key advantage of the relational system and one of the biggest weaknesses of the hierarchical and graph/network models.
- Your identity would be managed like all other essential business data. Basing your identity infrastructure on SQL technology also means that your IdM is now “a part” of classic data management, and your DB team is working in their most comfortable environment.
Given these known advantages of SQL, why did our committee from above decide to implement the X500 data model (and hence LDAP) as a specialized hierarchical “pseudo object-oriented“ database? Because of one of the prime advantages of the hierarchical model: fast look-up (or “queries”) and fast attribute value-based search. (And here’s where the link with Hadoop, HDFS, and other big data “NO SQL” implementations will become apparent for the advanced reader.)
The Dark Secret of SQL: It Takes Forever to Navigate Complex Hierarchies and Graphs
You see, the best system to record and update data supporting the so-called ACID properties is also terrible at answering very complex queries. The beauty of the relational model is that it can capture any relationship (whether in graphs or trees) and organize data so it’s captured only once and the updates are consistent, with any change in the real world reflected faithfully in the database. Once recorded in SQL, any graph, hierarchy, or relation between entities and data can be re-generated. As a result, there is no question that the relational system cannot answer. But there is also an important hidden cost: While you always get an answer to your SQL query, receiving that answer can take an incredibly long time. Isn’t it ironic that the world’s leading provider of SQL databases is named “Oracle”? Because, just as in Delphi, when it comes to graphs, hierarchies, or rich contextual data, your questions will always be answered—at some stage, when the fates are ready to tell you.
Pictures Tell the Story: Graph/Trees as a Recursive Relation
Still not convinced? Let’s take a look at how we might represent the “graph” of a person and all his or her friends in the relational world. In the following diagrams, we can see the flow of the representation from the most abstract entity level into a complete instantiation at the network or record level.
At the entity (or model) level, we see a recursive relationship in a relational store, where the recursive “friend of” link loops on the person structure. This very abstract and elegant model can generate both graphs and trees, proving my point that “a relational database can be the mother of all trees and graphs.” One way it does that is representing a tree or a graph by a recursive relationship, where the entity loops on itself.
Implemented at the table level, we see that each row is a person. We can use this table to anchor and encode all the relationships between people. (Of course, we didn’t expand all the duplicate rows for the purpose of illustration.)
At the relational level, when you navigate those rows from picture 2, we see how we can generate an infinite number of trees and graphs from a relational table. This is a tremendous capability—and one that we could and should leverage in the IdM of tomorrow. But, as I mentioned, it comes at quite a cost. While the relational model can represent any tree or graph you want, every time you have to navigate those links, you are joining across nodes—and that’s a very expensive operation. So we can represent any tree by a relationship, but while there’s no problem recording that tree, navigating through the tree will be very slow. That’s just a fact of SQL: you can record and update everything in a really robust system, but queries take forever to navigate. In fact, when in comes to context and graph navigation, “SQL” stands for “Slow Query Language.” (Luckily, there’s a way around the slow-down—stay tuned for my next post.)
And finally, at the network level, we see that in a complex table with lots of people and links, you’ll get to an image similar to this one, which I borrowed from Mark Dixon’s excellent response to Ian’s video. So you can imagine the performance in a complex network such as Facebook.
Everything Old is New Again: What About the Graph Database?
Okay, so we see how SQL can deliver all the trees and graphs we need—but at a snail’s pace. So should we try some thing new? Ian raises the issue of a specialized hierarchical or graph database that would be optimized for faster navigation/search. So let’s say we kill hierarchies and replace them with graphs, for a full network, which is a lot richer. So now we have fast access, but it comes with two huge problems:
- Trading one specialized storage for another: Implementing a new array of graph databases means we still have a specialized storage that sits apart from the SQL mainstream, which is the source of reference for most of your identity data anyway. Basically, we just replace the LDAP ghetto for a graph ghetto.
- Reconciling the system across multiple nodes: A graph database does not follow the normalization rules of the relational model, which make it simple to maintain a single version of the truth. While graph databases are very fast to navigate, it doesn’t take advanced math to see that information updates are hell in a network with umpteen nodes like the one we see in picture 4, where duplicate users may play many different roles. Imagine trying to maintain consistency when a person’s phone number changes and that update must be made at all n aliases across the network. The clean model of SQL, where each entity is only recorded once, starts to look pretty good, right?
So do we rely on the old workhouse SQL, a flexible and well-understood “update and record“ engine that’s mainstream for most of your applications? One that contains much of the information we need about identities and their diverse contextual attributes, but which is slow at tying them together in a contextual way? Or do we invest in a new set of specialized databases supporting hierarchies and graphs that are fast at queries (just like LDAP!), but not great at write and update—and devoid of all the information that’s been captured by your mainstream applications?
In my third post in response to the Glazer video, we’ll look at the practice side of this equation: how you can make identity work, given legacy investments and growing demands. I’ll focus on the fact that our existing infrastructure already has the two players we need—SQL and LDAP—and show what other player is needed (think cache-enabled virtualization) to make them work together to deliver all the trees, graphs, and—wait for it—context we could ever need.
Explore More
Subscribe to receive blog updates
Don’t miss the latest conversations and innovations from Radiant Logic, delivered straight to your in-box.