Introduction
SQL has been the mainstay language for building software applications for over five decades. SQL itself was based on a paper written by E.F. Codd of IBM Research in 1970. One of the main reasons for SQL to survive so long is that it has evolved with time, changed to handle newer use cases, provided a common interface (relational) which is easy to understand and model real world entities.
However in the new world data has grown in leaps and bounds and just plain old SQL had a hard time processing data across terabytes of data spread across geographical data centers and having highly concurrent access to data. This has led to the evolution of SQL to Distributed SQL which had its underpinnings in Distributed Databases.
As per the definition in Wikipedia:
A distributed SQL database is a single relational database which replicates data across multiple servers. Distributed SQL databases are strongly consistent and most support consistency across racks, data centers, and wide area networks including cloud availability zones and cloud geographic zones.
NoSQL databases came up to address the scalability challenges legacy SQL databases were facing but could not provide ACID (atomicity, consistency, isolation, durability) guarantee.
Distributed SQL databases are solving that problem by providing distributed ACID transaction capability.
Origin
Most modern distributed SQL databases have their origin and inspiration from Google’s Spanner Database that was published in a paper in 2013.
Spanner is a multiversion, globally distributed and synchronously replicated database that can span millions of machines across hundreds of data centers and handle trillions of database records. It is the first database system to distribute data on a global scale and support externally consistent distributed transactions.
Google’s Advertising backend transitioned out of MySQL to Spanner and is one of the largest internal customers for Spanner.
The paper on Spanner provides a good walk through of the design and architecture of Spanner.
Reference Database
Yugabyte is one of the new kids on the block and is challenging established players. The best part of this database is that it is an open source distributed SQL database.
From their github repository we find their definition:
YugabyteDB is a high-performance, cloud-native distributed SQL database that aims to support all PostgreSQL features. It is best to fit for cloud-native OLTP (i.e. real-time, business-critical) applications that need absolute data correctness and require at least one of the following: scalability, high tolerance to failures, or globally-distributed deployments.
Yugabyte’s transaction design is based on Google's Spanner. It supports strong consistency of writes by using cluster wide distributed ACID transactions.
It allows distributed SQL query execution by executing across multiple database nodes simultaneously. Incoming queries are accepted by any node in the cluster which then requests other nodes to process their portion of the query in a way that latency is minimized by reducing intra-node data transfer. The node that accepted the query aggregates the results from all nodes that executed the distributed SQL query and sends back the results to the client.
It automatically distributes data and indexes across multiple nodes of the cluster so that no single node becomes a bottleneck. It commits write operations synchronously across multiple nodes in the cluster and thus provides consistent replication of data.
One of the promising features is support for distributed ACID transactions, a single transaction is executed across multiple nodes by using 2 Phase Commit protocol.
Competition
Yugabyte has put up a feature comparison of the various commercial distributed SQL databases which gives a good idea of how the competition is lined up.
Conclusion
As the requirement to store, access and make data available in real time grows, Distributed SQL databases provide the hope to scale up and provide data consistency across multiple concurrent transactions. Research in this area is hot and many new players are coming up with good solutions to address current challenges.
Another key challenge is to make these databases cloud native so that commodity hardware can be used to store and process massive amounts of data.
In the next article we will look at the state of art technology for cloud based distributed SQL databases along with serverless options.