Thursday, 10 December 2009

Pyrrho and the Cloud

A number of writers, considering the poor support for transactions and locking support, have concluded that cloud storage might not be suitable as a data store for the purposes of most line-of-business applications (Chantry 2009), (Abadi, 2009), (Wei et al, 2009). Brantner et al (2009) reviewing the support for databases in the cloud, note that although some services are quite powerful, they might not directly satisfy the requirements for a given task. In particular, storage providers usually give only relaxed consistency guarantees, and storage services are “orders of magnitude” slower than locally attached disks. Abadi (2009) discusses the limitations of the “simple databases” offered by cloud services such as Amazon and Yahoo, noting that they simply relax the ACID requirements for transactions and do not guarantee consistency.
The different cloud services providers differ most of all in their choice of implementation language. Microsoft’s Windows Azure platform, based on .NET and its language set, notably C#, provides three sorts of storage: Blobs, Tables, and Queues, which are accessible either using Web Roles (from IIS) or from Worker Roles (from Azure). Azure hosted services can also be accessed using WCF and REST, but so far I haven’t been able to access Azure Storage by this route. Distressingly, the web pages generated to access Azure storage using the November 2009 samples seem to have a lot of hidden authorisation stuff, and I haven’t been able to generate my own client applications to consume Azure Storage data. It is all still clearly under development, and is only due for release in 2010.
Still, I am sure that before too long we will be able to use WCF Streaming from Azure storage, so I am assuming this in the following design ideas.
Many authors draw attention to the CAP theorem whereby a shared-data system can choose at most two out of the three properties: consistency, availability, and tolerance to partitions. DBMS designers usually insist on consistency at all costs. For this reason, we will end up with a solution that retains a transaction master server for each database, while taking advantage of the cloud for data replication and transaction routing. The paper by Wei et al (2009) presents essentially the same solution as the one proposed here.

Pyrrho’s Binary Storage Architecture

Pyrrho operates a layered architecture described in the documentation using the diagram shown. In this diagram the bottom two layers represent durable storage (e.g. file storage on the hard disk). It is natural to focus on using Cloud storage as an alternative form of durable storage, and the concept of a logical file consisting of binary segments, all but the last of which will never require updating, makes Pyrrho particularly attractive for such a design.

In fact, the original design of Pyrrho envisaged completely immutable file data, where each committed transaction would simply append new durable data to the file.
The current physical design of these two layers is not ideally suited to use with the cloud however. In the first place, there is a tamper-proofing end-of-file marker which is overwritten by the next transaction commit. Secondly, the segment size used on the hard disk is measured in gigabytes, which is too large for current internet implementations. Finally, there is exactly one disk write for each transaction commit.
Taking this last point first, all of Pyrrho’s competitors, when committing a transaction finally to disk storage, write in many places in the disk file, so that multiple disk writes are required. Measurements of disk traffic show that under heavy transaction loading, Pyrrho uses fewer disk writes by a factor of around 70 compared to commercial DBMSs. Nevertheless, the default transaction mode used is that each SQL statement will autocommit, so that there is approximately one disk update for each INSERT/UPDATE/DELETE statement. If cloud storage is used, there will be one server round-trip corresponding to each storage change: this strongly motivates the use of explicit BEGIN TRANSACTION and COMMIT if many changes are to be made, as this would tend to group associated modifications in a single batch. It is an obvious first step towards implementation to modify Pyrrho’s autocommit policy so that by default, transactions are committed when a connection is closed rather than at completion of each statement. This will have the effect of adding to cloud storage in larger chunks, and this seems to point to a natural solution for the segment size issue mentioned in the second point.
For the segment size of cloud storage, there is obviously a balance to be struck between the practical size of a binary transmission and the number of connection requests managed by the network. Assuming an efficient streaming implementation, which we will come to in a moment, the natural thing to do would be to store the data from each transaction in a new segment to be committed to cloud storage.
This leaves the question of the end-of-file marker, which we no longer need. Its purpose was to ensure that the contents of the entire data file had not been changed since they were committed. Instead, we simply store the same marker data in the cloud storage along with the associated segment: the marker can be used to check the entire data file up to that point.

Following this step, the Pyrrho DBMS server can operate using local (disk) storage and remote (cloud) storage. So far, this model could be conveniently hosted by a simple REST/bytestream service such as Amazon S3. Older parts of the database can be replicated without loss of consistency. We consider next how we can ensure that the data is up-to-date.
Transaction Serialisation
Pyrrho’s transaction design is actually very well suited to sharing cloud storage between database engines, because its transaction serialisation handling is performed at the level of the binary records (“Physical”s) that are written to the durable storage when a transaction commits. Pyrrho uses optimistic locking, which is also very well suited to use with cloud storage, since it would be a bad idea to have an exclusive lock on the Cloud service.
For a proper understanding of the mechanism, it is necessary first to mention that the tree and queue data structures managed in the Logical Database layer have the property that all their elements are immutable: any change results in a new head structure, and following links through new components of the data structure usually leads to components of the original structure.
So, at the start of a transaction, new headers for the database(s) involved are created, which initially lead to the current committed state: if a few transactions are in progress, there will be several versions of the database(s), mostly coinciding (on unchanged portions). The changes associated with the transaction each have a Physical which will be written to durable storage along with a descriptor of the transaction itself.
Whether such ongoing transactions conflict can be determined by examining just these Physicals. For example, two updates to the same row of a table will conflict, an insert in a table will conflict with dropping the table, etc. This means that transaction serialisation can be determined at the Physical Records level of the database.
There are two subtleties to the resulting design, which apply equally to local and remote storage. The first relates to relocation of Physicals, the second to transaction conflicts where reads conflict with changes.
The relocation issue arises because Pyrrho uses the (immutable) defining addresses of data in Physical records: so that for example a Column will refer to its parent Table by giving its defining address rather than its (mutable) name. For this reason a group of Physicals in a transaction are very likely to refer to each other using their positions in the data file. While transactions are in progress, nobody can tell which transaction will commit first, so the positions of Physicals in the data file are generally not known until the Commit succeeds. The physical layer of the DBMS therefore relocates the transaction to its final place in the data file.
The read-conflict issue is rather pedantic. Pyrrho uses a very strict concept of transaction isolation, and notes all data that a transaction reads. If any of this data changes before the transaction commits, Pyrrho will roll back the transaction. For example, if a transaction computes a value to insert in the database, it should be invalidated by changes to any values used in the computation. Pyrrho calls these ReadConstraints.
Similarly, referential constraints (called IndexConstraints in the code) must also be checked again at Commit time, in case another transaction has inserted a record with a conflicting unique key, or deleted a record that our transaction intends to refer to.
For this reason, the data used in the Commit process consists of the list of proposed Physicals together with sets of ReadConstraints and IndexConstraints. For a remote database. It is this collection of data, and not the final binary data, that must be sent to the remote master server for processing.

Unfortunately, we are left with the situation that serialisation must ultimately be decided by a single DBMS server. The most we can use the cloud for in transaction management is as an enterprise system bus, routing transaction streams to the transaction master. To ensure that we have read the most up-to-date version, we simply call commit: if it is not up to date, our transaction will fail, and re-running the query will get its new value.
The Data Stream
The upper layers of the DBMS still need to construct and maintain the logical database for doing query processing, joins etc, and for this purpose, as with local databases, the complete data stream of Physicals must be read at some stage. For a large database, this is a time-consuming process (roughly 1 minute per 200MB) so should only be carried out when a cold-start of the DBMS occurs. Thus the DBMS itself is too stateful to be placed in the cloud.
On the other hand, once the DBMS is up to date with all of the Physicals that have committed to the local or remote data file, it is relatively easy matter to check for transactions committed by other servers, at the start of a transaction and again at transaction commit. The protocol for obtaining this Physical data must be able to deal with both small and large volumes of data, and take into account that Physicals can vary greatly in size. A binary large object (blob) for example, is committed in a single Physical.
Pyrrho already has excellent mechanisms for serialising such diverse binary data, with the AsyncStream class. This streaming mechanism can be used directly to deal with the Physical data from cloud storage. It is important to be able to use streamed responses from the cloud service, though, as it would be unacceptably slow to recreate repeated connections for downloading such data piecemeal. The AsyncStream class is also a useful mechanism for uploading to the cloud service the data described in the last section. It also supports a mechanism for reporting any exceptions thrown by the cloud service, which would most likely be transaction conflicts.
Nevertheless, a local initialisation device (such as an SD card) containing a copy of a database up to a certain date would seem an obvious way to save a lot of time, particularly if the remote database is often used.
Distributed Databases
Pyrrho maintains a single logical file for each database, which as we have seen might comprise a number of segments. This has the advantage that ordinary operating system utilities can be used for backing up, moving or renaming databases.
Pyrrho supports multi-database connections, where one DBMS can combine data from several databases in response to a query. A two-phase commit protocol must be used to ensure that all of the participating databases succeed in committing the data, or all roll it back. Pyrrho does not support multi-database relationships: it is not possible to have a foreign key relationship to a table in another database, since references to another database would be vulnerable to the sorts of file operations just mentioned. Whenever a multi-database connection is used, Pyrrho checks that all distributed transaction commits recorded in the databases are reflected in all of the databases in the connection.
In the simplest cases the DBMS has at its disposal the logical database structure for all of the databases involved in the connection, either because the data comes from local durable files under its control, or has been synchronised from another DBMS or the cloud service. In the latter case, there may be a local file or device that contains a copy of the data for initialisation purposes. Another DBMS is involved in cluster computing scenarios, where only one member of the cluster is given write access to the data file for each database, but all members maintain all the logical databases. This mechanism, can be used effectively to partition data among servers so that transaction masters for each partition of the data collaborate to ensure the ACID properties of the transaction.
Remote Databases
With the introduction of cloud storage, and with the increasing use of smaller devices as servers, it becomes important to extend the storage modes available for remote data, to include the case where, although the local server needs to know the schema for all databases in a connection, it does not have row or index information for remote databases. The logical databases all need to exist somewhere, of course: in this case the remote database needs to be managed by a remote DBMS, not just by the cloud service. During query processing, the local server constructs a query for the remote DBMS which includes join, projection and selection aspects to minimise the amount of traffic required, and it is only when there is local data to be combined with the resulting rowset(s) that the local DBMS actually needs to do much work.

It will be seen that this is a rather different matter than simply using cloud storage. Which approach is preferable will depend on whether the memory of the local device is capable of holding the whole of the remote logical database, and how frequently consultation of the remote database is really required.
As a result of these developments, in the next edition of the Pyrrho DBMS, any server can have a configuration file which specifies which databases are available from specific servers or cloud storage (and how to access them): and the cloud service needs to refer all transactions to their transaction masters. In addition, the configuration file should explain for remote databases whether the local database maintains the full logical database, and if so, whether there is a local initialisation file.
“Eventual Consistency”
It seems reasonable for read-only transactions to appeal to the eventual consistency of cloud computing services. A client can always COMMIT a read-only transaction if they wish to be warned about a recent update that might have invalidated their work; but this is of limited value since a further update can always occur before they commit their next transaction. It is best for the reasons given above to transact reads with their resulting updates. For simplicity we have stayed with a single transaction master per data item, although a majority voting (Byzantine) system would be theoretically possible.
Conclusion
This paper has outlined a manifesto for a future version of Pyrrho DBMS supporting cloud storage and the use of small devices as local DBMS servers. We agree with other published studies that cloud storage on its own would violate several key database principles and so remains unsuitable for transactional data management. It is pleasing that this approach is corroborated by the work reported in Wei et al (2009).
References
Abadi, Daniel (2009): Data Management in the Cloud: Limitations and Opportunities, Bulletin of the IEEE Computer Society, 32 (2009), p.3-12.
Brantner, Matthias; Florescu, Daniela; Graf, David; Kossmann, Donald; Kraska, Tim (2009) Building a Database in the Cloud, Technical Report, ETH Zurich, www.dbiz.ethz.ch
Chantry, Darryl (2009): Mapping Applications to the Cloud, The Architecture Journal, 19 (2009), p.2-9 (Microsoft, www.architecturejournal.net)
Crowe, Malcolm (2007): The Pyrrho Database Management System, Computing and Information Systems Technical Reports No. 38 (University of the West of Scotland) www.pyrrhodb.com
Wei, Zhou; Pierre, Guillaume; Chi, Chi-Hung (2009) Scalable transactions for web applications in the cloud, Proc 15th International Euro-Par Conference on Parallel Processing, LNCS 5704, p.442-453 (Springer)