Saturday, 28 December 2013

"Database as a Service" and the CAP Theorem

DBaaS is difficult on most platforms for several reasons:

1.       Cloud providers charge by the amount of use. Databases need a lot of disk space so incur charges. If the database engine caches a lot of things in memory, then the server instance needs to run the whole time, and this also costs a lot f money. Finally, if there is a privileged single instance (such as a transaction master for a dataset) then this is also a restriction.

2.       Cloud “database” systems typically do not support ACID principles. They add timestamps etc so that eventual consistency is guaranteed at the expense of durability (there will be lost updates). Also, eventual consistency means that users in telephone contact with each other may see differences in data values in New York and Glasgow that take time to be resolved.

For these reasons Pyrrho does not claim to work with cloud providers. It requires a transaction master and uses memory at least for indexes.
The title refers to Eric Brewer's famous theorem that in a distributed system you cannot have all three of Consistency, Availability and Partition Tolerance. As with many inconvenient truths, many people have tried to pretend they have a workaround.
Assuming we want C for consistency, the P of CAP is “partition tolerance”, which means tolerating when the network is broken (partitioned into two or more fragments). This sort of partition is not the same as horizontal or vertical partitioning of databases. If a client cannot contact the transaction master, no transactions can be committed, so on part of the network the database will not be available (the A).

What Pyrrho does offer in the direction of DBaaS is distributed and (horizontally) partitioned databases. Each horizontal partition is its own transaction master (a replica of a horizontal partition will not be). The most (network) partition tolerant design is where the only distributed transactions are either read-only or for schema changes. In that case you have a good deal of (network) partition tolerance:

1.       When any (horizontal) partition comes up, it checks with its parent in the (horizontal) partitioning scheme.

2.       To view data from anywhere connect to any partition: if your query attempts to access a partition that is offline, there will be an error.

3.       For updates to a given partition, you just connect to that partition (this is not a distributed transaction).

For example, if the horizontal partitioning was by country, few users would notice disruption to network traffic between countries. Pyrrho does not prevent more complex and fragile database designs. In this way we still have global consistency and ACID, and have partial A and P .


Monday, 25 November 2013

Distributed and Partitioned DB Tutorials

Today's release of PyrrhoDB v5.0 comes with a full account of the distributed and partitioned tutorials described in August 2013 postings on this blog.

Those postings have been updated slightly to match the new version. The main difference is that reconfiguration and repartition of databases are now transactioned. You can say begin transaction before you start either process and then no changes are made to disks on any of the servers involved until the transaction is committed. Recall that Pyrrho is very fussy about transaction isolation, so that while a transaction is ongoing the connection that started the transaction is the only participant that can examine the progress being made.

The advantage of doing this is more than theoretical. Defining a partition includes specifying a set of conditions for including records in the new partition, and autocommits during this process would not be helpful.

The new version comes with a much more robust 3-phase distributed transaction protocol than was provided in previous versions, and there are slight differences.

For the purpose of explaining the internal operation of Pyrrho for distributed and partitioned data, a tutorial mode -T has been implemented that exposes all of the server-server protocols and commit steps. Sections have been added to the tutorials to explain some of what is going on. The Pyrrho manual and the SourceIntro document in the open source distribution have been updated with full details. I am happy to explain the internals to anyone who is interested and plan to add more comments to the source code.

Future developments for Pyrrho will develop these facilities a little further, offering behaviour closer to scatter-gather (Hadoop). In connection with a related project at UWS, I am also planning to provide internal support for a BSON data type. As usual, what distinguished Pyrrho from other database initiatives is that all databases are consistent and relational, support full SQL and optimistic concurrency.

Thursday, 15 August 2013

Pyrrho DBMS architecture

With v5.0 Pyrrho has enhanced internal support for distributed, partitioned and remote databases. The architectural model outlined in 2009 Pyrrho and the Cloud is still correct, and the diagram from that article still applies:

The bottom two blocks are referred to in the source code as Level 1 (datafiles). Level 2 (physical records) is where transaction serialization occurs, Level 3 is for database objects, and Level 4 is where the SQL mechanisms, rowsets, OLAP, multi-database connections etc occur. Since approximately v4.5, the strong data type mechanism operates in level 2, and this now facilitates binary server-server communications.

With v5.0 the asynchronous client-server implementation (AsyncStream) is adapted for server-server communications, and applies for communication at the different levels in the following way:
  • At Level 1 server-server communications support the notion of remote storage
  • At Level 2 we use server-server comms for remote transaction master
  • At Level 3 we support database partitioning, where the asynchronous comms enable traversal of partitioned table segments
  • At Level 4 server-server comms will support mobile clients accessing remote databases
The configuration database uses the partitioning mechanism at level 3 to implement its peer-to-peer nature, with subclasses of the LocalTransaction class for handling side effects such as data transfer when partitions are moved between servers.

The v5.0 implementation of partitioning has brought a subtle and important change to Pyrrho's internal operation. Up to now objects in data and schema records have uniformly been identified using their defining or actual addresses in the transaction log files. However, from v5.0, schema objects are always identified by places in the base database's transaction log, and a special physical record (Partition) is used to wrap and import these into the partition's log. These Partition records are created and pushed to the partition as needed whenever a communication channel is opened between them. (They should probably be transacted in the partition under the identity of whoever made the schema change to the base database: but currently they sit outside of the transaction mechanism, as they are not part of the current transaction. I would like such changes to be asynchronous as I don't require all partitions to be online for a schema change.)

I plan to add more comments to the source code to explain things and make the structure clearer. There are some attempts to explain the internal workings of Pyrrho in the SourceIntro.doc and Classes spreadsheet in the distribution.

Wednesday, 14 August 2013

Partitioned Database Implementation

If a database becomes too large or too busy for the server(s) on which it is provided, an option is to partition it to create a “distributed database” (Ceri and Pelagatti, 1984). When tables are partitioned horizontally by selecting ranges of values of fields in one or more columns of a single table, the result is nowadays called “shared-nothing replication”. Table partitioning can be combined with other distribution mechanisms such as ordinary database replication, additional query processors and remote clients. A (distributed) configuration database is used to specify the relationship between databases, partition files, and the partitioning mechanism, as described in Configuring Database Servers.

All operations on partitioned tables are effected by SQL statements sent to the base database or any of the partitions, so that the partitioning is transparent to clients. Operations on a single, local partition will generally be faster than those involving multiple or remote partitions: in particular the use of cross-partition constraints or triggers should be discouraged. Schema changes can only be made on the base database.

When table partitions are used, the complete database consists of the base database and a set of database partitions. A database partition contains a copy of the schema obtained from the base database, and records for its table partitions. A database partition can contain named segments from a number of partitioned tables. Each named segment is stored in its entirety in a named partition is defined by a set of specifying a range of values for columns. The columns do not have to be part of an index for the table (the table does not even need a primary key, and in specifying the range, either or both of the maximum and minimum values can be included or not. Segments do not overlap, and any data not in any segment will remain in the base database.

The server hosting the partition is the transaction master for that partition: this allows updates to be made to the rest of the database even when one partition is offline. Indexes are also partitioned, with each partition holding indexes for the data it contains. Cross-partition operations will be distributed transactions, and by default the server will try to verify coherence of the participating versions as the database is loaded. A good partitioning scheme for a particular scenario will minimise the number of distributed transactions that result.

Partitions are created by adding a row to the _Partition table on the base database server (see section 8.2). The base server creates or connects to the partition database using the credentials provided, creates the schema for the partition, and moves any existing partition data.

Certain types of partition changes can be carried out on a live database by means of transacted groups of changes to the Partition table: a new segment, moving a segment, splitting a segment, deleting a segment (data in the partition is moved to the base database).

In partitions, schema information uses defining positions in the base database. Each partition maintains a binary copy of the entire schema in Partition records in the database: a further Partition record is supplied by the base server if the schema has changed on the next commit of data to the partition.

When the partition information changes this probably involves multiple rows in the _Partition table, so these should occur in transacted groups, with validation (non-overlapping) enforced at the end of each group, when we should see what records need to transfer between partitions or to/from the base database. Table segments should move in their entirety. We support four situations (for at most one base database at a time):

a.       A new partition (or new table or segment for an existing partition) has been added without any changes to other partitions, examine the base database for records for it.

(AddRecords only: this is NewPartition())

b.       One or more configuration rows for a partition P has been deleted. No updates to other rows: records affected will move to the base database. (Deletion of records for P: this is OldPartition())

c.       Changes to other partitions are only to receive tables or segments from P. (Changes to partition names from a single P: this is OldPartition())

d.       Split of a single TableSegment: no records need to move. (change to ValueBounds, and Add another)

It should be possible to design any change in partitioning as a sequence of these four steps. Note that movement of records happens after the validation step, so the order of operations only has to match one of the above.

For implementing constraints, there is an extra stage of checking at the base database level (e.g. uniqueness across all partitions). We will add a PartitonedIndex class to help with this, and a service to look up (a set of) entries in a local index.

A simple example

Two servers A  and B both running, each in its own empty folder (no configuration databases).

0: No _ databases. Create database D on server A and create table E, with some sample data:

                E: (F int, G char). (1,’One’),(2,’Two’).

1: On A, use pyrrhocmd _ to autocreate the configuration database, and define a new partition D2 on B containing a single table segment E2:

Insert into “_Partition” values (‘D2’, 'B', ’me’, ’password’, ‘D’, 'A', ‘password’, ’E’,’E2’, ’F’, 2, 2, true, true)

The credentials supplied must be sufficient to create or open database D2 on server B and create objects in D2.

2. Inspect the logfiles to see what has been done. There should be a configuration database on A with the above information, but on B we see an entry in _Database for a database called D2, and an entry in the _Client table for the base server as a client of B.

On A, the log for D should show the deletion of one record in E as part of a distributed transaction involving _ and D2.

On B, the log for D2 on B should show the creation of E with the same owner as in A, the partition information for E, and the addition of a single record.

3. From A, verify that the table E still shows both records.

4. From A insert some more records in E: (3,’Three’) (2,’Deux’). Check both show up in E, and verify from the logfiles that one new record each has been added to databases D and D2.

5. On A, delete the partition by deleting the single partition record in _ .

See a full version of this tutorial, here.

Configuring Database Servers

The new version of Pyrrho has a completely approach to server configuration, based on a peer-to-peer database. In this article, we look at the configuration of distributed databases, addressing the "Pyrrho and the Cloud" scenario from 2009, and return to issues of partitioned databases in a forthcoming article.

Each server has its own folder for holding databases (although full pathnames can be specified for databases). The configuration database (if any) is found in this folder and is called _ ,. It contains tables with names starting with _. The basic schema of _ is hardwired in the server engine so that _ is created the first time the server owner alters a configuration table.

By default the configuration tables are empty: databases that are not mentioned in the configuration tables are local databases, and we know about no other servers. In this way we already have an API for configuration (namely SQL/ADO.NET) and we conform to Codd’s rule that SQL should be the DDL for the DBMS.

Configuration tables holding server data are automatically partitioned so that each _ database only holds information for data managed by that server instance, and is the transaction master for its own data. Thus any change to properties of remote servers is handled as a remote update – the Pyrrho protocol already supports this. It turns out that the provision for verification of distributed transactions is exactly what is required to guarantee consistency of the information held by different servers.

When the server starts up it should load the _ database if present. What will be new is a more elegant handling of the situations where servers are unavailable: obviously this should mean that their databases and partitions are unavailable. Code exists in the server already for deferring distributed transaction verification and this is exactly what is needed for if and when the remote server comes online later. See also “Database dependency” below.

As of v 5.0, we will use the following schema for _:

_Client: (Server, Client, User, Password). We locally store entries for which Server=us. This table is maintained automatically: don’t update it. Passwords use Pyrrho’s password type so are not readable. The user and password information enables access to the configuration database on the client.

_Database: (Name, Server, Password, ServerRole, Remote, RemoteUser, RemotePassword). (Remote, Server) references _Client. We locally store entries for which Server=us. Passwords use Pyrrho’s password type so are not readable, but they must be supplied when this table is updated to enable server-server communication. The remoteuser and remotepassword information enable server-server access to the named database on the remote server. The password information enables access to the configuration database on the server.

_Partition: (Base, BaseServer, BasePassword, Table, PartitionName, PartitionServer PartitionServerConfigUser, PartitionServerConfigPassword, SegmentName, Column, MinValue, MaxValue, IncludeMin, IncludeMax). Table and Column are positions in the base database: we understand that the partition database has appropriate entries in its schema and a mapping from the base positions. We locally store entries where BaseServer=us.  If you create such an entry, Pyrrho automatically defines the table in the partition database, and makes a Partition1 record in the partition database that gives the server details. If the PartitionServer field is null, the partition is marked as dropped. The partition user and password are for server-server access from the base database: the base server password is a security precaution: it must match the base server’s account. The PartName field enables a number of tableparts to be specified as being in the partition: each part can specify a number of column conditions.

Remember that _ is for distributed configuration information, and is not affected by connection-specific things such as creating or accessing non-configured databases. The current server’s serverRole for non-configured databases will be 7 (for local databases: Query+Master+Storage) and 0 (remote), but they are not entered in the configuration tables. No server is obliged to have complete information.

In the _Database table, Remote is null if ServerRole has bit 1 (Master), and otherwise references cause distributed transactions: e.g. a server providing query and/or storage only will usually list the master server as remote. If no master server is configured the database is read-only (this is not an error). However, at most one server can have the master role.

To change configuration information, the owner of the server account simply connects to the local _ database and modifies it. The information about other servers is read-only, although selects, joins etc show the overall information to the extent required for server-server communication (server roles and partitions). When changes are committed, affected databases are unloaded and a background process is started to reload them. Pyrrho does try to notify affected servers, using the same mechanism as for any update to a distributed database: this will be an area for continuing improvements.

Note, however, that the server passwords in the configuration tables are for enabling access from one server to another, using standard operating system security, for the accounts under which the servers have been configured to run. The partition base credentials enable access to the base database of the partition being configured. On a single server or in a single domain, it is quite likely that a single account and password will be used in all or most of the tables.

As a special case, the entry in the _Database table for a partition will have all 3 server roles and the remote information will be for the base server.

A simple configuration example

Three servers A  , B,  C  initially with no configuration databases. Database D on server A has a table E(F int). We will configure these servers so that both A and C serve queries for D, but C will access storage for D on server B, while A will retain its roles of Query+Master+Storage. 

This will show in the _ database as follows:

_Client table


 _Database table


As we get there we can keep track of developments using Pyrrho's system and log tables. Log tables show the contents of physical database files (so are available when the server has Query+Storage role).

Step 1: Create _ on B and configure storage for database D on A.

 pyrrhocmd –h:B _

SQL> [insert into "_Database" values('D','B',4,passwd,'A',user,passwd)]

The right user name (in Windows Domain\User form) and password should be supplied, but the password won’t be displayed. These two steps have the side effect of getting A to create a _ database and fill in the  _Client entry for B and _Database entry for D, so if A is offline this step should fail. This also creates a D.pfl file in B’s folder.

At this stage the Log$ files for _ on A and _ on B show different Records, but table “_Database” and table “_Client” will agree on the two servers, showing that the configuration database is shared.  

Step 2: Create _ on C and configure the Query ServerRole for database D on B.

pyrrhocmd –h:C _

SQL> [insert into "_Database" values('D','C',2,passwd,'B',user,passwd)]

Pyrrho discovers A's master role. The _Database table shows as above on server B at this stage, (A and C show 2 rows each) although “Sys$Table” and table “Log$” on each server shows just one record in _DatabaseTable E shows in D on A and C,  and E should be listed in Sys$Table on D but the Log$ table on D should not be visible since there is no D.pfl file in C’s folder. div>

This completes the configuration. On C delete an entry in E, and check for expected behaviour.  On C, alter table E to make F a primary key. Check that the Sys$Index tables on both A and C show the change.  We can read information from database D on C even if A is offline. We can’t change anything though.

See a full version of this tutorial, with full explanations of the protocols used,  here .

Thursday, 31 January 2013

Optimising enumerations

Like most database engines, at the heart of Pyrrho is an implementation of B-Trees (not binary trees). In Pyrrho this is implemented by a generic abstract class ATree <K,V> for a tree of key-value pairs. There are lots of subclasses of ATree, some of which implement weakly-ordered or multilevel indexes.

The ATree method GetRowEnumerator returns a SlotEnumerator <K,V> that traverses the pairs of the tree in key order. There are two versions of this method, one of which supplies a Key for matching. This will enumerate all pairs where the key matches the given one. Now for a strongly-ordered tree (no key duplicates) the resulting enumeration will have 1 or zero entries (a TrivalEnumerator or an EmptyEnumerator) provided the key supplied will be a constant. By constant is meant "will not change during result-enumeration of any current query".

This is a very subtle and important point: Pyrrho uses partial evaluation so that a Column for values such as integers, shows just the current value, but this can change when an enumerator moves to the next row. Such values are obviously not constant, and so if the Key value supplied to GetRowEnumerator was such a value, while it would still be true that in each case there is either one or zero matching pairs in the tree, we need to check to find out which.

On the other hand, it is such an important optimisation to be able to replace an enumerator with a trivial or empty enumerator that it seems worth adding some machinery to the database engine to keep track of which expressions are constant. The illustration shows a code fragment from the database engine.

As a result of these considerations many structures (e.g. Column, TypedValue and all their subclasses) have an extra field or property with a name such as isConstant to speed up this determination.

Since the key K might be something very simple such as long or string, the IsConstant() method used in the illustration needs to be defined as an extension method. To my relief I find that Debian Squeeze supports the use of C# extension methods so henceforth Open-source Pyrrho OSP has moved back up to .NET 3.5. For Windows of course we currently use .NET 4.

Needless to say the above changes resulted in about 600 changes to the Pyrrho sources, and it is possible that some mistakes will need fixing. I have been doing quite a lot of testing and will continue to do so. For the next while there will be updates of Pyrrho roughly weekly.

Monday, 28 January 2013

Multisets and Arrays

This week's update makes a number of small alterations to Pyrrho's syntax, mostly to bring it better in line with the SQL standard. Specifically, array and multiset initialisation (if not using a subquery) uses [] instead of (). Subqueries are always introduced with a leading parenthesis as in (SELECT .., and are used in predicates such as IN, EXISTS, UNIQUE. The Open Source distribution has been moved back to .NET v2.0 so that it works with a larger set of Linux distributions.

Monday, 21 January 2013

Charting enhancements

This week's version of Pyrrho 4.8 has modified slightly the output flags that can be added to tables and columns for automatic charting. Table metadata can include Pie, Line, Points, Histogram. Column metadata can include X, Y and Caption. Legend can be usefully added to Pie and Histogram: if legend is not specified Pyrrho writes the captions directly into the chart. The colours are selected automatically. An example is shown here.
In this example the values in the table are added explicitly: in a real application this would be a presentation database and the table would be filled by some business analytics process. Feedback is always welcome on features of this sort.

Monday, 7 January 2013

Version 4.8 released

This brings standard SQL diagnostics management. Standard SQL exception handling includes three kinds of handlers for errors: CONTINUE, EXIT and UNDO. Pyrrho honours not_found if it is handled but otherwise does not report an error condition if DML operations do not affect any rows. Pyrrho's strict interpretation of correct transaction behaviour is supported by better prompts in the command line interface and a prohibition on using COMMIT inside a SQL routine (ROLLBACK is allowed but aborts the routine). The meaning of UNDO is very interesting: if this handler is activated, all of the database changes performed so far in the scope of the handler declaration are undone. This is a great improvement on ROLLBACK or EXIT as it allows for subtle and sophisticated error handling. Exceptions arising from most of Pyrrho's SQLSTATEs can be handled: you cannot handle such errors as damaged files, bad syntax, or attempts to defeat the transaction mechanism. In addition the keywords not_found and sqlexception etc are supported, and it is permissible to use your own signals (the usual rules for identifiers apply). The diagnostics area in the SQL standard is interesting. There is a set of keywords, including message_text, some of which have prescribed default values for each exception. You can override these using SET assignments in a SIGNAL statement, and collect their values by using GET DIAGNOSTICS, which allows assignment of such data identified by keywords to variables that you specify. In addition, a roundup of improvements results in better behaviour of such things as structured types. The philosophy nowadays is that client software should not need to know about how data types are implemented in the server, but that (as in XML or JSON) should simply give a full and readable description of structured data. The Pyrrho engine uses strong data types, so the server describes each part of the data by giving the nominal data type and the actual data subtype if it is different. For example, the actual type might include semantic information or metadata. Such actual types are visible to SQL and the user when they are explicitly assigned to the value using the RDF/OWL ^^ suffix or the SQL TREAT function. Unfortunately some previous cross-platform support for other APIs is now lost, although many of the interfaces in ADO.NET and JPA are still supported. For example, there seems to be no way for .NET's DataTable to support (nested) structured types, and so DataAdapters and some of the more complex JPA interfaces have been dropped in this edition. The venerable PyrrhoMgr tool is a casualty of this change. I still rather like the PyrrhoSQL tool but it needs someone other than me to improve its usability.