Sunday, 29 December 2013

MongoDB and the Document standard type

The Document Type

A new feature of Pyrrho (from version 5.1) is the Document standard type. It is planned that this should be used in so-called Big Data applications of Pyrrho. The Document type is inspired by MongoDB and is very useful for ad hoc data. Documents contain strongly typed named fields, but without a schema: the only field that every Document is required to have is called _id, and this is supposed to be unique per document: it is generated for you if you do not supply it.
Documents can be provided in SQL using Json syntax: Pyrrho's SQL is extended to allow Json objects to be written directly as in
select * from a where b={'C':23}

Note that field names are always case-sensitive. Document fields can be embedded documents, arrays or regular expressions: for example a regular expression /a?b.*c/i can be written without quotes inside a Json object. As discussed further below, the equals sign here only tests for the fields mentioned in the document literal.
The client library also supports the Document type: PyrrhoDocument has conversions to and from Json, Bson and byte[].

Indexing Documents

Any table can have columns that contain Documents, and fields inside documents can be accessed from SQL using the a.b.c syntax. The usual SQL2011 case-sensitivity rules apply, so this selector will obtain values such as 23 from the above table.  In fact the above simple query can be written
select * from a where b.c=23

Document field selectors can also be used where SQL allows column lists:
select b.c, d from a

create table f(g char, h document, primary key(g, h.i as int))

though here we note that Pyrrho needs to be told what type the field I has. Extra document indexes can be specified as is usual in SQL using unique or references . Such indexes add restrictions to the creation of new documents, as fields used in index keys must not be null.
These new features also work well for other forms of structured data.


MongoDB has $ operators for use in creating templates for queries and in updates, and these are also available in Pyrrho, and provide alternative ways of writing queries in ordinary data. For example  the query
select * from t where x>100

(even where table t contains no Document columns) can be rewritten using a document literal as
select * from t where x={'$gt': 100}

Such a constant equality-match condition can be used very efficiently on remote data.

Document Updates

Documents are almost never replaced in their entirety. Instead document fields are modified using templates that contain $ operators, and Pyrrho's transaction log contains only the update templates used: the actual binary value of the document is maintained in memory.

Saturday, 28 December 2013

"Database as a Service" and the CAP Theorem

DBaaS is difficult on most platforms for several reasons:

1.       Cloud providers charge by the amount of use. Databases need a lot of disk space so incur charges. If the database engine caches a lot of things in memory, then the server instance needs to run the whole time, and this also costs a lot f money. Finally, if there is a privileged single instance (such as a transaction master for a dataset) then this is also a restriction.

2.       Cloud “database” systems typically do not support ACID principles. They add timestamps etc so that eventual consistency is guaranteed at the expense of durability (there will be lost updates). Also, eventual consistency means that users in telephone contact with each other may see differences in data values in New York and Glasgow that take time to be resolved.

For these reasons Pyrrho does not claim to work with cloud providers. It requires a transaction master and uses memory at least for indexes.
The title refers to Eric Brewer's famous theorem that in a distributed system you cannot have all three of Consistency, Availability and Partition Tolerance. As with many inconvenient truths, many people have tried to pretend they have a workaround.
Assuming we want C for consistency, the P of CAP is “partition tolerance”, which means tolerating when the network is broken (partitioned into two or more fragments). This sort of partition is not the same as horizontal or vertical partitioning of databases. If a client cannot contact the transaction master, no transactions can be committed, so on part of the network the database will not be available (the A).

What Pyrrho does offer in the direction of DBaaS is distributed and (horizontally) partitioned databases. Each horizontal partition is its own transaction master (a replica of a horizontal partition will not be). The most (network) partition tolerant design is where the only distributed transactions are either read-only or for schema changes. In that case you have a good deal of (network) partition tolerance:

1.       When any (horizontal) partition comes up, it checks with its parent in the (horizontal) partitioning scheme.

2.       To view data from anywhere connect to any partition: if your query attempts to access a partition that is offline, there will be an error.

3.       For updates to a given partition, you just connect to that partition (this is not a distributed transaction).

For example, if the horizontal partitioning was by country, few users would notice disruption to network traffic between countries. Pyrrho does not prevent more complex and fragile database designs. In this way we still have global consistency and ACID, and have partial A and P .


Monday, 25 November 2013

Distributed and Partitioned DB Tutorials

Today's release of PyrrhoDB v5.0 comes with a full account of the distributed and partitioned tutorials described in August 2013 postings on this blog.

Those postings have been updated slightly to match the new version. The main difference is that reconfiguration and repartition of databases are now transactioned. You can say begin transaction before you start either process and then no changes are made to disks on any of the servers involved until the transaction is committed. Recall that Pyrrho is very fussy about transaction isolation, so that while a transaction is ongoing the connection that started the transaction is the only participant that can examine the progress being made.

The advantage of doing this is more than theoretical. Defining a partition includes specifying a set of conditions for including records in the new partition, and autocommits during this process would not be helpful.

The new version comes with a much more robust 3-phase distributed transaction protocol than was provided in previous versions, and there are slight differences.

For the purpose of explaining the internal operation of Pyrrho for distributed and partitioned data, a tutorial mode -T has been implemented that exposes all of the server-server protocols and commit steps. Sections have been added to the tutorials to explain some of what is going on. The Pyrrho manual and the SourceIntro document in the open source distribution have been updated with full details. I am happy to explain the internals to anyone who is interested and plan to add more comments to the source code.

Future developments for Pyrrho will develop these facilities a little further, offering behaviour closer to scatter-gather (Hadoop). In connection with a related project at UWS, I am also planning to provide internal support for a BSON data type. As usual, what distinguished Pyrrho from other database initiatives is that all databases are consistent and relational, support full SQL and optimistic concurrency.

Thursday, 15 August 2013

Pyrrho DBMS architecture

With v5.0 Pyrrho has enhanced internal support for distributed, partitioned and remote databases. The architectural model outlined in 2009 Pyrrho and the Cloud is still correct, and the diagram from that article still applies:

The bottom two blocks are referred to in the source code as Level 1 (datafiles). Level 2 (physical records) is where transaction serialization occurs, Level 3 is for database objects, and Level 4 is where the SQL mechanisms, rowsets, OLAP, multi-database connections etc occur. Since approximately v4.5, the strong data type mechanism operates in level 2, and this now facilitates binary server-server communications.

With v5.0 the asynchronous client-server implementation (AsyncStream) is adapted for server-server communications, and applies for communication at the different levels in the following way:
  • At Level 1 server-server communications support the notion of remote storage
  • At Level 2 we use server-server comms for remote transaction master
  • At Level 3 we support database partitioning, where the asynchronous comms enable traversal of partitioned table segments
  • At Level 4 server-server comms will support mobile clients accessing remote databases
The configuration database uses the partitioning mechanism at level 3 to implement its peer-to-peer nature, with subclasses of the LocalTransaction class for handling side effects such as data transfer when partitions are moved between servers.

The v5.0 implementation of partitioning has brought a subtle and important change to Pyrrho's internal operation. Up to now objects in data and schema records have uniformly been identified using their defining or actual addresses in the transaction log files. However, from v5.0, schema objects are always identified by places in the base database's transaction log, and a special physical record (Partition) is used to wrap and import these into the partition's log. These Partition records are created and pushed to the partition as needed whenever a communication channel is opened between them. (They should probably be transacted in the partition under the identity of whoever made the schema change to the base database: but currently they sit outside of the transaction mechanism, as they are not part of the current transaction. I would like such changes to be asynchronous as I don't require all partitions to be online for a schema change.)

I plan to add more comments to the source code to explain things and make the structure clearer. There are some attempts to explain the internal workings of Pyrrho in the SourceIntro.doc and Classes spreadsheet in the distribution.

Wednesday, 14 August 2013

Partitioned Database Implementation

If a database becomes too large or too busy for the server(s) on which it is provided, an option is to partition it to create a “distributed database” (Ceri and Pelagatti, 1984). When tables are partitioned horizontally by selecting ranges of values of fields in one or more columns of a single table, the result is nowadays called “shared-nothing replication”. Table partitioning can be combined with other distribution mechanisms such as ordinary database replication, additional query processors and remote clients. A (distributed) configuration database is used to specify the relationship between databases, partition files, and the partitioning mechanism, as described in Configuring Database Servers.

All operations on partitioned tables are effected by SQL statements sent to the base database or any of the partitions, so that the partitioning is transparent to clients. Operations on a single, local partition will generally be faster than those involving multiple or remote partitions: in particular the use of cross-partition constraints or triggers should be discouraged. Schema changes can only be made on the base database.

When table partitions are used, the complete database consists of the base database and a set of database partitions. A database partition contains a copy of the schema obtained from the base database, and records for its table partitions. A database partition can contain named segments from a number of partitioned tables. Each named segment is stored in its entirety in a named partition is defined by a set of specifying a range of values for columns. The columns do not have to be part of an index for the table (the table does not even need a primary key, and in specifying the range, either or both of the maximum and minimum values can be included or not. Segments do not overlap, and any data not in any segment will remain in the base database.

The server hosting the partition is the transaction master for that partition: this allows updates to be made to the rest of the database even when one partition is offline. Indexes are also partitioned, with each partition holding indexes for the data it contains. Cross-partition operations will be distributed transactions, and by default the server will try to verify coherence of the participating versions as the database is loaded. A good partitioning scheme for a particular scenario will minimise the number of distributed transactions that result.

Partitions are created by adding a row to the _Partition table on the base database server (see section 8.2). The base server creates or connects to the partition database using the credentials provided, creates the schema for the partition, and moves any existing partition data.

Certain types of partition changes can be carried out on a live database by means of transacted groups of changes to the Partition table: a new segment, moving a segment, splitting a segment, deleting a segment (data in the partition is moved to the base database).

In partitions, schema information uses defining positions in the base database. Each partition maintains a binary copy of the entire schema in Partition records in the database: a further Partition record is supplied by the base server if the schema has changed on the next commit of data to the partition.

When the partition information changes this probably involves multiple rows in the _Partition table, so these should occur in transacted groups, with validation (non-overlapping) enforced at the end of each group, when we should see what records need to transfer between partitions or to/from the base database. Table segments should move in their entirety. We support four situations (for at most one base database at a time):

a.       A new partition (or new table or segment for an existing partition) has been added without any changes to other partitions, examine the base database for records for it.

(AddRecords only: this is NewPartition())

b.       One or more configuration rows for a partition P has been deleted. No updates to other rows: records affected will move to the base database. (Deletion of records for P: this is OldPartition())

c.       Changes to other partitions are only to receive tables or segments from P. (Changes to partition names from a single P: this is OldPartition())

d.       Split of a single TableSegment: no records need to move. (change to ValueBounds, and Add another)

It should be possible to design any change in partitioning as a sequence of these four steps. Note that movement of records happens after the validation step, so the order of operations only has to match one of the above.

For implementing constraints, there is an extra stage of checking at the base database level (e.g. uniqueness across all partitions). We will add a PartitonedIndex class to help with this, and a service to look up (a set of) entries in a local index.

A simple example

Two servers A  and B both running, each in its own empty folder (no configuration databases).

0: No _ databases. Create database D on server A and create table E, with some sample data:

                E: (F int, G char). (1,’One’),(2,’Two’).

1: On A, use pyrrhocmd _ to autocreate the configuration database, and define a new partition D2 on B containing a single table segment E2:

Insert into “_Partition” values (‘D2’, 'B', ’me’, ’password’, ‘D’, 'A', ‘password’, ’E’,’E2’, ’F’, 2, 2, true, true)

The credentials supplied must be sufficient to create or open database D2 on server B and create objects in D2.

2. Inspect the logfiles to see what has been done. There should be a configuration database on A with the above information, but on B we see an entry in _Database for a database called D2, and an entry in the _Client table for the base server as a client of B.

On A, the log for D should show the deletion of one record in E as part of a distributed transaction involving _ and D2.

On B, the log for D2 on B should show the creation of E with the same owner as in A, the partition information for E, and the addition of a single record.

3. From A, verify that the table E still shows both records.

4. From A insert some more records in E: (3,’Three’) (2,’Deux’). Check both show up in E, and verify from the logfiles that one new record each has been added to databases D and D2.

5. On A, delete the partition by deleting the single partition record in _ .

See a full version of this tutorial, here.

Configuring Database Servers

The new version of Pyrrho has a completely approach to server configuration, based on a peer-to-peer database. In this article, we look at the configuration of distributed databases, addressing the "Pyrrho and the Cloud" scenario from 2009, and return to issues of partitioned databases in a forthcoming article.

Each server has its own folder for holding databases (although full pathnames can be specified for databases). The configuration database (if any) is found in this folder and is called _ ,. It contains tables with names starting with _. The basic schema of _ is hardwired in the server engine so that _ is created the first time the server owner alters a configuration table.

By default the configuration tables are empty: databases that are not mentioned in the configuration tables are local databases, and we know about no other servers. In this way we already have an API for configuration (namely SQL/ADO.NET) and we conform to Codd’s rule that SQL should be the DDL for the DBMS.

Configuration tables holding server data are automatically partitioned so that each _ database only holds information for data managed by that server instance, and is the transaction master for its own data. Thus any change to properties of remote servers is handled as a remote update – the Pyrrho protocol already supports this. It turns out that the provision for verification of distributed transactions is exactly what is required to guarantee consistency of the information held by different servers.

When the server starts up it should load the _ database if present. What will be new is a more elegant handling of the situations where servers are unavailable: obviously this should mean that their databases and partitions are unavailable. Code exists in the server already for deferring distributed transaction verification and this is exactly what is needed for if and when the remote server comes online later. See also “Database dependency” below.

As of v 5.0, we will use the following schema for _:

_Client: (Server, Client, User, Password). We locally store entries for which Server=us. This table is maintained automatically: don’t update it. Passwords use Pyrrho’s password type so are not readable. The user and password information enables access to the configuration database on the client.

_Database: (Name, Server, Password, ServerRole, Remote, RemoteUser, RemotePassword). (Remote, Server) references _Client. We locally store entries for which Server=us. Passwords use Pyrrho’s password type so are not readable, but they must be supplied when this table is updated to enable server-server communication. The remoteuser and remotepassword information enable server-server access to the named database on the remote server. The password information enables access to the configuration database on the server.

_Partition: (Base, BaseServer, BasePassword, Table, PartitionName, PartitionServer PartitionServerConfigUser, PartitionServerConfigPassword, SegmentName, Column, MinValue, MaxValue, IncludeMin, IncludeMax). Table and Column are positions in the base database: we understand that the partition database has appropriate entries in its schema and a mapping from the base positions. We locally store entries where BaseServer=us.  If you create such an entry, Pyrrho automatically defines the table in the partition database, and makes a Partition1 record in the partition database that gives the server details. If the PartitionServer field is null, the partition is marked as dropped. The partition user and password are for server-server access from the base database: the base server password is a security precaution: it must match the base server’s account. The PartName field enables a number of tableparts to be specified as being in the partition: each part can specify a number of column conditions.

Remember that _ is for distributed configuration information, and is not affected by connection-specific things such as creating or accessing non-configured databases. The current server’s serverRole for non-configured databases will be 7 (for local databases: Query+Master+Storage) and 0 (remote), but they are not entered in the configuration tables. No server is obliged to have complete information.

In the _Database table, Remote is null if ServerRole has bit 1 (Master), and otherwise references cause distributed transactions: e.g. a server providing query and/or storage only will usually list the master server as remote. If no master server is configured the database is read-only (this is not an error). However, at most one server can have the master role.

To change configuration information, the owner of the server account simply connects to the local _ database and modifies it. The information about other servers is read-only, although selects, joins etc show the overall information to the extent required for server-server communication (server roles and partitions). When changes are committed, affected databases are unloaded and a background process is started to reload them. Pyrrho does try to notify affected servers, using the same mechanism as for any update to a distributed database: this will be an area for continuing improvements.

Note, however, that the server passwords in the configuration tables are for enabling access from one server to another, using standard operating system security, for the accounts under which the servers have been configured to run. The partition base credentials enable access to the base database of the partition being configured. On a single server or in a single domain, it is quite likely that a single account and password will be used in all or most of the tables.

As a special case, the entry in the _Database table for a partition will have all 3 server roles and the remote information will be for the base server.

A simple configuration example

Three servers A  , B,  C  initially with no configuration databases. Database D on server A has a table E(F int). We will configure these servers so that both A and C serve queries for D, but C will access storage for D on server B, while A will retain its roles of Query+Master+Storage. 

This will show in the _ database as follows:

_Client table


 _Database table


As we get there we can keep track of developments using Pyrrho's system and log tables. Log tables show the contents of physical database files (so are available when the server has Query+Storage role).

Step 1: Create _ on B and configure storage for database D on A.

 pyrrhocmd –h:B _

SQL> [insert into "_Database" values('D','B',4,passwd,'A',user,passwd)]

The right user name (in Windows Domain\User form) and password should be supplied, but the password won’t be displayed. These two steps have the side effect of getting A to create a _ database and fill in the  _Client entry for B and _Database entry for D, so if A is offline this step should fail. This also creates a D.pfl file in B’s folder.

At this stage the Log$ files for _ on A and _ on B show different Records, but table “_Database” and table “_Client” will agree on the two servers, showing that the configuration database is shared.  

Step 2: Create _ on C and configure the Query ServerRole for database D on B.

pyrrhocmd –h:C _

SQL> [insert into "_Database" values('D','C',2,passwd,'B',user,passwd)]

Pyrrho discovers A's master role. The _Database table shows as above on server B at this stage, (A and C show 2 rows each) although “Sys$Table” and table “Log$” on each server shows just one record in _DatabaseTable E shows in D on A and C,  and E should be listed in Sys$Table on D but the Log$ table on D should not be visible since there is no D.pfl file in C’s folder. div>

This completes the configuration. On C delete an entry in E, and check for expected behaviour.  On C, alter table E to make F a primary key. Check that the Sys$Index tables on both A and C show the change.  We can read information from database D on C even if A is offline. We can’t change anything though.

See a full version of this tutorial, with full explanations of the protocols used,  here .

Thursday, 31 January 2013

Optimising enumerations

Like most database engines, at the heart of Pyrrho is an implementation of B-Trees (not binary trees). In Pyrrho this is implemented by a generic abstract class ATree <K,V> for a tree of key-value pairs. There are lots of subclasses of ATree, some of which implement weakly-ordered or multilevel indexes.

The ATree method GetRowEnumerator returns a SlotEnumerator <K,V> that traverses the pairs of the tree in key order. There are two versions of this method, one of which supplies a Key for matching. This will enumerate all pairs where the key matches the given one. Now for a strongly-ordered tree (no key duplicates) the resulting enumeration will have 1 or zero entries (a TrivalEnumerator or an EmptyEnumerator) provided the key supplied will be a constant. By constant is meant "will not change during result-enumeration of any current query".

This is a very subtle and important point: Pyrrho uses partial evaluation so that a Column for values such as integers, shows just the current value, but this can change when an enumerator moves to the next row. Such values are obviously not constant, and so if the Key value supplied to GetRowEnumerator was such a value, while it would still be true that in each case there is either one or zero matching pairs in the tree, we need to check to find out which.

On the other hand, it is such an important optimisation to be able to replace an enumerator with a trivial or empty enumerator that it seems worth adding some machinery to the database engine to keep track of which expressions are constant. The illustration shows a code fragment from the database engine.

As a result of these considerations many structures (e.g. Column, TypedValue and all their subclasses) have an extra field or property with a name such as isConstant to speed up this determination.

Since the key K might be something very simple such as long or string, the IsConstant() method used in the illustration needs to be defined as an extension method. To my relief I find that Debian Squeeze supports the use of C# extension methods so henceforth Open-source Pyrrho OSP has moved back up to .NET 3.5. For Windows of course we currently use .NET 4.

Needless to say the above changes resulted in about 600 changes to the Pyrrho sources, and it is possible that some mistakes will need fixing. I have been doing quite a lot of testing and will continue to do so. For the next while there will be updates of Pyrrho roughly weekly.