Monday, 12 January 2015

The MongoDB service

It is noticeable that many companies are now using MongoDB documents alongside traditional SQL databases. But it is a very bad idea to require connections to different DBMS in a single application since transactional consistency is made even more difficult if not impossible. Pyrrho already allows a single connection to involve several databases, and MongoDB has a similar facility. But if an application is going to work with both database types it is really essential to have a single connection to both the SQL and no-SQL databases. Pyrrho has added support for MongoDB documents (and accessibility to these documents via SQL) for at least academic interest, by making DOCUMENT (and DOCARRAY) into primitive types in the same way that SQL added support for XML. Pyrrho’s SQL syntax in chapter 7 of this manual allows {} to delimit Json documents (and, where the context allows, [] to delimit Json arrays). Within such documents the Mongo operators have their usual meanings.

By default Pyrrho starts up a MongoDB-like wire protocol service on port 27017, so that the mongo shell can work with Pyrrho although the database file formats are completely different. In application programming where SQL tables and documents are both needed, it is better to use the above SQL extensions on a single TCP connection rather than opening a second connection on port 27017. The mongo shell’s document collections are implemented as database tables with a single column “doc” of type DOCUMENT, although SQL can access more general layouts. Comparison of documents has been hacked so that WHERE “doc”= condition in SQL behaves similarly to a Mongo find() (the resulting concept of equality of documents is admittedly strange but works well).

When working with such mixed systems the biggest surprise is around case-sensitivity. Json collection and field names are case sensitive, while SQL by default converts unquoted identifiers to upper case.

The MongoDB documentation for update contains the following example {
  $set: { status: "D" },
  $inc: { quantity: 2 }
Here are complete examples for updates using Pyrrho DB. The first uses the mongo shell to access Pyrrho, the second, the Pyrrhocmd client:
C:\Program Files\MongoDB 2.6 Standard\bin>mongo
MongoDB shell version: 2.6.6
connecting to: test
:0> db.runCommand({insert:"orders",inserts: [{product:"coffee",status:"E",quantity:1}]
{ "ok" : true, "n" : 1 }
:0> db.orders.find()
{ "product" : "coffee", "status" : "E", "quantity" : 1, "_id" : ObjectId("0000000039150006003d1909") }
:0> db.runCommand({update:"orders",updates:
[{query: {product:"coffee"},update:{$set:{status:"D"},$inc:{quantity:2}}}]
{ "ok" : true, "n" : 1 }
:0> db.orders.find()
{ "product" : "coffee", "status" : "D", "quantity" : 3, "_id" : ObjectId("0000000039150006003d1909") }

Microsoft Windows [Version 6.3.9600]
(c) 2013 Microsoft Corporation. All rights reserved.

C:\PyrrhoDB\OSP\OSP>pyrrhocmd test
SQL> update "orders" set "doc"={$set: {status:"G"},$inc: {quantity: 4}} where "doc"={product:"coffee"}
1 records affected

SQL> table "orders"
|{"product": "coffee", "status": "G", "quantity": 7, "_id": "0000000039150004007

Wednesday, 7 January 2015

Version 5.2 Beta

Since early April 2014 I have been working on this new version of Pyrrho DBMS. The open source edition of this version is available today in beta form for download, together with a set of unit tests.
The new version retains the same SQL language and file format as previous versions, so should be fully backwards compatible with databases created by them.
But the query processing system has been completely rewritten to be more robust for supporting developments such as MongoDB, and enhancements planned for the HTTP/REST service. I will start working on these soon.
The comments in the code also need to be completed and revised.
The new version is better able to process some standard SQL syntax (I give some examples below), and also some innovations such as the following:
  • FROM STATIC: similar to the use of "DUAL" in other DBMS, this helps meet the requirement for the FROM keyword in the SQL standard while allowing the evaluation of expressions, e.g. SELECT SESSION_ROLE FROM STATIC.
  • The ability to index on fields within structured types, e.g.
create type mrw as (c int)
create table e(f mrw, primary key(f.c)
  • The ability to rename database objects with automatic consequential modification to stored procedures, constraints, views etc that reference the modified objects.
One of the main motivations for the new version was to get triggers working fully according to the SQL 2011 standard. Example:

create trigger sdai instead of delete on a referencing old table as ot for each statement begin atomic insert into c (select b,c from ot) end
create trigger riab before insert on a referencing new as nr for each row begin atomic set nr.c=nr.b+3; update b set tot=tot+nr.b end
create trigger ruab before update on a referencing old as mr new as nr for each row begin atomic update b set tot=tot-mr.b+nr.b; set nr.d='changed' end

Other examples from the test suite:
select bb,dd from aa order by bb+cc
select a[b] from (select array(bb,cc,dd) as a,ee as b from aa)
select array(select bb from aa order by cc)[1] from static

Some examples withdrawn from the test suite include

select bb as a,(select max(bb) from aa t where>a) from aa
insert into d (select 17,(select a from b where c=1) from static)
These look like SQL, but I don't think they are valid. In the first, left-to-right processing of SQL as required by the standard means the type of bb is unknown until we reach "from aa". In the second I believe SQL requires the column names to be correct for INSERT SELECT: the test could be improved by inserting the VALUES keyword. I'll be interested in comments about these tests. 

Sunday, 29 December 2013

MongoDB and the Document standard type

The Document Type

A new feature of Pyrrho (from version 5.1) is the Document standard type. It is planned that this should be used in so-called Big Data applications of Pyrrho. The Document type is inspired by MongoDB and is very useful for ad hoc data. Documents contain strongly typed named fields, but without a schema: the only field that every Document is required to have is called _id, and this is supposed to be unique per document: it is generated for you if you do not supply it.
Documents can be provided in SQL using Json syntax: Pyrrho's SQL is extended to allow Json objects to be written directly as in
select * from a where b={'C':23}

Note that field names are always case-sensitive. Document fields can be embedded documents, arrays or regular expressions: for example a regular expression /a?b.*c/i can be written without quotes inside a Json object. As discussed further below, the equals sign here only tests for the fields mentioned in the document literal.
The client library also supports the Document type: PyrrhoDocument has conversions to and from Json, Bson and byte[].

Indexing Documents

Any table can have columns that contain Documents, and fields inside documents can be accessed from SQL using the a.b.c syntax. The usual SQL2011 case-sensitivity rules apply, so this selector will obtain values such as 23 from the above table.  In fact the above simple query can be written
select * from a where b.c=23

Document field selectors can also be used where SQL allows column lists:
select b.c, d from a

create table f(g char, h document, primary key(g, h.i as int))

though here we note that Pyrrho needs to be told what type the field I has. Extra document indexes can be specified as is usual in SQL using unique or references . Such indexes add restrictions to the creation of new documents, as fields used in index keys must not be null.
These new features also work well for other forms of structured data.


MongoDB has $ operators for use in creating templates for queries and in updates, and these are also available in Pyrrho, and provide alternative ways of writing queries in ordinary data. For example  the query
select * from t where x>100

(even where table t contains no Document columns) can be rewritten using a document literal as
select * from t where x={'$gt': 100}

Such a constant equality-match condition can be used very efficiently on remote data.

Document Updates

Documents are almost never replaced in their entirety. Instead document fields are modified using templates that contain $ operators, and Pyrrho's transaction log contains only the update templates used: the actual binary value of the document is maintained in memory.

Saturday, 28 December 2013

"Database as a Service" and the CAP Theorem

DBaaS is difficult on most platforms for several reasons:

1.       Cloud providers charge by the amount of use. Databases need a lot of disk space so incur charges. If the database engine caches a lot of things in memory, then the server instance needs to run the whole time, and this also costs a lot f money. Finally, if there is a privileged single instance (such as a transaction master for a dataset) then this is also a restriction.

2.       Cloud “database” systems typically do not support ACID principles. They add timestamps etc so that eventual consistency is guaranteed at the expense of durability (there will be lost updates). Also, eventual consistency means that users in telephone contact with each other may see differences in data values in New York and Glasgow that take time to be resolved.

For these reasons Pyrrho does not claim to work with cloud providers. It requires a transaction master and uses memory at least for indexes.
The title refers to Eric Brewer's famous theorem that in a distributed system you cannot have all three of Consistency, Availability and Partition Tolerance. As with many inconvenient truths, many people have tried to pretend they have a workaround.
Assuming we want C for consistency, the P of CAP is “partition tolerance”, which means tolerating when the network is broken (partitioned into two or more fragments). This sort of partition is not the same as horizontal or vertical partitioning of databases. If a client cannot contact the transaction master, no transactions can be committed, so on part of the network the database will not be available (the A).

What Pyrrho does offer in the direction of DBaaS is distributed and (horizontally) partitioned databases. Each horizontal partition is its own transaction master (a replica of a horizontal partition will not be). The most (network) partition tolerant design is where the only distributed transactions are either read-only or for schema changes. In that case you have a good deal of (network) partition tolerance:

1.       When any (horizontal) partition comes up, it checks with its parent in the (horizontal) partitioning scheme.

2.       To view data from anywhere connect to any partition: if your query attempts to access a partition that is offline, there will be an error.

3.       For updates to a given partition, you just connect to that partition (this is not a distributed transaction).

For example, if the horizontal partitioning was by country, few users would notice disruption to network traffic between countries. Pyrrho does not prevent more complex and fragile database designs. In this way we still have global consistency and ACID, and have partial A and P .


Monday, 25 November 2013

Distributed and Partitioned DB Tutorials

Today's release of PyrrhoDB v5.0 comes with a full account of the distributed and partitioned tutorials described in August 2013 postings on this blog.

Those postings have been updated slightly to match the new version. The main difference is that reconfiguration and repartition of databases are now transactioned. You can say begin transaction before you start either process and then no changes are made to disks on any of the servers involved until the transaction is committed. Recall that Pyrrho is very fussy about transaction isolation, so that while a transaction is ongoing the connection that started the transaction is the only participant that can examine the progress being made.

The advantage of doing this is more than theoretical. Defining a partition includes specifying a set of conditions for including records in the new partition, and autocommits during this process would not be helpful.

The new version comes with a much more robust 3-phase distributed transaction protocol than was provided in previous versions, and there are slight differences.

For the purpose of explaining the internal operation of Pyrrho for distributed and partitioned data, a tutorial mode -T has been implemented that exposes all of the server-server protocols and commit steps. Sections have been added to the tutorials to explain some of what is going on. The Pyrrho manual and the SourceIntro document in the open source distribution have been updated with full details. I am happy to explain the internals to anyone who is interested and plan to add more comments to the source code.

Future developments for Pyrrho will develop these facilities a little further, offering behaviour closer to scatter-gather (Hadoop). In connection with a related project at UWS, I am also planning to provide internal support for a BSON data type. As usual, what distinguished Pyrrho from other database initiatives is that all databases are consistent and relational, support full SQL and optimistic concurrency.

Thursday, 15 August 2013

Pyrrho DBMS architecture

With v5.0 Pyrrho has enhanced internal support for distributed, partitioned and remote databases. The architectural model outlined in 2009 Pyrrho and the Cloud is still correct, and the diagram from that article still applies:

The bottom two blocks are referred to in the source code as Level 1 (datafiles). Level 2 (physical records) is where transaction serialization occurs, Level 3 is for database objects, and Level 4 is where the SQL mechanisms, rowsets, OLAP, multi-database connections etc occur. Since approximately v4.5, the strong data type mechanism operates in level 2, and this now facilitates binary server-server communications.

With v5.0 the asynchronous client-server implementation (AsyncStream) is adapted for server-server communications, and applies for communication at the different levels in the following way:
  • At Level 1 server-server communications support the notion of remote storage
  • At Level 2 we use server-server comms for remote transaction master
  • At Level 3 we support database partitioning, where the asynchronous comms enable traversal of partitioned table segments
  • At Level 4 server-server comms will support mobile clients accessing remote databases
The configuration database uses the partitioning mechanism at level 3 to implement its peer-to-peer nature, with subclasses of the LocalTransaction class for handling side effects such as data transfer when partitions are moved between servers.

The v5.0 implementation of partitioning has brought a subtle and important change to Pyrrho's internal operation. Up to now objects in data and schema records have uniformly been identified using their defining or actual addresses in the transaction log files. However, from v5.0, schema objects are always identified by places in the base database's transaction log, and a special physical record (Partition) is used to wrap and import these into the partition's log. These Partition records are created and pushed to the partition as needed whenever a communication channel is opened between them. (They should probably be transacted in the partition under the identity of whoever made the schema change to the base database: but currently they sit outside of the transaction mechanism, as they are not part of the current transaction. I would like such changes to be asynchronous as I don't require all partitions to be online for a schema change.)

I plan to add more comments to the source code to explain things and make the structure clearer. There are some attempts to explain the internal workings of Pyrrho in the SourceIntro.doc and Classes spreadsheet in the distribution.

Wednesday, 14 August 2013

Partitioned Database Implementation

If a database becomes too large or too busy for the server(s) on which it is provided, an option is to partition it to create a “distributed database” (Ceri and Pelagatti, 1984). When tables are partitioned horizontally by selecting ranges of values of fields in one or more columns of a single table, the result is nowadays called “shared-nothing replication”. Table partitioning can be combined with other distribution mechanisms such as ordinary database replication, additional query processors and remote clients. A (distributed) configuration database is used to specify the relationship between databases, partition files, and the partitioning mechanism, as described in Configuring Database Servers.

All operations on partitioned tables are effected by SQL statements sent to the base database or any of the partitions, so that the partitioning is transparent to clients. Operations on a single, local partition will generally be faster than those involving multiple or remote partitions: in particular the use of cross-partition constraints or triggers should be discouraged. Schema changes can only be made on the base database.

When table partitions are used, the complete database consists of the base database and a set of database partitions. A database partition contains a copy of the schema obtained from the base database, and records for its table partitions. A database partition can contain named segments from a number of partitioned tables. Each named segment is stored in its entirety in a named partition is defined by a set of specifying a range of values for columns. The columns do not have to be part of an index for the table (the table does not even need a primary key, and in specifying the range, either or both of the maximum and minimum values can be included or not. Segments do not overlap, and any data not in any segment will remain in the base database.

The server hosting the partition is the transaction master for that partition: this allows updates to be made to the rest of the database even when one partition is offline. Indexes are also partitioned, with each partition holding indexes for the data it contains. Cross-partition operations will be distributed transactions, and by default the server will try to verify coherence of the participating versions as the database is loaded. A good partitioning scheme for a particular scenario will minimise the number of distributed transactions that result.

Partitions are created by adding a row to the _Partition table on the base database server (see section 8.2). The base server creates or connects to the partition database using the credentials provided, creates the schema for the partition, and moves any existing partition data.

Certain types of partition changes can be carried out on a live database by means of transacted groups of changes to the Partition table: a new segment, moving a segment, splitting a segment, deleting a segment (data in the partition is moved to the base database).

In partitions, schema information uses defining positions in the base database. Each partition maintains a binary copy of the entire schema in Partition records in the database: a further Partition record is supplied by the base server if the schema has changed on the next commit of data to the partition.

When the partition information changes this probably involves multiple rows in the _Partition table, so these should occur in transacted groups, with validation (non-overlapping) enforced at the end of each group, when we should see what records need to transfer between partitions or to/from the base database. Table segments should move in their entirety. We support four situations (for at most one base database at a time):

a.       A new partition (or new table or segment for an existing partition) has been added without any changes to other partitions, examine the base database for records for it.

(AddRecords only: this is NewPartition())

b.       One or more configuration rows for a partition P has been deleted. No updates to other rows: records affected will move to the base database. (Deletion of records for P: this is OldPartition())

c.       Changes to other partitions are only to receive tables or segments from P. (Changes to partition names from a single P: this is OldPartition())

d.       Split of a single TableSegment: no records need to move. (change to ValueBounds, and Add another)

It should be possible to design any change in partitioning as a sequence of these four steps. Note that movement of records happens after the validation step, so the order of operations only has to match one of the above.

For implementing constraints, there is an extra stage of checking at the base database level (e.g. uniqueness across all partitions). We will add a PartitonedIndex class to help with this, and a service to look up (a set of) entries in a local index.

A simple example

Two servers A  and B both running, each in its own empty folder (no configuration databases).

0: No _ databases. Create database D on server A and create table E, with some sample data:

                E: (F int, G char). (1,’One’),(2,’Two’).

1: On A, use pyrrhocmd _ to autocreate the configuration database, and define a new partition D2 on B containing a single table segment E2:

Insert into “_Partition” values (‘D2’, 'B', ’me’, ’password’, ‘D’, 'A', ‘password’, ’E’,’E2’, ’F’, 2, 2, true, true)

The credentials supplied must be sufficient to create or open database D2 on server B and create objects in D2.

2. Inspect the logfiles to see what has been done. There should be a configuration database on A with the above information, but on B we see an entry in _Database for a database called D2, and an entry in the _Client table for the base server as a client of B.

On A, the log for D should show the deletion of one record in E as part of a distributed transaction involving _ and D2.

On B, the log for D2 on B should show the creation of E with the same owner as in A, the partition information for E, and the addition of a single record.

3. From A, verify that the table E still shows both records.

4. From A insert some more records in E: (3,’Three’) (2,’Deux’). Check both show up in E, and verify from the logfiles that one new record each has been added to databases D and D2.

5. On A, delete the partition by deleting the single partition record in _ .

See a full version of this tutorial, here.