Thursday 15 August 2013

Pyrrho DBMS architecture

With v5.0 Pyrrho has enhanced internal support for distributed, partitioned and remote databases. The architectural model outlined in 2009 Pyrrho and the Cloud is still correct, and the diagram from that article still applies:

The bottom two blocks are referred to in the source code as Level 1 (datafiles). Level 2 (physical records) is where transaction serialization occurs, Level 3 is for database objects, and Level 4 is where the SQL mechanisms, rowsets, OLAP, multi-database connections etc occur. Since approximately v4.5, the strong data type mechanism operates in level 2, and this now facilitates binary server-server communications.

With v5.0 the asynchronous client-server implementation (AsyncStream) is adapted for server-server communications, and applies for communication at the different levels in the following way:
  • At Level 1 server-server communications support the notion of remote storage
  • At Level 2 we use server-server comms for remote transaction master
  • At Level 3 we support database partitioning, where the asynchronous comms enable traversal of partitioned table segments
  • At Level 4 server-server comms will support mobile clients accessing remote databases
The configuration database uses the partitioning mechanism at level 3 to implement its peer-to-peer nature, with subclasses of the LocalTransaction class for handling side effects such as data transfer when partitions are moved between servers.

The v5.0 implementation of partitioning has brought a subtle and important change to Pyrrho's internal operation. Up to now objects in data and schema records have uniformly been identified using their defining or actual addresses in the transaction log files. However, from v5.0, schema objects are always identified by places in the base database's transaction log, and a special physical record (Partition) is used to wrap and import these into the partition's log. These Partition records are created and pushed to the partition as needed whenever a communication channel is opened between them. (They should probably be transacted in the partition under the identity of whoever made the schema change to the base database: but currently they sit outside of the transaction mechanism, as they are not part of the current transaction. I would like such changes to be asynchronous as I don't require all partitions to be online for a schema change.)

I plan to add more comments to the source code to explain things and make the structure clearer. There are some attempts to explain the internal workings of Pyrrho in the SourceIntro.doc and Classes spreadsheet in the distribution.

Wednesday 14 August 2013

Partitioned Database Implementation

If a database becomes too large or too busy for the server(s) on which it is provided, an option is to partition it to create a “distributed database” (Ceri and Pelagatti, 1984). When tables are partitioned horizontally by selecting ranges of values of fields in one or more columns of a single table, the result is nowadays called “shared-nothing replication”. Table partitioning can be combined with other distribution mechanisms such as ordinary database replication, additional query processors and remote clients. A (distributed) configuration database is used to specify the relationship between databases, partition files, and the partitioning mechanism, as described in Configuring Database Servers.

All operations on partitioned tables are effected by SQL statements sent to the base database or any of the partitions, so that the partitioning is transparent to clients. Operations on a single, local partition will generally be faster than those involving multiple or remote partitions: in particular the use of cross-partition constraints or triggers should be discouraged. Schema changes can only be made on the base database.

When table partitions are used, the complete database consists of the base database and a set of database partitions. A database partition contains a copy of the schema obtained from the base database, and records for its table partitions. A database partition can contain named segments from a number of partitioned tables. Each named segment is stored in its entirety in a named partition is defined by a set of specifying a range of values for columns. The columns do not have to be part of an index for the table (the table does not even need a primary key, and in specifying the range, either or both of the maximum and minimum values can be included or not. Segments do not overlap, and any data not in any segment will remain in the base database.

The server hosting the partition is the transaction master for that partition: this allows updates to be made to the rest of the database even when one partition is offline. Indexes are also partitioned, with each partition holding indexes for the data it contains. Cross-partition operations will be distributed transactions, and by default the server will try to verify coherence of the participating versions as the database is loaded. A good partitioning scheme for a particular scenario will minimise the number of distributed transactions that result.

Partitions are created by adding a row to the _Partition table on the base database server (see section 8.2). The base server creates or connects to the partition database using the credentials provided, creates the schema for the partition, and moves any existing partition data.

Certain types of partition changes can be carried out on a live database by means of transacted groups of changes to the Partition table: a new segment, moving a segment, splitting a segment, deleting a segment (data in the partition is moved to the base database).

In partitions, schema information uses defining positions in the base database. Each partition maintains a binary copy of the entire schema in Partition records in the database: a further Partition record is supplied by the base server if the schema has changed on the next commit of data to the partition.

When the partition information changes this probably involves multiple rows in the _Partition table, so these should occur in transacted groups, with validation (non-overlapping) enforced at the end of each group, when we should see what records need to transfer between partitions or to/from the base database. Table segments should move in their entirety. We support four situations (for at most one base database at a time):

a.       A new partition (or new table or segment for an existing partition) has been added without any changes to other partitions, examine the base database for records for it.

(AddRecords only: this is NewPartition())

b.       One or more configuration rows for a partition P has been deleted. No updates to other rows: records affected will move to the base database. (Deletion of records for P: this is OldPartition())

c.       Changes to other partitions are only to receive tables or segments from P. (Changes to partition names from a single P: this is OldPartition())

d.       Split of a single TableSegment: no records need to move. (change to ValueBounds, and Add another)

It should be possible to design any change in partitioning as a sequence of these four steps. Note that movement of records happens after the validation step, so the order of operations only has to match one of the above.

For implementing constraints, there is an extra stage of checking at the base database level (e.g. uniqueness across all partitions). We will add a PartitonedIndex class to help with this, and a service to look up (a set of) entries in a local index.

A simple example

Two servers A  and B both running, each in its own empty folder (no configuration databases).

0: No _ databases. Create database D on server A and create table E, with some sample data:

                E: (F int, G char). (1,’One’),(2,’Two’).

1: On A, use pyrrhocmd _ to autocreate the configuration database, and define a new partition D2 on B containing a single table segment E2:

Insert into “_Partition” values (‘D2’, 'B', ’me’, ’password’, ‘D’, 'A', ‘password’, ’E’,’E2’, ’F’, 2, 2, true, true)

The credentials supplied must be sufficient to create or open database D2 on server B and create objects in D2.

2. Inspect the logfiles to see what has been done. There should be a configuration database on A with the above information, but on B we see an entry in _Database for a database called D2, and an entry in the _Client table for the base server as a client of B.

On A, the log for D should show the deletion of one record in E as part of a distributed transaction involving _ and D2.

On B, the log for D2 on B should show the creation of E with the same owner as in A, the partition information for E, and the addition of a single record.

3. From A, verify that the table E still shows both records.

4. From A insert some more records in E: (3,’Three’) (2,’Deux’). Check both show up in E, and verify from the logfiles that one new record each has been added to databases D and D2.

5. On A, delete the partition by deleting the single partition record in _ .

See a full version of this tutorial, here.

Configuring Database Servers

The new version of Pyrrho has a completely approach to server configuration, based on a peer-to-peer database. In this article, we look at the configuration of distributed databases, addressing the "Pyrrho and the Cloud" scenario from 2009, and return to issues of partitioned databases in a forthcoming article.

Each server has its own folder for holding databases (although full pathnames can be specified for databases). The configuration database (if any) is found in this folder and is called _ ,. It contains tables with names starting with _. The basic schema of _ is hardwired in the server engine so that _ is created the first time the server owner alters a configuration table.

By default the configuration tables are empty: databases that are not mentioned in the configuration tables are local databases, and we know about no other servers. In this way we already have an API for configuration (namely SQL/ADO.NET) and we conform to Codd’s rule that SQL should be the DDL for the DBMS.

Configuration tables holding server data are automatically partitioned so that each _ database only holds information for data managed by that server instance, and is the transaction master for its own data. Thus any change to properties of remote servers is handled as a remote update – the Pyrrho protocol already supports this. It turns out that the provision for verification of distributed transactions is exactly what is required to guarantee consistency of the information held by different servers.

When the server starts up it should load the _ database if present. What will be new is a more elegant handling of the situations where servers are unavailable: obviously this should mean that their databases and partitions are unavailable. Code exists in the server already for deferring distributed transaction verification and this is exactly what is needed for if and when the remote server comes online later. See also “Database dependency” below.

As of v 5.0, we will use the following schema for _:

_Client: (Server, Client, User, Password). We locally store entries for which Server=us. This table is maintained automatically: don’t update it. Passwords use Pyrrho’s password type so are not readable. The user and password information enables access to the configuration database on the client.

_Database: (Name, Server, Password, ServerRole, Remote, RemoteUser, RemotePassword). (Remote, Server) references _Client. We locally store entries for which Server=us. Passwords use Pyrrho’s password type so are not readable, but they must be supplied when this table is updated to enable server-server communication. The remoteuser and remotepassword information enable server-server access to the named database on the remote server. The password information enables access to the configuration database on the server.

_Partition: (Base, BaseServer, BasePassword, Table, PartitionName, PartitionServer PartitionServerConfigUser, PartitionServerConfigPassword, SegmentName, Column, MinValue, MaxValue, IncludeMin, IncludeMax). Table and Column are positions in the base database: we understand that the partition database has appropriate entries in its schema and a mapping from the base positions. We locally store entries where BaseServer=us.  If you create such an entry, Pyrrho automatically defines the table in the partition database, and makes a Partition1 record in the partition database that gives the server details. If the PartitionServer field is null, the partition is marked as dropped. The partition user and password are for server-server access from the base database: the base server password is a security precaution: it must match the base server’s account. The PartName field enables a number of tableparts to be specified as being in the partition: each part can specify a number of column conditions.

Remember that _ is for distributed configuration information, and is not affected by connection-specific things such as creating or accessing non-configured databases. The current server’s serverRole for non-configured databases will be 7 (for local databases: Query+Master+Storage) and 0 (remote), but they are not entered in the configuration tables. No server is obliged to have complete information.

In the _Database table, Remote is null if ServerRole has bit 1 (Master), and otherwise references cause distributed transactions: e.g. a server providing query and/or storage only will usually list the master server as remote. If no master server is configured the database is read-only (this is not an error). However, at most one server can have the master role.

To change configuration information, the owner of the server account simply connects to the local _ database and modifies it. The information about other servers is read-only, although selects, joins etc show the overall information to the extent required for server-server communication (server roles and partitions). When changes are committed, affected databases are unloaded and a background process is started to reload them. Pyrrho does try to notify affected servers, using the same mechanism as for any update to a distributed database: this will be an area for continuing improvements.

Note, however, that the server passwords in the configuration tables are for enabling access from one server to another, using standard operating system security, for the accounts under which the servers have been configured to run. The partition base credentials enable access to the base database of the partition being configured. On a single server or in a single domain, it is quite likely that a single account and password will be used in all or most of the tables.

As a special case, the entry in the _Database table for a partition will have all 3 server roles and the remote information will be for the base server.

A simple configuration example


Three servers A  , B,  C  initially with no configuration databases. Database D on server A has a table E(F int). We will configure these servers so that both A and C serve queries for D, but C will access storage for D on server B, while A will retain its roles of Query+Master+Storage. 

This will show in the _ database as follows:

_Client table

Server
Client
User
Password
A
B
M\me
*******
B
C
M\me
*******

 _Database table

Name
Server
ServerRole
Password
Remote
RemoteUser
RemotePassword
D
A
7
 
 
 
 
D
B
4
*******
A
M\me
*******
D
C
2
*******
B
M\me
*******

As we get there we can keep track of developments using Pyrrho's system and log tables. Log tables show the contents of physical database files (so are available when the server has Query+Storage role).

Step 1: Create _ on B and configure storage for database D on A.

 pyrrhocmd –h:B _

SQL> [insert into "_Database" values('D','B',4,passwd,'A',user,passwd)]

The right user name (in Windows Domain\User form) and password should be supplied, but the password won’t be displayed. These two steps have the side effect of getting A to create a _ database and fill in the  _Client entry for B and _Database entry for D, so if A is offline this step should fail. This also creates a D.pfl file in B’s folder.

At this stage the Log$ files for _ on A and _ on B show different Records, but table “_Database” and table “_Client” will agree on the two servers, showing that the configuration database is shared.  

Step 2: Create _ on C and configure the Query ServerRole for database D on B.

pyrrhocmd –h:C _

SQL> [insert into "_Database" values('D','C',2,passwd,'B',user,passwd)]

Pyrrho discovers A's master role. The _Database table shows as above on server B at this stage, (A and C show 2 rows each) although “Sys$Table” and table “Log$” on each server shows just one record in _DatabaseTable E shows in D on A and C,  and E should be listed in Sys$Table on D but the Log$ table on D should not be visible since there is no D.pfl file in C’s folder. div>

This completes the configuration. On C delete an entry in E, and check for expected behaviour.  On C, alter table E to make F a primary key. Check that the Sys$Index tables on both A and C show the change.  We can read information from database D on C even if A is offline. We can’t change anything though.

See a full version of this tutorial, with full explanations of the protocols used,  here .