Friday 3 September 2021

Implementing Shareable Data Structures

 The concept of shareable data structures first appeared in this blog in 2018, and there has been a software repository in Github since that time. Several demonstration programs and PyrrhoV7alpha source code can also be found there. In the nature of things, the concept has evolved.

The use of shareable data structures in fundamental to the implementation of Pyrrho V7. The following documentation is based on part of a tutorial given by me and Fritz Laux at DBKDA 2021.

These have a recursive definition: a data structure is shareable iff it is immutable, and all fields are in turn shareable. In all programming languages, values of primitive data types are shareable. In Hava, Pythone and C#, strings are also shareable. Unfortunately, in any programming language that allows data structures to have have subclasses, other than declaring a class sealed or final. there is no way of preventing a subclass from adding a mutable field. If we want to ensure our data structures are shareable, we need to declare them as internal to our project. Then at least we control what subclasses exist.

In a programming language based on references, such as Java, or C#, we can make all fields in our structure final, or readonly. Then any reference can be safely shared, and all fields might as well be public (unless there are confidentiality issues).

If all of the fields are, in turn, also known to be immutable, then there is no need to clone or copy fields: copying a pointer to the structure itself gives read-only access to the whole thing.

For example, if the Database class is shareable, and b is a database, then a:=b is a shareable copy of the whole database (we have just copied a single 64-bit pointer). Such an assignment (or snapshot) is also guaranteed to be thread-safe.

Pointers to shareable structures are never updated but can be replaced when we have a new version to reference. If this new version has some changed fields, it is perfectly fine to continue to use the same pointers for all the unchanged fields

Shareable tree structures: adding a node

When we add a field located deep in a shareable structure (e.g. a node to a shareable tree structure), we will need to construct a single new node at each level back to the top of the structure. But the magic is that all the other nodes remain shared between the old and new versions.


The above picture (from Krijnen and Mertens, Mathematics Centre, Amsterdam, 1987) shows a tree with 7 leaves (that is, a tree of size 7), and updating (that is, replacing) one leaf node has resulted in just 2 new inner nodes being added to the tree. This makes shareable B-Trees into extremely efficient storage structures.

In tests, (such as DictDemo, in the github repository) we see that for a suitable minimum number of child nodes in the B-Tree, the number of new nodes required for a single update to a B-Tree of size N is O(logN), and experimentally, this means that for each tenfold increase in N, the number of new nodes per operation roughly doubles.

Note that we also get a new root node every time (this avoids wearing out flash memory).

The choice of programming language

It is helpful if the programming language supports:

  •        readonly directives (Java has final)
  •        Generics (Java has these)
  •        Customizing operators such as +=  (not Java)

o    Because a+=b is safer than Add(a,b)

o    Easy to forget to use the return value a=Add(a,b)

  •        Implies strong static typing (so not Java)

o    Many languages have “type erasure”

  •          Also useful to have all references nullable

So I prefer C#, which now has been around for 19 years. Java and Python have been with us for over 30 years. 

However, as mentioned above, C# provides no syntax for requiring a class to be shareable: specifically, there is no way of requiring a subclass of a shareable class to be shareable. It will cease to be shareable if it has even one mutable field.

Shareable database objects

What data structures in the DBMS can be made shareable?

  •       Database itself, and its subclass, Transaction.
  •       Database Objects such as Table, Index, TableColumn, Procedure, Domain, Trigger, Check, View, Role
  •       Processing objects such as Query, Executable, RowSet, and their many subclasses;
  •       Cursor and most of its subclasses.
  •       TypedValue and all its subclasses

All of these can be made shareable. But many things cannot be made shareable:

Context and Activation cannot be made shareable because in processing expressions we so often have intermediate values.

Also, something needs to access system non-shareable structures such as FileStreams, HttpRequest.

And Physical and its subclasses are used for preparing objects for the database file, so cursors that examine logs are not shareable.

An implementation library: first steps

A fundamental building block in many DBMS implementation is the B-tree. In Pyrrho BTree<K,V> is a sort of unbalanced B-tree. It has a += operator to add a (key,value) pair, and a -= operator to remove a key.

BList<V> is a subscriptable subclass where K is int. It is much slower than BTree, because it partially reindexes the list starting from 0 on insertion and deletion.

Both of these structures can be traversed up and down using shareable helper classes ABookmark<K,V> and methods First(), Last(). The ABookmark class implements key(). value(), Next() and Previous().

These classes and their subclasses are used throughout the DBMS implementation. They are shareable provided K and V are shareable. If the classes K and V are not themselves shareable, for example if one or both is the object class, a tree will be shareable provided all of its contents (the nodes actually added) are shareable. At least, it is easy to ensure that all public constructors only have shareable parameters.

For convenience, Pyrrho uses a potentially non-shareable base class Basis, whose only field is a BTree<long,object> called mem. It has an abstract method New which can be used to implement the += and -= as effectively covariant operators on subclasses, and these can be used to change the properties on a database (of course, by creating a new one).

DBObject and Database

DBObject is a subclass of Basis, with a public readonly field called defpos, of type long. The defpos (defining position) acts as a uid or unique identifier for every object in the database. As described in section 2.3 below, the range of values of long is divided into ranges for database objects depending on their lifetime: committed objects have the same lifetime as the database (e.g., the SQL delete statement can unlink one or more objects but they remain in the log). There are many subclasses of DBObject, described in this booklet.

Database is also a subclass of Basis, and it uses the mem field with positive uids to store all of database object it contains. (Entries in mem with negative uids are used for propert ies of the database, as described in section 3.5.2.) The Database class also has two static lists: databases, and dbfiles. The first is the set of databases indexed by database name, and the second is a corresponding this of File objects.

Then there is a method called Load() to build the database from the transaction log file, and Commit() which writes Physical record durably to the log.

Transaction and B-Tree

We can use the B-Tree concept to model how the transaction Commit process works. The Commit process was described in terms of transformations in the previous posting in this blog.

  1.       Suppose we have a database in any starting state D0.
  2. .       If we start a transaction T0 from this state, initially the T0 is a copy of D0 (i.e. equal pointers).
  3. .       As T0 is modified it becomes T1, which shares many nodes with D0, but not all.
  4. .       D0 also evolves as some other transaction has committed, so D1 has a new root and some new nodes.
  5. .       When T1 commits, we get a new database D2 incorporating the latest versions of everything.


No comments:

Post a Comment