The concept of shareable data structures first appeared in this blog in 2018, and there has been a software repository in Github since that time. Several demonstration programs and PyrrhoV7alpha source code can also be found there. In the nature of things, the concept has evolved.
The use of shareable data structures in fundamental to the implementation of Pyrrho V7. The following documentation is based on part of a tutorial given by me and Fritz Laux at DBKDA 2021.
These have a recursive definition: a data structure is shareable iff it is immutable, and all fields are in turn shareable. In all programming languages, values of primitive data types are shareable. In Hava, Pythone and C#, strings are also shareable. Unfortunately, in any programming language that allows data structures to have have subclasses, other than declaring a class sealed or final. there is no way of preventing a subclass from adding a mutable field. If we want to ensure our data structures are shareable, we need to declare them as internal to our project. Then at least we control what subclasses exist.
In a programming language based on
references, such as Java, or C#, we can make all fields in our structure final, or readonly. Then any reference can be safely shared,
and all fields might as well be public (unless there are confidentiality
issues).
If all of the fields are, in turn, also
known to be immutable, then there is no need to clone or copy fields: copying a
pointer to the structure itself gives read-only access to the whole thing.
For example, if the Database class is
shareable, and b is a database, then a:=b is a shareable copy of the whole
database (we have just copied a single 64-bit pointer). Such an assignment (or
snapshot) is also guaranteed to be thread-safe.
Pointers to shareable structures are never
updated but can be replaced when we have a new version to reference. If this
new version has some changed fields, it is perfectly fine to continue to use
the same pointers for all the unchanged fields
Shareable tree structures: adding a node
When
we add a field located deep in a shareable structure (e.g. a node to a
shareable tree structure), we will need to construct a single new node at each
level back to the top of the structure. But the magic is that all the other
nodes remain shared between the old and new versions.
The above picture (from Krijnen and Mertens, Mathematics Centre, Amsterdam, 1987) shows a tree with 7 leaves (that is, a tree of size 7), and updating (that is, replacing) one leaf node has resulted in just 2 new inner nodes being added to the tree. This makes shareable B-Trees into extremely efficient storage structures.
In tests, (such as DictDemo, in the github repository) we see that for a suitable minimum number of
child nodes in the B-Tree, the number of new nodes required for a single update
to a B-Tree of size N is O(logN), and experimentally, this means that for each
tenfold increase in N, the number of new nodes per operation roughly doubles.
Note that we also get a new root node every time (this
avoids wearing out flash memory).
The choice of programming language
It is helpful if the
programming language supports:
- readonly directives (Java has final)
- Generics (Java has these)
- Customizing operators such as += (not Java)
o
Because a+=b is
safer than Add(a,b)
o
Easy to forget to
use the return value a=Add(a,b)
- Implies strong static typing (so not Java)
o
Many languages
have “type erasure”
- Also useful to have all references nullable
So I prefer C#, which now has
been around for 19 years. Java and Python have been with us for over 30
years.
However, as mentioned above, C# provides no syntax
for requiring a class to be shareable: specifically, there is no way of
requiring a subclass of a shareable class to be shareable. It will cease to be
shareable if it has even one mutable field.
Shareable database objects
What data structures in the
DBMS can be made shareable?
- Database itself, and its subclass, Transaction.
- Database Objects such as Table, Index, TableColumn, Procedure, Domain, Trigger, Check, View, Role
- Processing objects such as Query, Executable, RowSet, and their many subclasses;
- Cursor and most of its subclasses.
- TypedValue and all its subclasses
All of these can be made
shareable. But many things cannot be made shareable:
Context and Activation cannot
be made shareable because in processing expressions we so often have
intermediate values.
Also, something needs to
access system non-shareable structures such as FileStreams, HttpRequest.
And Physical and its
subclasses are used for preparing objects for the database file, so cursors
that examine logs are not shareable.
An implementation library: first steps
A fundamental building block
in many DBMS implementation is the B-tree. In Pyrrho BTree<K,V> is a sort
of unbalanced B-tree. It has a += operator to add a (key,value) pair, and a -=
operator to remove a key.
BList<V> is a
subscriptable subclass where K is int. It is much slower than BTree, because it
partially reindexes the list starting from 0 on insertion and deletion.
Both of these structures can
be traversed up and down using shareable helper classes ABookmark<K,V>
and methods First(), Last(). The ABookmark class implements key(). value(),
Next() and Previous().
These classes and their
subclasses are used throughout the DBMS implementation. They are shareable
provided K and V are shareable. If the classes K and V are not themselves
shareable, for example if one or both is the object class,
a tree will be shareable provided all of its contents (the nodes actually
added) are shareable. At least, it is easy to ensure that all public
constructors only have shareable parameters.
For convenience, Pyrrho uses a
potentially non-shareable base class Basis, whose only field is a BTree<long,object> called mem. It has an
abstract method New which can be used to implement the += and -= as effectively
covariant operators on subclasses, and these can be used to change the
properties on a database (of course, by creating a new one).
DBObject and Database
DBObject is a subclass of Basis, with a public readonly field called defpos, of type
long. The defpos (defining position) acts as a uid or unique identifier for
every object in the database. As described in section 2.3 below, the range of
values of long is divided into ranges for database objects depending on their
lifetime: committed objects have the same lifetime as the database (e.g., the
SQL delete statement can unlink one or more objects but they remain in the
log). There are many subclasses of DBObject, described in this booklet.
Database is also a subclass of Basis, and it uses the mem field with positive
uids to store all of database object it contains. (Entries in mem with negative
uids are used for propert ies of the database, as described in section 3.5.2.)
The Database class also has two static lists: databases, and dbfiles. The first
is the set of databases indexed by database name, and the second is a
corresponding this of File objects.
Then there is a method called Load() to
build the database from the transaction log file, and Commit() which writes
Physical record durably to the log.
Transaction and B-Tree
We can use the
B-Tree concept to model how the transaction Commit process works. The Commit process was described in terms of transformations in the previous posting in this blog.
- Suppose we have a database in any starting state D0.
- . If we start a transaction T0 from this state, initially the T0 is a copy of D0 (i.e. equal pointers).
- . As T0 is modified it becomes T1, which shares many nodes with D0, but not all.
- . D0 also evolves as some other transaction has committed, so D1 has a new root and some new nodes.
- . When T1 commits, we get a new database D2 incorporating the latest versions of everything.
No comments:
Post a Comment