Saturday, 2 June 2018

More details about RESTViews


An introduction to RESTViews


The essential idea with RESTView is that the Pyrrho database allows definition of views where the data is held on remote DBMS(s): at present, the only options are Pyrrho and MySQL. The remote DBMS is accessible via SQL statements sent over HTTP with Json responses.

For MySQL a simple client called RestifD (source posted on github.com/MalcolmCrowe/restif) provides this HTTP service. The HTTP access provides the user/password combinations set up for this purpose within MySQL by the owners of contributor databases. In the use cases considered here, where a query Q references a RESTView V, we assume that (a) materialising V by Extract-transform-load is undesirable for some legal reason, and (b) we know nothing of the internal details of contributor databases. A single remote select statement defines each RESTView: the agreement with a contributor does not provide any complex protocols, so that for any given Q, we want at most one query to any contributor, compatible with the permissions granted to us by the contributor, namely grant select on the RESTView columns.

Crucially, though, for any given Q, we want to minimise the volume D of data transferred. We can consider how much data Q needs to compute its results, and we rewrite the query to keep D as low as possible. Obviously many such queries (such as the obvious select * from V) would need all of the data. At the other extreme, if Q only refers to local data (no RESTViews) D is always zero, so that all of this analysis is specific to the RESTView technology.

We will add a set of query-rewriting rules to the database engine aiming to reduce D by recursive analysis of Q and the views and tables it references. As the later sections of this document explain, some of these rules can be very simple, such as filtering by rows or columns of V, while others involve performing some aggregations remotely (extreme cases such as select count(*) from V needs only one row to be returned). In particular, we will study the interactions between grouped aggregations and joins. The analysis will in general be recursive, since views may be defined using aggregations and joins of other views and local tables.

Any given Q might not be susceptible to such a reduction, or at least we may find that none of our rules help, so that a possible outcome of any stage in the analysis might be to decide not to make further changes. Since this is Pyrrho, its immutable data structures can retain previous successful stages of query rewriting, if the next stage in the recursion is unable to make further progress.

Although in this document the example are mostly very simple, we aim to present the analysis in such a way as to demonstrate the applicability of the rules to more complex cases. In other studies, such as the Sierra Leone example, queries can reference multiple stored queries (view definitions) and functions. For now RESTViews are only found in Pyrrho, but in principle we could have several stages where one RESTView is defined using other RESTViews. We also bear in mind that a query Q might involve joins of RESTViews possibly from the same remote database(s).

There are two types of RESTView corresponding to whether the view has one single contributor or multiple remote databases (for examples, see the text in red below). In the simple exercises in this document, V is a RESTview with one contributor, and W has two. In the multiple-contributors case, the view definition always includes a list of contributors (the “using table”, VU here) making it a simple matter to manage the list of contributors.

Setting up the tests

The test cases considered in this document are set up as follows. Everything assumes that RestifD.exe is running (it uses port 8078 and needs no configuration), and the MySQL server is on the local machine.
In MySQL, at a command line
create database db
use db
create table T(E integer,F nvarchar(6))
insert into T values(3,'Three'),(6,'Six'),(4,'Vier'),(6,'Sechs')
grant all privileges on T to 'root'@'%' identified by 'admin'
create database dc
use dc
create table U(E integer,F varchar(7))
insert into U values(5,'Five'),(4,'Four'),(8,'Ate')
grant all privileges on U to 'root'@'%' identified by 'admin'
In Pyrrho: We assume OSPSvr.exe is running, with flags such as –V -E -H. Start up a command line with ospcmd rv. Then at the SQL> prompt:
create view V of (E int,F char) as get 'http://root:admin@localhost:8078/db/t'
create table VU (d char primary key, k int, u char)
insert into VU values ('B',4,'http://root:admin@localhost:8078/db/t')
insert into VU values ('C',1,'http://root:admin@localhost:8078/dc/u')
create view W of (E int, D char, K int, F char) as get using VU
create table M (e int primary key, n char, unique(n))
insert into M values (2,'Deux'),(3,'Trois'),(4,'Quatre')
insert into M values (5,'Cinq'),(6,'Six'),(7,'Sept')
These sample tables are of course very small. But we can still use them to show how the RESTView minimises the data returned. Much larger examples inclouding the "Sierra Leone" study have demonstrated the effectiveness of the algorithms here for large data sets, views defined in terms of other views including RESTViews, aggregations, joins of views, etc. 
To see the internal workings of the database engine on the following examples, start up OSPSvr.exe with the flags mentioned above. For brevity in this blog, we does not show the results of these queries.

Simple retrieval (to be avoided)

table V
Unsurprisingly, simple queries result in the simple requests to the remote databases
http://root:admin@localhost:8078/db select E,F from t
--> 2 rows
table W
http://root:admin@localhost:8078/db select E,'B' as D,4 as K,F from t
--> 4 rows
http://root:admin@localhost:8078/dc select E,'C' as D,1 as K,F from u
--> 3 rows

Filters

Filters (where or having conditions) should be used to reduce the amount of data retrieved from the remote database. Simple cases are easy:
select * from V where E=3
http://root:admin@localhost:8078/db select E,F from t where E=3
--> 1 rows
For the more complex multi-database GET USING syntax we can also filter on columns of the using table (VU in this example):
Select * from w where k>2
http://root:admin@localhost:8078/db select E,'B' as D,4 as K,F from t
--> 2 rows
In this case, the request only goes to one of the remote databases. Note that the value of E is supplied to the remote database, in case it is referred to in the select list of the query. Such filters can of course be combined.

Aggregations

select count(e) from w
http://root:admin@localhost:8078/db select COUNT(E) as C_12 from t
--> 1 rows
http://root:admin@localhost:8078/dc select COUNT(E) as C_12 from u
--> 1 rows
Internally, the local database engine computes SUM(C_12) from the returned rows.
select max(f) from w having e>4
http://root:admin@localhost:8078/db select MAX(F) as C_33 from t where E>4
--> 1 rows
http://root:admin@localhost:8078/dc select MAX(F) as C_33 from u where E>4
--> 1 rows
Again, the local engine computes max(C_33) .
select sum(e),f from w group by f
http://root:admin@localhost:8078/db select SUM(E) as C_74,F from t group by F
--> 2 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_74,F from u group by F
--> 3 rows
Our trivial example shows no reduction in the number of rows. A larger example would show of course that the number of rows returned in determined by the number of distinct values of the grouped column f in that database.
select count(*),k/2 as k2 from w group by k2
http://root:admin@localhost:8078/db select COUNT(*) as C_11,(4/2) as K2 from t
--> 1 rows
http://root:admin@localhost:8078/dc select COUNT(*) as C_11,(1/2) as K2 from u
--> 1 rows
Other aggregation functions are converted to use sums and counts from the remote databases, for example:
Select avg(e) from w
http://root:admin@localhost:8078/db select SUM(E) as C_156,COUNT(E) as D_156 from t
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_156,COUNT(E) as D_156 from u
--> 1 rows
The local database computes the request for AVG as SUM/COUNT. Common subexpressions are also optimised:
select sum(e)*sum(e),d from w group by d
http://root:admin@localhost:8078/db select SUM(E) as C_12,'B' as D from t group by D
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_12,'C' as D from u group by D
--> 1 rows

More expression rewriting

In general, subexpressions will contain other SQL functions and a mixture of local and remote columns. The following examples illustrate the number of cases that arise.
select sum(e)+char_length(max(f)) from w
http://root:admin@localhost:8078/db select SUM(E) as C_12,MAX(F) as C_14 from t
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_12,MAX(F) as C_14 from u
--> 1 rows

select count(*),sum(e)+ char_length(max(f)) as x from w group by x
http://root:admin@localhost:8078/db select COUNT(*) as C_54,SUM(E) as C_56,MAX(F) as C_58 from t group by E,F
--> 2 rows
http://root:admin@localhost:8078/dc select COUNT(*) as C_54,SUM(E) as C_56,MAX(F) as C_58 from u group by E,F
--> 3 rows

select sum(e)+k as x,k from w group by k

http://root:admin@localhost:8078/db select SUM(E) as C_12,4 as K from t group by K
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_12,1 as K from u group by K
--> 1 rows

Joins

One of the steps in constructing a RESTView is to reduce the view to the columns needed for the given query. When the RESTView is used in a join, we need to ensure that columns needed for the joinCondition are added to the list of needed columns.
Select f,n from w natural join m
http://root:admin@localhost:8078/db select F,E from t
--> 2 rows
http://root:admin@localhost:8078/dc select F,E from u
--> 3 rows

If a grouped query uses a join whose non-remote factor has key J, an aggregation operation grouped by G can be shared with the factors of the join by applying a similar aggregation grouped by GJ. In this example, the join column is the primary key of both factors. If there is no primary key defined, then all columns are needed to form the join.
Select count(*) from w natural join m

http://root:admin@localhost:8078/db select COUNT(*) as C_178,E from t group by E
--> 2 rows
http://root:admin@localhost:8078/dc select COUNT(*) as C_178,E from u group by E
--> 3 rows

Friday, 11 May 2018

DBMS Support for Big Live Data

In 2016 I reported on an idea for Big Live Data, and included some facilities in Pyrrho to support it. The current version of Pyrrho (11 May 2018) has better support and this blog post aims to explain the details of how it works. Basically, with Big Live Data it is not sensible to copy it all to a local database, for two reasons: (a) the copy will some be out of date or will need constant updates (b) the data doesn't belong to us in the first place.

The  Pyrrho database allows definition of views where the data is held on remote DBMS(s): at present the only options are Pyrrho and MySQL. The remote DBMS is accessible via SQL statements sent over HTTP with Json responses. For MySQL this HTTP service is implemented by a simple RestifD client (source posted on github.com/MalcolmCrowe/restif). The HTTP access provides the user/password combination set up within MySQL.
The syntax for the view definition is basically CREATE VIEW ViewDef AS GET (url|USING table)
This means there two types of RESTView corresponding to whether there is one single contributor with a single url or multiple remote databases supplying data for the view.
The technical details can be found in the Pyrrho documentation (Pyrrho.pdf and SourceIntro.pdf in the distribution, and were presented at DBKDA 2017 in Barcelona by Fritz Laux. There is a PDF version of this blog post with full details of how to set up some simple examples, entitled "The May 2018 Version of Pyrrho".
When RESTViews are used, remote data is accessed in such a way as to minimised the amount of data transferred, and avoid storing any of it in one the local machine. In some ways this is the opposite of Extract-Transform-Load. We assume an agreement with the contributor of the data that we have SELECT permission on view(s) they have defined, and credentials to enable us to access this view on the remote DBMS. There is also an understanding that we may access this view on numerous occasions but minimising the amount of data we request.
This means that on our local DBMS (Pyrrho for now) we will apply filters and aggregations to requests for the remote data. We will analyse such results locally, and join it with local data using SQL. We will often work interactively in SQL, and we will be facilitated by automatic mechanisms that transform our SQL to contain embedded requests (in SQL) that will be sent to the remote contributor. These requests will comply with the SELECT permissions we have been given on the remote view(s).
With most query optimisation mechanisms in DBMS the goal is to get the data we need into memory where processing is faster. But in the RESTView situation the aim of optimisation is to get as much aggregation and filtering as possible done on the remote system, so that as little data as possible comes to us. And of course we don't want to have the write a program every time we want something different. Ideally any query we wish to evaluate should result in at most one query being sent to each contributor.
The multiple-contributors version of RESTView assumes that all of the contributors provide data in similar format. Most probably all of them produce this data by extracting/transforming from their databases: so their base tables probably look nothing like the view they are providing. The last thing they, or we, want is to keep having to make changes to the facility they are giving us.
So we consider what opportunities we have for keeping everyone happy. There are opportunities for filtering the data vertically (fewer columns) and horizontally (fewer rows). MySQL will soon support window functions, so we can use these. We can filter and order by expressions, selecting maximum or minimum values. Aggregations over a set of contributors requires some automated support as COUNT becomes a sum of counts from the different contributors, and AVG and STDDEV_POP also need to be implemented as a combination of partial sums.
If we are joining the returned data with something, and then doing aggregations, it is natural to want to transforms these too, moving work into the RESTView. The query can be analused to find which expressions and conditions can be computed using only remote data: this allows the conditions to be applied on the remote system. Conditions can be separated into disjunctions and some applied locally, some remotely. Some aggregations can be done on the remote systems (provided we group by keys of joins we plan to make). Common subexpressions can identified and remotely computed.
In the examples I have been working on, the SQL requests to the remote DBMS all look very different from the original view definition. In my opinion this validates the entire proposal to use simple HTTP/SQL/JSon for communication with remote DBMS, rather than an API.


Friday, 9 June 2017

Interworking with MySQL databases

Previous posts in this blog have discussed the use of HTTP for direct operations with the DBMS. With Pyrrho's support for RESTViews it becomes very important to have REST interfaces to other DBMS, and there are several products such as dreamfactory and restifydb that offer this.

From today the Pyrrho distribution includes a very simple solution for MySQL, called RestIfD , and controllers to support other DBMS will probably be added over time. RestIfD should be started up by an administrator on the same host as the MySQL instance, and left running. It requires no configuration, creates no files and uses very little memory. It provides a REST service on http://localhost:8803/REST/MySQL , and all requests should include an Authentication header giving the user and password for accessing MySQL.

To obtain a list of databases in Json format, use GET to http://localhost:8803/REST/MySQL . To send one or more commands to MySQL, place them in posted data to POST http://localhost:8803/REST/MySQL .

To obtain a list of tables on database db in Json format, use GET to http://localhost:8803/REST/MySQL/db .To send one or more commands for database db to MySQL, place them in posted data to POST http://localhost:8803/REST/MySQL/db .The group of commands is executed in a single transaction.

To obtain the rows of table tb on database db, in Json format, use GET to http://localhost:8803/REST/MySQL/db/tb .

To filter these results using a where condition or document w, use GET to http://localhost:8803/REST/MySQL/db/tb/w . You can delete rows from the table by using a DELETE requests to a similar url.

To update a single row in table tb on database db, send it in Json format as posted data to POST http://localhost:8803/REST/MySQL/db/tb . The row to be updated will be identified by the key fields in the supplied data.

To add a single row to table tb in database db, send it in Json format as posted data to POST http://localhost:8803/REST/MySQL/db/tb .

Comments welcome.


Saturday, 3 June 2017

Pyrrho Version 6.0

The new version completes some of the changes announced in previous posts in this blog.

A: Open Source Version and Append Storage

Append storage has been a build option for some time in Pyrrho, and the new version uses it by default for the Open Source version. Unfortunately this means that databases built with the previous default version of OSPSvr (up to 5.7) cannot be used with the new default build (v.6) because they won't have the encrypted end-of-file marker.
Workarounds for users incovenienced in this way include continuing to use v5.7 for now, manually to strip the last 5 bytes from the binary file, or rebuild v6's OSPSvr without the APPEND flag. But see the next point.
v.6 OSPSvr by default doesn't support the Distributed and Partitioned features either but again these can be restored by removing the LOCAL build flag. The Pro version by default continues to use the end-of-file tamper-proof lock and distributed/partitioned features. These have been enhanced as described below.

B: Using the latest version of C#

The codebase uses the 2017 version of C# with its shortcut implicit declarations e.g. if (x is SomeClass y), and .NET's clever new Task model for some purposes, including for Push, described next.

C: Push and Pull for update notifications

Distributed and Partitioned operation up to v 5.5 was not very scalable as transaction masters and base partitions were contacted at the start of every transaction by every replica and partition to check that the local copies were up to date. Some versions since v.5.5 have contained bugs in this area. This is now remedied in v6.0 with the following interesting design, visible in the Open Source codebase and used in the default PyrrhoSvr build.
From v6.0 servers holding slave databases and partitions check their copy information is up-to-date only on start-up, and thereafter receive Push notifications about any changes to database length. They will retrieve the details of the changes when and if they need to. This mechanism works in a hierarchical way so that the Push load is shared by intermmediate servers if these have been configured. All this did complicate the implementation, since server-server communication channels need to be between threads, but it is worthwhile for theoretical reasons.
I will update the screenshots and documentation in the distributed and partitioned tutorials in the next days.

D: Enhancements to the HTTP/REST services

Previous posts in this blog have referred to ETags and RESTViews. A paper on these has now been published, thanks to Friz Laux, Mart Laiho and Carolyn Begg. The practical effect of these features could be adopted much more widely than Pyrrho, and I hope to return to this aspect soon.
Basically, it would be great if all DBMS provided a really simple REST service so that a simple GET would get a table (or other query results) in Json format together with an RFC7232 ETag. Even better if such a view was also potentially updatable with PUT, POST and DELETE. If this is done, database applications can always use WebRequests instead of the clunky APIs: JDBC,ODBC, ADO.NET, JPA.... The authorisation arrangements for this should also be simple and preferably role based, and use the If-Match HTTP header to check ETags.
This works already for Pyrrho (If-Match from v6), and there is more:
HTTP is stateless so the simple steps described above don't provide a way of making several changes in a single HTTP call. But at least for a single database, ACID transactions can be supported by a simple POST of an SQL statement to the database (+role) URL, since SQL statements can be compound and have exception handlers etc. And the POST can be guarded by If-Match to if the SQL statement contains data previously obtained from the database.
It is definitely a step too far in my opinion to try to use these mechanisms for transactions that change more than one database.
As usual, any comments welcome, preferably to malcolm.crowe at uws.ac.uk (make sure the subject line of the email is relevant).

Thursday, 6 April 2017

Anonymous row types and monotonic functions

A previous blog post explained about the usefulness of mutually inverse function pairs. It is also useful to know if a function is monotonic. Many joins can be speeded up if the join condition uses a function that is known to be monotonic, so that the table operands are then automatically sorted before forming the join.
Most CASTs are monotonic, and Pyrrho now allows a user-defined function to be declared MONOTONIC through the use of metadata.
For example
create function f(b date) returns (int,int) monotonic return (extract(year from b),extract(month from b))
This function could be used to help join a table with a date primary key to one with a primary key containing year and month, using a join condition such as WHERE (y,m)=f(b) .
In this example we also two small innovations in the current version, as (a) the return type of the function is an anonymous row type, (b) no real distinction is made in Pyrrho now between ON conditions in inner joins, and WHERE conditions in cross joins.
In recent versions of the SQL standard ON conditions can be any boolean expression, so by default Pyrrho combines them and decides which is faster, and whether an index can be used to speed things up.

Friday, 31 March 2017

Adapter functions, updatable views and joins

With the introduction of REST Views it has become more interesting to support updatable views and joins, and more important to support adapter functions. The Beta version 5.7 of Pyrrho makes some important contributions in these aspects.
In the following sample, we model a situation where databases have columns that nearly match but do not quite do so. For example suppose Table A(B,C) has an integer primary key B, while table D(E,G) uses corresponding string values all prefixed by the letter H as its primary key E. It is easy to define a suitable adapter function F

create function f(x int) returns char return 'H'|| cast(x as char)

and create a view of A that makes sense in D's database:

create view v as select f(b) as fb,c from a

D, however, wishes to be able to update table A, so would like this view to be updatable. Pyrrho allows the declaration of an inverse of F:

create function f1(a char) returns int inverts f return cast(substring(a from 1) as int)

where the phrase "inverts f" is parsed as metadata for the new function f1 (otherwise everything so far is ordinary standard SQL). This makes f and f1 into mutually inverse functions, that is, declaring f1 as an inverse for f also declares f as an inverse of f1. The machinery works for multiple parameters using row types for the return values.

This makes v updatable and we can write

insert into v values('H91','This is new')

We can also use such adapter functions in referential constraints, e.g. a slight extension to SQL in Pyrrho allows us to declare the above relationship between tables A and D as follows:

create table d (e char references a using f1,g int)

The following SQL sample code demonstrating the above ideas is supported by Pyrrho 5.7 as of today.

create function f(x int) returns char return 'H'|| cast(x as char)
[create function f1(a char) returns int inverts f return cast(substring(a from 1) as int)]
select f(45) from static
select f1('B67') from static
create table a(b int primary key,c char)
insert into a values(23,'Twenty3'),(42,'Forty2')
create view v as select f(b) as fb,c from a
select * from v
insert into v values('H91','This is new')
table a
create table d (e char references a using f1,g int)
insert into d values('H23',234),('H91',567)
create view w as select c,e from a left join d on b=f1(e)
delete from w where c='Twenty3'
table a


Friday, 17 March 2017

Composite Database example with details

This post gives more details of the example contained in the posting Composite Databases in this blog.

The following transcripts use the beta version 5.7 of Pyrrho dated 16 March 2017, and using localhost instead of servA, with servers A and B using folders \A and \B respectively. I have set a debugging –D flag on server A so that that we can see the use of RVVs and ETags.
We begin by setting up the databases on servers A and B:
Database A:
create table D (e int primary key, f char, g char)


insert into D values (1,'Joe','Soap'), (2,'Betty','Boop')

Database B:

create table H (e int primary key, k char, m int)
insert into H values (1,'Cleaner',12500), (2,'Manager',31400)
[create view W of (e int, f char, g char) as get
'http://localhost:8180/A/A/D']

create view V as select * from W natural join H

The square brackets here are added because of the embedded newline added in the formatting of the page.
After setting up the databases on A and B with B’s views defined, we see the transaction log contents for A and B. Some of the numbers shown will be used in RVV and ETags in what follows.



We see at position 381 that the URL http://localhost:8180/A/A/D has been provided in metadata for the view W. W was defined in position 366 in terms of the anonymous structure with columns E, F, G declared at position 290.

The use of position numbers instead of identifiers in the definition of view V at position 439 is a standard feature of Pyrrho to allow renaming of objects.
In the blog post we now have the following operations on database B:
select e,f,m,check from V where e=1

Note that in normal use there is no need to request the check pseudocolumn: it is here so can show what is happening within the two databases. The database API uses it to implement the Versioned feature in client side “database model” classes.
Here, the check value was requested explicitly in the SELECT statement, and shows that this row of the join uses a row from A with defining position 209 placed there in transaction 193 (see the log for A) and a row from B with defining position 208 arising from transaction 192 (see the log for B).
The debug information for server A shows the REST request from B to A, and the ETag it constructed.

The ETag consists of an RVV for the first row of the result (mentioned above), and a readCheck for the read operation carried out by A. This was a specific row in table D (position 69) with key (1) .
The next operation is
update v set f='Elizabeth' where e=2


We see there is now an updated ETag supplied by A showing the new transaction that has updated the record defined at 241.
Also check B’s view using the join:

The next operation is an Insert into the View/RestView/Join combination:
[insert into v(e,f,g,k,m)
values(3,'Fred','Smith','Janitor',22160)]



This time the readCheck information indicates it will conflict with any read operation on table 69.
And again verifying the view from B:


Finally, we try a deletion from the View/RestView/Join combination: