Saturday, 2 June 2018

More details about RESTViews


An introduction to RESTViews


The essential idea with RESTView is that the Pyrrho database allows definition of views where the data is held on remote DBMS(s): at present, the only options are Pyrrho and MySQL. The remote DBMS is accessible via SQL statements sent over HTTP with Json responses.

For MySQL a simple client called RestifD (source posted on github.com/MalcolmCrowe/restif) provides this HTTP service. The HTTP access provides the user/password combinations set up for this purpose within MySQL by the owners of contributor databases. In the use cases considered here, where a query Q references a RESTView V, we assume that (a) materialising V by Extract-transform-load is undesirable for some legal reason, and (b) we know nothing of the internal details of contributor databases. A single remote select statement defines each RESTView: the agreement with a contributor does not provide any complex protocols, so that for any given Q, we want at most one query to any contributor, compatible with the permissions granted to us by the contributor, namely grant select on the RESTView columns.

Crucially, though, for any given Q, we want to minimise the volume D of data transferred. We can consider how much data Q needs to compute its results, and we rewrite the query to keep D as low as possible. Obviously many such queries (such as the obvious select * from V) would need all of the data. At the other extreme, if Q only refers to local data (no RESTViews) D is always zero, so that all of this analysis is specific to the RESTView technology.

We will add a set of query-rewriting rules to the database engine aiming to reduce D by recursive analysis of Q and the views and tables it references. As the later sections of this document explain, some of these rules can be very simple, such as filtering by rows or columns of V, while others involve performing some aggregations remotely (extreme cases such as select count(*) from V needs only one row to be returned). In particular, we will study the interactions between grouped aggregations and joins. The analysis will in general be recursive, since views may be defined using aggregations and joins of other views and local tables.

Any given Q might not be susceptible to such a reduction, or at least we may find that none of our rules help, so that a possible outcome of any stage in the analysis might be to decide not to make further changes. Since this is Pyrrho, its immutable data structures can retain previous successful stages of query rewriting, if the next stage in the recursion is unable to make further progress.

Although in this document the example are mostly very simple, we aim to present the analysis in such a way as to demonstrate the applicability of the rules to more complex cases. In other studies, such as the Sierra Leone example, queries can reference multiple stored queries (view definitions) and functions. For now RESTViews are only found in Pyrrho, but in principle we could have several stages where one RESTView is defined using other RESTViews. We also bear in mind that a query Q might involve joins of RESTViews possibly from the same remote database(s).

There are two types of RESTView corresponding to whether the view has one single contributor or multiple remote databases (for examples, see the text in red below). In the simple exercises in this document, V is a RESTview with one contributor, and W has two. In the multiple-contributors case, the view definition always includes a list of contributors (the “using table”, VU here) making it a simple matter to manage the list of contributors.

Setting up the tests

The test cases considered in this document are set up as follows. Everything assumes that RestifD.exe is running (it uses port 8078 and needs no configuration), and the MySQL server is on the local machine.
In MySQL, at a command line
create database db
use db
create table T(E integer,F nvarchar(6))
insert into T values(3,'Three'),(6,'Six'),(4,'Vier'),(6,'Sechs')
grant all privileges on T to 'root'@'%' identified by 'admin'
create database dc
use dc
create table U(E integer,F varchar(7))
insert into U values(5,'Five'),(4,'Four'),(8,'Ate')
grant all privileges on U to 'root'@'%' identified by 'admin'
In Pyrrho: We assume OSPSvr.exe is running, with flags such as –V -E -H. Start up a command line with ospcmd rv. Then at the SQL> prompt:
create view V of (E int,F char) as get 'http://root:admin@localhost:8078/db/t'
create table VU (d char primary key, k int, u char)
insert into VU values ('B',4,'http://root:admin@localhost:8078/db/t')
insert into VU values ('C',1,'http://root:admin@localhost:8078/dc/u')
create view W of (E int, D char, K int, F char) as get using VU
create table M (e int primary key, n char, unique(n))
insert into M values (2,'Deux'),(3,'Trois'),(4,'Quatre')
insert into M values (5,'Cinq'),(6,'Six'),(7,'Sept')
These sample tables are of course very small. But we can still use them to show how the RESTView minimises the data returned. Much larger examples inclouding the "Sierra Leone" study have demonstrated the effectiveness of the algorithms here for large data sets, views defined in terms of other views including RESTViews, aggregations, joins of views, etc. 
To see the internal workings of the database engine on the following examples, start up OSPSvr.exe with the flags mentioned above. For brevity in this blog, we does not show the results of these queries.

Simple retrieval (to be avoided)

table V
Unsurprisingly, simple queries result in the simple requests to the remote databases
http://root:admin@localhost:8078/db select E,F from t
--> 2 rows
table W
http://root:admin@localhost:8078/db select E,'B' as D,4 as K,F from t
--> 4 rows
http://root:admin@localhost:8078/dc select E,'C' as D,1 as K,F from u
--> 3 rows

Filters

Filters (where or having conditions) should be used to reduce the amount of data retrieved from the remote database. Simple cases are easy:
select * from V where E=3
http://root:admin@localhost:8078/db select E,F from t where E=3
--> 1 rows
For the more complex multi-database GET USING syntax we can also filter on columns of the using table (VU in this example):
Select * from w where k>2
http://root:admin@localhost:8078/db select E,'B' as D,4 as K,F from t
--> 2 rows
In this case, the request only goes to one of the remote databases. Note that the value of E is supplied to the remote database, in case it is referred to in the select list of the query. Such filters can of course be combined.

Aggregations

select count(e) from w
http://root:admin@localhost:8078/db select COUNT(E) as C_12 from t
--> 1 rows
http://root:admin@localhost:8078/dc select COUNT(E) as C_12 from u
--> 1 rows
Internally, the local database engine computes SUM(C_12) from the returned rows.
select max(f) from w having e>4
http://root:admin@localhost:8078/db select MAX(F) as C_33 from t where E>4
--> 1 rows
http://root:admin@localhost:8078/dc select MAX(F) as C_33 from u where E>4
--> 1 rows
Again, the local engine computes max(C_33) .
select sum(e),f from w group by f
http://root:admin@localhost:8078/db select SUM(E) as C_74,F from t group by F
--> 2 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_74,F from u group by F
--> 3 rows
Our trivial example shows no reduction in the number of rows. A larger example would show of course that the number of rows returned in determined by the number of distinct values of the grouped column f in that database.
select count(*),k/2 as k2 from w group by k2
http://root:admin@localhost:8078/db select COUNT(*) as C_11,(4/2) as K2 from t
--> 1 rows
http://root:admin@localhost:8078/dc select COUNT(*) as C_11,(1/2) as K2 from u
--> 1 rows
Other aggregation functions are converted to use sums and counts from the remote databases, for example:
Select avg(e) from w
http://root:admin@localhost:8078/db select SUM(E) as C_156,COUNT(E) as D_156 from t
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_156,COUNT(E) as D_156 from u
--> 1 rows
The local database computes the request for AVG as SUM/COUNT. Common subexpressions are also optimised:
select sum(e)*sum(e),d from w group by d
http://root:admin@localhost:8078/db select SUM(E) as C_12,'B' as D from t group by D
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_12,'C' as D from u group by D
--> 1 rows

More expression rewriting

In general, subexpressions will contain other SQL functions and a mixture of local and remote columns. The following examples illustrate the number of cases that arise.
select sum(e)+char_length(max(f)) from w
http://root:admin@localhost:8078/db select SUM(E) as C_12,MAX(F) as C_14 from t
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_12,MAX(F) as C_14 from u
--> 1 rows

select count(*),sum(e)+ char_length(max(f)) as x from w group by x
http://root:admin@localhost:8078/db select COUNT(*) as C_54,SUM(E) as C_56,MAX(F) as C_58 from t group by E,F
--> 2 rows
http://root:admin@localhost:8078/dc select COUNT(*) as C_54,SUM(E) as C_56,MAX(F) as C_58 from u group by E,F
--> 3 rows

select sum(e)+k as x,k from w group by k

http://root:admin@localhost:8078/db select SUM(E) as C_12,4 as K from t group by K
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_12,1 as K from u group by K
--> 1 rows

Joins

One of the steps in constructing a RESTView is to reduce the view to the columns needed for the given query. When the RESTView is used in a join, we need to ensure that columns needed for the joinCondition are added to the list of needed columns.
Select f,n from w natural join m
http://root:admin@localhost:8078/db select F,E from t
--> 2 rows
http://root:admin@localhost:8078/dc select F,E from u
--> 3 rows

If a grouped query uses a join whose non-remote factor has key J, an aggregation operation grouped by G can be shared with the factors of the join by applying a similar aggregation grouped by GJ. In this example, the join column is the primary key of both factors. If there is no primary key defined, then all columns are needed to form the join.
Select count(*) from w natural join m

http://root:admin@localhost:8078/db select COUNT(*) as C_178,E from t group by E
--> 2 rows
http://root:admin@localhost:8078/dc select COUNT(*) as C_178,E from u group by E
--> 3 rows