An introduction to RESTViews
The essential idea with RESTView
is that the Pyrrho database allows definition of views where the data is held
on remote DBMS(s): at present, the only options are Pyrrho and MySQL. The
remote DBMS is accessible via SQL statements sent over HTTP with Json
responses.
For MySQL a simple client called RestifD
(source posted on github.com/MalcolmCrowe/restif) provides this HTTP service.
The HTTP access provides the user/password combinations set up for this purpose
within MySQL by the owners of contributor databases. In the use cases
considered here, where a query Q references a RESTView V, we assume that (a)
materialising V by Extract-transform-load is undesirable for some legal reason,
and (b) we know nothing of the internal details of contributor databases. A
single remote select statement defines each RESTView: the agreement with a
contributor does not provide any complex protocols, so that for any given Q, we
want at most one query to any contributor, compatible with the permissions
granted to us by the contributor, namely grant select on the RESTView columns.
Crucially, though, for any given
Q, we want to minimise the volume D of data transferred. We can consider how
much data Q needs to compute its results, and we rewrite the query to keep D as
low as possible. Obviously many such queries (such as the obvious select * from
V) would need all of the data. At the other extreme, if Q only refers to local
data (no RESTViews) D is always zero, so that all of this analysis is specific
to the RESTView technology.
We will add a set of
query-rewriting rules to the database engine aiming to reduce D by recursive
analysis of Q and the views and tables it references. As the later sections of
this document explain, some of these rules can be very simple, such as
filtering by rows or columns of V, while others involve performing some
aggregations remotely (extreme cases such as select count(*) from V needs only
one row to be returned). In particular, we will study the interactions between
grouped aggregations and joins. The analysis will in general be recursive,
since views may be defined using aggregations and joins of other views and
local tables.
Any given Q might not be
susceptible to such a reduction, or at least we may find that none of our rules
help, so that a possible outcome of any stage in the analysis might be to
decide not to make further changes. Since this is Pyrrho, its immutable data
structures can retain previous successful stages of query rewriting, if the
next stage in the recursion is unable to make further progress.
Although in this document the
example are mostly very simple, we aim to present the analysis in such a way as
to demonstrate the applicability of the rules to more complex cases. In other
studies, such as the Sierra Leone example, queries can reference multiple
stored queries (view definitions) and functions. For now RESTViews are only
found in Pyrrho, but in principle we could have several stages where one
RESTView is defined using other RESTViews. We also bear in mind that a query Q
might involve joins of RESTViews possibly from the same remote database(s).
There are two types of RESTView
corresponding to whether the view has one single contributor or multiple remote
databases (for examples, see the text in red below). In the simple exercises in this document, V is a RESTview with one
contributor, and W has two. In the multiple-contributors case, the view
definition always includes a list of contributors (the “using table”, VU here)
making it a simple matter to manage the list of contributors.
Setting up the tests
The test cases considered in this
document are set up as follows. Everything assumes that RestifD.exe is running
(it uses port 8078 and needs no configuration), and the MySQL server is on the
local machine.
In MySQL, at a command line
create database db
use db
create table T(E integer,F nvarchar(6))
insert into T values(3,'Three'),(6,'Six'),(4,'Vier'),(6,'Sechs')
grant all privileges on T to 'root'@'%' identified by 'admin'
create database dc
use dc
create table U(E integer,F varchar(7))
insert into U values(5,'Five'),(4,'Four'),(8,'Ate')
grant all privileges on U to 'root'@'%' identified by 'admin'
In Pyrrho: We assume OSPSvr.exe
is running, with flags such as –V -E -H. Start up a command line with ospcmd
rv. Then at the SQL> prompt:
create view V of (E int,F char) as get
'http://root:admin@localhost:8078/db/t'
create table VU (d char primary key, k int, u char)
insert into VU values ('B',4,'http://root:admin@localhost:8078/db/t')
insert into VU values ('C',1,'http://root:admin@localhost:8078/dc/u')
create view W of (E int, D char, K int, F char) as get using VU
create table M (e int primary key, n char, unique(n))
insert into M values (2,'Deux'),(3,'Trois'),(4,'Quatre')
insert into M values (5,'Cinq'),(6,'Six'),(7,'Sept')
These sample tables are of course very small. But we can still use them to show how the RESTView minimises the data returned. Much larger examples inclouding the "Sierra Leone" study have demonstrated the effectiveness of the algorithms here for large data sets, views defined in terms of other views including RESTViews, aggregations, joins of views, etc.
To see the internal workings of the database engine on the following examples, start up OSPSvr.exe with the flags mentioned above. For brevity in this blog, we does not show the results of these queries.
Simple retrieval (to be avoided)
table
V
Unsurprisingly, simple queries result in the simple requests to the remote databases
http://root:admin@localhost:8078/db select E,F from t--> 2 rows
table W
http://root:admin@localhost:8078/db select E,'B' as D,4 as K,F from t
--> 4 rows
http://root:admin@localhost:8078/dc select E,'C' as D,1 as K,F from u
--> 3 rows
--> 4 rows
http://root:admin@localhost:8078/dc select E,'C' as D,1 as K,F from u
--> 3 rows
Filters
Filters (where or having conditions) should be used to reduce the amount of data retrieved from the remote database. Simple cases are easy:
select * from V where E=3
http://root:admin@localhost:8078/db select E,F from t where E=3
--> 1 rows
http://root:admin@localhost:8078/db select E,F from t where E=3
--> 1 rows
For the more complex multi-database GET USING syntax we can also filter on columns of the using table (VU in this example):
Select * from w where k>2
http://root:admin@localhost:8078/db select E,'B' as D,4 as K,F from t
--> 2 rows
http://root:admin@localhost:8078/db select E,'B' as D,4 as K,F from t
--> 2 rows
In this case, the request only goes to one of the remote databases. Note that the value of E is supplied to the remote database, in case it is referred to in the select list of the query. Such filters can of course be combined.
Aggregations
select count(e) from w
http://root:admin@localhost:8078/db select COUNT(E) as C_12 from t
--> 1 rows
http://root:admin@localhost:8078/dc select COUNT(E) as C_12 from u
--> 1 rows
--> 1 rows
http://root:admin@localhost:8078/dc select COUNT(E) as C_12 from u
--> 1 rows
Internally, the local database engine computes SUM(C_12) from the returned rows.
select max(f) from w having e>4
http://root:admin@localhost:8078/db select MAX(F) as C_33 from t where E>4
--> 1 rows
http://root:admin@localhost:8078/dc select MAX(F) as C_33 from u where E>4
--> 1 rows
http://root:admin@localhost:8078/db select MAX(F) as C_33 from t where E>4
--> 1 rows
http://root:admin@localhost:8078/dc select MAX(F) as C_33 from u where E>4
--> 1 rows
Again, the local engine computes max(C_33) .
select sum(e),f from w group by f
http://root:admin@localhost:8078/db select SUM(E) as C_74,F from t group by F
--> 2 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_74,F from u group by F
--> 3 rows
http://root:admin@localhost:8078/db select SUM(E) as C_74,F from t group by F
--> 2 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_74,F from u group by F
--> 3 rows
Our trivial example shows no reduction in the number of rows. A larger example would show of course that the number of rows returned in determined by the number of distinct values of the grouped column f in that database.
select count(*),k/2 as k2 from w group by k2
http://root:admin@localhost:8078/db select COUNT(*) as C_11,(4/2) as K2 from t
--> 1 rows
http://root:admin@localhost:8078/dc select COUNT(*) as C_11,(1/2) as K2 from u
--> 1 rows
http://root:admin@localhost:8078/db select COUNT(*) as C_11,(4/2) as K2 from t
--> 1 rows
http://root:admin@localhost:8078/dc select COUNT(*) as C_11,(1/2) as K2 from u
--> 1 rows
Other aggregation functions are converted to use sums and counts from the remote databases, for example:
Select avg(e) from w
http://root:admin@localhost:8078/db select SUM(E) as C_156,COUNT(E) as D_156 from t
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_156,COUNT(E) as D_156 from u
--> 1 rows
http://root:admin@localhost:8078/db select SUM(E) as C_156,COUNT(E) as D_156 from t
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_156,COUNT(E) as D_156 from u
--> 1 rows
The local database computes the request for AVG as SUM/COUNT. Common subexpressions are also optimised:
select sum(e)*sum(e),d from w group by d
http://root:admin@localhost:8078/db select SUM(E) as C_12,'B' as D from t group by D
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_12,'C' as D from u group by D
--> 1 rows
http://root:admin@localhost:8078/db select SUM(E) as C_12,'B' as D from t group by D
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_12,'C' as D from u group by D
--> 1 rows
More expression rewriting
In general, subexpressions will contain other SQL functions and a mixture of local and remote columns. The following examples illustrate the number of cases that arise.
select sum(e)+char_length(max(f)) from w
http://root:admin@localhost:8078/db select SUM(E) as C_12,MAX(F) as C_14 from t
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_12,MAX(F) as C_14 from u
--> 1 rows
select count(*),sum(e)+ char_length(max(f)) as x from w group by x
http://root:admin@localhost:8078/db select COUNT(*) as C_54,SUM(E) as C_56,MAX(F) as C_58 from t group by E,F
--> 2 rows
http://root:admin@localhost:8078/dc select COUNT(*) as C_54,SUM(E) as C_56,MAX(F) as C_58 from u group by E,F
--> 3 rows
select sum(e)+k as x,k from w group by k
http://root:admin@localhost:8078/db select SUM(E) as C_12,4 as K from t group by K
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_12,1 as K from u group by K
--> 1 rows
http://root:admin@localhost:8078/db select SUM(E) as C_12,MAX(F) as C_14 from t
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_12,MAX(F) as C_14 from u
--> 1 rows
select count(*),sum(e)+ char_length(max(f)) as x from w group by x
http://root:admin@localhost:8078/db select COUNT(*) as C_54,SUM(E) as C_56,MAX(F) as C_58 from t group by E,F
--> 2 rows
http://root:admin@localhost:8078/dc select COUNT(*) as C_54,SUM(E) as C_56,MAX(F) as C_58 from u group by E,F
--> 3 rows
select sum(e)+k as x,k from w group by k
http://root:admin@localhost:8078/db select SUM(E) as C_12,4 as K from t group by K
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_12,1 as K from u group by K
--> 1 rows
Joins
One of the steps in constructing a RESTView is to reduce the view to the columns needed for the given query. When the RESTView is used in a join, we need to ensure that columns needed for the joinCondition are added to the list of needed columns.
Select f,n from w natural join m
http://root:admin@localhost:8078/db select F,E from t
--> 2 rows
http://root:admin@localhost:8078/dc select F,E from u
--> 3 rows
If a grouped query uses a join whose non-remote factor has key J, an aggregation operation grouped by G can be shared with the factors of the join by applying a similar aggregation grouped by GJ. In this example, the join column is the primary key of both factors. If there is no primary key defined, then all columns are needed to form the join.
Select count(*) from w natural join m
http://root:admin@localhost:8078/db select COUNT(*) as C_178,E from t group by E
--> 2 rows
http://root:admin@localhost:8078/dc select COUNT(*) as C_178,E from u group by E
--> 3 rows
http://root:admin@localhost:8078/db select F,E from t
--> 2 rows
http://root:admin@localhost:8078/dc select F,E from u
--> 3 rows
If a grouped query uses a join whose non-remote factor has key J, an aggregation operation grouped by G can be shared with the factors of the join by applying a similar aggregation grouped by GJ. In this example, the join column is the primary key of both factors. If there is no primary key defined, then all columns are needed to form the join.
Select count(*) from w natural join m
http://root:admin@localhost:8078/db select COUNT(*) as C_178,E from t group by E
--> 2 rows
http://root:admin@localhost:8078/dc select COUNT(*) as C_178,E from u group by E
--> 3 rows