Pyrrho now provides a free-to-use simulation of Bell-LaPadula security, similar to the US Department of Defense Orange Book. The basic idea is well-explained in Wikipedia and popular database textbooks and there corrent implementations of this security model in Oracle 18.
A feature of the system is that a table can contain rows with different security classifications, and users will be able to see a subset of these detemined by their security clearance. Users can create or modify data that matches their clearance.
The basic idea is that there are security levels D,C,B and A, and security is managed by the database owner (the security administrator SA). The SA can give users a security clearance, and can give a classification based on these levels to objects such as tables and rows in the database.
By default all access to tables is then subject to security clearance, but the SA can limit enforcement in a table to any combination of read, insert, update and delete. In addition the SA can specify two sets of identifiers (Pyrrho calls these groups and references), so that security clearance and classification can be enhanced using subsets of these identifiers.
To whet your appetite, here is a simple test script [as revised 5 January 2019]:
A. Logged in as database owner
1. Starting with empty database "mac"
create table A(B int,C char)
create table D(E char primary key) security level D groups Army Navy references Defence scope read
create table F(G char primary key,H char security level C)
revoke "mac" from public
2. Create some users with and without clearance. (On Windows prefix the user names with Domain\)
grant "mac" to "Student"
grant "mac" to "Fred"
grant security level B groups Army references Defence Cyber to "Student"
3. Add some rows with and without classification
insert into A values(2,'Two')
insert into A values(3,'Three') security level C
insert into D values('Test')
insert into F values('MI6','sis.gov.uk')
4. Check we can see two rows in A, one row in D and two columns in F
table A
table D
table F
B. Logged in as Fred
5. Check we can only see one row in A, one column in F, and nothing in D
table A
table D
table F
6. Check we can add a row in A, D and F
insert into A values(4,'Four')
insert into D values('Fred wrote this')
insert into F values('UWS')
C. Logged in as Student
7. Check we can see three rows in A, two rows in D and two columns in F
table A
table D
table F
8. Check we can' only make changes in table D (enforcement in D is only for read)
update A set c = 'No' where b=2
update A set c = 'No' where b=3
update A set c = 'No' where b=4
update D set E='Fred?' where E<>'Test'
update F set H='www.sis.gov.uk' where G='MI6'
update F set H='www.uws.ac.uk' where G='UWS'
9. Check we can add and update rows in all three tables
insert into A values(5,'Fiv')
update A set c='Five' where b=5
insert into D values('Another')
insert into F values('BBC','bbc.co.uk')
update F set H='www.bbc.co.uk' where G='BBC'
10. Check we can see our rows and changes
table A
table D
table F
D. Logged in as Fred
11. Check Fred can't see the new rows
table A
table D
table F
E. Logged in as database owner
12. Check all tables including the security information
select B,C,security from A
select E,security from D
select G,H,security from F
select * from A where security=level c
update A set security=level C where security=level B
update F set security=level D where G='BBC'
table "Sys$Classification"
F. Logged in as Student
13. Check we can still see our row in A
select * from a where b=5
14. Check we can no longer update our rows in A or F
delete from A where b=5
update F set H='bbc.com' where G='BBC'
G. Logged in as Fred
15. Check we can see the row about the BBC
H. Logged in as database owner
16. Check that auditing has been happening
table "Sys$Audit"
Full details of Pyrrho's implementation are in the manual at section 3.4.2 and the syntax pages on pyrrhodb.com have been updated to include the syntax extensions. The SA is able to manage all of this with the help of a set of system tables such as Sys$Classification. A commercial vendor such as Oracle provides many tools to assist in this process.
The implementation in Pyrrho is new and no doubt will evolve over the next weeks. Comments please to malcolm.crowe@uws.ac.uk . Also follow me @MalcolmCrowe #PyrrhoDBMS .
(Update 5 Jan 2019: the manual has been updated to remove an incorrect assertion about changes to security clearance. Such changes can have no effect on ongoing transactions.)
Thursday, 13 December 2018
Friday, 23 November 2018
Recording access to sensitive data
Today there is considerable
interest in access auditing, and a requirement in some jurisdictions for companies
to record use of sensitive data.
Pyrrho already indelibly records changes to all data together with the user/role and time of changes, and watches reading of data during transactions to construct simple constraints on a transaction being committed.
As a matter of fact, few experts agree with this feature, since it means the inclusion of a read operation in a transaction will prevent the transaction being committed a concurrent transaction commits a modification of any of the data that has been read. Pyrrho enforces this approach however, because the user whose transaction is prevented from committing is the same one who (presumably deliberately) included the read operation in the transaction.
The present discussion also considers read operations but in a different way. Here we want to distinguish sensitive data (say at the data type level) and immediately record access to it (by anyone other than the database owner), whether or not the current operation is ever committed.
These features [updated: 27 November] are available in Pyrrho version 6.3. The audit record is merged into the transaction log and a system table gives access to all the details.
From the manual:
Sec 1.5:
From the manual:
Sec 1.5:
Version
6.3 adds support for “sensitive” data, for which any access is auditable. Columns,
domains and types can be declared SENSITIVE[1].
Sensitive values are not assignment-compatible with anything that is not
sensitive, and there is a sensitive property inherited by any object that
contains a sensitive data type. This means for example that the sum of
sensitive data is still sensitive. The transaction log will contain a record of
every access to sensitive values (apart from by the database owner), even if the
transaction is rolled back. These details are visible in the Sys$Audit system
table (see section 8.3.1).
Sec 7.4:
Type = (StandardType
| DefinedType | Domain_id | Type_id | REF’(‘TableReference’)’) [UriType]
[SENSITIVE] .
Sec 8.3.1:
8.3.1 Sys$Audit
Field
|
DataType
|
Description
|
Pos
|
Char
|
The location of this access
record in the transaction log
|
User
|
Char
|
The defining position of the
accessing user
|
Table
|
Char
|
The defining position of the
sensitive table or view object
|
Timestamp
|
Int
|
The time of the access in
ticks
|
Audit
records are only for committed sensitive data. Entries come from physical Audit
records, and are added immediately on access (do not wait for transaction
commit).
8.3.2 Sys$AuditKey
Field
|
DataType
|
Description
|
Pos
|
Char
|
The location of the access
record in the transaction log
|
Seq
|
Int
|
The ordinal position of the
key (0 based)
|
Col
|
Char
|
The defining position of the
key column
|
Key
|
Char
|
A string representation of
the key value at this position
|
Key
information for audit records comes from the filters used to access a sensitive
object. For example, if a record is inserted in a table, there is no applicable
filter, the audit record will apply to the whole table, and there will be no
key information here.
Comments by email are welcome.
[1] SENSITIVE is a
reserved word in SQL that normally applies to cursor sensitivity. The usage in
Pyrrho described here is quite different, and the keyword comes at the end of a
type clause (see section 7.4).
Wednesday, 21 November 2018
Rethinking Shareable Data Structures
I have been recently developing a set of data structures following on from Okasaki's Purely Functional Data Structures. There is a lot to be gained by using immutable data structures, that is, where all the fields are public readonly in C# or public final in Java. Strings in C# and Java already have this property and it turns out to be remarkably easy to develop all the usual data structure types with this property. Something of the kind had already started to happen in Pyrrho: many of Pyrrho's data structures are immutable, and Pyrrho uses Bookmarks instead of Iterators. Specifically, the benefits (as with strings) are
So the time seemed ripe for a serious approach to #ShareableDataStructures and the fruits of these labours are emerging at github.com . Eventually the classes will be rich enough to implement a DBMS, and the plan is to implement everything in C# and Java, and then Python later. Efforts in the DBMS direction are currently called #StrongDBMS . A lot will depend on the performance of the TPCC benchmark.
It is natural to ask what this might mean for Pyrrho. It does seem like a natural evolution (Pyrrho 7.0 maybe), but some of the Pyrrho code would be a real nuisance to transform. Time will tell.
- a snapshot is obtained by a simple assignment, so rollback is a breeze
- structures can be modified while a traversal continues with the previous state#
- they are thread-safe and safe to pass as a parameter in C# or Java, so that
- these data structures never need to be locked
So the time seemed ripe for a serious approach to #ShareableDataStructures and the fruits of these labours are emerging at github.com . Eventually the classes will be rich enough to implement a DBMS, and the plan is to implement everything in C# and Java, and then Python later. Efforts in the DBMS direction are currently called #StrongDBMS . A lot will depend on the performance of the TPCC benchmark.
It is natural to ask what this might mean for Pyrrho. It does seem like a natural evolution (Pyrrho 7.0 maybe), but some of the Pyrrho code would be a real nuisance to transform. Time will tell.
Tuesday, 2 October 2018
Window Functions
A basic set of window functions are supported in PyrrhoDBMS. The full set of OLAP functions as defined in ISO 9075 used to be supported and can be again if there is demand: email me if you think this would be useful or if you find any issues (my direct email at UWS is in the pyrrhodb.com download).
Saturday, 11 August 2018
Latest version of Pyrrho 6.2
This version fixes a number of long-standing bugs in Pyrrho, including the following significant areas:
- Exit handlers
- Check constraints
- Triggers
- OSPStudio
- Row count
The Java Connector (PyrrhoJC) is also being fixed for Java SE, and more changes are expected.
In addition, some syntax error messages have been made more helpful. Many thanks for those who have submitted performance reports - I have tried your patience by taking so long to fix them.
In addition, some syntax error messages have been made more helpful. Many thanks for those who have submitted performance reports - I have tried your patience by taking so long to fix them.
Please send me details of any problems you find (malcolm.crowe@uws.ac.uk).
Thursday, 19 July 2018
RESTViews: the "Sierra Leone" example
A previous post on this blog (More details about RESTViews) showed the range of HTTP queries that could be automatically generated by Pyrrho from a simple View definition. This post takes extends the examples, using a set of sample MySQL databases constructed for the purpose by Fritz Laux.
A number of steps are required to set up the MYSQL databases and allow remote SELECT connections to the view, so these will not detailed here. The important point is that once the remote views have been defined to a Pyrrho database, a large set of SQL queries (with filters, aggregations and joins to other local tables) are supported using automatic creation of the appropriate SQL requests to the remote MySQL databases. The local Pyrrho database is not modified during any of these tests as none of the views are materialised and no schema changes are made. The data requested (volume of data transferred from the remote databases) is minimised as shown in these blog posts. Full technical details of the mechanism are given in the Pyrrho distribution at http://pyrrhodb.com, which contains a document "The Sierra Leone example" giving more details than are given here.
In this example, the remote view is defined in Pyrrho as follows (the syntax uses an SQL extension that Pyrrho calls RESTView):
[create view patientdataSL of ("hospID" char,"hospDesc" char,"patientID" int,"rCode" int,"address" char, "birthdate" date,"admission" date,"ICD10" char,"diagnosis" char,"treatments" char) as get using hospitalssl]
where hospitalssl is the following small table:
The first column is a key to identify the remote contributor database, at the url shown.
The remote contributor databases supply views for the given credentials containing the remaining columns in the above view definition. All three databases are in a MySQL database on localhost, and of course in a real example they would be on different hosts. The HTTP listener on port 8078 is the HTTP interface restif available at GitHub - this interface needs to run on the same server as the MySQL database but has no privileges or configuration of its own. It simply accepts one or more SQL statements from and returns SQL data in Json format.
With this setup, no further programming is required to access data from the remote views, as Pyrrho generates any MySQL queries that may be required. Further local views can be defined as indicated below and the code below includesjoins aith a local table called deomgraphicSL.
Consider the following SQL statements and the corresponding SQL queries sent to the remote contributors (Pyrrho request is in black, MySQL generated query for one remote hospital in blue):
[select * from patientdataSL where "patientID" = 10015]
select 'CMH' as hospID,'Choithram Memorial Hospital' as hospDesc,patientID,rCode,address,birthdate,admission,ICD10,diagnosis,treatments from patientrecordcmh where patientID=10015
select count("patientID") from patientdataSL
select COUNT(patientID) as C_298 from patientrecordcmh
This request of course returns a single row from each hospital with its number of patients and Pyrrho gives the total of these counts as 30003.
select count(*) from patientdatasl where icd10='N73.4'
select COUNT(*) as C_323 from patientrecordcmh where ICD10='N73.4'
This time the count is 3.
select "patientID" from patientdataSL where icd10='N73.4'
select patientID from patientrecordcmh where ICD10='N73.4'
Three patientIDs are returned 0 from the first hospital, 2 from the second, and 1 from the third.
[select count( *) as "numPatients", "rCode", count( *)*1.0 / "totalinhabitants" as "percentageOfPopulation" from patientdataSL natural join demographicSL where "rCode" < 1010 group by "rCode","totalinhabitants" order by "percentageOfPopulation" desc]
select COUNT(*) as C_275,rCode from patientrecordcmh group by rCode having rCode < 1010
This fetches 9 rows from each hospital, and results in a table of 9 rows sorted by proportionOfPopulation.
[create view patientSummary as
select "rCode",
extract(year from "admission")-extract(year from "birthdate") as "age",
extract(year from "admission") as "adyear",
icd10,
count(*) as patients
from patientDataSL
group by "rCode", "age", "adyear", icd10]
select * from patientsummary where icd10='N73.4'
select rCode as rCode,(YEAR(admission)-YEAR(birthdate)) as age,(YEAR(admission) as adyear,ICD10 as ICD10,COUNT(*) as C_856 from patientrecordcmh group by rCode,age,adyear,ICD10 having ICD10='N73.4'
This gives the rCode, age and admission year for the three patients mentioned above.
create view patientAnalysis as
select * from patientSummary natural join demographicSL
select * from patientanalysis where icd10='N73.4'
select rCode as rCode,(YEAR(admission)-YEAR(birthdate)) as age,(YEAR(admission) as adyear,ICD10 as ICD10,COUNT(*) as C_903 from patientrecordcmh group by rCode,age,adyear,ICD10 having ICD10='N73.4'
[select "rCode", icd10, "age", patients*100.0/"under10" as percentage from patientAnalysis where ICD10='N17.1' and "age"<10]
select rCode as rCode,(YEAR(admission)-YEAR(birthdate)) as age,(YEAR(admission) as adyear,ICD10 as ICD10,COUNT(*) as C_972 from patientrecordcmh group by rCode,age,adyear,ICD10 having ICD10='N73.4'
These adds to the previous details some demographic information about the regions involved. The generated SQL is similar (Pyrrho currently does not pass on the condition on the calculated field age).
[select coalesce(young.icd10,older.icd10) as "icd10", youngPat, oldPat from
(select icd10, sum(patients) as youngPat from patientsummary where "age"<10 and="" icd10="">'Z87' group by icd10) young
full join
(select icd10, sum(patients) as oldPat from patientsummary where "age">50 and icd10>'Z87' group by icd10) older
on young.icd10=older.icd10 order by youngPat desc]10>
select rCode as rCode,(YEAR(admission)-YEAR(birthdate)) as age,YEAR(admission) as adyear,ICD10 as ICD10,COUNT(*) as C_253 from patientrecordcmh group by rCode,age,adyear,ICD10 having ICD10>'Z87'
Each half of this join retrieves 100, 97, and 70 records from the three hospitals, and groups the results into a table with 56 rows. Also the descending order is honoured. The conversions and processing are very fast.
A number of steps are required to set up the MYSQL databases and allow remote SELECT connections to the view, so these will not detailed here. The important point is that once the remote views have been defined to a Pyrrho database, a large set of SQL queries (with filters, aggregations and joins to other local tables) are supported using automatic creation of the appropriate SQL requests to the remote MySQL databases. The local Pyrrho database is not modified during any of these tests as none of the views are materialised and no schema changes are made. The data requested (volume of data transferred from the remote databases) is minimised as shown in these blog posts. Full technical details of the mechanism are given in the Pyrrho distribution at http://pyrrhodb.com, which contains a document "The Sierra Leone example" giving more details than are given here.
In this example, the remote view is defined in Pyrrho as follows (the syntax uses an SQL extension that Pyrrho calls RESTView):
[create view patientdataSL of ("hospID" char,"hospDesc" char,"patientID" int,"rCode" int,"address" char, "birthdate" date,"admission" date,"ICD10" char,"diagnosis" char,"treatments" char) as get using hospitalssl]
where hospitalssl is the following small table:
hospID | hospDesc | url |
---|---|---|
'CMH' | 'Choithram Memorial Hospital' | 'http://root:admin@localhost:8078/sleone/patientrecordcmh' |
'HS' | 'Holy Spirit Hospital' | 'http://root:admin@localhost:8078/sleone/patientrecordholyspirit' |
'MABESSS' | 'St John of God Catholic (Mabesseneh) Hospital' | 'http://root:admin@localhost:8078/sleone/patientrecordmabesss' |
The remote contributor databases supply views for the given credentials containing the remaining columns in the above view definition. All three databases are in a MySQL database on localhost, and of course in a real example they would be on different hosts. The HTTP listener on port 8078 is the HTTP interface restif available at GitHub - this interface needs to run on the same server as the MySQL database but has no privileges or configuration of its own. It simply accepts one or more SQL statements from and returns SQL data in Json format.
With this setup, no further programming is required to access data from the remote views, as Pyrrho generates any MySQL queries that may be required. Further local views can be defined as indicated below and the code below includesjoins aith a local table called deomgraphicSL.
Consider the following SQL statements and the corresponding SQL queries sent to the remote contributors (Pyrrho request is in black, MySQL generated query for one remote hospital in blue):
[select * from patientdataSL where "patientID" = 10015]
select 'CMH' as hospID,'Choithram Memorial Hospital' as hospDesc,patientID,rCode,address,birthdate,admission,ICD10,diagnosis,treatments from patientrecordcmh where patientID=10015
This request of course returns a single row from each hospital with its number of patients and Pyrrho gives the total of these counts as 30003.
select COUNT(*) as C_323 from patientrecordcmh where ICD10='N73.4'
This time the count is 3.
select patientID from patientrecordcmh where ICD10='N73.4'
Three patientIDs are returned 0 from the first hospital, 2 from the second, and 1 from the third.
select COUNT(*) as C_275,rCode from patientrecordcmh group by rCode having rCode < 1010
This fetches 9 rows from each hospital, and results in a table of 9 rows sorted by proportionOfPopulation.
select rCode as rCode,(YEAR(admission)-YEAR(birthdate)) as age,(YEAR(admission) as adyear,ICD10 as ICD10,COUNT(*) as C_856 from patientrecordcmh group by rCode,age,adyear,ICD10 having ICD10='N73.4'
This gives the rCode, age and admission year for the three patients mentioned above.
select rCode as rCode,(YEAR(admission)-YEAR(birthdate)) as age,(YEAR(admission) as adyear,ICD10 as ICD10,COUNT(*) as C_903 from patientrecordcmh group by rCode,age,adyear,ICD10 having ICD10='N73.4'
select rCode as rCode,(YEAR(admission)-YEAR(birthdate)) as age,(YEAR(admission) as adyear,ICD10 as ICD10,COUNT(*) as C_972 from patientrecordcmh group by rCode,age,adyear,ICD10 having ICD10='N73.4'
select rCode as rCode,(YEAR(admission)-YEAR(birthdate)) as age,YEAR(admission) as adyear,ICD10 as ICD10,COUNT(*) as C_253 from patientrecordcmh group by rCode,age,adyear,ICD10 having ICD10>'Z87'
Saturday, 2 June 2018
More details about RESTViews
An introduction to RESTViews
The essential idea with RESTView
is that the Pyrrho database allows definition of views where the data is held
on remote DBMS(s): at present, the only options are Pyrrho and MySQL. The
remote DBMS is accessible via SQL statements sent over HTTP with Json
responses.
For MySQL a simple client called RestifD
(source posted on github.com/MalcolmCrowe/restif) provides this HTTP service.
The HTTP access provides the user/password combinations set up for this purpose
within MySQL by the owners of contributor databases. In the use cases
considered here, where a query Q references a RESTView V, we assume that (a)
materialising V by Extract-transform-load is undesirable for some legal reason,
and (b) we know nothing of the internal details of contributor databases. A
single remote select statement defines each RESTView: the agreement with a
contributor does not provide any complex protocols, so that for any given Q, we
want at most one query to any contributor, compatible with the permissions
granted to us by the contributor, namely grant select on the RESTView columns.
Crucially, though, for any given
Q, we want to minimise the volume D of data transferred. We can consider how
much data Q needs to compute its results, and we rewrite the query to keep D as
low as possible. Obviously many such queries (such as the obvious select * from
V) would need all of the data. At the other extreme, if Q only refers to local
data (no RESTViews) D is always zero, so that all of this analysis is specific
to the RESTView technology.
We will add a set of
query-rewriting rules to the database engine aiming to reduce D by recursive
analysis of Q and the views and tables it references. As the later sections of
this document explain, some of these rules can be very simple, such as
filtering by rows or columns of V, while others involve performing some
aggregations remotely (extreme cases such as select count(*) from V needs only
one row to be returned). In particular, we will study the interactions between
grouped aggregations and joins. The analysis will in general be recursive,
since views may be defined using aggregations and joins of other views and
local tables.
Any given Q might not be
susceptible to such a reduction, or at least we may find that none of our rules
help, so that a possible outcome of any stage in the analysis might be to
decide not to make further changes. Since this is Pyrrho, its immutable data
structures can retain previous successful stages of query rewriting, if the
next stage in the recursion is unable to make further progress.
Although in this document the
example are mostly very simple, we aim to present the analysis in such a way as
to demonstrate the applicability of the rules to more complex cases. In other
studies, such as the Sierra Leone example, queries can reference multiple
stored queries (view definitions) and functions. For now RESTViews are only
found in Pyrrho, but in principle we could have several stages where one
RESTView is defined using other RESTViews. We also bear in mind that a query Q
might involve joins of RESTViews possibly from the same remote database(s).
There are two types of RESTView
corresponding to whether the view has one single contributor or multiple remote
databases (for examples, see the text in red below). In the simple exercises in this document, V is a RESTview with one
contributor, and W has two. In the multiple-contributors case, the view
definition always includes a list of contributors (the “using table”, VU here)
making it a simple matter to manage the list of contributors.
Setting up the tests
The test cases considered in this
document are set up as follows. Everything assumes that RestifD.exe is running
(it uses port 8078 and needs no configuration), and the MySQL server is on the
local machine.
In MySQL, at a command line
create database db
use db
create table T(E integer,F nvarchar(6))
insert into T values(3,'Three'),(6,'Six'),(4,'Vier'),(6,'Sechs')
grant all privileges on T to 'root'@'%' identified by 'admin'
create database dc
use dc
create table U(E integer,F varchar(7))
insert into U values(5,'Five'),(4,'Four'),(8,'Ate')
grant all privileges on U to 'root'@'%' identified by 'admin'
In Pyrrho: We assume OSPSvr.exe
is running, with flags such as –V -E -H. Start up a command line with ospcmd
rv. Then at the SQL> prompt:
create view V of (E int,F char) as get
'http://root:admin@localhost:8078/db/t'
create table VU (d char primary key, k int, u char)
insert into VU values ('B',4,'http://root:admin@localhost:8078/db/t')
insert into VU values ('C',1,'http://root:admin@localhost:8078/dc/u')
create view W of (E int, D char, K int, F char) as get using VU
create table M (e int primary key, n char, unique(n))
insert into M values (2,'Deux'),(3,'Trois'),(4,'Quatre')
insert into M values (5,'Cinq'),(6,'Six'),(7,'Sept')
These sample tables are of course very small. But we can still use them to show how the RESTView minimises the data returned. Much larger examples inclouding the "Sierra Leone" study have demonstrated the effectiveness of the algorithms here for large data sets, views defined in terms of other views including RESTViews, aggregations, joins of views, etc.
To see the internal workings of the database engine on the following examples, start up OSPSvr.exe with the flags mentioned above. For brevity in this blog, we does not show the results of these queries.
Simple retrieval (to be avoided)
table
V
Unsurprisingly, simple queries result in the simple requests to the remote databases
http://root:admin@localhost:8078/db select E,F from t--> 2 rows
table W
http://root:admin@localhost:8078/db select E,'B' as D,4 as K,F from t
--> 4 rows
http://root:admin@localhost:8078/dc select E,'C' as D,1 as K,F from u
--> 3 rows
--> 4 rows
http://root:admin@localhost:8078/dc select E,'C' as D,1 as K,F from u
--> 3 rows
Filters
Filters (where or having conditions) should be used to reduce the amount of data retrieved from the remote database. Simple cases are easy:
select * from V where E=3
http://root:admin@localhost:8078/db select E,F from t where E=3
--> 1 rows
http://root:admin@localhost:8078/db select E,F from t where E=3
--> 1 rows
For the more complex multi-database GET USING syntax we can also filter on columns of the using table (VU in this example):
Select * from w where k>2
http://root:admin@localhost:8078/db select E,'B' as D,4 as K,F from t
--> 2 rows
http://root:admin@localhost:8078/db select E,'B' as D,4 as K,F from t
--> 2 rows
In this case, the request only goes to one of the remote databases. Note that the value of E is supplied to the remote database, in case it is referred to in the select list of the query. Such filters can of course be combined.
Aggregations
select count(e) from w
http://root:admin@localhost:8078/db select COUNT(E) as C_12 from t
--> 1 rows
http://root:admin@localhost:8078/dc select COUNT(E) as C_12 from u
--> 1 rows
--> 1 rows
http://root:admin@localhost:8078/dc select COUNT(E) as C_12 from u
--> 1 rows
Internally, the local database engine computes SUM(C_12) from the returned rows.
select max(f) from w having e>4
http://root:admin@localhost:8078/db select MAX(F) as C_33 from t where E>4
--> 1 rows
http://root:admin@localhost:8078/dc select MAX(F) as C_33 from u where E>4
--> 1 rows
http://root:admin@localhost:8078/db select MAX(F) as C_33 from t where E>4
--> 1 rows
http://root:admin@localhost:8078/dc select MAX(F) as C_33 from u where E>4
--> 1 rows
Again, the local engine computes max(C_33) .
select sum(e),f from w group by f
http://root:admin@localhost:8078/db select SUM(E) as C_74,F from t group by F
--> 2 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_74,F from u group by F
--> 3 rows
http://root:admin@localhost:8078/db select SUM(E) as C_74,F from t group by F
--> 2 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_74,F from u group by F
--> 3 rows
Our trivial example shows no reduction in the number of rows. A larger example would show of course that the number of rows returned in determined by the number of distinct values of the grouped column f in that database.
select count(*),k/2 as k2 from w group by k2
http://root:admin@localhost:8078/db select COUNT(*) as C_11,(4/2) as K2 from t
--> 1 rows
http://root:admin@localhost:8078/dc select COUNT(*) as C_11,(1/2) as K2 from u
--> 1 rows
http://root:admin@localhost:8078/db select COUNT(*) as C_11,(4/2) as K2 from t
--> 1 rows
http://root:admin@localhost:8078/dc select COUNT(*) as C_11,(1/2) as K2 from u
--> 1 rows
Other aggregation functions are converted to use sums and counts from the remote databases, for example:
Select avg(e) from w
http://root:admin@localhost:8078/db select SUM(E) as C_156,COUNT(E) as D_156 from t
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_156,COUNT(E) as D_156 from u
--> 1 rows
http://root:admin@localhost:8078/db select SUM(E) as C_156,COUNT(E) as D_156 from t
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_156,COUNT(E) as D_156 from u
--> 1 rows
The local database computes the request for AVG as SUM/COUNT. Common subexpressions are also optimised:
select sum(e)*sum(e),d from w group by d
http://root:admin@localhost:8078/db select SUM(E) as C_12,'B' as D from t group by D
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_12,'C' as D from u group by D
--> 1 rows
http://root:admin@localhost:8078/db select SUM(E) as C_12,'B' as D from t group by D
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_12,'C' as D from u group by D
--> 1 rows
More expression rewriting
In general, subexpressions will contain other SQL functions and a mixture of local and remote columns. The following examples illustrate the number of cases that arise.
select sum(e)+char_length(max(f)) from w
http://root:admin@localhost:8078/db select SUM(E) as C_12,MAX(F) as C_14 from t
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_12,MAX(F) as C_14 from u
--> 1 rows
select count(*),sum(e)+ char_length(max(f)) as x from w group by x
http://root:admin@localhost:8078/db select COUNT(*) as C_54,SUM(E) as C_56,MAX(F) as C_58 from t group by E,F
--> 2 rows
http://root:admin@localhost:8078/dc select COUNT(*) as C_54,SUM(E) as C_56,MAX(F) as C_58 from u group by E,F
--> 3 rows
select sum(e)+k as x,k from w group by k
http://root:admin@localhost:8078/db select SUM(E) as C_12,4 as K from t group by K
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_12,1 as K from u group by K
--> 1 rows
http://root:admin@localhost:8078/db select SUM(E) as C_12,MAX(F) as C_14 from t
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_12,MAX(F) as C_14 from u
--> 1 rows
select count(*),sum(e)+ char_length(max(f)) as x from w group by x
http://root:admin@localhost:8078/db select COUNT(*) as C_54,SUM(E) as C_56,MAX(F) as C_58 from t group by E,F
--> 2 rows
http://root:admin@localhost:8078/dc select COUNT(*) as C_54,SUM(E) as C_56,MAX(F) as C_58 from u group by E,F
--> 3 rows
select sum(e)+k as x,k from w group by k
http://root:admin@localhost:8078/db select SUM(E) as C_12,4 as K from t group by K
--> 1 rows
http://root:admin@localhost:8078/dc select SUM(E) as C_12,1 as K from u group by K
--> 1 rows
Joins
One of the steps in constructing a RESTView is to reduce the view to the columns needed for the given query. When the RESTView is used in a join, we need to ensure that columns needed for the joinCondition are added to the list of needed columns.
Select f,n from w natural join m
http://root:admin@localhost:8078/db select F,E from t
--> 2 rows
http://root:admin@localhost:8078/dc select F,E from u
--> 3 rows
If a grouped query uses a join whose non-remote factor has key J, an aggregation operation grouped by G can be shared with the factors of the join by applying a similar aggregation grouped by GJ. In this example, the join column is the primary key of both factors. If there is no primary key defined, then all columns are needed to form the join.
Select count(*) from w natural join m
http://root:admin@localhost:8078/db select COUNT(*) as C_178,E from t group by E
--> 2 rows
http://root:admin@localhost:8078/dc select COUNT(*) as C_178,E from u group by E
--> 3 rows
http://root:admin@localhost:8078/db select F,E from t
--> 2 rows
http://root:admin@localhost:8078/dc select F,E from u
--> 3 rows
If a grouped query uses a join whose non-remote factor has key J, an aggregation operation grouped by G can be shared with the factors of the join by applying a similar aggregation grouped by GJ. In this example, the join column is the primary key of both factors. If there is no primary key defined, then all columns are needed to form the join.
Select count(*) from w natural join m
http://root:admin@localhost:8078/db select COUNT(*) as C_178,E from t group by E
--> 2 rows
http://root:admin@localhost:8078/dc select COUNT(*) as C_178,E from u group by E
--> 3 rows
Friday, 11 May 2018
DBMS Support for Big Live Data
In 2016 I reported on an idea for Big Live Data, and included some facilities in Pyrrho to support it. The current version of Pyrrho (11 May 2018) has better support and this blog post aims to explain the details of how it works. Basically, with Big Live Data it is not sensible to copy it all to a local database, for two reasons: (a) the copy will some be out of date or will need constant updates (b) the data doesn't belong to us in the first place.
The Pyrrho database allows definition of views where the data is held
on remote DBMS(s): at present the only options are Pyrrho and MySQL. The remote
DBMS is accessible via SQL statements sent over HTTP with Json responses. For MySQL this HTTP service is implemented by a simple RestifD client (source posted on github.com/MalcolmCrowe/restif). The HTTP access provides the user/password combination set up within MySQL.
The syntax for the view definition is basically CREATE VIEW ViewDef AS GET (url|USING table)
This means there two types of RESTView
corresponding to whether there is one single contributor with a single url or multiple remote
databases supplying data for the view.
The technical details can be
found in the Pyrrho documentation (Pyrrho.pdf and SourceIntro.pdf in the
distribution, and were presented at DBKDA 2017 in Barcelona by Fritz Laux. There is a PDF version of this blog post with full details of how to set up some simple examples, entitled "The May 2018 Version of Pyrrho".
When RESTViews are used, remote data is accessed in such a way as to minimised the amount of data transferred, and avoid storing any of it in one the local machine. In some ways this is the opposite of Extract-Transform-Load. We assume an agreement with the contributor of the data that we have SELECT permission on view(s) they have defined, and credentials to enable us to access this view on the remote DBMS. There is also an understanding that we may access this view on numerous occasions but minimising the amount of data we request.
This means that on our local DBMS (Pyrrho for now) we will apply filters and aggregations to requests for the remote data. We will analyse such results locally, and join it with local data using SQL. We will often work interactively in SQL, and we will be facilitated by automatic mechanisms that transform our SQL to contain embedded requests (in SQL) that will be sent to the remote contributor. These requests will comply with the SELECT permissions we have been given on the remote view(s).
With most query optimisation mechanisms in DBMS the goal is to get the data we need into memory where processing is faster. But in the RESTView situation the aim of optimisation is to get as much aggregation and filtering as possible done on the remote system, so that as little data as possible comes to us. And of course we don't want to have the write a program every time we want something different. Ideally any query we wish to evaluate should result in at most one query being sent to each contributor.
The multiple-contributors version of RESTView assumes that all of the contributors provide data in similar format. Most probably all of them produce this data by extracting/transforming from their databases: so their base tables probably look nothing like the view they are providing. The last thing they, or we, want is to keep having to make changes to the facility they are giving us.
So we consider what opportunities we have for keeping everyone happy. There are opportunities for filtering the data vertically (fewer columns) and horizontally (fewer rows). MySQL will soon support window functions, so we can use these. We can filter and order by expressions, selecting maximum or minimum values. Aggregations over a set of contributors requires some automated support as COUNT becomes a sum of counts from the different contributors, and AVG and STDDEV_POP also need to be implemented as a combination of partial sums.
If we are joining the returned data with something, and then doing aggregations, it is natural to want to transforms these too, moving work into the RESTView. The query can be analused to find which expressions and conditions can be computed using only remote data: this allows the conditions to be applied on the remote system. Conditions can be separated into disjunctions and some applied locally, some remotely. Some aggregations can be done on the remote systems (provided we group by keys of joins we plan to make). Common subexpressions can identified and remotely computed.
In the examples I have been working on, the SQL requests to the remote DBMS all look very different from the original view definition. In my opinion this validates the entire proposal to use simple HTTP/SQL/JSon for communication with remote DBMS, rather than an API.
Subscribe to:
Posts (Atom)