Friday 11 May 2018

DBMS Support for Big Live Data

In 2016 I reported on an idea for Big Live Data, and included some facilities in Pyrrho to support it. The current version of Pyrrho (11 May 2018) has better support and this blog post aims to explain the details of how it works. Basically, with Big Live Data it is not sensible to copy it all to a local database, for two reasons: (a) the copy will some be out of date or will need constant updates (b) the data doesn't belong to us in the first place.

The  Pyrrho database allows definition of views where the data is held on remote DBMS(s): at present the only options are Pyrrho and MySQL. The remote DBMS is accessible via SQL statements sent over HTTP with Json responses. For MySQL this HTTP service is implemented by a simple RestifD client (source posted on github.com/MalcolmCrowe/restif). The HTTP access provides the user/password combination set up within MySQL.
The syntax for the view definition is basically CREATE VIEW ViewDef AS GET (url|USING table)
This means there two types of RESTView corresponding to whether there is one single contributor with a single url or multiple remote databases supplying data for the view.
The technical details can be found in the Pyrrho documentation (Pyrrho.pdf and SourceIntro.pdf in the distribution, and were presented at DBKDA 2017 in Barcelona by Fritz Laux. There is a PDF version of this blog post with full details of how to set up some simple examples, entitled "The May 2018 Version of Pyrrho".
When RESTViews are used, remote data is accessed in such a way as to minimised the amount of data transferred, and avoid storing any of it in one the local machine. In some ways this is the opposite of Extract-Transform-Load. We assume an agreement with the contributor of the data that we have SELECT permission on view(s) they have defined, and credentials to enable us to access this view on the remote DBMS. There is also an understanding that we may access this view on numerous occasions but minimising the amount of data we request.
This means that on our local DBMS (Pyrrho for now) we will apply filters and aggregations to requests for the remote data. We will analyse such results locally, and join it with local data using SQL. We will often work interactively in SQL, and we will be facilitated by automatic mechanisms that transform our SQL to contain embedded requests (in SQL) that will be sent to the remote contributor. These requests will comply with the SELECT permissions we have been given on the remote view(s).
With most query optimisation mechanisms in DBMS the goal is to get the data we need into memory where processing is faster. But in the RESTView situation the aim of optimisation is to get as much aggregation and filtering as possible done on the remote system, so that as little data as possible comes to us. And of course we don't want to have the write a program every time we want something different. Ideally any query we wish to evaluate should result in at most one query being sent to each contributor.
The multiple-contributors version of RESTView assumes that all of the contributors provide data in similar format. Most probably all of them produce this data by extracting/transforming from their databases: so their base tables probably look nothing like the view they are providing. The last thing they, or we, want is to keep having to make changes to the facility they are giving us.
So we consider what opportunities we have for keeping everyone happy. There are opportunities for filtering the data vertically (fewer columns) and horizontally (fewer rows). MySQL will soon support window functions, so we can use these. We can filter and order by expressions, selecting maximum or minimum values. Aggregations over a set of contributors requires some automated support as COUNT becomes a sum of counts from the different contributors, and AVG and STDDEV_POP also need to be implemented as a combination of partial sums.
If we are joining the returned data with something, and then doing aggregations, it is natural to want to transforms these too, moving work into the RESTView. The query can be analused to find which expressions and conditions can be computed using only remote data: this allows the conditions to be applied on the remote system. Conditions can be separated into disjunctions and some applied locally, some remotely. Some aggregations can be done on the remote systems (provided we group by keys of joins we plan to make). Common subexpressions can identified and remotely computed.
In the examples I have been working on, the SQL requests to the remote DBMS all look very different from the original view definition. In my opinion this validates the entire proposal to use simple HTTP/SQL/JSon for communication with remote DBMS, rather than an API.