Gillius's Programming

RealDB

Master's project by Jason Winnebeck - Current Status | Download | Using RealDB | Progress Reports

Committee

Abstract

RealDB (RDB) will be a real-time embeddable database system for data streams. It is a non-relational database that is based on a changing value of a stream of sequential, timestamped data. To support this environment, the system will have a low overhead and deliver high performance; the trade-off is a narrower approach and fewer guarantees than a traditional relational database management system (RDBMS) would deliver.

Summary

To summarize the main topics on what this work is to accomplish:

For more details, please see the latest proposal.

Current Status / Schedule

The RealDB project proposal is finished. The final proposal can be found here (SVN revision 81, 7-Aug-2008).

Milestone 1
First design phase completed, creation of object model and some stub functionality and tests. Setup of environments and compilers such as GCJ, and prototypes for high-risk code.
Milestone 2
Creation of maintenance tools to create an empty database on disk, and simple storage implementation (single stream, no or incomplete space management)
Milestone 3
Completion of database metadata and functionality required for writing, including space management but excluding compressed data storage
Milestone 4
Completed research and implementation of compressed data storage and gathering, reconstruction algorithms, and read functionality including APIs and query tool
Milestone 5
Completion of proof-of-concept use for RealDB
Milestone 6
Design and implementation of RealDB version of performance tool and completed design for solving problem using other solutions
Milestone 7
Completion of performance tool versions for MySQL InnoDB, MySQL MyISAM, and Apache Derby

The project schedule, from the final proposal (status as of 1-Sept-2008):

Target Planned Completed Percent Complete
Preproposal 2006-10-17 2006-10-17 100%
Preproposal Presentation 2006-10-17 2006-10-17 100%
Proposal Approved 2008-07-31 2008-08-11 100%
Milestone 1 2008-08-31 2008-08-28 100%
Milestone 2 2008-08-31 2008-09-29 100% (r123)
Milestone 3 2008-09-08 2009-01-29 100% (r199)
Milestone 4 2008-09-22 2010-02-28 100%
Milestone 5 2008-10-20   0%
Milestone 6 2008-11-10    
Milestone 7 2008-11-24    
Report 2008-12-15    
Defense 2009-01-15    

Download

Latest Release: RealDB-2010-02-28-r318 (milestone 4), released on February 28, 2010.

Javadocs and be browsed online here. You may want to read on how to use RealDB.

License

RealDB Project - Copyright (c) 2008-2010 by Jason Winnebeck

All Rights Reserved.

Notes

My eventual goal is to release the project as open-source, but currently I have not settled on which license. Therefore for now the license is "all rights reserved" -- no copying, distribution, or derivative works are permitted with this release.

Dependencies

For compiling and running unit tests, the following libraries are used:

For running RealDB, the only dependency is the Antlr runtime jar.

Using RealDB

This section is very minimalistic and will be expanded in the future.

Running RealDB

Download the full release of RealDB; no compilation is needed. There are currently three ways to use RealDB:

  1. Run the RealDB-demo.jar in a GUI environment by double-clicking it (if your OS environment has an association for executable jars).
  2. Run the RealDB-cli.jar on the command line by executing java -jar RealDB-cli.jar in the directory where you uncompressed RealDB.
  3. Reference RealDB-core.jar in your own Java project and use the API calls to read and write data.

Compiling RealDB

To compile RealDB yourself, the most compatible method is to use the realdb.xml build file, which should work with even very old Ant versions. This method was used to generate the full distribution.

ant -file realdb.xml

Currently this method can not run unit tests. You can use the NetBeans-generated build files if you have Ant 1.7.1 or later under the build.xml file. You can use NetBeans IDE or IntelliJ IDE with the IPR project file (Community Edition is OK) to also compile and run all unit tests. The IntelliJ project files are the most maintained and up-to-date versions.

RealDB Definition Language

The RealDB Definition Language (RDL) is used to define RealDB databases. Here is an example RDL file:

SET blockSize     = 2048
SET fileSize      = 204800
SET maxStreams    = 3
SET dataBlockSize = 2

CREATE STREAM Test WITH ID 1 {
  value float NULL //will use SampledAlgorithm by default
}

CREATE STREAM CarSnapshots WITH ID 2 {
  rpm float WITH CODEC DeadbandAlgorithm PARAMS (deadband=50.0),
  speed float WITH CODEC DeadbandAlgorithm PARAMS (deadband=5),
  passengers uint8 WITH CODEC StepAlgorithm,
  driving boolean WITH CODEC StepAlgorithm
}

This section will be expanded more later. Brief descriptions of the elements in this file:

Creating Streams

RealDB Concepts

DataStream

A DataStream is a time-ordered sequence of Records with a fixed format -- an ordered list of elements.

Record

A single entry in a DataStream, has a timestamp and can be a "discontinuity" marker.

Element

A specific field in the Record with a known name and type.

StreamInterval

A reconstructed continuous interval of a DataStream that comprises of 1 or more original (possibly dropped) records.

Discontinuities

A stream can be in a "discontinuity" state. This is different from elements individually being null and even all elements being null at the same time. The actual meaning can be defined by the user. For example a null could be "not applicable," or "collected but not present", whereas a discontinuity could signifiy ranges where the system was turned off and not collecting data at all. When a discontinuity is written, the stream is said to remain in that state until the next non-discontinuity record is written.

Timestamps

All records in RealDB are timestamped, and written in ascending order of time. The timestamp is a 63 bit unsigned integer. RealDB does not assume any particular unit or meaning for time other than it being linear (i.e. distance between 5 and 10 is the same as between 50 and 55), leaving the user to define both the unit (i.e. seconds versus milliseconds) and the epoch (the meaning of time 0).

Command Line Tool

The current command line tool supports the following commands. Use the help command to get more information:

bulkLoad Bulk loads data into a stream from a tab-separated values file.
create Creates a new RealDB Database
describe Shows information about a database.
help Prints out a list of all commands or detailed help on a single command
intervals Reads intervals in a given time range on an element within a stream.
read Reads all raw records or a range of raw records from a single stream and outputs a tab-delimited output with the first row as the column headers.
summary Reads intervals in a given time range on an element within a stream, and summarizes them into a single report.

Progress Reports

28-Feb-2010

Milestone 4 release. I took a little extra time from my previous report to make the import/export file formats more robust to handle discontinuities, and I realized that I missed some API support for discontinuities in StreamIntervals.

Refer to the instructions above for how to download and run RealDB. For this release, you can run these commands:

create examples/test.rdl test.rdb
describe test.rdb
bulkLoad test.rdb 1 examples/test.txt
bulkLoad test.rdb 2 examples/CarSnapshots.txt
read test.rdb 1 10 20
read test.rdb 2
intervals test.rdb 2 rpm 50 70
summary test.rdb 2 passengers

To get the output below:

% java -jar RealDB-cli.jar
RealDB Tool Interactive mode
This version is extremely experimental and the commands will change in the next release.
You can also invoke this program with arguments to run a single command non-interactively.
Use 'help' command to get a list of commands, 'exit' to end
> create examples/test.rdl test.rdb
Created database at test.rdb
> describe test.rdb
Database /home/stu4/s30/jpw9607/realdb/realdb-2010-02-28-r318/test.rdb:
  Block size     : 2048
  Database size  : 100 blocks (204800 bytes)
  Data block size: 2
Database has 2 streams:
Test (1), with 1 elements:
  0 value (Float, optional) with codec SampledAlgorithm
Stream contains records from times -1 to -1
CarSnapshots (2), with 4 elements:
  0 rpm (Float) with codec DeadbandAlgorithm
  1 speed (Float) with codec DeadbandAlgorithm
  2 passengers (UInt8) with codec StepAlgorithm
  3 driving (Boolean) with codec StepAlgorithm
Stream contains records from times -1 to -1
> bulkLoad test.rdb 1 examples/test.txt
Wrote 35 records into stream Test (ID 1)
> bulkLoad test.rdb 2 examples/CarSnapshots.txt
Wrote 30 records into stream CarSnapshots (ID 2)
> read test.rdb 1 10 20
    stream       time      value
         1         10       null
         1         11       11.0
         1         12       12.0
         1         13       13.0
         1         14       14.0
         1         15       15.0
         1         --         --
         1         17       null
         1         18       18.0
         1         19       19.0
         1         20       20.0

Read 11 records.
> read test.rdb 2
    stream       time        rpm      speed passengers    driving
         2          5        0.0        0.0          0      false
         2         20        0.0        0.0          1      false
         2         35        0.0        0.0          2      false
         2         45        0.0        0.0          3      false
         2         50      500.0        0.0          3       true
         2         65      600.0        0.0          3       true
         2         70      950.0        5.0          3       true
         2         75     1250.0       10.0          3       true
         2         80     1500.0       20.0          3       true
         2         85      700.0       25.0          2       true
         2         90      850.0       30.0          2       true
         2         95     1000.0       35.0          2       true
         2        100     1200.0       40.0          2       true
         2        105     1400.0       45.0          2       true
         2        120        0.0        0.0          2      false
         2        135        0.0        0.0          0      false

Read 16 records.
> intervals test.rdb 2 rpm 50 70
    stream    element      start        end        min        avg        max   integral
         2        rpm         50         65      500.0      500.0      500.0     7500.0
         2        rpm         65         70      600.0      600.0      600.0     3000.0
         2        rpm         70         75      950.0      950.0      950.0     4750.0

Read 3 intervals.
> summary test.rdb 2 passengers
    stream    element      count      start        end        min        avg        max   integral
         2 passengers         16          5        135        0.0 0.25384615384615383        3.0      255.0

 

23-Feb-2010

I am almost completely done with milestone 4, and a release will be coming this week. Since the last progress report, I have completed all of the "next steps":

I'm just polishing up a few things for the release:

Here is an example RDL that I will be using for this run:

SET blockSize     = 2048
SET fileSize      = 204800
SET maxStreams    = 3
SET dataBlockSize = 2

CREATE STREAM Test WITH ID 1 {
  value float //will use SampledAlgorithm by default
}

CREATE STREAM CarSnapshots WITH ID 2 {
  rpm float WITH CODEC DeadbandAlgorithm PARAMS (deadband=50.0),
  speed float WITH CODEC DeadbandAlgorithm PARAMS (deadband=5),
  passengers uint8 WITH CODEC StepAlgorithm,
  driving boolean WITH CODEC StepAlgorithm
}

There is one caveat to the completion of milestone 4: the current transaction handling, while functioning perfectly (as far as I know), does not meet the scalability requirements in my proposal as the current implementation is naive. It is linear time on the size of the database, while it should be linear time on the uncommitted transactions, up to the maximum user-specified transaction log size.

The command line tool (org.gillius.realdb.tools.cli.RealDBTool) will provide at least the following commands:

22-Nov-2009

Currently I am working on completing milestone 4, which really has caused me to come up with a full solution to the issue of treating the database as a database of continuous signals/streams rather than a series of points. I did not encounter much challenges in defining a data model to represent a "stream interval" over a time range. But then, I faced a dilemma about what kind of classes of interpolation ("compression") methods to support -- what is the minimum that I need in my API to support everything that I want. There were two major problems with my original concept:

  1. The compression algorithms were able to store additional data with the fields to be able to reconstruct intervals later, for example a "slope" with a linear compression. However, I had no way of representing these hidden data elements in the "raw" data model (preventing copying/duplication/movement of data), and the physical data format would be tied to the interpolation algorithm used (and could not be updated or changed).
  2. The compression algorithms worked only by being given the "current" point and are asked if they wish to drop or write it. However, having to make an interpolation decision with only a single point works only for 0 order interpolation.

For awhile each adjustment to that design I tried would be unworkable in some way. I didn't want this to hold up my project any longer especially since I was trying to really focus on reliability and performance than complex interpolation methods. But then I finally had a breakthrough after reviewing some typical methods. Many reasonable methods didn't really require anything but the points, but they do sometimes require to know about points coming ahead in the stream. Those that do take more parameters than the control points themselves are either too complicated, I think (NURBS), can be set to, reasonable, "generic" values (Hermite). I realized that this is true only after some research on interpolation methods, the most concise and useful I found is at Paul Bourke's site. So I made two fundamental changes in my design, based on the above problems, and with the new knowledge of what the algorithms would really need:

  1. Remove ability for codecs to save custom fields in the records. Now, the stored format of the record is exactly the same as the measured format from the user. This allows verbatim copying between databases if needed without involving the codec. The codec will need to reconstruct stream intervals using only the measured values.
  2. Allow the codecs to see a finite number (defined by them at configuration time) of adjacent samples before and after the sample they are outputting/reconstructing.

The first change also allowed the separation of the serialization code from the compression code, which significantly simplified the design by separating the responsibilities better as well as allowing the compression code to be potentially used outside of the RealDB context. The new model for both compression and reconstruction is a stream transformation function. The "look ahead" and "look behind" features are enabled by the design by allowing the stream transformation function to be delayed. For example see the table of interactions below for a theoretical transformation of a stream with 5 samples to a stream with 3:

Input Output
2.0 null
3.0 null
4.0 2.0
4.0 null
4.5 4.0
null null
null 4.5

The compression algorithm delays the output by 2 samples (because it wants to see 2 samples ahead to determine if a sample is needed), and preserves the first and last samples.

It is expected that 0 order interpolation (deadbands, always sample) need 0 look-ahead, first order needs 1 look-ahead, and higher order like cubic or hermite need 2 or more look-ahead points. Until I thought of the better design, I was ready to say that RealDB would only support nothing more complicated than linear interpolation under the theory that anything else would be too slow. However, looking at the implementation, cubic and/or Hermite don't involve a high number of complex operations.

The other thought I had about methods like Cubic and Hermite that work on 4 control points is that the complexity to find the best points could be something like NP-complete, and I almost ruled it out on that fact, until I thought that it is conceivable to have O(n) methods that have a chance of being quite good even though not perfect. I don't intend on researching these techniques in this project -- I only want to design a framework that could support such a technique were one to design it. The example technique I used to prove to myself it could be possible to be practically useful in terms of performance is this simple shell of an heuristic point-selection algorithm for a cubic/Hermite interpolation function:

  1. Know the last 2 points output by the algorithm
  2. Look at the m+2 most recent points in the stream (the finite sized m keeps this algorithm linear time with respect to the samples in the stream, complexity O(n*m))
  3. Consider the last 2 output points and the 2 most recent input points. Form the interpolation based on these 4 points and check the error of the interpolation on the m (middle) points.
    1. If the interpolation's error is under the threshold, drop the 3rd most recent point.
    2. If the error is out of threshold, output the 3rd and 2nd most recent points (and the most recent point becomes part of the "m" set for the next period).
  4. If the size of the "m" buffer is full, then output the 2 most recent points.

There would be some handling of edge cases in the above that I left out for simplicity, to express the point simply that such an algorithm is feasible and could be successful in eliminating a significant portion of points.

Current Progress

Both steps above have finished tested (via unit testing).

Next Steps

29-Jun-2009

In the last 4 months, there was development and fixing work leading up to a graphical demonstration program that works on physical disks; demonstrated on hard drive and removable flash. The demo produced at the middle of May would occasionally fail to create the database, which was surprising given the automated testing I had done.

I sat out to create an automated test based on the demo and was able to reproduce the problem in some cases. There were two bugs:

  1. When advancing the head (oldest end) of the index block "queue", I write a flag "headAdvanced" to the header that denotes that the head pointer points to the actual "real" block, and not to consider the backup block. The bug is that when writing to the "alternate" head, I don't clear the "headAdvanced" flag -- if the database fails immediately after, on reloading the old version of the head is used, causing previously removed data blocks to appear at the start of a stream, which usually leads to the same data block being referenced by two indices (since the DB is always full and transfers blocks in a rolling queue).
  2. There was an error in calculating the length of the queue when the head is physically after the tail (wrap-around): the return from the function was 1 too large. This effectively caused the data index iterator to traverse the "tail" (newest) block of the queue twice.

I wondered, why didn't my tests where I test failures at every possible I/O operation detect these bugs? Because I chose parameters such that a single index block was sufficient to hold all of the indices of a stream; a supported but degenerate case where head == tail and never moves.

I fixed the two bugs and I have yet to get the demo to fail again. I updated the original tests and ensured that they failed with the old code. However, with the new code they get much farther in the test, but still fail. So, I am still finding some bugs. These tests are needed not just because it's virtually impossible to duplicate all of the write faults in a real situation, but will come in critical use when I start "optimizing" some of the naive code just to get things to "work," in order to meet the complexity goals for the project (i.e. turn O(n) recovery into O(1), for example).

1-Feb-2009

Updated the milestone 3 acceptance test framework to extend to n streams, where each stream skips every x records. This allows for streams to write between 50% to 100% of the full dataset. I haven't figured out how to "test" the performance aspects -- the test looks for the literal guarantees made by RealDB. Because RealDB is allowed to reclaim space (deleting records) technically as long as there are some records and they are in order and the same records that I've written, the test can pass. I've written some statements to print that give me a "heuristic" way to check that the test is reasonable. Here is an example of the results of the full (non-corrupting) run of an 8 stream test with skips = 0, 5, 2, 3, 0, 6, 50, 90. What this means is that stream 1 skips no records (100%), stream 3 skips 1 out of every 2 records (50%) and stream 7 skips 1 out of every 50 records (98%).

Total operations performed: 19623
Stream Test0 (ID 0) has 12999 records from 37001 to 50000
Stream Test1 (ID 1) has 10399 records from 37001 to 49999
Stream Test2 (ID 2) has 6499 records from 37001 to 49999
Stream Test3 (ID 3) has 8667 records from 37000 to 50000
Stream Test4 (ID 4) has 13000 records from 37000 to 50000
Stream Test5 (ID 5) has 10833 records from 37000 to 50000
Stream Test6 (ID 6) has 12739 records from 37001 to 49999
Stream Test7 (ID 7) has 12856 records from 37000 to 50000

Note that the time ranges are fairly close together -- the deletion algorithm always picks the oldest block, which is supposed to try to keep the tail of each stream together. The fact that the streams each contain proportionally different records is also a good sign. However, it's hard to write a JUnit assertion for this since I haven't defined the precise tolerance -- because of flushing and block boundaries, the stream that outputs half of the records of another will likely not have exactly half (but close -- within a block's worth).

Every 250 iterations I output the statistics again. Below is an example of the statistics on iteration 10250 (meaning that after the 10,250th I/O operation it throws an exception and, if it was a write, changes the target block or blocks to random data).

Beginning iteration 10250
Stream Test0 (ID 0) has 12999 records from 13501 to 26500
Stream Test1 (ID 1) has 10460 records from 13424 to 26499
Stream Test2 (ID 2) has 6499 records from 13501 to 26499
Stream Test3 (ID 3) has 8666 records from 13501 to 26500
Stream Test4 (ID 4) has 12999 records from 13500 to 26499
Stream Test5 (ID 5) has 10909 records from 13408 to 26499
Stream Test6 (ID 6) has 12890 records from 13346 to 26499
Stream Test7 (ID 7) has 12854 records from 13501 to 26499

We can see from this example that the tails are pretty close together and so are the ratios, which gives a good impression that the transaction and recovery handling is not just passing the strict requirements but also is performing reasonably. An additional piece of evidence is that the total number of records is close to the non-failing example.

Milestone 3 Caveats

In my last status update I forgot to mention that not everything is clean -- there are some limitations:

29-Jan-2009

Summary: Although 2 months have passed without a progress report, a lot of work has been done. The version on 25-Nov was not capable of space management (reclaiming the oldest block to write new data), nor was it capable of handling corruption. 39 revisions of code later, RealDB now has space management and is capable of handling file storage failures at any point. Tentatively I am calling milestone 3 completed, but I want to clean up a few loose ends.

Features/Details of work

Milestone 3 Acceptance Test

As an acceptance test to confirm completion of milestone 3, created a JUnit-based automated test similar to the one in milestone 2. The key component in this test is the usage of the FailingBlockFile, which allows throwing an IOException after x calls into the API and, if the call was a write, overwrites the entire block (or set of blocks if a multi-block call) with random data. The steps in this test are as follows:

  1. Create a database with an RDL file input and pointed to an in-memory BlockFile implementation (ByteArrayBlockFile)
  2. Wrap the file with a ProfilingBlockFile and run the test, and record the number of I/O operations performed (7969 in my test):
    1. Write 200,000 unique, ordered, records into a single stream. The database is small enough that 200,000 records is enough to fill up the whole database many times.
    2. Every 500 records, perform a manual flush operation.
    3. Close the database and reload it (creating a wholly new set of objects)
    4. After the write is complete, re-read the database ensure that the database contains a set of records that is entirely in order, and includes records in sequence up to the last flushed or written record. For example, if we write records 0 to 1345, then the last record flushed is 1000. If, for example, the database contains records 567 to 1000, or 950 to 1340, those are considered passing tests. If the records are in a different order or data before the last flush is lost (355 to 900 for example), the test fails.
  3. For every integer X from 1 to the number of operations performed:
    1. Rerun the test, this time with a FailingBlockFile that fails after X operations are performed, overwriting the target block(s) with random data if a write fails.
    2. Skip to step 3 of the test when a failure occurs: reload the database and re-read all records. The database should reload without exceptions (recover in 100% of cases), and not have lost any records from before the last flush, and all records must be in order and contain the correct data.

The test runs in 461 seconds on my machine to loop over failures in each of the 7969 operations. It is important to note that the 7969 operations even includes the operations to actually create the database from an empty file, so the test also tests for failures when creating the database. However, RealDB does not guarantee that you can reload such a database. If the database throws IOException before it is built, the test does not attempt to read any data and passes.

Transactions

Unfortunately I had to end up implementing a lightweight transaction system in RealDB. I wanted initially to stay away from transactions and rely more on "checkpoints", which scale with the number of flush operations. However, to adhere to RealDB's guarantees, I only have to do transactions on the two data index operations: Moving a block from the free pool to an index and moving a block from one index to another (or to itself).

The key feature in the transaction system to improve the performance is to not only let multiple transactions be pending at the same time, but also allow the operations to be scheduled for a different time. Each transaction is a list of operations represented as object instances implementing java.io.Flushable. The first operation of a transaction always is to open the transaction by writing it to disk. The add operation returns a Flushable object itself of a task that must be performed if the given operation is performed. This ensures that operations are performed always in order, which is essential for allowing transaction recovery.

An example might be a transaction that requires the steps A, B, C, D, E. A coordinator gets a request to perform an operation, which is to be done as a single transaction. It schedules A, B, C, D, and E to occur in order, although none of the operations actually occur yet, through calls like "addBlock( int block, Transaction tx );" If commit is called directly on the transaction, all 5 operations will occur at that time, in order. But of course there is more than one piece of data in every block in RealDB. That necessarily means that if a block is committed on an index, for example, a whole number of transactions that may be affected.

Let's say that D is the operation to add the item to the index. The coordinator calls index.addBlock( block, tx ). The index calls tx.addAction( D ) and that method returns back an object that performs {A,B,C} in order. If index.flush() is called, say because the client wants to commit some other operation "F", this might mean that D is about to be written to disk. Therefore, before the flush occurs, the index calls its object, which performs {A,B,C}, then it performs D directly. A Transaction remebers what is already done, so if later a piece of code asks for C to happen, it looks to run {A,B,C} and sees that those are already completed, and therefore skips those steps.

I am still currently concerned somewhat about circular dependencies, so I will have to make sure that what can be a transaction is managed properly and if a circular dependency could happen, it forces an early commit of part of a transaction. In fact, this is what happens on a smaller scale when a transaction occurs to transfer a block from the end of a stream to the front of that same stream. To properly support transaction recovery, I have to make sure that the removal happens before the add -- and because the head and tail of the queue could rest in the same physical block, it is possible that the block cannot be removed before it is added. It would be possible to perform this operation entirely without a Transaction, but that would require the coordinator to have a too deep of an understanding of the DataIndex implementation.

Future Plans

I would like to try to solidify that I've met milestone 3 by testing a multiple stream database and testing on real flash card hardware and creating faults by actually pulling the drive out of the socket as it is writing. I have a high confidence that both situations should work because:

  1. The multiple-stream transaction handling is actually simpler than single-stream, because in the single-stream case I have to move a block from a stream to itself in one transaction, which requires extra effort to track, especially when the head and tail of the stream are in the same physical block.
  2. I have a high confidence that this will work on real hardware because I tested failures at each of the 7969 possible failure points on the in-memory fail in the previous test.

I would also like to extend the test a bit to check to make sure that no free blocks are unaccounted for. The database could "leak blocks" but still pass in the scenarios above.

25-Nov-2008

Next Steps to complete milestone 3:

28-Oct-2008

12-Oct-2008

29-Sept-2008

2-Sept-2008

1-Sept-2008

18-Aug-2008

7-Aug-2008

24-July-2008

23-July-2008

17-July-2008

13-July-2008

12-June-2008

27-May-2008

20-May-2008