RealDB
Master's project by Jason Winnebeck - Current Status
| Progress Reports
Committee
Abstract
RealDB (RDB) will be a real-time embeddable database system for data streams.
It is a non-relational database that is based on a changing value of a stream
of sequential, timestamped data. To support this environment, the system will
have a low overhead and deliver high performance; the trade-off is a narrower
approach and fewer guarantees than a traditional relational database management
system (RDBMS) would deliver.
Summary
To summarize the main topics on what this work is to accomplish:
- Create a storage engine for live streaming data streams that performs
better than traditional relational database systems, by making applicable
assumptions and tradeoffs for the specific situation.
- Faster: constant-time insertion and deletion of records, with respect
to the size of the database
- Faster: logarithmic lookup time for data within a data stream, with
linear time to read records
- Smaller: tightly compacted dataset with little or no indexing required,
data is assumed to come in-order
- Represent data streams as a continuous value over time, rather than as
a set of discrete samples or rows
- Smaller: allows reduction of data stored by allowing reconstruction
of the signal from a reduced data set
- Smarter: allows calculation of averages, integrals, etc based on the
reconstruction over timespans.
For more details, please see the latest proposal.
Current Status
- The RealDB project proposal is finished. The final proposal can be found
here (SVN revision 81, 7-Aug-2008).
- Prototyping the highest-level of code design, with the goal of making
sure that each concept is represented and has some place in the design.
- Developing physical DB structure and object model to represent it
Schedule
- Milestone 1
- First design phase completed, creation of object model and some stub functionality
and tests. Setup of environments and compilers such as GCJ, and prototypes
for high-risk code.
- Milestone 2
- Creation of maintenance tools to create an empty database on disk, and
simple storage implementation (single stream, no or incomplete space management)
- Milestone 3
- Completion of database metadata and functionality required for writing,
including space management but excluding compressed data storage
- Milestone 4
- Completed research and implementation of compressed data storage and gathering,
reconstruction algorithms, and read functionality including APIs and query
tool
- Milestone 5
- Completion of proof-of-concept use for RealDB
- Milestone 6
- Design and implementation of RealDB version of performance tool and completed
design for solving problem using other solutions
- Milestone 7
- Completion of performance tool versions for MySQL InnoDB, MySQL MyISAM,
and Apache Derby
The project schedule, from the final proposal:
| Target |
Planned |
Completed |
Percent Complete |
| Preproposal |
2006-10-17 |
2006-10-17 |
100% |
| Preproposal Presentation |
2006-10-17 |
2006-10-17 |
100% |
| Proposal Approved |
2008-07-31 |
2008-08-11 |
100% |
| Milestone 1 |
2008-08-31 |
|
95% |
| Milestone 2 |
2008-08-31 |
|
15% |
| Milestone 3 |
2008-09-08 |
|
|
| Milestone 4 |
2008-09-22 |
|
|
| Milestone 5 |
2008-10-20 |
|
|
| Milestone 6 |
2008-11-10 |
|
|
| Milestone 7 |
2008-11-24 |
|
|
| Report |
2008-12-15 |
|
|
| Defense |
2009-01-15 |
|
|
Progress Reports
18-Aug-2008
- Significant development of metadata objects and unit tests of creating
and loading the MetadataSection. Still may need some rework, but getting
there.
- Creation of in-memory BlockFile implementation for testing.
- Design for data index section done out on paper, including handling details
of handling safe-commits.
7-Aug-2008
- Updated proposal (SVN revision 81) Updated
schedule for milestones 1 and 2 slightly, and made notes about improving
references in the final report.
- Wrote out on paper a DB structure, breaking it up into four sections:
file header, metadata section, data index section, and data section. Started
object model to construct these and some code to write them on disk.
- Performed the final round of testing on Linux again to confirm the earlier
results, including confirming the operation on the raw partition. Currently
only the Sun JVM is capable of using raw partitions (results
log).
24-July-2008
- I think I have a handle on the corruption. My thought is that when writing
synchronously, most of the time is actually spent not writing, so it was
just very hard to actually cause corruption by pulling out the card. Whenever
I did cause corruption, I always got 128k byte blocks written out. Because
I am writing random data, I can't confirm if there is some block management
system giving the appearance of fail-safe 128k blocks or if they are going
random. Either way, I feel confident enough now that there is hardware with
an upper-bound on the amount of corruption and it won't corrupt any previously
written blocks to continue with the project. (Results
log here). Probably my next step is to confirm that everything works
in Linux, and when using a raw partition, and then just move on with the
actual design.
23-July-2008
- Updated proposal (revision 72) to alphabetize
references and updated schedule.
17-July-2008
- Improvements and more tests with CorruptionTest experimental code
- Still trying to quantify and qualify my assertions of whether or not
corruption occurs as I expected. Still not producing corruption in Windows,
somehow even with 1MB buffers. I really don't think that I am using
the checksum improperly.
- Improvements to the file layout produced by CorruptionTest: Added
block numbers, timestamps to each block, random data, CRC32 checksum.
- Improvements to the error detection and reporting, to try to check
for more possible fault conditions that may otherwise look like success.
- Results:
- July 13: (Windows)
tried 256 byte block size, saw different caching behavior. Could not
cause corruption.
- July 14: (Windows)
test using timestamps in the blocks, either worked (usually) or produced
a drive that Windows wouldn't read (once) -- still could not get a corrupted
file.
- July 16: (Windows)
using "rwd" synched writing mode now, tried 256 byte block
sizes. Could not corrupt the disk. I could get a file partially written,
but any data that I wrote came out perfectly -- i.e. no blocks with
garbage, either 100% old content or 100% new content in each block.
Even happened with 64k and 1MB block sizes -- this I can't believe!
13-July-2008
- Earlier work from June:
- Installation and set up of development environments (Sun J2SE 6 Windows
XP/Linux and Ubuntu Linux 8.04 with GCJ 4.2.3)
- Test accessing/creating/writing fixed size files in all three environments,
everything works except that GCJ 4.2.3 is not able to properly open
special block device files on Linux (i.e. /dev/hdb1).
- Creation of initial layout of classes and packages with some very preliminary
prototyping of some core interfaces; only low-level BlockFile API implemented.
Expectation that this will change as the design progresses.
- Creation of CorruptionTest class that uses BlockFile API to perform various
file IO operations to further test accessing flash hardware and test how
corruption results in power-loss scenarios and plugging in/out cards while
writing. These tests were done on a fixed-size file on a FAT16-formatted
compact flash card.
- Results of the tests
done so far with Windows XP - could not get corruption to occur
in a handful of trials, except once, which produced a card that when
I attempted to read the file, the hardware shutoff (or the driver in
XP cut off the USB port). Windows does not perform much caching on the
write side.
- Results of the tests
done so far with Linux - corruption did occur, but multiple blocks
were corrupted, which was not as I expected. Linux also performs heavy
caching of both write and read.
- Future work: follow-up research to further understand current results
- Improve code to provide more details on what is corrupted and how
rather than just detecting that it happened and in what quantity.
- Try to use "synchronous" writing mode (as provided by Java
APIs), and possibly sync/flush in Linux
- Discover the effects by different block sizes
- Perform the tests (Linux only) on raw partitions, without file systems
- No source code yet until proposal is finalized and license/IP issues are
sorted out.
12-June-2008
27-May-2008
- Updated draft proposal (revision 39)
- Write proof-of-concept section.
- Rewrite of schedule section, break out into milestones, and better
date estimates.
- Place for signatures on front page
20-May-2008
All content on this web site is copyright © 1998-2006
by Jason Winnebeck, unless otherwise noted.