Draft Blog Entries

% fortune -ae paul murphy

SMP/CMT, ZFS, and RDBMS internals

Look inside one of Sun's current "Niagara" or "coolthreads" machines and what you'll see is an early market product reflecting a deep commitment to migrating what we naively think of as SMP infrastructure onto to the same piece of silicon with the processors. Thus one key reason an eight core T2000 at 1.2Ghz is a lot faster than a 24 processor UltraSPARC II machine at 400 Mhz while using much less power is essentially that data transfers happen less frequently, cover shorter distances, and do so with more parallelism.

Look inside ZFS and you'll see the traditional filesystem rethought for hardware environments in which disks are much larger, much faster, and used with machines that have lots of I/O bandwidth and cheap memory

Both the Oracle RDBMS and Sybase ASE work quite well on Niagara machines, whether configured to use ZFS or a more traditional file system with a volume management overlay. Thus loading old code, or continuing with old ideas, works in this environment.

It's possible, however, to make old code run better in the Niagara/ZFS world by making some minor changes in sysadmin/DBA thinking. Specifically:

dropping the raw disk idea makes sense;
In traditional systems using raw disk was always faster (and more predictable) for RDBMS I/O than either direct I/O ( essentially using cooked files with the file buffer size set to zero) or standard I/O - but that isn't true with ZFS since it already bypasses traditional Unix file system buffering and faking in raw I/O just adds overhead.
and, changing RAID configurations makes sense too.
With traditional raw I/O or file system support putting a CPU and memory on the disk pack to offload read/write buffering and the overheads associated with disk striping + mirroring made sense. With ZFS/CMT, it doesn't because doing so adds cost and points of failure without producing significant gains in either performance or reliability.
Instead you now add memory to the host computer and configure each available JBOD as two ZRAID pools - preferably on separate controllers. That way there's no controller to choke transfers between your application and the buffered data while ZFS handles I/O balancing, striping, and recovery with no extra effort, or risk of error, on your part. Then, when I/O eventually starts to bottleneck, you just add another split JBOD - two more controllers, and two more storage pools.

Notice that neither of these adaptions, like a number of others likely to emerge with experience, require change to application or database code.

Now if you're a developer this is good news, because it means that your product will continue to work as Sun moves the architecture forward. On the other hand, the software business can be competitive, so developers have to be wondering what they can do to improve on just letting existing designs go along for the ride.

One option I'm thinking about - but don't know enough to really evaluate yet - would be to take advantage of both the SMP on a chip hardware and the ZFS/Solaris 10 software to drive RDBMS architectural change.

Right now parallelism in query processing stages like parsing and validation is usually handled by spinning off a lightweight thread for each query - but SMP/CMT should make it possible to do steps like the actual parsing on a single query in parallel. If so, the effect would be to automatically serialize most transactions - thus side stepping most of the complexities invoked by the need to protect transaction integrity in environments supporting multiple concurrent transactions.

In a similar-but-different way, the open source nature of ZFS should make it possible to build a better rDB -in other words, to explore the issue of writing an SQL interpreter directly for ZFS under both Linux and Solaris.

Neither of these would be trivial to pull off, and in the end you'd just have another RDBMS engine; so why bother?

Because existing RDBMS implementations work on Niagara/ZFS, but aren't internally designed to match its balance between memory, I/O, and processing. I.e. because bringing the software into alignment with the hardware just might offer enormous speed and cost benefits.

Ok, I may be dreaming in techni-color, but taking file management, atomicity control, back-up, and rollback management out of of the RDBMS design problem should yield order of magnitude speed improvements - meaning that a $100K Sun Pod with a 32GB Niagara and dual 2TB JBODs on pairs of Sun's Dual Channel 4 Gb FC PCI-E and PCI-X cards, could handle Fortune 5000 class jobs that now need an E20K - at over $1.5 million.

If that's even remotely right it signals a tremendous opportunity for a new RDBMS product to arrive - because new developers wouldn't have a backward compatibility problem (except with respect to SQL), but could expect to achieve benchmark blasting speed -and then to experience dramatically lower maintenance and related costs "going forward" because both enabling toolsets, ZFS and SMP/CMT, are open specifications, but Sun's job, and not theirs, to maintain.

Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specialising in Unix and Unix-related management issues.