In Memory Computing Blog Series
Part 1: In Memory Computing - A Walk Down Memory Lane
In the series’ previous post we took a walk down memory lane and reviewed the evolution of In Memory Computing (IMC) over the last few decades. In this blog post we will deep dive into a couple of specific implementations of IMC to better understand both the innovations that have happened to date and their limitations.
The Buffer Cache
Buffer caches, also sometimes referred to as page caches, have existed in operating systems and databases for decades now. A buffer cache can be described a piece of RAM set aside for caching frequently accessed data so that one can avoid disk accesses. Performance, of course, improves the more one can leverage the buffer cache for data accesses as opposed to going to disk.
Buffer caches are read-only caches. In other words only read accesses can potentially benefit from a buffer cache. In contrast, writes must always commit to the underlying datastore because RAM is a volatile medium and an outage will result in the data residing in the buffer cache getting lost forever. As you can imagine this can become a huge deal for an ACID compliant database. ☺
The diagrams below depict how the buffer cache works for both reads and writes.
For reads, the first access must always go to the underlying datastore whereas subsequent accesses can be satisfied from the buffer cache
For writes we always have to involve the underlying datastore. The buffer cache provides no benefits.
The fact that the buffer cache only helps reads is severely limiting. OLTP applications, for example, have a good percentage of write operations. These writes will track the performance of the underlying datastore and will most probably mask any benefits the reads are reaping from the buffer cache.
Secondly, sizing the buffer cache is not trivial. Size the buffer cache too small and it is of no material benefit. Size it too large and you end up wasting memory. The irony of it all, of course, is that in virtualized environments the server resources are shared across VM and are managed by the infrastructure team. Yet, sizing the buffer cache is an application specific requirement as opposed to an infrastructure requirement. One therefore ends up doing this with no idea of what other applications are running on the host or sometimes even without an awareness of how much RAM is present in the server.
API based in memory computing
Another common way to leverage IMC is to leverage libraries or API within one’s applications. These libraries or API allow the application to use RAM as a cache. A prevalent example for this is memcached. You can learn more about memcached here. Here’s is an example of how one would use memcached to leverage RAM as a cache:
foo = memcached_get("foo:" . foo_id)
return foo if defined foo
foo = fetch_foo_from_database(foo_id)
memcached_set("foo:" . foo_id, foo)
This example shows you how you must rewrite your application to incorporate the API at the appropriate places. Any change to the API or the logic of your application means a rewrite.
Here is a similar example from SQL.
CREATE TABLE FOO (…..) WITH (MEMORY_OPTIMIZED=ON);
In the SQL example note that you, as a user, need to specify which tables need to be cached in memory. You need to do this when the table is created. One presumes that if the table already exists you will need to use the ALTER TABLE command. As you know DDL changes aren’t trivial. Also, how should a user decide which tables to keep in RAM and which not to? And does such static, schema based definitions work in today’s dynamic world? Note also that it is unclear if these SQL extensions are ANSI SQL compliant.
In both cases enabling IMC means making fundamental changes to the application. It also means sometimes unnecessarily upgrading your database or your libraries to a version that supports the API you need. And, finally, it also means that each time the API behavior changes you need to revisit your application.
In Memory Appliances
More recently ISV have started shipping appliances that run their applications in memory. The timing for these appliances makes sense given servers with large amounts of DRAM are becoming commonplace. These appliances typically use RAM as a read cache only. In the section on buffer caches we’ve already discussed the limitations of read only caches.
Moreover since many of these appliances run ACID compliant databases consistency is a key requirement. This means that writes must be logged in non-volatile media in order to guarantee consistency. If there are enough writes in the system then overall performance will begin to track the performance of the non-volatile media as opposed to RAM. The use of flash in place of mechanical drives mitigates this to some extent but if one is going to buy a purpose built appliance for in memory computing it seems unfortunate that one can only be guaranteed flash performance instead.
Most of these appliances also face tremendous startup times. They usually work by pulling all of their data into RAM during startup so that it can be cached. Instrad of using an ‘on demand’ approach they preload the data into RAM. This means that startup times are exorbitantly high when data sizes are large.
Finally, from an operational perspective rolling in a purpose built appliance usually means more silos and more specialized monitoring and management that takes away from the tremendous value that virtualization brings to the table.
IMC as it exists today put the onus on the end user/application owner to make tweaks for storage performance when in fact the problem lies within the infrastructure. The result is tactical and proprietary optimizations that either make sub-optimal use of the infrastructure or simply break altogether over time. Wouldn’t the right way instead be to fix the storage performance problem, via in memory computing, at the infrastructure level? After all, storage performance is an infrastructure service to begin with. For more information, check out http://www.pernixdata.com