Archive

Archive for May, 2010

VMWare Needs a New File System

I can’t be the first person to say this, but just in case I am, here’s what I’m thinking.

I don’t think VMWare can continue on their current path, focused very much on the cloud, without a new approach to storage.  Why?  Simple, cost and scale.  By scale I don’t mean 10, 100 or even 500 nodes, but thousands of nodes.  Current solutions involving block storage with VMFS or NAS with NFS rely on costly external systems from the likes of Netapp, HP, EMC and others, this leads to complexity, additional cost and limited scalability potential.  I don’t say this lightly, but as someone who’s actually designed a cloud computing offering on VMWare I’ve seen the limitations first hand.

What I’m proposing is a replacement file system to VMFS, one that’s not tied to the traditional SAN/NAS approach to storage.  What would this look like?  Many of the very systems that run VMWare can have varying amounts of local storage, from 1 drive to massive 32+ drive internal SAS arrays.  The problem with using internal storage is none of the other hosts can see the VMDKs for VMotion etc.   What I’m proposing is leveraging all of those drives and spindles into a large cluster of storage that each VMWare ESX host can access.

To be clear, I’m not talking about simply having a VM run on each VMWare host as a means to this end but rather for VMWare to natively support making use of all this storage in VMWare directly, in much the same way VMFS is layered on top of a LUN, the local storage in a standard server could be clustered together to form a large pool of storage across a 10Gb network.  This addresses cost and scale – Cost, simply adding 4+ 1TB drives to a typical 1U server isn’t very expensive.  Scale because with this approach every time you add a ESX host you’re adding storage , CPU and networking, not just CPU and networking.

What would this file system look like, perhaps like IBM’s GPFS, or Apache Hadoop (HDFS) (Not a fan of the single Namenode, but that’s a different blog post), or something completely new.  I believe something completely new would provide increased flexibility rather than making one of these off the shelf solutions fit, but the general approach would be the same.  I’m not talking pie in the sky, this is how IBM runs their large Super Computers, if it can be done for those, it can be done for this.

Each VMWare host becomes part of the greater storage cluster, not at the VM level but natively within ESX itself.  Think of the VMDKs as objects, replicate the writes across the data center, VMotion would work in much the same way as it does with VMFS today on a traditional LUN.  Or even better have VMWare provide the means for storage companies such as IBM, HP, EMC and others provide their own “File System Plugin” – storage becomes software, in the same way that servers, firewalls and network switches are now software thanks to virtualization.  Virtualize your storage on your virtualization platform, not externally.

Take this one step further and you could have different ESX hosts with different storage types – Some nodes with SAS, some with SATA and yet others still with SSD.  It would even be possible to have non-storage nodes, where they don’t contribute to the overall storage in the cluster, but provide additional CPU to the cluster to run VMs, nodes could even be dedicated storage nodes that don’t run VMs, but that’s not as ideal to me as having every node have some storage.

Another step up in the stack VMs could be assigned to “storage types” so that a “Database VM” could be assigned to storage of the “SAS” type and possibly mixed type of SSD/SAS or SAS/SATA and ILM approach native to VMWare.  VMotion, DRS, etc all become aware of the VMs storage needs, VMWare would be aware of the storage the VMs are provisioned on and become innately aware of the performance that the storage is providing to the VMs themselves.  Allow for multiple blocks to be stored on multiple nodes depending on redundancy requirements.  Have an important database?  Then keep 3+ copies distributed across the cluster, have a simple web servers, perhaps keep 2 or even just the 1.

Extend this to the data center level, replicate data to other data centers and have your “active” file system in your production data center and your “backup” file system in the other data center hundreds of kilometers away.  Your not replicating the entire file system on a schedule, you’re replicating the block writes to the VMDKs in the clustered file system.   Taken to the extreme this would allow you to run the application in any data center at any time with little more than a VMotion to the other data center.  Now it isn’t about having a “production data center” and a “DR data center” but rather running the apps in the data center that is most suited to the given work load, or possibly the data center that is currently less per KW-h.

EMC’s recent announcement of VPLEX achieves some of what I’m after, but it’s yet another box(s) that isn’t directly part of the VM infrastructure.  From what I’ve read it also seems to be an FC solution, so again, not addressing the complexity, cost and scalability issues inherent in an FC deployment.  Scaling FC to thousands of nodes isn’t practical for many reasons; a single clustered network storage option would address that and more.

Perhaps I’m dreaming, but I think this is completely doable, VMWare just has to realize it’s needed.

Advertisements