High-performance computing

Hypertable Podcast

Zvents software architect, Doug Judd, discussed a short code example from soon be released open-source distributed data storage system, Hypertable, with LinuxWorld. In the podcast (Download: mp3 / iTunes) he explains the basics and high performance of Hypertable which is designed to support applications requiring maximum performance, scalability, and reliability.

Since Hypertable is starting from the ground up, also discussed are the system’s goals, strengths, and limitations with some insight to the state of the art in project infrastructure, including git for revision control, cmake for build, and Google Code for issue tracking.

Listen now: mp3 / iTunes


High performance computing hypertable linuxworld open source podcast
, , , ,

If you enjoyed this post, make sure you subscribe to my RSS feed!

Comments

The Google File System (and How It Can Be Improved)

Over the past several years, Google has built and deployed large scale distributed software infrastructure for the purpose of storing and processing the massive amount of data that they collect. Three of these systems are fundamental and underpin their entire business. They are: The Google File System, Map-Reduce, and Bigtable. They’ve described these systems in detail in three separate papers (see http://research.google.com/pubs/papers.html).

The GFS was designed to reliably store a massive amount of data (i.e. petabytes) in such a way that allows it to be efficiently processed. To achieve this goal in the most cost effective way possible, the GFS was designed to be run on large clusters of commodity hardware (i.e. the commodity PC and gigabit ethernet). The following list highlights some of the important features of GFS:

  • Global namespace. GFS provides a global namespace in the sense that any machine or process can connect to it and see the whole thing. It also runs entirely in user space and doesn’t require system administrator privileges to connect to it. NFS, on the other hand, requires you to mount many “islands” of filesystem into your own local filesystem requiring system administrator privileges to do so.
  • High Availability. Given the unreliable nature of commodity hardware, GFS has been designed to constantly check for and react to machine failures. High data availability is achieved through inter-machine replication. GFS breaks each file into chunks (default is 64MB) and each chunk is replicated across some number of machines (default is 3). Unlike RAID, it can withstand all types of machine failures (e.g. Power supply, memory, network ports) as opposed to just disk failures. Chunks are also replicated inter-rack to protect against correlated failures caused by failures in a network switch or power circuit.
  • Efficient processing. By splitting data into 64MB chunks and distributing the chunks across a large cluster of machines, the data can be processed efficiently in parallel. This is achieved by pushing processing jobs to the machines that store the chunks and running the computation at the storage nodes in parallel. This has enormous benefit since the computation requires almost no network I/O to read the data that it is processing (see Map-reduce).
  • Atomic record append. Traditional Unix filesystems support the ability to open a file in O_APPEND mode, which instructs the system to append all writes to the end of the file. Unfortunately, most implementations have a race condition where concurrent appends can conflict with one another causing one to overwrite the other. GFS supports true atomic append where the client specifies only the data to be appended and the system will append it at least once, atomically, returning the offset of where it was written, back to the client. This operation is used heavily at Google and allows files to be used as producer/consumer queues with multiple producers.
  • Snapshot. A file or directory tree can be copied almost instantaneously with the snapshot operation. To achieve this fast copy, the systems copies only the metadata and marks all of the underlying chunks copy-on-write. Only if a chunk gets modified will a new copy get created with the modification, otherwise the chunks are shared between the original files and the snapshotted files.

There are several aspects of the GFS design that could be improved given the context of Google’s usage of GFS today and the other distributed systems that have been built up around it.

The first improvement would be to get rid of the random write and support only record append. The GFS paper repeatedly emphasizes that the write workload Google typically sees consists of large streaming writes. Small random writes are almost non-existent. Since the paper was published, Google has developed a database system called Bigtable that sits on top of GFS. It is designed to efficiently handle small random updates by caching them in memory and periodically spilling data sequentially to the GFS. The number of applications that require small random updates, but where Bigtable is inappropriate, is effectively zero. By eliminating this feature, a considerable amount of complexity gets dropped from the system. Reduction of complexity in a software system generally leads to better quality and maintainability.

The next place where I see room for improvement is the fixed (64MB) chunk size. Most applications operate on fixed or variable sized records. If the file system is made aware of these record boundaries, then it could vary the size of its chunks to accomodate whole records only. With a fixed 64MB chunk size the last record in a chunk usually gets truncated, causing the application to read data from multiple locations to assemble the whole record. This isn’t such a big deal in many applications, but can cause scaling problems in the Map-reduce system. Map-reduce achieves much of its efficiency by pushing computation out to where the data is physically stored and running it locally. If, for every chunk, the last record in the chunk is truncated, then the system must fetch the other part of the record to process it, which often involves communicating with another node on the network. At scale and under load, every additional network round-trip negatively impacts performance. If the file system takes into account application record boundaries, then it can make more intelligent placement decisions and therefore reduce overall network load.

The last place in GFS where I feel there could be an improvement is in its data consistency model. Under certain conditions where there are concurrent modifications at the same location, records can get duplicated, truncated, and/or padding (e.g. dead space) can get written into the file. GFS leaves it up to the application to write self-validating and self-identifying records to guard against these situations. This just does not seem right. It places a big onus on the application. I suspect that if GFS moved to a record append only model, that it would be much easier to provide a fully consistent data model.

There are several efforts underway to build open source implementations of this distributed computing infrastructure. Zvents has been building an implementation of a Bigtable-like distributed database called Hypertable. Look for an announcement coming on this soon.

- Doug Judd


High performance computing

If you enjoyed this post, make sure you subscribe to my RSS feed!

Comments (1)