Professional Documents
Culture Documents
Dai Qin, Angela Demke Brown, and Ashvin Goel, University of Toronto
https://www.usenix.org/conference/atc14/technical-sessions/presentation/qin
when no ordering is needed. To ensure these writes are The main advantage of this approach over write-
all durable, the application waits for these writes to be through caching is that writes have lower latency because
acknowledged and then issues the cache flush command. dirty blocks can be flushed to storage asynchronously.
When this command is acknowledged, further writes can Moreover, it provides stronger guarantees than vanilla
be issued with the guarantee that they will be ordered af- write-through caching because it handles storage failures
ter the previous writes. as well (see Section 2). Compared to write-back caching,
barrier requests will take longer with this policy, which
3.2 Caching Policies primarily affects sync-heavy workloads.
Our caching design is motivated by a simple princi- 3.2.2 Write-back Persist
ple: the caching device should provide exactly the same
semantics as the physical device, as described in Sec- The write-back persist policy implements barrier seman-
tion 3.1. This approach has two advantages. First, ap- tics on a cache flush request by atomically flushing dirty
plications running above the caching device get the same cache metadata to the flash device. The dirty file-system
reliability guarantees as they expect from the physical blocks are already cached on the flash device, but we
device, without requiring any modifications. Second, the also need to atomically persist the cache metadata in
caching policies can be simpler and more efficient, be- client memory to flash, to ensure that blocks can be
cause they need to make the minimal guarantees pro- found on the flash device after a client or storage failure.
vided by the physical device. This metadata contains mappings from block locations
In this section, we present our write-back flush and on storage to block locations on the flash device, helping
write-back persist policies. Both policies essentially im- to find blocks that are cached on the device.
plement the semantics of the write barrier. The flush pol- The write-back persist policy assumes that the flash
icy handles destructive failures in which the flash device device is available after failure. During recovery, the
may not be available after a client failure, while the per- cache metadata is read from flash into client memory.
sist policy handles recoverable failures in which the flash The atomic flush operation at each barrier ensures that
device is available after a client failure. The choice of the metadata provides a consistent state of data in the
these policies in a given environment depends on the type cache, at the time the last barrier was issued.
of failure that the storage administrator is anticipating. The main advantage of write-back persist is that its
performance is close to that of ordinary write-back
3.2.1 Write-Back Flush
caching. Some latency is added to barrier requests, but
The write-back flush policy implements barrier seman- persisting the cache metadata to the flash device has
tics on a cache flush request by flushing dirty data on the low overhead, given the fairly small amount of meta-
flash device to storage. The flush process sends these data needed for typical cache sizes. The drawback is
blocks to storage, issues a write barrier on the storage that destructive failure cannot be tolerated because large
side, and then acknowledges the command to the client. amounts of dirty data may be cached on the flash de-
Similar to write-through, the write-back flush policy vice, similar to write-back caching. Furthermore, if the
does not require any recovery after failure. Any dirty client fails permanently, then recovery time may be sig-
data cached either on flash or storage since the last barrier nificant because it will involve moving data from flash
may be lost, but it is not expected to be durable anyway. using either an out-of-band channel (e.g., live CD), or by
As a result, this policy is resilient to destructive failure. physically moving the device to another client.
Table 1 provides a comparison of the different caching We avoid delaying read and write requests when the
policies. The caching policies are shown in increasing cache is full (which it will be, after it is warm) by only
order of performance, with write-through being the slow- evicting clean blocks from the cache, using the standard
est and write-back being the fastest caching policy. The LRU replacement policy. The clean blocks are main-
write-back flush policy provides the same guarantees as tained in a separate clean LRU list to speed up evic-
write-through, with low write latency, but with increased tion. We ensure that clean blocks are always available
barrier latency. The write-back persist policy provides by limiting the number of dirty blocks in the cache to
performance close to write-back, but unlike the write- a max_dirty_blocks threshold value. Once the cache
back flush policy, it doesnt handle destructive failure. hits this threshold, we fall back to write-through caching.
The flush thread writes dirty blocks to storage in the
4 Design of the Caching System background. It uses asynchronous IO to batch blocks,
and writes them proactively so that write requests avoid
We now describe the design of our caching system that hitting the max_dirty_blocks threshold. The dirty
supports the write-back flush and persist policies. We map always refers to the latest version of a block, so only
first present the basic operation of our system, followed the last version is flushed when a block has been written
by our design for storing the cache metadata and the allo- multiple times. After a block is flushed, it is moved from
cation information. Last, we describe our flushing policy. the dirty map to the clean map. The flush thread waits
when the number of dirty blocks reaches a low threshold
4.1 Basic Operation value, unless it is signaled by the write-back flush policy
Our block-level caching system maintains mapping in- to flush all dirty blocks (not shown in Table 3).
formation for each block that is cached on the flash de- The write-back flush and write-back persist policies
vice. This map takes a storage block number as key, and are implemented on a barrier request. The flush policy
helps find the corresponding block in the cache. We also writes the dirty blocks to storage and waits for them to be
maintain block allocation and eviction information for durable by issuing a barrier request to storage. The per-
all blocks on the flash device. In addition, we use a flush sist policy makes the current blocks on storage durable
thread to write dirty data blocks on the flash device back and persists the dirty map on the flash device, perform-
to storage. The mapping and allocation operations are ing the two operations concurrently and atomically.
shown in Table 2. As we discuss in Section 4.2.2, we These policies share much of the caching functional-
separate the mappings for clean and dirty blocks into two ity. Next, we describe key differences in the mapping
maps, called the clean and dirty maps. and allocation operations for the two policies.
Table 3 shows the pseudocode for processing IO re-
4.2 Mapping Information
quests in our system, using the mapping and allocation
operations. On a read request, we use the IO request The mapping information allows translating the storage
block number (bnr) to find the cached block. On a cache block numbers in IO requests to the blocks numbers for
miss, we read the block from storage, allocate a block the cached blocks on flash. We store this mapping infor-
on flash, write the block contents there, and then insert mation in a BTree structure in memory because it enables
a mapping entry for the newly cached block in the clean fast lookup, and it can be persisted efficiently on flash.
map. On a write request, instead of overwriting a cached
4.2.1 Write-back Flush
block, we allocate a block on flash, write the block con-
tents there, and insert a mapping entry in the dirty map. The mapping information is kept solely in client memory
This no-overwrite approach allows writes to occur con- for the write-back flush policy, because the cache con-
currently with flushing a write is not blocked while a tents are not needed after a client failure, as explained in
previous block version is being flushed. Mappings are Section 3.2.1. On a client restart, the flash cache is empty
updated only after writes are acknowledged to maintain and the mapping information is repopulated on IO re-
barrier semantics (see Section 3.1). quests. Cache warming can help reduce this impact [23].
4.2.2 Write-back Persist When the flush thread writes a dirty block to storage,
we move its mapping from the dirty map to the clean
The mapping information needs to be persisted atom- map, as shown in Table 3. This in-memory operation up-
ically for the write-back persist policy, as explained dates the dirty map, which is persisted at the next barrier.
in Section 3.2.2. This persist_map operation is per-
formed on a barrier, as shown in Table 3. We implement 4.3 Allocation Information
atomic persist by using a copy-on-write BTree, similar to
the approach used in the Btrfs file system [17]. We use a bitmap to record block allocation information
Only the dirty mappings need to be persisted to ensure on the flash device. The bitmap indicates the blocks that
consistency and durability, since the clean blocks are al- are currently in use, either for the mapping metadata or
ready safe and can be retrieved from storage following a the cached data blocks.
client restart. This option reduces barrier latency because We do not persist the allocation bitmap to flash for
the clean mappings do not have to be persisted. How- several reasons. First, the bitmap does not need to be
ever, persisting all mappings may be beneficial because persisted at all for the write-back flush policy since the
a client can restart with a warm cache. cache starts empty after client failure, as discussed in
We have chosen to persist only the dirty mapping in- Section 4.2.1. Second, we separate the clean and dirty
formation. To do so, we keep two separate BTrees, called mapping information and only persist the dirty map for
the clean map and the dirty map, for the clean and dirty the write-back persist policy. As a result, we would also
mappings. The dirty map is persisted on a barrier re- need to separate the clean and dirty allocation informa-
quest; we call this the persisted dirty map. Compared tion and only persist the dirty allocation information to
to persisting all mappings using a single map, this sepa- ensure consistency of the mapping and allocation infor-
ration benefits both read-heavy and random write work- mation during recovery. Since we read in the dirty map
loads. In both cases, the dirty map will remain relatively during recovery anyway, which allows us to rebuild the
small compared to the single map, which would either be allocation bitmap, this added complexity is not needed.
large due to many clean mappings, or would have dirty In the write-back persist policy, the cache blocks
mappings spread across the map, requiring many blocks that are referenced in the persisted dirty map cannot be
to be persisted. An additional benefit is that recovery, evicted even if they are clean, or else corruption may
which needs to read the persisted dirty map, is sped up. occur after recovery. For example, suppose that Block
for read-heavy workloads. For webserver, the caching fraction of the flash cache. Read requests tend to be for
policies increase IOPS by more than 2X compared to clean blocks however, since recently written dirty blocks
no-cache because the workload fits in the cache. We are more likely to be in the buffer cache, and the reduced
found that webserver saturates the flash bandwidth and number of clean blocks in the flash cache leads to a lower
hence write-back persist performs slightly worse because hit rate. We see a similar effect in the ms_nfs workload,
it needs to persist its mapping table to flash on barri- although the hit rates are lower in both cases (45% for
ers. In contrast, webserver-large has a low cache hit rate. write-back flush vs. 60% for write-back persist).
Thus, it issues many reads to storage, making the storage
Varmail is the most demanding workload for our poli-
a bottleneck, and so the no-cache policy performs as well
cies. It issues fsync() calls frequently and waits for
as any caching policy.
them to finish, making the workload sensitive to barrier
The ms_nfs workloads create many dirty blocks and
latency. Write-back persist issues synchronous writes
then free them, causing cache pollution and many cache
to flash and hence has significant overheads compared
misses, because the caching layer is unaware of freed
to the write-back policy. However, persist still per-
blocks. Nonetheless, our write-back policies perform
forms much better than the other policies that issue syn-
well, with write-back persist performing comparably
chronous writes to storage. The write-back flush pol-
to vanilla write-back because the workloads are write-
icy performs worse than the write-through policy by 7%,
heavy (84% and 58% write requests in ms_nfs-small and
because in our current implementation, the flush thread
ms_nfs, respectively). With ms_nfs-small, write-through
always performs an additional block read from flash to
caching performs slightly worse than no-caching be-
flush the block to storage.
cause the cache misses require filling the cache, however
the difference is within the error bars. With the larger Contrary to our expectation, the write-back consistent
ms_nfs, storage becomes a bottleneck, and hence perfor- policy, which doesnt provide durability, performs worse
mance decreases for all workloads. However, this work- than the write-back flush policy for the ms_nfs work-
load has a smaller ratio of write requests than ms_nfs- loads. These workloads quickly consume all available
small and so the performance benefits of write-back epochs because each file-system commit issues a barrier
caching over write-through caching are smaller. that starts a new epoch, but the barriers themselves are
In ms_nfs-small, the cache hit rate for write-back flush not held up by flushing. When no epochs are available,
is 52%, while the hit rate for write-back persist is 79%, all writes are blocked. We observed that epochs do not
accounting for the difference in their performance. For become large (as with write-back flush) but writes are
write-back flush, the high barrier latency (due to flushing frequently blocked due to having no available epochs.
dirty blocks back to storage) causes filesystem journal Increasing the number of epochs did not significantly im-
commits to be delayed since the next transaction can- prove performance. In contrast, the write-back flush pol-
not commit until the previous one has completed. For icy delays barriers, but due to this delay it does not run
ms_nfs-small, only 13 transactions were committed dur- out of epochs. This result suggests that the delay intro-
ing the 20 minute run. This delay increases the epoch duced by barrier-based flushing provides a natural con-
size (we observed a maximum of 4.6GB, with 2GB on trol mechanism for avoiding other limits due to cache
average), which leads to dirty blocks occupying a large size, number of epochs, etc.