You are on page 1of 9

07CEG84

Google File System

The Google File system


RANA NIRAV
07-CEM-84, Computer Engg Department
Sardar Vallabhbhai Patel Institute of Technology
Vasad Gujarat
Rana_nirav2007@yahoo.com

1. ABSTRACT 2.INTRODUCTION
We have designed and implemented the Google File We have designed and implemented the Google File
System, a scalable distributed system for large System(GFS) to meet the rapidly growing demands
distributed data-intensive applications. It provides of Google’s data processing needs.GFS shares many
fault tolerance while running on inexpensive of the same goals as previous distributed systems
commodity hardware and it delivers high aggregate such as performance, scalability, reliability, and
performance to a large number of clients. availability. However, its design has been driven by
While sharing many of the same goals as previous key observations of our application workloads and
distributed systems, our design has been driven by technological environment, both current and
observations of our application workloads and anticipated that reflect a marked departure from some
technological environment, both current and earlier google system design assumptions. We have
anticipated, that react a marked departure from some reexamined traditional choices and explored radically
earlier system assumptions. This has led us to different points in the design space.
reexamine traditional choices and explore radically First, component failures are the norm rather than the
different design points. exception. The google system consists of hundreds or
The Google system has successfully met our storage even thousands of storage machines built from
needs. It is widely deployed within Google as the inexpensive commodity parts and is accessed by a
storage platform for the generation and processing comparable number of client machines. The quantity
of data used by our service as well as research and and quality of the components virtually guarantee
development efforts that require large data sets. The that some are not functional at any given time and
largest cluster to date provides hundreds of terabytes some will not recover from their current failures. We
of storage across thousands of disks on over a have seen problems caused by application bugs,
thousand machines, and it is concurrently accessed operating system bugs, human errors, and the failures
by hundreds of clients. In this paper, we present of disks, memory, connectors, networking, and power
google system interface extensions designed to supplies. Therefore, constant monitoring, error
support distributed applications, discuss many detection, fault tolerance, and automatic recovery
aspects of our design, and report measurements from must be integral to the system.
both micro-benchmarks and real world use. Second, google are huge by traditional standards.
Multi-GB .le are common. Each .le typically contains
many application objects such as web documents.
When we are regularly working with fast growing
data sets of many TBs comprising billions of objects,
it is unwieldy to manage billions of approximately
KB-sized .les even when the google system could
support it. As a result, design assumptions and
parameters such as I/O operation and block sizes
have to be revisited.
Third, most .les are mutated by appending new data
rather than overwriting existing data. Random writes

1
07CEG84
Google File System

with practically non-existent. Once written, the .les large streaming reads and small random reads. In
are only read, and often only sequentially. A variety large streaming reads, individual operations typically
of data share these characteristics. Some may read hundreds of KBs, more commonly 1 MB or
constitute large repositories that data analysis more. Successive operations from the same client
programs scan through. Some may be data streams often read through a contiguous region of a .le. A
continuously generated by running applications. small random read typically reads a few KBs at some
Some may be archival data. Some may be arbitrary offset. Performance-conscious applications
intermediate results produced on one machine and often batch and sort their small reads to advance
processed on another, whether simultaneously or later steadily through the .le rather than go back and forth.
in time. Given this access pattern on huge .les, • The workloads also have many large, sequential
appending becomes the focus of performance writes that append data to .les. Typical operation
optimization and atomicity guarantees, while caching sizes are similar to those for reads. Once written, .les
data blocks in the client loses its appeal. are seldom
Fourth, co-designing the applications and the system Modified again. Small writes at arbitrary positions in
API benefits the overall system by increasing our a .le are supported but do not have to be efficient.
extensibility. • The system must efficiently implement well-defined
For example, we have relaxed GFS’s consistency semantics for multiple clients that concurrently
model to vastly simplify the system without append to the same .le. Our .les are often used as
imposing an onerous burden on the applications. We producer consumer queues or for many-way merging.
have also introduced an atomic append operation so Hundreds of producers, running one per machine,
that multiple clients can append concurrently to a will on currently
without extra synchronization between them. These append to a .le. Atomicity with minimal
will be discussed in more details later in the paper. synchronization overhead is essential. The .le may be
Multiple GFS clusters are currently deployed for read later, or a consumer may be reading through
different purposes. The largest ones have over 1000 the .le simultaneously.
storage nodes,over 300 TB of disk storage, and are • High sustained bandwidth is more important than
heavily accessed by hundreds of clients on distinct low latency. Most of our target applications place a
machines on a continuous basis. premium
on processing data in bulk at a high rate, while few
have stringent response time requirements for an
individual read or write.
3.2 Interface
GFS provides a familiar .le system interface, though
3. DESIGN OVERVIEW it does not implement a standard API such as POSIX.
Files are organized hierarchically in directories and
3.1 Assumptions identified by pathnames. We support the usual
In designing a google system for our needs, we have operations to create, delete, open, close, read, and
been guided by assumptions that over both challenges write .les.
and opportunities. We alluded to some key Moreover, GFS has snapshot and record append
observations earlier and now lay out our assumptions operations.Snapshot creates a copy of
in more details.
• The system is built from many inexpensive
commodity components that often fail. It must
constantly monitor itself and detect, tolerate, and
recover promptly from component failures on a
routine basis.
• The system stores a modest number of large data.
We expect a few million each typically 100 MB or
larger in size. Multi-GB are the common case and
should be managed efficiently. Small must be
supported, but we need not optimize for them.
• The workloads primarily consist of two kinds of
reads:

2
07CEG84
Google File System

the mapping from .les to chunks, and the current


locations of chunks.
It also controls system-wide activities such as chunk
lease management, garbage collection of orphaned
chunks, and chunk migration between chunk servers.
The master periodically communicates with each
chunk server in HeartBeat messages to give it
instructions and collect its state. GFS client code
linked into each application implements the .le
system API and communicates with the master and
chunk servers to read or write data on behalf of the
application.Clients interact with the master for
metadata operations, but all data-bearing
communication goes directly to the chunk servers.
We do not provide the POSIX API and therefore
need not hook in to the Linux vnode layer.
Neither the client nor the chunk server caches .le
data. Client caches o.er little benefit because most
a .le or a directory tree at low cost. Record append applications stream through huge .les or have
allows multiple clients to append data to the same .le working sets too large to be cached. Not having them
concurrently while guaranteeing the atomicity of simpli.es the client and the overall system by
each individual client’s append. It is useful for eliminating cache coherence issues.
implementing multi-way merge results and producer (Clients do cache metadata, however.) Chunkservers
consumer queues that many clients can need not cache .le data because chunks are stored as
simultaneously append to without additional locking. local .les and so Linux’s bu.er cache already keeps
We have found these types of .les to be invaluable in frequently accessed data in memory.
building large distributed applications. Snapshot and 2.4 Single Master
record append are discussed further in Sections 3.4 Having a single master vastly simplify our design and
and 3.3 respectively. enables the master to make sophisticated chunk
placement
3.3 Architecture Legend:
A GFS cluster consists of a single master and Data messages
multiple chunkservers and is accessed by multiple Control messages
clients, as shown in Figure 1. Each of these is Application
typically a commodity Linux machine running a user- (file name, chunk index)
level server process. It is easy to run both a (chunk handle,
chunkserver and a client on the same machine, as chunk locations)
long as machine resources permit and the lower GFS master
reliability caused by running possibly .aky File namespace
application code is acceptable.Files are divided into /foo/bar Instructions to chunkserver Chunkserver
fixed-size chunks. Each chunk is identified by an state and replication decisions using global
immutable and globally unique 64 bit chunk handle knowledge. However we must minimize its
assigned by the master at the time of chunk creation. involvement in reads and writes so that it does not
Chunk servers store chunks on local disks as Linux become a bottleneck. Clients never read and write .le
.les and read or write chunk data specified by a chunk data through the master. Instead, a client asks the
handle and byte range. For reliability, each chunk is master which chunkservers it should contact. It
replicated on multiple chunk servers. By default, we caches this information for a limited time and
store three replicas, though users can designate interacts with the chunkservers directly for many
different replication levels for different regions of the subsequent operations.
.le namespace. Let us explain the interactions for a simple read
The master maintains all .le system metadata. This with reference to Figure 1. First, using the fixed
includes the namespace, access control information, chunk size, the client translates the google name and
byte offset specified by the application into a chunk
index within the .le. Then, it sends the master a

3
07CEG84
Google File System

request containing the .le name and chunk index. The chunkservers storing this executable were overloaded
master replies with the corresponding chunk handle by hundreds of simultaneous requests. We fixed this
and locations of the replicas. The client caches this problem by storing such executables with a higher
information using the .le name and chunk index as replication factor and by making the batchqueue
the key. The client then sends a request to one of the system stagger application start times. A potential
replicas, most likely the closest one. The request long-term solution is to allow clients to read data
speci.es the chunk handle and a byte range within from other clients in such situations.
that chunk. Further reads of the same chunk require 3.6 Metadata
no more client-master interaction until the cached The master stores three major types of metadata:
information expires or the .le is reopened. the .le and chunk namespaces, the mapping from .les
In fact, the client typically asks for multiple to chunks and the locations of each chunk’s replicas.
chunks in the same request and the master can also All metadata is kept in the master’s memory. The
include the information for chunks immediately first two types (namespaces and .le-to-chunk
following those requested. This extra information mapping) are also kept persistent by logging
sidesteps several future client-master interactions at mutations to an operation log stored on the master’s
practically no extra cost. local disk and replicated on remote machines. Using
a log allows us to update the master state simply,
3.5 Chunk Size reliably, and without risking inconsistencies in the
Chunks size is one of the key design parameters. We event of a master crash. The master does not store
have chosen 64 MB, which is much larger than chunk location information persistently. Instead, it
typical google system blocks sizes. Each chunk asks each chunkserver about its chunks at master
replica is stored as a plain Linux google on a startup and whenever a chunkserver joins the cluster.
chunkserver and is extended only as needed. 3.6.1 In-Memory Data Structures
Lazy space allocation avoids wasting space due to Since metadata is stored in memory, master
internal fragmentation, perhaps the greatest objection operations are fast. Furthermore, it is easy and
against such a large chunksize. A large chunksize has efficient for the master to periodically scan through
several important advantages. First, it reduces clients’ its entire state in the background. This periodic
need to interact with the master because reads and scanning is used to implement chunk garbage
writes on the same chunk require only one initial collection, re-replication in the presence of
request to the master for chunk location chunkserver failures, and chunk migration to balance
information.The reduction is especially significant load and disk space usage across chunkservers. One
for our workloads because applications mostly read potential concern for this memory-only approach is
and write large .les sequentially. Even for small that the number of chunks and hence the capacity of
random reads, the client can comfortably cache all the whole system is limited by how much memory
the chunk location information for a multi-TB the master has. This is not a serious limitation in
working set. Second, since on a large chunk, a client practice. The master maintains less than 64 bytes of
is more likely to perform many operations on a given metadata for each 64 MB chunk. Most chunks are
chunk, it can reduce network overhead by keeping a full because most .les contain many chunks, only the
persistent TCP connection to the chunkserver over an last of which may be partially called. Similarly, the
extended period of time. Third, it reduces the size of google namespace data typically requires less then 64
the metadata stored on the master. This allows us to bytes per .le because it stores .le names compactly
keep the metadata in memory. using prefix compression.
On the other hand, a large chunk size, even with lazy If necessary to support even larger google systems,
space allocation, has its disadvantages. A small .le the cost of adding extra memory to the master is a
consists of a small number of chunks, perhaps just small price to pay for the simplicity, reliability,
one. The chunkservers storing those chunks may performance, and flexibility we gain by storing the
become hot spots if many clients metadata in memory.
are accessing the same .le. In practice, hot spots have 3.6.2 Chunk Locations
not been a major issue because our applications The master does not keep a persistent record of which
mostly read large multi-chunk lies sequentially. chunkservers have a replica of a given chunk. It
However, hot spots did develop when GFS was first simply polls chunkservers for that information at
used by a batch-queue system: an executable was startup. The master can keep itself up-to-date
written to GFS as a single-chunk lie and then started thereafter because it controls all chunk placement and
on hundreds of machines at the same time. The few monitors chunkserver status with regular HeartBeat

4
07CEG84
Google File System

messages. We initially attempted to keep chunk a new checkpoint can be created without delaying
location information persistently at the master, but incoming mutations. The master switches to a new
we decided that it was much simpler to request the log .le and creates the new checkpoint in a separate
data from chunkservers at startup, and periodically thread.
thereafter. This eliminated the problem of keeping The new checkpoint includes all mutations before
the master and chunkservers in sync as chunkservers the switch. It can be created in a minute or so for a
join and leave the cluster, change names, fail, restart, cluster with a few million .les. When completed, it is
and so on. In a cluster with hundreds of servers, these written to disk both locally and remotely. Recovery
events happen all too often. Another way to needs only the latest complete checkpoint and
understand this design decision is to realize that a subsequent log .les. Older checkpoints and log .les
chunkserver has the final word over what chunks it can be freely deleted, though we keep a few around
does or does not have on its own disks. There is no to guard against catastrophes. A failure during
point in trying to maintain a consistent view of this checkpointing does not affect correctness because the
information on the master because errors on a recovery code detects and skips incomplete
chunkserver may cause chunks to vanish checkpoints.
spontaneously (e.g., a disk may go bad and be 2.7 Consistency Model
disabled) or an operator may rename a chunkserver. GFS has a relaxed consistency model that supports
3.6.3 Operation Log our highly distributed applications well but remains
The operation log contains a historical record of relatively simple and efficient to implement. We now
critical metadata changes. It is central to GFS. Not discuss GFS’s guarantees and what they mean to
only is it the only persistent record of metadata, but it applications. We also highlight how GFS maintains
also serves as a logical time line that de.nes the order these guarantees but leave the details to other parts of
of concurrent operations. Files and chunks, as well as the paper.
their versions (see Section 4.5), are all uniquely and 3.7.1 Guarantees by GFS
eternally identified by the logical times at which they File namespace mutations (e.g., .le creation) are
were created. Since the operation log is critical, we atomic. They are handled exclusively by the master:
must store it reliably and not make changes visible to namespace locking guarantees atomicity and
clients until metadata changes are made persistent. correctness ; the master’s operation log de.nes a
Otherwise, we effectively lose the whole .le system global total order of these operations. The state of a
or recent client operations even if the chunks .le region after a data mutation depends on the type of
themselves survive. Therefore, we replicate it on mutation, whether it succeeds or fails, and whether
multiple remote machines and respond to a client there are concurrent mutations. Table 1 summarizes
operation only after pushing the corresponding log the result. A .le region is consistent if all clients will
record to disk both locally and remotely. The master always see the same data, regardless of which
batches several log records together before pushing replicas they read from. A region is de.ned after a .le
thereby reducing the impact of pushing and data mutation if it is consistent and clients will see
replication on overall system throughput. The master what the mutation writes in its entirety. When a
recovers its google system state by replaying the mutation succeeds without interference from
operation log. To minimize startup time, we must concurrent writers, the affected region is de.ned (and
keep the log small. The master checkpoints its state by implication consistent): all clients will always see
whenever the log grows beyond a certain size so that what the mutation has written. Concurrent successful
it can recover by loading the latest checkpoint from mutations leave the region unde.ned but consistent:
local disk and replaying only the Write Record all clients see the same data, but it may not re.ect
Append Serial defined defined success interspersed what any one mutation has written. Typically, it
with consists of mingled fragments from multiple
Concurrent consistent inconsistent successes but mutations. A failed mutation makes the region
undefined Failure inconsistent limited number of log inconsistent (hence also undefined): different clients
records after that. The checkpoint is in a compact B- may see different data at different times. We describe
tree like form that can be directly mapped into below how our applications can distinguish de.ned
memory and used for namespace lookup without regions from unde.ned regions. The applications do
extra parsing. This further speeds up recovery and not need to further distinguish between different
improves availability. kinds of unde.ned regions.Data mutations may be
Because building a checkpoint can take a while, the writes or record appends. A write causes data to be
master’s internal state is structured in such a way that written at an application-specified .le offset. A record

5
07CEG84
Google File System

append causes data (the “record”) to be appended much has been successfully written. Checkpoints
atomically at least once even in the presence of may also include application-level checksums.
concurrent mutations, but at an offset of GFS’s Readers verify and process only the .le region up to
choosing. (In contrast, a “regular” append is merely a the last checkpoint, which is known to be in the
write at an offset that the client believes to be the de.ned state. Regardless of consistency and
current end of .le.) The offset is returned to the client concurrency issues, this approach has served us well.
and marks the beginning of a de.ned region that Appending is far more efficient and more resilient to
contains the record. application failures than random writes.
In addition, GFS may insert padding or record Checkpointing allows writers to restart incrementally
duplicates in between. They occupy regions and keeps readers from processing successfully
considered to be inconsistent and are typically written .le data that is still incomplete from the
dwarfed by the amount of user data. After a sequence application’s perspective. In the other typical use,
of successful mutations, the mutated .le region is many writers concurrently append to a .le for merged
guaranteed to be de.ned and contain the data written results or as a producer-consumer queue. Record
by the last mutation. GFS achieves this by (a) append’s append-at-least-once semantics preserves
applying mutations to a chunk in the same order on each writer’s output. Readers deal with the
all its replicas and (b) using chunk version numbers occasional padding and duplicates as follows. Each
to detect any replica that has become stale because it record prepared by the writer contains extra
has missed mutations while its chunk server was information like checksums so that its validity can be
down . Stale replicas will never be involved in a verified. A reader can identify and discard extra
mutation or given to clients asking the master for padding and record fragments using the checksums.
chunk locations. They are garbage collected at the If it cannot tolerate the occasional duplicates (e.g., if
earliest opportunity. Since clients cache chunk they would trigger non-idempotent operations), it can
locations, they may read from a stale replica before filter them out using unique identi.ers in the records,
that information is refreshed. This window is limited which are often needed anyway to name
by the cache entry’s timeout and the next open of corresponding application entities such as web
the .le, which purges from the cache all chunk documents. These functionalities for record I/O
information for that .le. Moreover, as most of our .les (except duplicate removal) are in library code shared
are append-only, a stale replica usually returns a by our applications and applicable to other .le
premature end of chunk rather than outdated data. interface implementations at Google. With that, the
When a reader retries and contacts the master, it will same sequence of records, plus rare duplicates, is
immediately get current chunk locations.Long after a always delivered to the record reader.
successful mutation, component failures can of
course still corrupt or destroy data. GFS identi.es 4. FAULT TOLERANCE AND
failed chunkservers by regular handshakes between
master and all chunkservers and detects data
DIAGNOSIS
corruption by checksumming.
Once a problem surfaces, the data is restored from One of our greatest challenges in designing the
valid replicas as soon as possible . A chunk is lost system is dealing with frequent component failures.
irreversibly only if all its replicas are lost before GFS The quality and quantity of components together
can react, typically within minutes. Even in this case, make these problems more the norm than the
it becomes unavailable, not corrupted: applications exception: we cannot completely trust the machines,
receive clear errors rather than corrupt data. nor can we completely trust the disks. Component
3.7.2 Implications for Applications failures can result in an unavailable system or, worse,
GFS applications can accommodate the relaxed corrupted data. We discuss how we meet these
consistency model with a few simple techniques challenges and the tools we have built into the system
already needed for other purposes: relying on to diagnose problems when they inevitably occur.
appends rather than overwrites, checkpointing, and 4.1 High Availability Among hundreds of servers in a
writing self-validating, self-identifying records. GFS cluster, some are bound to be unavailable at any
Practically all our applications mutate .les by given time. We keep the overall system highly
appending rather than overwriting. In one typical use, available with two simple yet effective strategies: fast
a writer generates a .le from beginning to end. It recovery and replication.
atomically renames the .le to a permanent name after 4.1.1 Fast Recovery
writing all the data, or periodically checkpoints how

6
07CEG84
Google File System

Both the master and the chunkserver are designed to changes to its data structures exactly as the primary
restore their state and start in seconds no matter how does. Like the primary, it polls chunkservers at
they terminated. In fact, we do not distinguish startup (and infrequently thereafter) to locate chunk
between normal and abnormal termination; servers replicas and exchanges frequent handshake messages
are routinely shut down just by killing the process. with them to monitor their status. It depends on the
Clients and other servers experience a minor hiccup primary master only for replica location updates
as they time out on their outstanding requests, resulting from the primary’s decisions to create and
reconnect to the restarted server, and retry. delete replicas.
5.1.2 Chunk Replication 4.2 Data Integrity
As discussed earlier, each chunk is replicated on Each chunkserver uses checksumming to detect
multiple chunk servers on different racks. Users can corruption of stored data. Given that a GFS cluster
specify different replication levels for different parts often has thousands of disks on hundreds of
of the .le namespace. The default is three. The master machines, it regularly experiences disk failures that
clones existing replicas as needed to keep each chunk cause data corruption or loss on both the read and
fully replicated as chunk servers go offline or detect write paths. (See Section 7 for one cause.) We can
corrupted replicas through checksum verification. recover from corruption using other chunk replicas,
Although replication has served us well, we are but it would be impractical to detect corruption by
exploring other forms of cross-server redundancy comparing replicas across chunkservers. Moreover,
such as parity or erasure codes for our increasing read divergent replicas may be legal: the semantics of
only storage requirements. We expect that it is GFS mutations, in particular atomic record append as
challenging but manageable to implement these more discussed earlier, does not guarantee identical
complicated redundancy schemes in our very loosely replicas. Therefore, each chunkserver must
coupled system because our tra.c is dominated by independently verify the integrity of its own copy by
appends and reads rather than small random writes. maintaining checksums. A chunki s broken up into
5.1.3 Master Replication 64 KB blocks. Each has a corresponding 32 bit
The master state is replicated for reliability. Its checksum. Like other metadata, checksums are kept
operation log and checkpoints are replicated on in memory and stored persistently with logging,
multiple machines. A mutation to the state is separate from user data. For reads, the chunkserver
considered committed only after its log record has veri.es the checksum of data blocks that overlap the
been pushed to disk locally and on all master read range before returning any data to the requester,
replicas. For simplicity, one master process remains whether a client or another chunkserver. Therefore
in charge of all mutations as well as background chunkservers will not propagate corruptions to other
activities such as garbage collection that change the machines. If a block does not match the recorded
system internally. When it fails, it can restart almost checksum, the chunkserver returns an error to the
instantly. If its machine or disk fails, monitoring requestor and reports the mismatch to the master. In
infrastructure outside GFS starts a new master response, the requestor will read from other replicas,
process elsewhere with the replicated operation log. while the master will clone the chunkfrom another
Clients use only the canonical name of the master replica. After a valid new replica is in place, the
(e.g. gfs-test), which is a DNS alias that can be master instructs the chunkserver that reported the
changed if the master is relocated to another machine. mismatch to delete its replica. Checksumming has
Moreover, “shadow” masters provide read-only little e.ect on read performance for several reasons.
access to the .le system even when the primary Since most of our reads span at least a few blocks, we
master is down. They are shadows, not mirrors, in need to read and checksum only a relatively small
that they may lag the primary slightly, typically amount of extra data for verification. GFS client code
fractions of a second. They enhance read availability further reduces this overhead by trying to align reads
for .les that are not being actively mutated or at checksum block boundaries. Moreover, checksum
applications that do not mind getting slightly stale lookups and comparison on the chunkserver are done
results. In fact, since .le content is read from without any I/O, and checksum calculation can often
chunkservers, applications do not observe stale .le be overlapped with I/Os.
content. What could be stale within short windows Checksum computation is heavily optimized for
is .le metadata, like directory contents or access writes that append to the end of a chunk(a s opposed
control information. To keep itself informed, a to writes that overwrite existing data) because they
shadow master reads a replica of the growing are dominant in our workloads. We just
operation log and applies the same sequence of incrementally update the checksum for the last partial

7
07CEG84
Google File System

checksum block, and compute new checksums for many may apply to data processing tasks of a similar
any brand new checksum blocks filled by the append. magnitude and cost consciousness. We started by
Even if the last partial checksum block is already reexamining traditional .le system assumptions in
corrupted and we fail to detect it now, the new light of our current and anticipated application
checksum value will not match the stored data, and workloads and technological environment. Our
the corruption will be detected as usual when the observations have led to radically different points in
block is next read. In contrast, if a write overwrites the design space. We treat component failures as the
an existing range of the chunk, we must read and norm rather than the exception, optimize for huge .les
verify the first and last blocks of the range being that are mostly appended to (perhaps concurrently)
overwritten, then perform the write, and finally and then read (usually sequentially), and both extend
compute and record the new checksums. If we do not and relax the standard .le system interface to improve
verify the first and last blocks before overwriting the overall system. Our system provides fault
them partially, the new checksums may hide tolerance by constant monitoring, replicating crucial
corruption that exists in the regions not being data, and fast and automatic recovery. Chunk rep
overwritten. During idle periods, chunkservers can lication allows us to tolerate chunk server failures.
scan and verify the contents of inactive chunks. This The frequency of these failures motivated a novel
allows us to detect corruption in chunks that are online repair mechanism that regularly and
rarely read. Once the corruption is detected, the transparently repairs the damage and compensates for
master can create a new uncorrupted replica and lost replicas as soon as possible. Additionally, we use
delete the corrupted replica. This prevents an inactive checksumming to detect data corruption at the disk or
but corrupted chunk replica from fooling the master IDE subsystem level, which becomes all too common
into thinking that it has enough valid replicas of a given the number of disks in the system. Our design
chunk. delivers high aggregate throughput to many
4.3 Diagnostic Tools concurrent readers and writers performing a variety
Extensive and detailed diagnostic logging has helped of tasks. We achieve this by separating .le system
immeasurably in problem isolation, debugging, and control, which passes through the master, from data
performance analysis, while incurring only a minimal transfer, which passes directly between chunkservers
cost. Without logs, it is hard to understand transient, and clients. Master involvement in common
non-repeatable interactions between machines. GFS operations is minimized by a large chunk size and by
servers generate diagnostic logs that record many chunkleases, which delegates authority to primary
significant events (such as chunkservers going up and replicas in data mutations. This makes possible a
down) and all RPC requests and replies. These simple, centralized master that does not become a
diagnostic logs can be freely deleted without bottleneck.
affecting the correctness of the system. However, we We believe that improvements in our networking
try to keep these logs around as far as space permits. stack will lift the current limitation on the write
The RPC logs include the exact requests and throughput seen byan individual client. GFS has
responses sent on the wire, except for the .le data successfully met our storage needs and is widely used
being read or written. By matching requests with within Google as the storage platform for research
replies and collating RPC records on different and development as well as production data
machines, we can reconstruct the entire interaction processing. It is an important tool that enables us to
history to diagnose a problem. The logs also serve as continue to innovate and attack problems on the scale
traces for load testing and performance analysis. The of the entire web.
performance impact of logging is minimal (and far
outweighed by the benefits) because these logs are
written sequentially and asynchronously. The most 6.REFERENCE
recent events are also kept in memory and available
for continuous online monitoring.
1. David Kramer worked on performance
enhancements.
5. CONCLUSIONS 2. Fay Chang, Urs Hoelzle, Max Ibel, Sharon
Perl, Rob Pike, and Debby Wallach commented
The Google File System demonstrates the qualities on earlier drafts of the paper. Many of our
essential for supporting large-scale data processing colleagues at Google bravely trusted their data to
workloads on commodity hardware. While some a new google system and gave us useful feedback.
design decisions are specific to our unique setting,

8
07CEG84
Google File System

3. Yoshka helped with early testing.


4. www.google.com

You might also like