A Disk RAID array Volume Manager for Disk

responses that do not represent true discrepancies (e.g.,
directory entries returned in different orders), and then
assists the user with visualizing the “problem” RPCs.
Off-line post-processing is useful for reducing on-line
overheads as well as allowing the user to refine comparison
rules without losing data from the real environment
(since the log is a filtered trace).
4.2 State synchronization
The synchronization module updates the SUT to enable
useful comparisons. Doing so requires making the SUT’s
internal state match the reference server’s to the point
that the two servers’ responses to a given RPC could be
expected to match. Fortunately, NFSv3 RPCs generally
manipulate only one or two file objects (regular files, directories,
or links), so some useful comparisons can be
made long before the entire file system is copied to the
reference server.
Synchronizing an object requires establishing a point
within the stream of requests where comparison could
begin. Then, as long as RPCs affecting that object are
handled in the same order by both servers, it will remain
synchronized. The lifetime of an object can be viewed
as a sequence of states, each representing the object as it
exists between two modifications. Synchronizing an object,
then, amounts to replicating one such state from the
reference server to the SUT.
Performing synchronization offline (i.e., when the reference
server is not being used by any clients) would
be straightforward. But, one of our goals is the ability
to insert a SUT into a live environment at runtime.
This requires dealing with object changes that are concurrent
with the synchronization process. The desire not
to disrupt client activity precludes blocking requests to
an object that is being synchronized. The simplest solution
would be to restart synchronization of an object if a
modification RPC is sent to the reference server before it
completes. But, this could lead to unacceptably slow and
inefficient synchronization of large, frequently-modified
objects. Instead, our synchronization mechanism tracks
changes to objects that are being synchronized. RPCs are
sent to the reference server as usual, but are also saved in
a changeset for later replay against the SUT.
Figure 3 illustrates synchronization in the presence of
write concurrency. The state S1 is first copied from the
reference server to the SUT. While this copy is taking
place, a write (Wr1) arrives and is sent to the reference
server. Wr1 is not duplicated to the SUT until the copy of
S1 completes. Instead, it is recorded at the Tee. When the
copy of S1 completes, a new write, Wr1’, is constructed
based on Wr1 and sent to the SUT. Since no further concurrent
changes need to be replayed, the object is marked
reference
object
lifetime
SUT
object
lifetime
S2
Wr1
Copy S1
S1
Wr1’
Time
S1 S2
Figure 3: Synchronization with a concurrent write. The top
series of states depicts a part of the lifetime of an object on the reference
server. The bottom series of states depicts the corresponding object on
the SUT. Horizontal arrows are requests executed on a server (reference
or SUT), and diagonal arrows are full object copies. Synchronization
begins with copying state S1 onto the SUT. During the copy of S1, write
Wr1 changes the object on the reference server. At the completion of
the copy of S1, the objects are again out of synchronization. Wr1’ is
the write constructed from the buffered version of Wr1 and replayed on
the SUT.
synchronized and all subsequent requests referencing it
are eligible for duplication and comparison.
Even after initial synchronization, concurrent and overlapping
updates (e.g., Wr1 and Wr2 in Figure 4) can
cause a file object to become unsynchronized. Two requests
are deemed overlapping if they both affect the
same state. Two requests are deemed concurrent if the
second one arrives at the relay before the first one’s response.
This definition of concurrency accounts for both
network reordering and server reordering. Since the Tee
has no reliable way to determine the order in which concurrent
requests are executed on the reference server, any
state affected by both Wr1 and Wr2 is indeterminate.
Resynchronizing the object requires re-copying the affected
state from the reference server to the SUT. Since
overlapping concurrency is rare, our Tee simply marks
the object unsynchronized and repeats the process entirely.
The remainder of this section provides details regarding
synchronization of files and directories, and describes
some synchronization ordering enhancements that allow
comparisons to start more quickly.
Regular file synchronization: A regular file’s state is
its data and its attributes. Synchronizing a regular file
takes place in three steps. First, a small unit of data and
the file’s attributes are read from the reference server and
written to the SUT. If a client RPC affects the object during
this initial step, the step is repeated. This establishes
a point in time for beginning the changeset. Second, the
remaining data is copied. Third, any changeset entries
are replayed.
A file’s changeset is a list of attribute changes and
written-to extents. A bounded amount of the written data
reference
object
lifetime
SUT
object
lifetime
Time
X
S2
S2
Copy S2
S1
S1
Wr1, Wr2
Figure 4: Re-synchronizing after write concurrency. The example
begins with a synchronized object, which has state S1 on both
servers. When concurrent writes are observed (Wr1 and Wr2 in this
example), the Tee has no way of knowing their execution order at the
reference server. As a consequence, it cannot know the resulting reference
server state. So, it must mark the object as unsynchronized and
repeat synchronization.
is cached. If more data was written, it must be read from
the reference server to replay changes. As the changeset
is updated, by RPCs to reference server, overlapping extents
are coalesced to reduce the work of replaying them;
so, for example, two writes to the same block will result
in a single write to the SUT during the third step of file
synchronization.
Directory synchronization: A directory’s state is its
attributes and the name and type of each of its children.
5 This definition of state allows a directory to be
synchronized regardless of whether its children are synchronized.
This simplifies the tracking of a directory’s
synchronization status and allows the comparison of responses
to directory-related requests well before the children
are synchronized.
Synchronizing a directory is done by creating missing
directory entries and removing extraneous ones. Hard
links are created as necessary (i.e., when previously discovered
file handles are found). As each unsynchronized
child is encountered, it is enqueued for synchronization.
When updates occur during synchronization,
a directory’s changeset will include new attribute values
and two lists: entries to be created and entries to be removed.
Each list entry stores the name, file handle, and
type for a particular directory entry.
Synchronization ordering: By default, the synchronization
process begins with the root directory. Each unknown
entry of a directory is added to the list of files to
be synchronized. In this way, the synchronization process
works its way through the entire reference file system.
One design goal is to begin making comparisons as
5File type is not normally considered to be part of a directory’s contents.
We make this departure to facilitate the synchronization process.
During comparison, file type is a property of the file, not of the parent
directory.
quickly as possible. To accomplish this, our Tee synchronizes
the most popular objects first. The Tee maintains
a weighted moving average of access frequency for each
object it knows about, identifying accesses by inspecting
the responses to lookup and create operations. These
quantities are used to prioritize the synchronization list.
Because an object cannot be created until its parent directory
exists on the SUT, access frequency updates are
propagated from an object back to the file system root.
4.3 Comparison
The comparison module compares responses to RPC requests
on synchronized objects. The overall comparison
functionality proceeds in two phases: on-line and postprocessed.
The on-line comparisons are performed at
runtime, by the Tee’s comparison module, and any nonmatching
responses (both responses in their entirety) are
logged together with the associated RPC request. The
logged information allows post-processing to eliminate
false non-matches (usually with more detailed examination)
and to help the user to explore valid non-matches in
detail.
Most bitwise-comparable fields are compared on-line.
Such fields include file data, file names, soft link contents,
access control fields (e.g., modes and owner IDs),
and object types. Loosely-comparable fields include
time values and directory contents. The former are compared
on-line, while the latter (in our implementation)
are compared on-line and then post-processed.
Directory contents require special treatment, when comparison
fails, because of the looseness of the NFS protocol.
Servers are not required to return entries in any
particular order, and they are not required to return any
particular number of entries in a single response to a
READDIR or READDIRPLUS RPC request. Thus, entries
may be differently-ordered and differently-spread
across multiple responses. In fact, only when the Tee
observes complete listings from both servers can some
non-matches be definitively declared. Rather than deal
with all of the resulting corner cases on-line, we log the
observed information and leave it for the post-processor.
The post-processor can link multiple RPC requests iterating
through the same directory by the observed file handles
and cookie values. It filters log entries that cannot
be definitively compared and that do not represent mismatches
once reordering and differing response boundaries
are accounted for.
4.4 Implementation
We implemented our Tee in C++ on Linux. We used the
State Threads user-level thread library. The relay runs