A Disk RAID array Volume Manager for Disk

Figure 6: Mapping of a database table with 16 attributes onto
Atropos logical volume with 4 disks.
Fetching any subset of attributes for a given page
(record range) is thus a simple matter of issuing the corresponding
number of I/Os, each accessing a contiguous
region of the VLBN space mapped to a contiguous region
on the disk. If several I/Os fall onto stripe units mapped
to the same disk, the internal disk scheduler optimizes
their execution by minimizing positioning times.
5 Evaluation
This section evaluates the performance of Atropos. First,
it quantifies the efficiencies of sequential, semi-sequential
and random accesses and shows the impact of disk
trends on the layout parameter values. For all access
patterns, Atropos achieves performance comparable or
superior to conventional disk array data organizations.
Second, trace replay experiments of a TPC-H workload
on the Atropos implementation shows the benefit
of matching the stripe-unit size to the exact track
size and exposing it to applications. Third, the benefits
of Atropos’s data organizations are shown for (a subset
of) queries from the TPC-H benchmark running on the
Shore database storage manager [3] and our Atropos implementation.
5.1 Experimental setup
The experiments were performed on a four-way
500 MHz Pentium III machine with 1 GB of memory
running Linux kernel v. 2.4.7 of RedHat 7.1 distribution.
The machine was equipped with two Adaptec Ultra160
Wide SCSI adapters on two separate PCI buses, each
controlling two 36 GB Maxtor Atlas 10K III disks.
5.2 Quantifying access efficiency
Traditional striped layouts of data across disks in a
RAID group offer efficient (sequential) access along one
major. The efficiency of accessing data along the other
major is much lower, essentially involving several random
accesses. Atropos’s quadrangle layout, on the other
hand, offers streaming efficiency for sequential accesses
and much higher efficiency for the other-major access.
We define “access efficiency” as the fraction of total access
time spent reading/writing data from/to the media.
The access efficiency is reduced by activities other than
data transfer, including seeks, rotational latencies, and
track switches. The efficiencies and response times described
in this subsection are for a single disk. With p
disks comprising a logical volume, each disk can experience
the same efficiency while accessing data in parallel.
5.2.1 Efficient access in both majors
Figure 7 graphs the access efficiency of the quadrangle
layout as a function of I/O size. It shows two important
features of the Atropos design. First, accessing data in
the primary-order (line 1) matches the best-possible efficiency
of track-based access with traxtents. Second, the
efficiency of the other other-major order access (line 2)
is much higher than the same type of access with the
Na¨ıve layout of conventional disk arrays (line 3), thanks
to semi-sequential access.
The data in the graph was obtained by measuring
the response times of requests issued to randomly chosen
DLBNs, aligned on track boundaries, within the Atlas
10K III’s outer-most zone (686 sectors or 343 KB
per track ). The average seek time in the first zone is
2.46 ms. The drop in the primary-major access efficiency
at the 343 KB mark is due to rotational latency
and an additional track switch incurred for I/Os larger
than the track size, when using a single disk.
The I/O size for the other-major access with the quadrangle
layout is the product of quadrangle depth, d, and
the number of consecutive DLBNs, b, accessed on each
track. For d = 4, a request for S sectors is split into four
I/Os of S=4 DLBNs. For this access in the Na¨ıve layout
(line 3), servicing these requests includes one seek
and some rotational latency for each of the four b-sized
I/Os, which are “randomly” located on each of the four
consecutive tracks.
The efficiency of semi-sequential quadrangle access
(line 2) with I/O sizes below 124 KB is only slightly
smaller than that of the efficiency of track-based access
with traxtents. Past this point, which corresponds to the
one-revolution constraint, the efficiency increases at a
slower rate, eventually surpassing the efficiency value at
the 124 KB mark. However, this increase in efficiency
comes at a cost of increased request latency; the access
will now require multiple revolutions to service. The
continuing increase in efficiency past the 124 KB mark
is due to amortizing the cost of a seek by larger data
transfer. Recall that each request includes an initial seek.
0 50 100 150 200 250 300 350
Access Efficiency for Maxtor Atlas 10K III
I/O size [KB]
Efficiency
0
0.2
0.4
0.6
0.8
1
Quadrangle (d=4) - primary-major access
Quadrangle (d=4) - other major access
Naive (d=4) - other-major access
track-based access (traxtents)
3
2
1
Figure 7: Comparison of access efficiencies. The maximal streaming
efficiency, i.e., without seeks, for this disk is 0.82 (computed by
Equation 6 in Appendix A).
5.2.2 Random accesses in the other-major
Figure 8 compares access times for a random 8 KB
chunk of data with different data layouts. The goal is to
understand the cost of accessing data in the other-major
order (e.g., row-major order access of the table in Figure
2). For context, a block in the primary-major has its
data mapped to consecutive LBNs. Such an access incurs
an average seek of 2.46 ms and an average rotational latency
of half a revolution, followed by an 8 KB media
access. The total response time of 5.93 ms is shown by
the bar labeled “Contiguous.”
Accessing 8 KB of data randomly spread across noncontiguous
VLBNs (e.g., single row access in the Na¨ıve
layout of Figure 2) incurs nearly half of a revolution of
rotational latency for each of the d accesses in addition
to the same initial seek. Such an access results in a large
response time, as shown by the bars labeled “Naive.”
Database systems using the DSM data layout decomposed
into d separate tables suffer this high penalty when
complete records are retrieved.
In contrast, with the quadrangle layout, an access in
the other-major incurs only a single seek and much less
total rotational latency than the access in the traditional
Na¨ıve layout. This access still incurs one (for d = 2)
or three (for d = 4) track switches, which explains the
penalty of 16% and 38% relative to the best case.
5.2.3 Access performance analysis
Using parameters derived in Section 3.2 and the analytical
model described in Appendix A, we can express
the expected response time for a quadrangle access and
compare it with measurements taken from a real disk.
Figure 9 plots response times for quadrangle accesses
to the disk’s outer-most zone as a function of I/O request
size, S, and compares the values obtained from the analytic
model to measurements from a real disk. The close
match between these data sets demonstrates that Atropos
d=2 d=4 d=2 d=4
Naive
Contiguous/
Atropos Atropos
primary-major access other-major access other-major access
Response Time [ms]
Random 8KB Access
3
0
6
9
12
15
18
21
Figure 8: Comparison of response times for random access.
can reliably determine proper values of quadrangle layout
analytically rather than empirically, which may be
time consuming. The data is shown for the Atlas 10K III
disk: N = 686, H = 139, and 6 ms revolution time.
The plotted response time does not include seek time;
adding it to the response time would simply shift the
lines up by an amount equivalent to the average seek
time. The total I/O request size, S, shown along the xaxis
is determined as S=db. With d =1, quadrangle access
reduces to normal disk access. Thus, the expected
response time grows from 3 to 6 ms. For d = 6, the
response time is at least 10.8 ms, even for the smallest
possible I/O size, because Dmax = 5 for the given disk.
The most prominent features of the graph are the
steps from the 6 ms to 10–12 ms regions. This abrupt
change in response time shows the importance of the
one-revolution constraint. If this constraint is violated
by an I/O size that is too large, the penalty in response
time is significant.
The data measured on the real disk (dashed lines in
Figure 9) match the predicted values. To directly compare
the two sets of data, the average seek value was
subtracted from the measured values. The small differences
occur because the model does not account for bus
transfer time, which does not proceed entirely in parallel
with media transfer.
5.2.4 Effect of disk characteristics
Figure 9 shows the relationship between quadrangle dimensions
and disk characteristics of one particular disk
with Dmax = 5. To determine how disk characteristics
affect the quadrangle layout, we use the analytic model
to study other disks. As shown in Table 2, the dimensions
of the quadrangles mapped to the disks’ outer-most
zones remain stable across different disks of the past
decade. The smaller Dmax for the Atlas 10K III is due
to an unfortunate track skew/head switch of H = 139. If
H = 136, Dmax = 6 and b = 1.

Quadrangle Access Maxtor Atlas 10K III
I/O size [KB]
Response Time [ms]
d=1
measured
predicted
d=2
d=3
d=4
d=5
d=6
d=1
d=2
d=3
d=4
d=5
d=6
0
2
4
6
8
10
12
Figure 9: Response time of semi-sequential access.
Table 2 also shows that, with d set to Dmax, the number
of DLBNs, b, accessed at each disk track remains below
10, with the exception of the Atlas 10K III. The data reveals
another favorable trend: the small value of R (number
of DLBNs on each track not mapped to VLBNs) is
a modest capacity tradeoff for large performance gains.
With the exception of the Atlas 10K III disk, less than
1% of the total disk capacity would be wasted. For that
disk, the value is 1.5%.
5.3 Track-sized stripe units
We now evaluate the benefits of one Atropos feature
in isolation: achieving efficient sequential access by
matching stripe units to exact track boundaries and exposing
it to applications. To do so, we replayed blocklevel
I/O traces of the TPC-H benchmark, representing
a decision support system workload dominated by large
sequential I/O. The original traces were captured on an
IBM DB2 v. 7.2 system using 8 KB NSM pages and
running each of the 22 TPC-H queries separately. The
configuration specifics and the description of the trace
replay transformations are detailed elsewhere [20].
For the experiments described in the remainder of this
section, we used a single logical volume created from
four disks (p = 4) and placed it on the disks’ outermost
zone, giving it a total size of 35 GB. The quadrangle
layout was configured as RAID 0 with d = 4 and b = 1.
To simulate the effects of varying stripe unit size and
exposing its value to DB2, we modified the captured
traces by compressing back-to-back sequential accesses
to the same table or index into one large I/O. We then
split this large I/O into individual I/Os according to the
stripe unit size, preserving page boundaries.
To simulate traditional disk arrays with a (relatively
small) hard-coded stripe unit size, we set the stripe unit
size to 64 KB (128 blocks) and called this base case scenario
64K-RAID. To simulate systems that approximate
track size, but do not take advantage of disk character-
Disk Year Dmax b R
HP C2247 1992 7 1 0
IBM Ultrastar 18 ES 1998 7 5 0
Quantum Atlas 10K 1999 6 2 0
Seagate Cheetah X15 2000 6 4 2
Seagate Cheetah 36ES 2001 6 7 3
Maxtor Atlas 10K III 2002 5 26 10
Seagate Cheetah 73LP 2002 7 8 2
Table 2: Quadrangle parameters across disk generations. For
each disk, the amount of space not utilized due to R residual DLBNs is
less than 1% with the exception of the Atlas 10K III, where it is 1.5%.
istics, we set the stripe unit to 256 KB (512 blocks) and
called this scenario Approximate-RAID. By taking advantage
of automatically-extracted explicit disk characteristics,
Atropos can set the stripe unit size to the exact
track size of 343 KB and we called this scenario
Atropos-RAID. For all experiments, we used the Atropos
LVM cofigured for RAID 0 (i.e., d = 1 and w was 128,
512, and 686 blocks respectively) and I/O sizes matching
stripe unit sizes.
The resulting I/O times of the 22 TPC-H queries are
shown in Figure 10. The graph shows the speedup of
the Approximate-RAID and Atropos-RAID scenarios relative
to the base case scenario 64K-RAID. The results
are in agreement with the expectations of sequential access
efficiencies in Figure 7. The larger, more efficient
I/Os of the Approximate-RAID and Atropos-RAID result
in the observed speedup. The exact full-track access of
Atropos-RAID provides additional 2%–23% benefit.
Some fraction of the query I/Os are not sequential or
stripe-unit sized (e.g., less than 1% for query 1 vs. 93%
and 99% for queries 14 and 17, respectively). These
differences explain why some queries are sped up more
than others; Atropos does not significantly improve the
small random I/Os produced by index accesses.
The efficiency of Atropos-RAID is automatically
maintained with technology changes (e.g., increasing
numbers of sectors per track). While Approximate-RAID
must be manually set to approximate track size (provided
it is known in the first place), Atropos automatically
determines the correct value for its disks and sets
its stripe unit size accordingly, eliminating the errorprone
manual configuration process.
5.4 Two-dimensional data access
To quantify the benefits of both efficient sequential and
semi-sequential accesses, we used a TPC-H benchmark
run on the Shore database storage manager [3] with three
different layouts. The first layout, standard NSM, is optimized
for row-major access. The second layout, standard
DSM, vertically partitions data to optimize columnmajor
access. The third layout, here called AtroposDB,
uses the page layout described in Section 4, which can
TPC-H Benchmark Trace Replay
1.0
1.2
1.4
1.6
1.8
1 2 3 4 5 6 7 8 9 10111213141516171819202122
Query #
Speedup relative to 64K-RAID
Approximate-RAID
Atropos-RAID
Figure 10: TPC-H trace replay on Atropos.
take advantage of Atropos’s full set of features. Each
setup uses an 8 KB page size.
The results for TPC-H queries 1 and 6 are shown in
Figure 11.1 These queries scan through the LINEITEM
table (the largest table of the TPC-H benchmark), which
consists of 16 attributes, and calculate statistics based
on a subset of six (Q1) and four (Q6) attributes. As expected,
the performance of DSM and AtroposDB is comparable,
since both storage models can efficiently scan
through the table, requesting data only for the desired
attributes. NSM, on the other hand, fetches pages that
contain full records (including the attributes not needed
by the queries), which results in the observed 2:5 to
4 worse performance. All scans were performed on
a 1 GB TPC-H installation, with the LINEITEM table
constituting about 700 MB.
Random record accesses to the LINEITEM table approximate
an OLTP workload behavior, which is dominated
by row-major access. For DSM, which is not suitable
for row-major access, this access involves 16 random
accesses to the four disks. For NSM, this access
involves a random seek and rotational latency of half a
revolution, followed by an 8 KB page access, resulting in
a run time of 5.76 s. For AtroposDB, this access includes
the same seek, but most rotational latency is eliminated,
thanks to semi-sequential access. It is, however, offset
by incurring additional track switches. With d=4, the
run time is 8.32 s; with d=2, it is 6.56 s. These results
are in accord with the random access results of Figure 8.
6 Related work
Atropos builds upon many ideas proposed in previous
research. Our contribution is in integrating them into
a novel disk array organization, cleanly extending the
storage interface to allow applications to exploit it, and
evaluating the result with a real implementation executing
benchmark database queries.
1Of the four TPC-H queries implemented by the Shore release
available to us, only Q1 and Q6 consist of LINEITEM table scan that
is dominated by I/O time.
Shore Database Storage Manager
8.14 7.34
5.76
2.68
1.89
18.69
2.70
1.88
8.32
2.77
1.83
6.56
0.0
4.0
8.0
12.0
16.0
20.0
TPC-H Query 1 TPC-H Query 6 1000 Random Records
Runtime [s]
NSM DSM AtroposDB(d=4) AtroposDB(d=2)
Figure 11: Database access results. This figure shows the runtime
of TPC-H queries 1 and 6 for three different layouts. The NSM and
DSM layouts are optimized for row- and column-order access respectively,
trading off performance in the other-order. Atropos, on the other
hand, offers efficient execution of TPC-H queries as well as randompage
access in OLTP workloads (random access to 1000 records).
Atropos’s track-based striping is inspired by recent
work on track-aligned extents [22], which showed that
track-based access could provide significant benefits for
systems using a single disk. Most high-end systems use
disk arrays and LVMs, and Atropos allows those to utilize
track-based access.
Gorbatenko and Lilja [9] proposed diagonal disk layout
of relational tables, allowing what we call semi-sequential
access. Atropos integrates such diagonal layout
with track-aligned extents to realize its support for
two-dimensional data structures with no penalty to the
primary access order.
Section 2.4 describes how modern databases address
the trade-off between row-major and column-major access
to database tables. Recently, Ramamurthy and De-
Witt proposed that database systems should address this
trade-off by maintaining two copies of data, one laid out
in column-major order and the other in row-major order,
to ensure efficient accesses in both dimensions [15].
This approach, however, not only doubles the storage requirements,
but also makes updates difficult; they must
propagate to two copies that are laid out differently.
Atropos provides efficient accesses in both dimensions
with only one copy of the data.
Denehy et al. [7] proposed the ExRAID storage interface
that exposes some information about parallelism
and failure-isolation boundaries of a disk array. This
information was shown to allow application software
to control data placement and to dynamically balance
load across the independent devices in the disk array.
ExRAID exposed coarse boundaries that were essentially
entire volumes, which could be transparently
spread across multiple devices. Atropos, on the other
hand, defines a particular approach to spreading data
among underlying devices and then exposes information
about its specifics. Doing so allows applications to benefit
from storage-managed parallelism and redundancy,