A Disk RAID array Volume Manager for Disk

2.3 Access to 2D data structures.

Figure 2 uses a simple example to illustrate Atropos’s benefits to applications that require efficient access to two-dimensional structures in both dimensions and contrasts it with conventional striping in disk arrays. The example depicts a two-dimensional data structure (e.g., a table of a relational database) consisting of four columns [1; : : : ;4] and many rows [a; : : : ; z]. For simplicity, each element (e.g., a1) maps to a single LBN of the logical volume spanning two disks. To map this two-dimensional structure into a linear space of LBNs, conventional systems decide a priori which order (i.e., column- or row-major) is likely to be accessed most frequently [5]. In the example in Figure 2, a column-major access was chosen and hence the runs of [a1;b1; : : : ; z1], [a2;b2; : : : ; z2], [a3;b3; : : : ; z3], and [a4;b4; : : : ; z4] assigned to contiguous LBNs. The mapping of each element to the LBNs of the individual disks is depicted in Figure 2(b) in a layout called Na¨ıve. When accessing a column, the disk array uses (i) sequential access within each disk and (ii) parallel access to both disks, resulting in maximum efficiency. Accessing data in the other order (i.e., row-major), however, results in disk I/Os to disjoint LBNs. For the example in Figure 2(b), an access to row [a1;a2;a3;a4] requires four I/Os, each of which includes the high positioning cost for a small random request. The inefficiency of this access pattern stems from the lack of information in conventional systems; one column is blindly allocated after another within the LBN address space. Atropos supports efficient access in both orders with a new data organization, depicted in Figure 2(c). This layout maps columns such that their respective first row elements start on the same disk and enable efficient roworder access. This layout still achieves sequential, and hence efficient, column-major access, just like the Na¨ıve layout. Accessing the row [a1;a2;a3;a4], however, is much more efficient than with Na¨ıve. Instead of small random accesses, the row is now accessed semi-sequentially in (at most) one disk revolution, incurring much smaller positioning cost (i.e., eliminating all but the first seek and all rotational latency). Section 3 describes why this semi-sequential access is efficient. 2.4 Efficient access for database systems By mapping two-dimensional structures (e.g., large nonsparse matrices or database tables) into a linear LBN space without providing additional information to applications, efficient accesses in conventional storage systems are only possible in one of row- or column-major order. Database management systems (DBMS) thus predict the common order of access by a workload andchoose a layout optimized for that order, knowing that accesses along the other major axis will be inefficient.

In particular, online transaction processing (OLTP) workloads, which make updates to full records, favor efficient row-order access. On the other hand, decision support system (DSS) workloads often scan a subset of table columns and get better performance using an organization with efficient column-order access [15]. Without explicit support from the storage device, however, a DBMS system cannot efficiently support both workloads with one data organization. The different storage models (a.k.a. page layouts) employed by DBMSs trade the performance of row-major and column-major order accesses. The page layout prevalent in commercial DBMS, called the N-ary storage model (NSM), stores a fixed number of full records (all n attributes) in a single page (typically 8 KB). This page layout is optimized for OLTP workloads with rowmajor access and random I/Os. This layout is also efficient for scans of entire tables; the DBMS can sequentially scan one page after another. However, when only a subset of attributes is desired (e.g., the column-major access prevalent in DSS workloads), the DBMS must fetch full pages with all attributes, effectively reading the entire tableven though only a fraction of the data is needed.