You are on page 1of 7

A Case for Heterogeneous Disk Arrays

T. Cortes and J. Labarta Departament dArquitectura de Computadors Universitat Polit` cnica de Catalunya - Barcelona e http://www.ac.upc.es/hpc/ toni,jesus @ac.upc.es

Abstract
Heterogeneous disk arrays are becoming a common conguration in many sites and specially in storage area networks (SAN). As new disks have different characteristics than old ones, adding new disks or replacing old ones ends up in a heterogeneous disk array. Current solutions to this kind of arrays do not take advantage of the improved characteristics of the new disks. In this paper, we present a block-distribution algorithm that takes advantage of these new characteristics and thus improves the performance and capacity of heterogeneous disk arrays compared to current solutions.

1 Introduction
Heterogeneous disk arrays are becoming (or will be in a near future) a common conguration in many sites. For example, whenever a component of a traditional array fails it has to be replaced by a new one. The same thing happens when the capacity needs of a site grow and new disks have to be added to the array. In both cases, it will be difcult to buy the same disks as the ones in the original conguration [5], and thus newer disks will be added. This means that we will make the array a heterogeneous one because it will be made of disks with different characteristics. This kind of heterogeneity becomes even more important in clusters of workstations or storage area networks (SAN) because, in these congurations, disks are quite loosely coupled. This simplies the task of building arrays with different kind of disks. Furthermore, heterogeneous disk arrays also help to have low-cost clusters as all available hardware may be used to improve both capacity and performance of the storage system.
This work has been supported by the Spanish Ministry of Education (CICYT) under the TIC-95-0429 contract. The contents of this technical report has been published in the 1st IEEE International Conference on Cluster Computing (Cluster 2000)

To handle this kind of disk arrays, current systems do not take into account the differences between the disks. All disks are treated as if they had same capacity and performance. This is a problem because improvements in both capacity and response time of the heterogeneous array could be achieved if each disk were used accordingly to its characteristics. In this work, we present a simple solution to this problem by proposing AdaptRaid0, a block-distribution algorithm that improves the performance and effective capacity of heterogeneous disk arrays compared to current solutions. This paper is organized in 6 sections. Section 2 presents the environment where proposal should work. Section 3 gives an overview of the previous work on heterogeneous disk arrays. Section 4 describes AdaptRaid0, which is the new block-distribution policy presented in this paper. In Section 5 we evaluate the proposed algorithm. And nally, Section 6 summarizes the main contributions of this work.

2 Target Environment
In this work, we focus on how to take advantage of heterogeneity in RAID level 0. This kind of disk arrays is widely used in high-performance environments because this conguration is the one that obtains the best performance [2]. The method presented in this paper has been designed to work on any kind of disk arrays (hardware or software, tightly or loosely coupled, etc.), although we will only evaluate the behavior of an array made by network-attached disks in a Storage-Area Network (SAN), which seems to be a very promising approach. This approach becomes even more interesting when merged with a cluster of workstations, which is also becoming a popular conguration. Finally, we will focus our attention on scientic and general purpose applications because multimedia environments, and their special assumptions, have already been addressed [4, 8, 12]. 1

3 Related Work
Some other projects have already addressed the same problem, but they have always been focused on multimedia systems (and specially video and audio servers). The work done by Santos and Muntz [8] proposed a random distribution with replications to improve the short and long-term load balance. In a similar line, Zimmermann proposed a data placement policy based on the creation of logical disks composed by fractions or combinations of several physical disks [12]. Finally, Dan and Sitaran proposed the usage of fast disks to place hot data while the less active data would be located in the slow disks [4]. The main difference from our approach is that all these projects were targeted to multimedia systems while we want a solution for general purpose and scientic environments. Due to their focus on multimedia, they could make some assumptions such as that very large disk blocks (1Mbyte) are used, that reads are much more important than writes and that the main objective is to obtain a sustained bandwidth as opposed to besteffort. These assumptions are not valid in our environment where blocks are only a few Kbytes in size, writes are as important as reads, and sustained bandwidth is not as important as faster response time. The only work, as far as we know, that deals with this problem in a non-multimedia environment has been implemented in Linux. In this system, any of the disks in the array can be build by putting several disks together [10]. Each disk will store part of the blocks assigned to this virtual disk. This solution is very simple but limited and does not solve completely nor the performance or the capacity problem. Slow disks are always used and we will see that this is not the best solution to the problem. Furthermore, it is quite difcult to nd a combination of two disks that has the exact same capacity of the larger disks. There have also been other projects that dealt with a heterogeneous set of disks, but their objective was to propose new architectures using different disks for different tasks. Along this line we could mention the HP AutoRaid [11], and DCD [6]. In our work, we do not try to decide which is the best hardware and then buy it, we want to deal with already existing devices, whichever they are. This allows us to build high-performance and high-capacity storage systems at a low cost.

A rst kind of parallelism is achieved within a single request. In this case, all disks work together to fulll a single request and thus the time spent transferring data from the magnetic surface is divided by the number of disks. This kind of parallelism makes sense when requests are large. If requests are small, the portion of the time spent transferring the data is so small that the parallelism obtained does not improve the response time signicantly. A second kind of parallelism occurs when several requests, witch can be handled by different disks, is served simultaneously. This kind of parallelism makes sense when requests are small. If requests are large, they will use all disks and the parallelism between requests will decrease signicantly. When none of these two kinds of parallelism can be exploited, the disk array hardly offers any performance benet over a single disk.

4.2 Intuitive Idea


As we have already mentioned, replacing an old disk by a new one or adding new disks to an old array are two common scenarios. In both cases, new disks are usually larger and faster than the old ones [5]. For this reason, we start assuming that faster disks are also larger, although we will drop this assumption at the end of this section. The intuitive idea is to place more data blocks in the larger disks than in smaller ones. This makes sense because larger disks are also faster, and thus they can serve more blocks per unit of time. Following this idea, we propose to use all the disks (as in a regular RAID0) for as many lines as blocks can t in the smallest disk. Once the smallest disk is full, we use the rest of the disks as if we had a disk array with D-1 disks. This distribution continues until all disks are full with data. A side effect of this distribution is that the system may have lines with different lengths. For instance, if the array has D disks where F of them are fast, the array will have lines with D blocks, but it will also have lines with F blocks. Nevertheless, we will show that this effect is not a problem. Figure 1 presents the distribution of blocks in an fourdisk array where each disk has a different capacity.

4.3 Allowing More Parallelism


Examining the algorithm so far, we observe that all long lines are placed in the lower portion of the disks while all short lines appear in the higher section of the disks. If we have a set of sequential requests in the higher part of the array, only a portion of the disks is used and the possible parallelism is limited. For this reason, distributing the location of long and short lines all over the array will improve the behavior of the array. 2

4 AdaptRaid0: A New Block-Distribution Policy


4.1 Disk Arrays and Parallelism
One of the main objectives of a disk array is to offer a high bandwidth by exploiting data access parallelism.

Disk Line 0 1 2 3 4 5

0 0 4 8 11 14 16

1 1 5 9 12 15

2 2 6 10 13

3 3 7
LIP = 6

Disk Line 0 1 2 3 4 5 6 7 8 9 10

0 0 4 8 11 14 16 17 21 25 28 31 33

1 1 5 9 12 15 18 22 26 29 32

2 2 6 10 13 19 23 27 30

3 3 7 20 24

Figure 1. Distribution of data blocks according to the intuitive version.

To make this distribution, we introduce the concept of pattern of lines. The algorithm assumes, for a moment, that disks are smaller (but with the same proportions in size) and distributes the blocks in this smaller array. This distribution becomes the pattern that is repeated until all disks are full. The resulting distribution has the same number of lines as the previous version of the algorithm. Furthermore, each disk also has the same number of blocks than as the previous version. The only difference is that short and long lines are distributed all over the array, which was our objective. With this solution, we can see the Figure 1 as a pattern that can be repeated in disks thousands of times larger than the ones presented.

11

Figure 2. Block distribution using the nal algorithm.

4.4 Generalizing the Solution


So far, we have presented an algorithm that works perfectly under the assumption that the size of disks and their performance grow at the same pace, but this is not usually the case [5]. For this reason, we want to generalize the algorithm in order to make it usable in any environment. If we examine the algorithm we can see that there are two main ideas that can be parameterized. The rst one is the number of blocks we place in each disk. So far, we assumed that all blocks in a disk were used. Now, we want to add a parameter to the algorithm that denes the proportion of blocks that are placed in each disk. The utilization factor (UF), which is dened on a per-disk basis, is a number between 0 and 1 that denes the relation between the number of blocks placed in each disk. The disk with more blocks always has a UF of 1 and the rest of disks have a UF related to the number of blocks they use compared to the most loaded one. For instance, if a disk has a UF of 0.5, it means that it stores half the number of blocks than the most loaded one. This parameter allows the system administrator to decide the load of the disks. We could set values, as we have assumed so far, that reect the size of the disks, or we could also nd values that reect the performance of the 3

disk instead of the capacity. The second parameter is the number of lines in the pattern (LIP). The number of lines in the pattern indicates how well distributed are the different kinds of lines along the array. This parameter equals the number of blocks the largest disk has in a pattern. Nevertheless, we should keep in mind that smaller disks will participate in less than LIP lines. Figure 2 presents a graphic example of how blocks are distributed in the rst two repetitions of the pattern if we use , the following parameters: , And . Remember that the picture only shows the rst two repetitions of the pattern. Fast but small disks: a special case The current algorithm can be used with any kind of disks. Nevertheless, it does not make much sense if the fastest disk is also signicantly smaller. In this case, a better use for these disks would be to keep hot data as proposed by Dan and Sitaran [4].

5 Performance Results
5.1 Methodology
Simulation and Environment Issues In order to perform this work, we have used HRaid [3], which is a storage-system simulator.

Table 1. Disk characteristics.


Fast Disks (Cheetah 4LP) 4.339Gbytes 512Kbytes 512bytes 64Kbytes YES YES 1100 s 10033 800 s 600 a=1.55 b=0.155134 a=4.2458 b=0.001740 Slow Disks (Barracuda 4LP) 2.061 Gbytes 128Kbytes 512bytes 64Kbytes YES YES 1100 s 7200 800 s 600 a=3.0 b=0.232702 a=7.2814 b=0.002364

Table 2. Workload characteristics. Requests Reads Writes 159208 115044 Request size (Average) 12.6Kbytes 12.4Kbytes Request Size (Mode) 8Kbytes (86.3% of all reads) 8Kbytes (69.7% of all writes)

Disk size Cache size Sector size Read/Write fence Prefetching Immediate report New-cmd overhead RPM Track switch Limit Short seek: Long seek:

All tests presented in this paper were performed simulating an array with a combination of slow and fast disks for simplicity reasons. The model used for these disks is the one proposed by Ruemmler and Wilkes [7]. We have simulated two Seagate disks [9] and a list with the most important characteristics for each disk (controller and drive) are presented in Table 1. The stripping unit used is 512 bytes because it allows us to present the behavior of both inter-request and intra-request parallelism when using the real-world traces from HP. Nevertheless, we have also used larger block sizes such as 64KB and 128KB with very similar results. These disks and the hosts were connected through a Fast-Ethernet network (10 s latency and 100Mbits/s bandwidth). We simulated the contention of the network, but no protocol overhead was simulated. Finally, we have to keep in mind that in the simulations we only had the network and disks (controller and drive) into account. The possible overhead of the requesting hosts was not simulated because it greatly depends on the implementation of the le system. The only issue we simulated from the le system was that it can only handle 10 requests at a time. The rest of requests wait in a queue until one of the previous requests has been served. Workload issues All the results presented in this paper have been obtained using a portion of the traces gathered by the Storage System Group at the HP Laboratories (Palo Alto) in 1999. These traces represent the load of a real general-purpose and scientic site. Some important data about these traces is summarized in Table 2. In this table, we present the number 4

of requests studied as well as their size. Regarding the size we have presented two measures: the average size and the mode. On the one hand, it is important to know the average size of the requests because RAIDs do not behave in the same way for small requests than for large ones. On the other hand, it is important to know the mode, which is the request size used most frequently. This will also be important to understand the behavior of the proposed algorithms. Besides this workload, we have also tested other traces such as the HP-traces in their 1992 version [1] and some synthetic ones. Although we will only present the results obtained with the rst set of traces (for briefty reasons), very similar results have been obtained in all our experiments.

Congurations Studied

As we mentioned in Section 4.1, we are interested in studying the parallelism between requests and what happens when no parallelism can be exploited. For this reason, we will examine two different congurations. The rst one will have 8 disks, which means that nearly no parallelism between requests will be observed (the average request uses all disks). The second conguration will have 32 disks and will allow this kind of parallelism as most requests only use 16 disks. For simplicity, the congurations will always have all fast disks in the rst positions and the slow ones in the last position of the array. Finally, we have chosen a single LIP for all experiments that have the same number of disks, also for simplicity reasons. This value has been computed using our previous experience and the results are: LIP=100 when 8 disks are used and LIP=10 when 32 disks are used. Regarding the UF for each disk, we have used 1 for the large and fast disks and 0.46 for the slow and small ones (This relationship is the same as the one found in the capacity). We should keep in mind that the aim of this paper is to prove that AdaptRaid0 is a better solution than the ones currently being used and not to nd the best possible parameters of the presented distribution algorithm.

Total capacity (in GBytes)

30

OnlyFast RAID0 AdaptRaid0

40 30 Gain (%) 20 10 0 1

AdaptRaid0 vs. RAID0 AdaptRaid0 vs. OnlyFast Reads Writes

20

10

0 0 2 4 6 8 Number of fast disks in the array

Figure 3. Available capacity for the studied congurations.

3 5 7 1 3 5 Number of fast disks

Figure 4. Performance gain when queue delays are not taken into account. Reference Systems We have compared AdaptRaid0 with the following two base congurations: systems. Figure 4 presents the results obtained if we start counting the time as soon as the le system starts to serve it. This measure does not take into account the time spent in the le-system queue (remember that, at most, 10 operations were served at a time). This graph intends to show the behavior of a system that is not too loaded. Figure 5 presents the results for the same experiments but taking the waiting time into account. This graph intends to show the behavior in a highly loaded system. If we focus our attention on Figure 4, we can see that our algorithm is signicantly faster than the traditional RAID0, and specially for write operations. Furthermore, the benets are greater as the number of fast disks increases. This happens because most operations use all disks and, in a traditional RAID0, they take as long as the slowest component. If we examine the comparison with OnlyFast we can see that it is not easy to decide which algorithm is better. As we mentioned in Section 4.1, if a request is small, the gain obtained by serving it with several disks is very small. For this reason, having a smaller number of disks in OnlyFast does not increase the response time, especially when compared to the overhead of using more disks where some of them are slow ones. Nevertheless, our algorithm offers a higher capacity, which makes it a slightly better option If we focus on the second graph (Figure 5), we can see that the performance gains obtained over RAID0 are much higher. This happens due to the high load of the system. As the time needed to serve a request on a RAID0 is bigger, the time in the queue also grows and at a higher pace. The comparison with OnlyFast has also changed a little bit. As the waiting time is taken into account now, having fewer disks increases this time, and thus our algorithm performs 5

RAID0: This is the traditional RAID0 algorithm and it uses all the disks. It is important to notice that this leads to fast disks being treated as if they were slow ones and that only a portion of their capacity is effectively used. OnlyFast: This is also a traditional RAID0, but only using fasts disks (slow disks are ignored). The number of fast disks will be the same as the number of fasts disks in the heterogeneous conguration. This comparison will give us the idea of whether it is better to throw the old disks away instead of using them.

5.2 Capacity Evaluation


As capacity is an important issue, we have drawn a graph with the effective capacity each conguration will have depending on the distribution algorithm used (Figure 3). We can see that AdaptRaid0 is the one that obtains the highest capacity. This happens because it knows how to take advantage of the capacity of all disks in the array. The graph is for an 8-disk array, but a very similar one can be drawn for a 32-disk array.

5.3 Performance with 8-Disk Arrays


In this section, we present the performance results obtained by our block-distribution algorithm when 8-disk arrays are used. Graphs 4 and 5 present the relative gain in response time obtained by AdaptRaid0 versus both reference

80 60

AdaptRaid0 vs. RAID0 AdaptRaid0 vs. OnlyFast Reads Writes 40 30

AdaptRaid0 vs. RAID0 AdaptRaid0 vs. OnlyFast Reads Writes

Gain (%)

40 20 0 1 3 5 7 1 3 5 Number of fast disks 7

Gain (%)

20 10 0 1 16 31 1 16 Number of fast disks 31

Figure 5. Performance gain when queue delays are taken into account.

Figure 6. Performance gain when queue delays are not taken into account.

better. This behavior happens as long as not too many fast disks are used, in which case OnlyFast has enough disks to reduce the waiting time.

the previous section applies.

6 Conclusions
In this paper, we have presented AdaptRaid0, an blockdistribution policy that takes full advantage of heterogeneous disk arrays. First, it achieves a signicant performance if compared to the policies currently being used. We have also shown that the algorithm behaves well no matter whether the parallelism between requests can be exploited or not. And second, it is able to use all the capacity available in all disks.

5.4 Performance with 32-Disk Arrays


In this section, we present the performance results obtained by our block-distribution algorithm in arrays with 32 disks. The same two graphs are presented in Figures 6 and 7. If we examine the results obtained when no waiting time is taken into account (Figure 6), we observe that now RAID0 is much closer to the performance of AdaptRaid0 than OnlyFast. This happened due to the parallelism between requests. As more than one request can be served in parallel, the number of disks makes an important difference and it is better to have 32 slow disks than only 16 fast disks. We can observe that in both cases, a discontinuity appears when 16 fast disks are used, which is the number of disks needed by most requests. When this conguration is used in AdaptRaid0, many of the requests are handled by the fast disks while the slow ones remain quite idle. This reduces the parallelism between requests while no such thing happens in RAID0. As the number of fast disks increases, parallelism between requests that only use fast disks starts to be possible and the gain starts increasing again. For this reason, the gain compared to RAID0 reaches a local minimum when there are 16 fast disks. On the other hand, when OnlyFast has more than 16 disks, it can start to increase its parallelism among requests and its performance gets much closer to AdaptRaid0. Finally, if we take the waiting time into account (Figure 7), the gains are again greater and the same reason as in 6

Acknowledgments
We thank the Storage System Group at HP Laboratories (Palo Alto), and specially to John Wilkes, for letting us use their 1999 disk traces and for their interesting comments. We are also grateful to the anonymous referees, whose comments helped us to improve the quality of the paper.

References
[1] M. G. Baker, J. H. Hartman, M. D. Kupfer, K. W. Shirriff, and J. K. Ousterhout. Measurements of a distributed le system. In Proceedings of the 13th Symposium on Operating System Principles, pages 198212. ACM Press, July 1991. [2] P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson. RAID: High-performance and reliable secondary storage. ACM Computing Surveys, 26(2):145185, 1994.

AdaptRaid0 vs. RAID0 AdaptRaid0 vs. OnlyFast Reads Writes 60 Gain (%)

40

20

0 1 16 31 1 16 Number of fast disks 31

Figure 7. Performance gain when queue delays are taken into account.

[3] T. Cortes and J. Labarta. HRaid: A exible storage-system simulator. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, pages 772778. CSREA Press, June 1999. [4] A. Dan and D. Sitaram. An online video plcement policy based on bandwidth to space ratio (bsr). In Proceedings of the SIGMOD, pages 376385, 1995. [5] E. Grochowski and R. F. Hoyt. Future trends in hard disk drives. IEEE Transactions on Magnetics, 32(3), May 1996. [6] Y. Hu and Q. Yang. A new hierarchical disk architecture. IEEE Micro, pages 6475, November/December 1998. [7] C. Ruemmler and J. Wilkes. An introducction to disk drive modeling. IEEE COMPUTER, pages 1728, March 1994. [8] J. R. Santos and R. Muntz. Performance analysis of the rio multimedia storage system with heterogenenous disk congurations. ACM Multimedia, pages 303308, 1998. [9] Seagate. Segate web page. http://www.seagete.com, January 2000. [10] L. Vepstas. Software-raid howto. http://www.linux.org/help/ldp/howto/Software-RAIDHOWTO.html, 1998. [11] J. Wilkes, R. Golding, C. Staelin, and T. Sullivan. The HP AutoRAID hierarchical storage system. In Proceedings of the 15th Operating System Review, pages 96108. ACM Press, December 1995. [12] R. Zimmermann. Continuous media placement and scheduling in heterogeneous disk storage systems. PhD thesis, University of Southern California, December 1998.

You might also like