Professional Documents
Culture Documents
T. Cortes and J. Labarta Departament dArquitectura de Computadors Universitat Polit` cnica de Catalunya - Barcelona e http://www.ac.upc.es/hpc/ toni,jesus @ac.upc.es
Abstract
Heterogeneous disk arrays are becoming a common conguration in many sites and specially in storage area networks (SAN). As new disks have different characteristics than old ones, adding new disks or replacing old ones ends up in a heterogeneous disk array. Current solutions to this kind of arrays do not take advantage of the improved characteristics of the new disks. In this paper, we present a block-distribution algorithm that takes advantage of these new characteristics and thus improves the performance and capacity of heterogeneous disk arrays compared to current solutions.
1 Introduction
Heterogeneous disk arrays are becoming (or will be in a near future) a common conguration in many sites. For example, whenever a component of a traditional array fails it has to be replaced by a new one. The same thing happens when the capacity needs of a site grow and new disks have to be added to the array. In both cases, it will be difcult to buy the same disks as the ones in the original conguration [5], and thus newer disks will be added. This means that we will make the array a heterogeneous one because it will be made of disks with different characteristics. This kind of heterogeneity becomes even more important in clusters of workstations or storage area networks (SAN) because, in these congurations, disks are quite loosely coupled. This simplies the task of building arrays with different kind of disks. Furthermore, heterogeneous disk arrays also help to have low-cost clusters as all available hardware may be used to improve both capacity and performance of the storage system.
This work has been supported by the Spanish Ministry of Education (CICYT) under the TIC-95-0429 contract. The contents of this technical report has been published in the 1st IEEE International Conference on Cluster Computing (Cluster 2000)
To handle this kind of disk arrays, current systems do not take into account the differences between the disks. All disks are treated as if they had same capacity and performance. This is a problem because improvements in both capacity and response time of the heterogeneous array could be achieved if each disk were used accordingly to its characteristics. In this work, we present a simple solution to this problem by proposing AdaptRaid0, a block-distribution algorithm that improves the performance and effective capacity of heterogeneous disk arrays compared to current solutions. This paper is organized in 6 sections. Section 2 presents the environment where proposal should work. Section 3 gives an overview of the previous work on heterogeneous disk arrays. Section 4 describes AdaptRaid0, which is the new block-distribution policy presented in this paper. In Section 5 we evaluate the proposed algorithm. And nally, Section 6 summarizes the main contributions of this work.
2 Target Environment
In this work, we focus on how to take advantage of heterogeneity in RAID level 0. This kind of disk arrays is widely used in high-performance environments because this conguration is the one that obtains the best performance [2]. The method presented in this paper has been designed to work on any kind of disk arrays (hardware or software, tightly or loosely coupled, etc.), although we will only evaluate the behavior of an array made by network-attached disks in a Storage-Area Network (SAN), which seems to be a very promising approach. This approach becomes even more interesting when merged with a cluster of workstations, which is also becoming a popular conguration. Finally, we will focus our attention on scientic and general purpose applications because multimedia environments, and their special assumptions, have already been addressed [4, 8, 12]. 1
3 Related Work
Some other projects have already addressed the same problem, but they have always been focused on multimedia systems (and specially video and audio servers). The work done by Santos and Muntz [8] proposed a random distribution with replications to improve the short and long-term load balance. In a similar line, Zimmermann proposed a data placement policy based on the creation of logical disks composed by fractions or combinations of several physical disks [12]. Finally, Dan and Sitaran proposed the usage of fast disks to place hot data while the less active data would be located in the slow disks [4]. The main difference from our approach is that all these projects were targeted to multimedia systems while we want a solution for general purpose and scientic environments. Due to their focus on multimedia, they could make some assumptions such as that very large disk blocks (1Mbyte) are used, that reads are much more important than writes and that the main objective is to obtain a sustained bandwidth as opposed to besteffort. These assumptions are not valid in our environment where blocks are only a few Kbytes in size, writes are as important as reads, and sustained bandwidth is not as important as faster response time. The only work, as far as we know, that deals with this problem in a non-multimedia environment has been implemented in Linux. In this system, any of the disks in the array can be build by putting several disks together [10]. Each disk will store part of the blocks assigned to this virtual disk. This solution is very simple but limited and does not solve completely nor the performance or the capacity problem. Slow disks are always used and we will see that this is not the best solution to the problem. Furthermore, it is quite difcult to nd a combination of two disks that has the exact same capacity of the larger disks. There have also been other projects that dealt with a heterogeneous set of disks, but their objective was to propose new architectures using different disks for different tasks. Along this line we could mention the HP AutoRaid [11], and DCD [6]. In our work, we do not try to decide which is the best hardware and then buy it, we want to deal with already existing devices, whichever they are. This allows us to build high-performance and high-capacity storage systems at a low cost.
A rst kind of parallelism is achieved within a single request. In this case, all disks work together to fulll a single request and thus the time spent transferring data from the magnetic surface is divided by the number of disks. This kind of parallelism makes sense when requests are large. If requests are small, the portion of the time spent transferring the data is so small that the parallelism obtained does not improve the response time signicantly. A second kind of parallelism occurs when several requests, witch can be handled by different disks, is served simultaneously. This kind of parallelism makes sense when requests are small. If requests are large, they will use all disks and the parallelism between requests will decrease signicantly. When none of these two kinds of parallelism can be exploited, the disk array hardly offers any performance benet over a single disk.
Disk Line 0 1 2 3 4 5
0 0 4 8 11 14 16
1 1 5 9 12 15
2 2 6 10 13
3 3 7
LIP = 6
Disk Line 0 1 2 3 4 5 6 7 8 9 10
0 0 4 8 11 14 16 17 21 25 28 31 33
1 1 5 9 12 15 18 22 26 29 32
2 2 6 10 13 19 23 27 30
3 3 7 20 24
To make this distribution, we introduce the concept of pattern of lines. The algorithm assumes, for a moment, that disks are smaller (but with the same proportions in size) and distributes the blocks in this smaller array. This distribution becomes the pattern that is repeated until all disks are full. The resulting distribution has the same number of lines as the previous version of the algorithm. Furthermore, each disk also has the same number of blocks than as the previous version. The only difference is that short and long lines are distributed all over the array, which was our objective. With this solution, we can see the Figure 1 as a pattern that can be repeated in disks thousands of times larger than the ones presented.
11
disk instead of the capacity. The second parameter is the number of lines in the pattern (LIP). The number of lines in the pattern indicates how well distributed are the different kinds of lines along the array. This parameter equals the number of blocks the largest disk has in a pattern. Nevertheless, we should keep in mind that smaller disks will participate in less than LIP lines. Figure 2 presents a graphic example of how blocks are distributed in the rst two repetitions of the pattern if we use , the following parameters: , And . Remember that the picture only shows the rst two repetitions of the pattern. Fast but small disks: a special case The current algorithm can be used with any kind of disks. Nevertheless, it does not make much sense if the fastest disk is also signicantly smaller. In this case, a better use for these disks would be to keep hot data as proposed by Dan and Sitaran [4].
5 Performance Results
5.1 Methodology
Simulation and Environment Issues In order to perform this work, we have used HRaid [3], which is a storage-system simulator.
Table 2. Workload characteristics. Requests Reads Writes 159208 115044 Request size (Average) 12.6Kbytes 12.4Kbytes Request Size (Mode) 8Kbytes (86.3% of all reads) 8Kbytes (69.7% of all writes)
Disk size Cache size Sector size Read/Write fence Prefetching Immediate report New-cmd overhead RPM Track switch Limit Short seek: Long seek:
All tests presented in this paper were performed simulating an array with a combination of slow and fast disks for simplicity reasons. The model used for these disks is the one proposed by Ruemmler and Wilkes [7]. We have simulated two Seagate disks [9] and a list with the most important characteristics for each disk (controller and drive) are presented in Table 1. The stripping unit used is 512 bytes because it allows us to present the behavior of both inter-request and intra-request parallelism when using the real-world traces from HP. Nevertheless, we have also used larger block sizes such as 64KB and 128KB with very similar results. These disks and the hosts were connected through a Fast-Ethernet network (10 s latency and 100Mbits/s bandwidth). We simulated the contention of the network, but no protocol overhead was simulated. Finally, we have to keep in mind that in the simulations we only had the network and disks (controller and drive) into account. The possible overhead of the requesting hosts was not simulated because it greatly depends on the implementation of the le system. The only issue we simulated from the le system was that it can only handle 10 requests at a time. The rest of requests wait in a queue until one of the previous requests has been served. Workload issues All the results presented in this paper have been obtained using a portion of the traces gathered by the Storage System Group at the HP Laboratories (Palo Alto) in 1999. These traces represent the load of a real general-purpose and scientic site. Some important data about these traces is summarized in Table 2. In this table, we present the number 4
of requests studied as well as their size. Regarding the size we have presented two measures: the average size and the mode. On the one hand, it is important to know the average size of the requests because RAIDs do not behave in the same way for small requests than for large ones. On the other hand, it is important to know the mode, which is the request size used most frequently. This will also be important to understand the behavior of the proposed algorithms. Besides this workload, we have also tested other traces such as the HP-traces in their 1992 version [1] and some synthetic ones. Although we will only present the results obtained with the rst set of traces (for briefty reasons), very similar results have been obtained in all our experiments.
Congurations Studied
As we mentioned in Section 4.1, we are interested in studying the parallelism between requests and what happens when no parallelism can be exploited. For this reason, we will examine two different congurations. The rst one will have 8 disks, which means that nearly no parallelism between requests will be observed (the average request uses all disks). The second conguration will have 32 disks and will allow this kind of parallelism as most requests only use 16 disks. For simplicity, the congurations will always have all fast disks in the rst positions and the slow ones in the last position of the array. Finally, we have chosen a single LIP for all experiments that have the same number of disks, also for simplicity reasons. This value has been computed using our previous experience and the results are: LIP=100 when 8 disks are used and LIP=10 when 32 disks are used. Regarding the UF for each disk, we have used 1 for the large and fast disks and 0.46 for the slow and small ones (This relationship is the same as the one found in the capacity). We should keep in mind that the aim of this paper is to prove that AdaptRaid0 is a better solution than the ones currently being used and not to nd the best possible parameters of the presented distribution algorithm.
30
40 30 Gain (%) 20 10 0 1
20
10
Figure 4. Performance gain when queue delays are not taken into account. Reference Systems We have compared AdaptRaid0 with the following two base congurations: systems. Figure 4 presents the results obtained if we start counting the time as soon as the le system starts to serve it. This measure does not take into account the time spent in the le-system queue (remember that, at most, 10 operations were served at a time). This graph intends to show the behavior of a system that is not too loaded. Figure 5 presents the results for the same experiments but taking the waiting time into account. This graph intends to show the behavior in a highly loaded system. If we focus our attention on Figure 4, we can see that our algorithm is signicantly faster than the traditional RAID0, and specially for write operations. Furthermore, the benets are greater as the number of fast disks increases. This happens because most operations use all disks and, in a traditional RAID0, they take as long as the slowest component. If we examine the comparison with OnlyFast we can see that it is not easy to decide which algorithm is better. As we mentioned in Section 4.1, if a request is small, the gain obtained by serving it with several disks is very small. For this reason, having a smaller number of disks in OnlyFast does not increase the response time, especially when compared to the overhead of using more disks where some of them are slow ones. Nevertheless, our algorithm offers a higher capacity, which makes it a slightly better option If we focus on the second graph (Figure 5), we can see that the performance gains obtained over RAID0 are much higher. This happens due to the high load of the system. As the time needed to serve a request on a RAID0 is bigger, the time in the queue also grows and at a higher pace. The comparison with OnlyFast has also changed a little bit. As the waiting time is taken into account now, having fewer disks increases this time, and thus our algorithm performs 5
RAID0: This is the traditional RAID0 algorithm and it uses all the disks. It is important to notice that this leads to fast disks being treated as if they were slow ones and that only a portion of their capacity is effectively used. OnlyFast: This is also a traditional RAID0, but only using fasts disks (slow disks are ignored). The number of fast disks will be the same as the number of fasts disks in the heterogeneous conguration. This comparison will give us the idea of whether it is better to throw the old disks away instead of using them.
80 60
Gain (%)
Gain (%)
Figure 5. Performance gain when queue delays are taken into account.
Figure 6. Performance gain when queue delays are not taken into account.
better. This behavior happens as long as not too many fast disks are used, in which case OnlyFast has enough disks to reduce the waiting time.
6 Conclusions
In this paper, we have presented AdaptRaid0, an blockdistribution policy that takes full advantage of heterogeneous disk arrays. First, it achieves a signicant performance if compared to the policies currently being used. We have also shown that the algorithm behaves well no matter whether the parallelism between requests can be exploited or not. And second, it is able to use all the capacity available in all disks.
Acknowledgments
We thank the Storage System Group at HP Laboratories (Palo Alto), and specially to John Wilkes, for letting us use their 1999 disk traces and for their interesting comments. We are also grateful to the anonymous referees, whose comments helped us to improve the quality of the paper.
References
[1] M. G. Baker, J. H. Hartman, M. D. Kupfer, K. W. Shirriff, and J. K. Ousterhout. Measurements of a distributed le system. In Proceedings of the 13th Symposium on Operating System Principles, pages 198212. ACM Press, July 1991. [2] P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson. RAID: High-performance and reliable secondary storage. ACM Computing Surveys, 26(2):145185, 1994.
AdaptRaid0 vs. RAID0 AdaptRaid0 vs. OnlyFast Reads Writes 60 Gain (%)
40
20
Figure 7. Performance gain when queue delays are taken into account.
[3] T. Cortes and J. Labarta. HRaid: A exible storage-system simulator. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, pages 772778. CSREA Press, June 1999. [4] A. Dan and D. Sitaram. An online video plcement policy based on bandwidth to space ratio (bsr). In Proceedings of the SIGMOD, pages 376385, 1995. [5] E. Grochowski and R. F. Hoyt. Future trends in hard disk drives. IEEE Transactions on Magnetics, 32(3), May 1996. [6] Y. Hu and Q. Yang. A new hierarchical disk architecture. IEEE Micro, pages 6475, November/December 1998. [7] C. Ruemmler and J. Wilkes. An introducction to disk drive modeling. IEEE COMPUTER, pages 1728, March 1994. [8] J. R. Santos and R. Muntz. Performance analysis of the rio multimedia storage system with heterogenenous disk congurations. ACM Multimedia, pages 303308, 1998. [9] Seagate. Segate web page. http://www.seagete.com, January 2000. [10] L. Vepstas. Software-raid howto. http://www.linux.org/help/ldp/howto/Software-RAIDHOWTO.html, 1998. [11] J. Wilkes, R. Golding, C. Staelin, and T. Sullivan. The HP AutoRAID hierarchical storage system. In Proceedings of the 15th Operating System Review, pages 96108. ACM Press, December 1995. [12] R. Zimmermann. Continuous media placement and scheduling in heterogeneous disk storage systems. PhD thesis, University of Southern California, December 1998.