You are on page 1of 6

2013 IEEE Seventh International Symposium on Service-Oriented System Engineering

Skew-Aware Task Scheduling in Clouds


Dongsheng Li+, Yixing Chen+, Richard Hu Hai
+

National Lab for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, China Raffles Business Institute, Singapore dsli@nudt.edu.cn PageRank [9] for large-scale search engineering is a typical application executed in MapReduce-like systems. The PageRank application performs a link analysis that assigns weights (ranks) to each vertex/webpage in the webpage link graph by iteratively aggregating the weights of its inbound neighbors. Studies [7], [8], [18] have shown that the degrees of webpage link graphs are much skewed and some vertexes are with a large degree of incoming edges. Since the MapReducelike systems [4] use the random hash algorithm to partition the intermediate data to Reducer nodes, the nodes that are responsible for the tasks of computing the weight of highdegree vertexes might take more time to finish their task, and thus become the straggles of the system. And the straggle problem caused by data skew has become an important research topic in MapReduce-like systems recently. In this paper, we propose a Skew-Aware Task Scheduling (SATS) mechanism for MapReduce-like systems. The SATS mechanism is based on the observation that many applications in MapReduce-like systems are iterative computations [5], such as the PageRank [9], machine learning applications, recursive relational queries and social network analysis. In iterative applications, data are processed iteratively until the computation satisfies a convergence or stopping condition, and each iteration in the computation could be one or multiple MapReduce jobs. There might be similarity between the data in two adjacent iterations, and the data distribution in the jobs of adjacent iterations might be similar. If the data distribution could be acquired before the execution of a MapReduce job, we might partition the data properly onto nodes in the system to improve the load balancing. Based on the idea, the SATS mechanism is designed to utilize the similarity of data distribution in adjacent iterations to reduce the straggle problem caused by data skew. It collects the data distribution information during the task execution in the current iteration, and uses the information to guide the data partition in the next iteration. As data skew is often happened in the Reduce phase of the MapReduce jobs, the SATS mechanism focus on the straggler problem in the Reduce phase of the MapReduce Jobs. The main contribution of this paper is shown as below. Firstly, we design a skew-aware task scheduling mechanism, called SATS, to deal with the straggle problem caused by data skew in iterative applications in MapReduce-like systems. Secondly, we implement the SATS mechanism and build a prototype based on HaLoop [5], an open source MapReducelike system. Finally, we perform compensative experiments to evaluate the SATS mechanism, and experimental results show that SATS can improve the load balancing effectively.

AbstractData skew is an important reason for the emergence of stragglers in MapReduce-like cloud systems. In this paper, we propose a Skew-Aware Task Scheduling (SATS) mechanism for iterative applications in MapReduce-like systems. The mechanism utilizes the similarity of data distribution in adjacent iterations of iterative applications to reduce the straggle problem caused by data skew. It collects the data distribution information during the execution of tasks for the current iteration, and uses the information to guide data partitioning in tasks for the next iteration. We implement the mechanism in the HaLoop system and deploy it in a cluster. Experiments show that the proposed mechanism could deal with the data skew and improve the load balancing effectively. KeywordsData Skew; Task Scheduling; Cloud; Load balancing

I.

INTRODUCTION

Cloud computing has become a promising technology in recent years, and MapReduce is one of the most successful realizations of large-scale data-intensive cloud computing platforms [1]-[3]. MapReduce uses a simple data parallel programming model with two basic operations, i.e., the Map and Reduce operations. Users can customize the Map function and Reduce function according to the application requirements. Each map task takes one piece of the input data and generates a set of intermediate key/value pairs using the Map function, which are shuffled to the Reduce tasks working with the Reduce function. This programming model is simple but robust, and many large-scale data processing applications can be expressed by the model. The MapReduce-like systems can automatically schedule multiple Map and/or Reduce tasks over distributed machines in the Clouds. As the synchronization step only exists between the Map phase and the Reduce phase, tasks executing in the same phase have high parallelism, and thus the concurrency and the scalability of the system could be highly enhanced. Hadoop [4] and its variants (e.g., HaLoop [5] and Hadoop++ [6]) are typical MapReduce-like systems. Since there is a synchronization step between the Map phase and the Reduce phase in MapReduce-like systems, one slow task in either phase may slow down the execution of the whole job. Such a slow task in the Map or Reduce phase is called a straggler. When stragglers come out, the execution time of the whole job will increase, and the resource usage will be reduced. Recently, studies [7]-[8] show that data skew in the Map or Reduce phase has become one of the main reasons for stragglers. In many scientific computing and data analysis applications, data skew of the input data or intermediate data could cause severe load unbalancing problem. For example,
978-0-7695-4944-6/12 $26.00 2012 IEEE DOI 10.1109/SOSE.2013.64 341

The rest of this paper is organized as follows. Section 2 discusses related work. Section 3 illustrates the design and implementation of the SATS mechanism. Section 4 evaluates the mechanism by experiments. Section 5 presents the conclusion and future work. II. RELATED WORK

A. MapReduce-like systems MapReduce [1] is a popular data parallel programming model for data-intensive cloud systems proposed by Google. Hadoop [4] is an open source implementation of the MapReduce model, including several sub-projects, such as the Hadoop Common and the HDFS [3]-[4]. The cloud computing systems using the MapReduce model are often called MapReduce-like systems. The MapReduce-like system divides all the nodes in a cluster into the Master (i.e., the JobTracker) and the Slave (i.e., the TaskTracker), and there are only one Master and many Slaves. The Master handles certain global work, such as Job and task scheduling, and the Slaves perform work assigned by the Master, including the Map work and the Reduce work. When a Map work is finished, intermediate key/value pairs with the same key will be assigned to one partition according to the data partitioning scheme. In the current version of Hadoop [4], the number of partitions is the same to the number of Reducer nodes, and each Reducer node processes the key/value pairs in one partition got from all distributed Mapper nodes. The SATS mechanism proposed in the paper could modify the data partition scheme to handle the straggler problem caused by data skew. HaLoop [5] is a modified version of the Hadoop for iterative applications, such as scientific computing and data analysis applications. HaLoop uses three caches, namely the Reducer Input Cache, the Reducer Output Cache and the Mapper Input Cache to improve the performance. The Reducer Input Cache is designed to store the output of Map tasks, providing data for the next iteration. The Reducer Output Cache is set to make the computing of the fixpoint easier. The Mapper Input Cache is used for the data locality of Map tasks. By using the loop-aware task scheduling and the Input/Output caches, HaLoop can reduce the execution time of iterative applications remarkably. The proposed SATS mechanism is implemented in the HaLoop system, and it utilizes the similarity of the intermediate data of tasks in adjacent iterations to improve the load balancing of Reducer nodes. B. Scheduling in MapReduce-like systems Scheduling is an important research topic in MapReducelike systems. There are several default job scheduling mechanisms in Hadoop, e.g., FIFO, the Capacity Scheduler, the Fair Scheduler [10]. Since the Hadoops scheduler might cause severe load unbalancing and performance degradation in heterogeneous environments, the Longest Approximate Time to End (LATE) scheduler [11] is designed to handle the straggler problem in heterogeneous clusters by modifying the speculative execution policy, and it can reduce Hadoops response time by a factor of 2.

Ganesh Ananthanarayanan et al. [12] classified the reasons for the straggler problem into three categories, including machine characteristics with varying capacity and reliability, network characteristics with varying bandwidths and congestion and workload characteristics among tasks (e.g., imbalance caused by data skew). They proposed Mantri [12], a mechanism that monitors tasks and culls stragglers using causeand resource-aware techniques, including the restart of stragglers, network-aware placement of tasks and protecting outputs of valuable tasks. With real-time progress reports, Mantri detects stragglers early in their lifetime, and takes appropriate action based on their causes. Data skew is a common phenomenon in many applications executed in MapReduce-like systems [7]-[8], [13]-[15]. YongChul Kwon et al. [7] presented that scientific analytics applications that extract features from datasets exhibit significant computational skew. Jimmy Lin [8] observed that the straggler problem occurs in many MapReduce jobs, and suggested it is relevant to the data skew of datasets. SkewReduce [7] statically optimizes the data partitioning according to user-defined cost functions, but it depends on domain knowledge from users and is limited to specific types of applications. SkewTune [13] is an automatic skew mitigation mechanism for user-defined MapReduce programs. When a node becomes idle, SkewTune identifies the task with the greatest expected remaining processing time and proactively repartition the unprocessed input data of this straggling task. LEEN [14] schedules keys to the reduce tasks based on cost models, and TopCluster [15] constructs a histogram of all reduce keys to identify skewed reduce keys. Overall, the above approaches are complementary to the proposed SATS mechanism, which is the first to utilize the similarity of the data in adjacent iterations of iterative applications to deal with the data skew and improve the load balancing in MapReduce-like systems. III. SATS DESIGN

A. Mechanism Overview The SATS mechanism is a runtime load-balancing mechanism to reduce the probability of stragglers caused by data skew in iterative applications. In the Reduce phase of the MapReduce framework, each Reducer node deals with some key/value pairs, so the data skew problem is the unbalancing key distribution problem, i.e., some keys have more corresponding key/value pairs than others. Also, the key/value pairs with the same key will be handled by the same Reducer node. Thus the basic unit of the SATS mechanism is the key/value pairs with the same key. In iterative applications, there often exists some similarity between the input data in two adjacent iterations, and the intermediate data might also have similar data distribution on key/value pairs. For examples, in all iterations of the PageRank application, the graph datasets are the same, and only the weight of vertexes change. The degree distribution of vertexes in the dataset is never changed, and the data distribution of both the input data and the intermediate data of MapReduce jobs is almost the same. Therefore, the intermediate data distribution

342

information on key/value pairs extracted from jobs in the current iteration can be used to predict data distribution in the next iteration. Based on the idea, the SATS mechanism is designed to utilize the similarity of data distribution in adjacent iterations to mitigate the straggle problem caused by data skew and enhance the load balancing. The SATS mechanism collects the data distribution information on the intermediate key/value

pairs generated by the Map tasks during the job execution in the current iteration, and utilizes the information to guide data partitioning to improve the load balancing of Reducer nodes in the next iteration. The components of the SATS mechanism in MapReduce-like systems are shown in Figure 1. The Map, Reduce and JobTracker are the common components of MapReduce-like systems.
JobTracker Balancer SATS Controller Collector

Map Reduce Input Input Input Reduce Map Output Map Output

Figure 1. The components of the SATS mechanism in MapReduce-like systems

The SATS mechanism is implemented by three modules, i.e., the collector module, the controller module, and the balancer module. In MapReduce-like systems, there is one TaskTracker working in distributed nodes for each Map or Reduce task. The collector module runs with the TaskTracker for the Reduce task, and gathers the data distribution information of intermediate key/value pairs in the MapReduce jobs. Each collector module transfers the data distribution information gathered to the balancer module. The balancer module works in the JobTracker subsystem in MapReduce-like systems, gathers all the data distribution information from distributed collectors, and computes the global distribution of intermediate key/value pairs, and then determine a data partitioning scheme for jobs in the next iterative to deal with the data skew and improve the load balancing of Reducer nodes. The balancer module adopts the HLF algorithm to calculate the data partitioning scheme, which is described later in subsection C. After the balancer module determines the data partitioning scheme, it notifies the controller modules distributed in the TaskTrackers that will do a Map task in the next iteration of the scheme. When the Map tasks in the next iteration generate the intermediate key/value pairs, they will partition the key/value pairs according to the partitioning scheme instead of the default HashPartitioner scheme in Hadoop/ HaLoop, and then shuffle them to Reducer nodes accordingly to deal with the data skew and improve the load balancing of Reducer nodes.

We implement these modules in the SATS mechanism in the HaLoop [5] system, and illustrate the details of these modules in the next subsections. B. Collect the Data Distribution information In MapReduce-like systems, the intermediate data is generated in the form of key/value pairs, and the data with the same key are shuffled to one Reducer node. Therefore, the data distribution information is about keys generated and their weight, i.e., the number of related key/value pairs. The collector module runs in each TaskTracker for the Reduce tasks in distributed machines, and it counts the weight of keys when the Reduce tasks execute on local nodes. As there are many distributed collector modules in the system, they should send the data distribution information in the form of keys and their weights to the balancer module in the Master node on which the JobTracker runs. There are several ways to transfer the data distribution information from the distributed collector modules to the JobTracker in MapReduce-like systems. As there are periodic heartbeat messages between the JobTracker and the TaskTracker, we can use the heartbeat messages to piggyback the data distribution information, or we can transfer the data distribution information from the TaskTracker to the JobTracker directly when needed. However, these implements need to rewrite or modify the communication mechanisms in MapReduce-like systems, and they may influent the

343

communication performance of the system. And we adopt a simple and light-weight method to transfer the information. As MapReduce-like systems usually use the HDFS [3], [4] distributed file system, each collector module writes the data distribution information of local reduce tasks in the HDFS file system. The intermediate data are stored in Java Iterator in the source code of the HaLoop system, and the collector module reads the key/value pairs in the Java class Iterator and writes the local data distribution information (i.e., keys and their weights) in the specific directory in the HDFS file system. C. Determine the Data Partitioning Scheme To get global data distribution information, the balancer module needs to sum up all the data distribution information reported by the collector modules in the TaskTrackers that run Reduce tasks. As the collector module writes its local data distribution information in the HDFS file system, the balancer module can read all the data distribution from the specific directory in the HDFS file system reported from various Reducer nodes, and then compute the global data distribution of key/value pairs. After gathering the global data distribution of key/value pairs, the balancer module should determine a data partitioning scheme to assign the keys to the Reducer nodes. The default HashPartitioner schema in MapReduce-like systems assigns the keys to Reducer nodes randomly, and thus might cause load unbalancing of Reducer nodes since the weights of keys are skew in many applications. Based on the weights of the keys gathered, the balancer module in the SATS mechanism uses a skew-aware data partitioning scheme, called HLF (Heaviest Load First), to improve the load balancing of Reducer nodes and deal with the data skew problem in the next iteration of the application. The HLF algorithm is a variant of the typical LPT (longest processing time) scheduling algorithm [16], which has asymptotic complexity O(nlogn) in the worst case where n is the number of tasks to be assigned, and the task finish time of the LPT algorithm is no longer than 133% of the optimal task finish time. The LPT algorithm assumes that the execution time of all tasks is known in advance, while the HLF algorithm relaxes the assumption to fit for the environment of MapReduce-like systems. LPT assigns the task with the longest processing time first, and HLF assigns the keys with the heaviest weight to Reducer nodes first. Because the computational time is often proportional to the size of intermediate data (i.e., the weights of the keys) in MapReducelike systems, the keys with the heaviest weight need the longest computational time. HLF (S, N) // S: the set of keys associated with their weights // N: the number of Reducer nodes 1 for i =1 to N do 2 L[i] 0 ; // L[i] is the current load of the Reducer node i 3 RSortKeyWeight(S); // sort the keys in decreasing order of their weights

4 while R <> null do 5 kFetchHead(R) //fetch the key k from the head of R //the weight of key k is heaviest in R 6 R R-{ k } ; 7 r FecthMinLoad(L); //fetch the reducer node r whose current load is the lightest among all Reducer nodes 8 AssignTask( r, k) // Assign the key k to Reducer node r 9 L[k] L[k] + r; 10 Endwhile
Figure 2. The Pseudocode of the HLF algorithm

The Pseudocode of the HLF algorithm is shown in Figure 2. We assume that all the Reducer nodes in the system are viewed as homogenous nodes, and the initial load are 0. If the Reducer nodes are heterogeneous, it is easy to adapt the HLF algorithm according to the capacity of Reducer nodes. We sort the keys in the order of their weights, and then assign the keys to the Reducer node which has the largest idle capacity currently in the decreasing order. Once a key is assigned to a node, the load of the node is increased with the weight of the key. D. Configure the Data Partitioner In MapReduce-like systems, the data partition of intermediate key/values is determined by configure files of the job, and it is implemented by the class of the Partitioner in the source code of Hadoop/HaLoop. Generally, the Partitioner implemented in Map tasks should have been determined before the job is submitted. In HaLoop that is designed to handle iterative applications, one iteration might consist of some subjobs, and their configurations about the Partitioner are store in a data structure in the iterative order. While a job begins to run, the JobTracker reads the configurations from the data structure in the iterative order to configure the Partitioner. At the end of the jobs in one iteration, the JobTracker writes new configuration for the next iteration into the data structure. This approach in HaLoop provides a chance to configure the Partitioner dynamically. And the controller module in the SATS mechanism uses the approach to configure the Partitioner in iterations dynamically. Since the balancer module calculating the data partitioning scheme is in the centralized JobTracker, it should notify the controller modules in distributed Mapper nodes in the next iteration of the data partitioning scheme. The balancer module also uses the HDFS file system to achieve this task. It writes the data partitioning scheme in a specific file in the HDFS file system, and the controller modules in distributed Mapper nodes read the file when needed. The configure of the Partitioner class should be dynamic in the SATS mechanism, since the data partitioning scheme calculated by the balancer module is changing constantly when the application executes iteratively. All the Map tasks that will run in the next iteration are notified by the controller module to store the new configuration (mainly the configuration of the Partitioner), after the balancer module works out the data

344

partition scheme. When these Map tasks are activated, they will read the configuration from the data structure and construct new Partitioners to reduce the load unbalancing and deal with the data skew problem. The precision of the collector module and the similarity between the data in two adjacent iterations determine the effect of load balancing. There remain two problems about the data partitioning in the SATS mechanism. The first problem is how to partition the new keys that come out in a Map task in the next iteration. Such keys do not exist in the last iteration and thus their partitioning are not calculated by the balancer module. The SATS mechanism uses the default data partition method in HaLoop, i.e., the random hash partition to deal with the partition of new keys. The second problem is about the coordination between the SATS mechanism and the input cache mechanism in HaLoop. The Reduce Input cache in HaLoop caches some output of Map tasks, which will be used as the input data of the Reduce tasks in the next iteration. The data might not be sampled by the collector module, so the balancer module might not know the existence of these data, thus have negative influence on the data partitioning scheme. This bypass mechanism of HaLoop does not affect the data partition in the balancer module much. The Reduce Input cache in HaLoop caches values, not key/value pairs, and these caches will not affect the shuffle process. When a Reducer node gets a key and its value, it queries whether the values of the key are caches in the local cache. If the answer is yes, the two Iterators with these values will be joined together and the results are submitted to the Reduce function. So caches in HaLoop could be used to increase the weight of a key. IV. EXPERIMENTAL EVALUATION

social networks. The first-round of the query is to retrieve the direct neighbors of a vertex, then the neighbors of these direct neighbors, etc.
TABLE I. EXPERIMENTAL DATASETS

Name as-skitter soc-LiveJournal1 wiki-Talk

Size 142MB 1.00GB 63.3MB

Description The Internet topology graph in 2005 The friendship network in LiveJournal community The communication network of the registered users in Wikipedia

Many studies [18] have pointed out that the long tail degree distribution is common in many real-world graphs, such as Internet topology graphs, the Web graphs, and the social networks. The long tail degree distribution means that the degree of a few vertices will be much higher than other vertices. We use the WordCount, a MapReduce job to calculate the degrees of vertices in the datasets. The degree distribution and other properties of the datasets are shown in Table II. From Table II, it can be inferred that the vertices degrees of these datasets are much skew. The high degree vertices with degree larger than 500 is only 0.1% of all the vertices in the wiki-Talk dataset. And the high degree vertices in the as-skitter dataset and soc-LiveJournal1 dataset, are about 0.14% and 0.31%, respectively.
TABLE II. PROPERTIES OF THE DATASETS

Datasets as-skitter socLiveJournal1 wiki-Talk

A. Experimental Setup We implement the SATS mechanism in the HaLoop system [5] and deploy our prototype system in a cluster with 14 nodes. All these nodes are physical machines, with 8-cores Intel Xeon 2.00 GHz CPU. The node working as the JobTracker, Namenode and SecondNamenode has a memory of 2.5 GB, the other 13 nodes working as TaskTrackers and Datanodes have 4 GB memory. All of nodes have 137 GB SCSI disk storage, and they are connected by 1 Gbps Ethernet network. The operating system on these nodes is Debian Linux, with kernel version of 2.6.32-3, and the JDK version is jdk-6u12linux-i586. We evaluate the performance of the PageRank and Descendant Query [5] applications on the prototype system, using three real-world datasets [17] shown in Table I. The PageRank application views the Internet as a directed graph, with web pages as vertices and hyperlinks as edges. The meaning of rank values are the importance of web pages, and the higher the rank value, the more important the web page. PageRank assigns ranks to each vertex/webpage in the graph by iteratively aggregating the ranks of inbound hyperlink neighbors. PageRank is an iterative application, starts from a random vertex, updates rank values until converges. The Descendant Query is an algorithm to query neighbor vertices within a few hops, and it is used in many queries in

Num of Vertices 1696415 4847571 2394385

Num of vertices with degree>500 2380 15106 2416

The highest degree 35455 22889 100032

The lowest degree 1 1 1

B. Experimental Result We evaluate the SATS mechanism with the PageRank and Descendant Query applications and the three datasets. The PageRank application [5] consists of 3 MapReudce jobs, including the PageRank-Count, PageRank-Initialize and PageRank-loop job. The PageRank-loop job is an iterative job with two steps, namely PageRank-Join and PageRankAggregate, and the experimental results of the PageRank-loop job are recorded. The default Hash Partitioner is used in the first two steps of the job. The Descendant Query application [5] consists of two jobs, including the IterativeJoin job and the AggregateDelta Relation job. The IterativeJoin job is an iterative job with two steps, namely the Descendant Join and Descendant Duplicate Elimination. We run the PageRank and the Descendant Query applications with these datasets, and record the load distribution of Reducer nodes and the job execution time. The PageRank application is executed for 20 iterations with the wiki-Talk and as-skitter datasets, using 13 Reducer nodes, and the Descendant Query application is

345

executed for 5 iterations with the wiki-Talk and socLiveJournal1 datasets, using 7 Reducer nodes. In the MapReduce-like systems, the job execution time in the Map or Reduce phase is determined by the slowest node with the heaviest load. Thus we use a parameter called load ratio, which is equal to the quotient that the maximum load dividing the average load of Reducer nodes in an iteration, to evaluate the load balancing characteristic of the Reducer nodes. Figure 3 and 4 show the values of the parameter load ratio in various iterations of the PageRank and Descendant Query application with the wiki-Talk dataset, respectively. In the Figures, the original refers to the results in HaLoop without modification, and the SATS refers to the results in the prototype system with the implementation of the SATS mechanism. From Figure 3 and Figure 4, it can be inferred that the SATS mechanism can effectively improve the load balancing of Reducer nodes.
PageRank Application
1.15 1.1

ACKNOWLEDGMENT This work is sponsored in part by the National Natural Science Foundation of China under Grant No. 61222205, the National Basic Research Program of China (973) under Grant No. 2011CB302600, and the Foundation for the Author of National Excellent Doctoral Dissertation of PR China (FANEDD) under Grant No. 200953. REFERENCES
[1] Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107113, 2008 Xicheng Lu, Huaimin Wang, Ji Wang, Jie Xu, Dongsheng Li. Internetbased Virtual Computing Environment: Beyond the data center as a computer. Future Generation Computer System, 29(1): 309-322, 2013. Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung. The Google File System. Proc. of SOSP '03, 2003. Apache Hadoop Project. http://hadoop.apache.org/. Yingyi Bu, Bill Howe, Magdalena Balazinska, Michael D. Ernst. HaLoop: Efficient Iterative Data Processing on Large Clusters. Proc. of VLDB10, 2010. J. Dittrich, J. Quiane-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). Proc. of the VLDB Endowment, 3(1), 2010. YongChul Kwon, Magdalena Balazinska, Bill Howe, Jerome Rolia. Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions. Proc. of ACM Symposium on Cloud computing, 2010. Jimmy Lin. The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce. Proc. of 7th Workshop on LargeScale Distributed Systems for Information Retrieval, 2009. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Proc. of WWW98, 1998. M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Job Scheduling for Multi-User MapReduce Clusters. Technical Report UCB/EECS-2009-55, University of California at Berkeley, April 2009. Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica. Improving MapReduce Performance in Heterogeneous Environments. Proc. of OSDI08, 2008. Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Edward Harris. Reining in the Outliers in Map-Reduce Clusters using Mantri. Proc. of OSDI10, 2010. YongChul Kwon, Magdalena Balazinska, Bill Howe, Jerome A. Rolia. SkewTune: Mitigating Skew in MapReduce Applications. Proc. of SIGMOD12, 2012. S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He, and L. Qi. LEEN: Locality/fairness-aware key partitioning for mapreduce in the cloud. Proc. of CloudCom, 2010. B. Gufler, N. Augsten, A. Reiser, and A. Kemper. Load balancing in mapreduce based on scalable cardinality estimates. Proc. of ICDE12, 2012. Gonzalez T, Ibarra O H, Sahni S. Bounds for LPT schedules on uniform processors. SIAM Journal on Computing, 6:155-166, 1977. Stanford Large Network Dataset Collection. http://snap.stanford.edu/data/index.html. Jure Leskovec, Jon Kleinberg, Christos Faloutsos. Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations. Proc. of ACM SIGKDD05, 2005.

[2]

[3] [4] [5]

Original STAS

Ratio

1.05 1 0.95 0.9 3 5 7 9

[6]

[7]

Iterations
Figure 3. The Load Ratio of Reducer nodes in the PageRank application [8]

Descendant Query Application


1.5 1.4

Original STAS

[9] [10]

Ratio

1.3 1.2 1.1 1 0.9 3 5 7 9

[11]

Iterations

[12]

Figure 4. The Load Ratio of Reducer nodes in the Descendant Query application

[13]

V.

CONCLUSION
[14]

In this paper, we design a skew-aware task scheduling mechanism, named SATS, for iterative applications in MapReduce-like systems. The SATS mechanism utilizes the similarity of data distribution in adjacent iterations of iterative applications to reduce the straggle problem caused by data skew in the intermediate data. The mechanism collects information about the data distribution, and uses the information to guide the data partitioning in the next iteration. We implement the SATS mechanism based on HaLoop and deploy the prototype system in a cluster. Experiments show that the mechanism could improve the load balancing effectively. In the future work, we will optimize the implementation of the prototype system and reduce the execution cost of the SATS mechanism by using the sampling mechanisms.

[15]

[16] [17] [18]

346

You might also like