Flow Scheduling: An Efficient Scheduling Method for MapReduce Framework in the Decoupled Architecture


Hadoop is a popular implementation of the MapReduce pro- gramming model for data processing. We first compare dif- ferent Hadoop models and discuss their advantages and lim- itation. The traditional Hadoop system is scalable because a machine serves both computation and storage function. However, this principle imposes a strong constraint on sys- tem design and does not quite fit enterprise and cloud appli- cation, which require to decouple computation and storage nodes. Any naive Hadoop implementations may fail to be optimized because they are designed to preserve data local- ity, which does not exist in the decoupled model. In this paper, we propose a flow scheduling method: it eliminates undesired factors that can decrease processing performance. We model the cost of task assignment based on the penalty of violating flow demand and convert this problem to the network optimization problem. We have implemented Flow Scheduler for Hadoop and the experiment results show that it can maximize the processing flow rate while improving the system throughput by up to 30%. More interestingly, our flow scheduling method can provide more smooth task execution time, which suggests it can eliminate stragglers that caused by resource contention.

In the Technical Report