Performance improvement of mapreduce process by promoting deep data locality

Document Type

Conference Proceeding

Publication Date


Publication Title

Proceedings - 3rd IEEE International Conference on Data Science and Advanced Analytics, DSAA 2016


Institute of Electrical and Electronics Engineers Inc.

First page number:


Last page number:



MapReduce has been widely used in many data science applications. It has been observed that an excessive data transfer has a negative impact on its performance. To reduce the amount of data transfer, MapReduce utilizes data locality. However, even though the majority of the processing cost occurs in the later stages, data locality has been utilized only in the early stages, which we call Shallow Data Locality (SDL). As a result, the benefit of data locality has not been fully realized. We have explored a new concept called Deep Data Locality (DDL) where the data is pre-Arranged to maximize the locality in the later stages. Toward achieving stronger DDL, we introduce a new block placement paradigm called Limited Node Block Placement Policy (LNBPP). Under the conventional default block placement policy (DBPP), data blocks are randomly placed on any available slave nodes, requiring a copy of RLM (Rack-Local Map) blocks. On the other hand, LNBPP places the blocks in a way to avoid RLMs, reducing the block copying time. The containers without RLM have a more consistent execution time, and when assigned to individual cores on a multicore node, they finish a job faster collectively than the containers under DBPP. LNBPP also rearranges the blocks into a smaller number of nodes (hence Limited Node) and reduces the data transfer time between nodes. These strategies bring a significant performance improvement in Map and Shuffle. Our test result shows that the execution times of Map and Shuffle have been improved by up to 33% and 44% respectively. In this paper, we describe the MapReduce workflow in Hadoop with a simple computational model and introduce the current research directions in each step. We analyze the block placement status and RLM locations in DBPP with the customer review data from TripAdvisor and measure the performances by executing the Terasort Benchmark with various sizes of data. We then compare the performances of LNBPP with DBPP. © 2016 IEEE.


Block Placement; Deep Data Locality; Hadoop; HDFS; MapReduce; Performance Analysis; Shuffle; YARN



UNLV article access