Physical Therapy Faculty Research

Performance improvement of mapreduce process by promoting deep data locality

Szu-Ping Lee, University of Nevada, Las VegasFollow
Juyeon Jo, University of Nevada, Las VegasFollow
Y Kim, University of Nevada, Las VegasFollow

Document Type

Conference Proceeding

Publication Date

1-1-2016

Publication Title

Proceedings - 3rd IEEE International Conference on Data Science and Advanced Analytics, DSAA 2016

Publisher

Institute of Electrical and Electronics Engineers Inc.

First page number:

292

Last page number:

301

Abstract

MapReduce has been widely used in many data science applications. It has been observed that an excessive data transfer has a negative impact on its performance. To reduce the amount of data transfer, MapReduce utilizes data locality. However, even though the majority of the processing cost occurs in the later stages, data locality has been utilized only in the early stages, which we call Shallow Data Locality (SDL). As a result, the benefit of data locality has not been fully realized. We have explored a new concept called Deep Data Locality (DDL) where the data is pre-Arranged to maximize the locality in the later stages. Toward achieving stronger DDL, we introduce a new block placement paradigm called Limited Node Block Placement Policy (LNBPP). Under the conventional default block placement policy (DBPP), data blocks are randomly placed on any available slave nodes, requiring a copy of RLM (Rack-Local Map) blocks. On the other hand, LNBPP places the blocks in a way to avoid RLMs, reducing the block copying time. The containers without RLM have a more consistent execution time, and when assigned to individual cores on a multicore node, they finish a job faster collectively than the containers under DBPP. LNBPP also rearranges the blocks into a smaller number of nodes (hence Limited Node) and reduces the data transfer time between nodes. These strategies bring a significant performance improvement in Map and Shuffle. Our test result shows that the execution times of Map and Shuffle have been improved by up to 33% and 44% respectively. In this paper, we describe the MapReduce workflow in Hadoop with a simple computational model and introduce the current research directions in each step. We analyze the block placement status and RLM locations in DBPP with the customer review data from TripAdvisor and measure the performances by executing the Terasort Benchmark with various sizes of data. We then compare the performances of LNBPP with DBPP. © 2016 IEEE.

Keywords

Block Placement; Deep Data Locality; Hadoop; HDFS; MapReduce; Performance Analysis; Shuffle; YARN

Language

English

Repository Citation

Lee, S., Jo, J., Kim, Y. (2016). Performance improvement of mapreduce process by promoting deep data locality. Proceedings - 3rd IEEE International Conference on Data Science and Advanced Analytics, DSAA 2016 292-301. Institute of Electrical and Electronics Engineers Inc..
http://dx.doi.org/10.1109/DSAA.2016.38

Find It

UNLV article access

COinS

Digital Scholarship@UNLV

Physical Therapy Faculty Research

Performance improvement of mapreduce process by promoting deep data locality

Document Type

Publication Date

Publication Title

Publisher

First page number:

Last page number:

Abstract

Keywords

Language

Repository Citation

Browse

Links

Digital Scholarship@UNLV

Physical Therapy Faculty Research

Performance improvement of mapreduce process by promoting deep data locality

Authors

Document Type

Publication Date

Publication Title

Publisher

First page number:

Last page number:

Abstract

Keywords

Language

Repository Citation

Share

Browse

Links