Key Based Deep Data Locality on Hadoop

Document Type

Conference Proceeding

Publication Date


Publication Title

2018 IEEE International Conference on Big Data (Big Data)

Publisher Location

Seattle, WA

First page number:


Last page number:



Apache Hadoop is a famous framework for big data science. Most of the research for improving the speed of big data analysis is researching based on Hadoop modules such as Hadoop common, Hadoop Distribute File System (HDFS), Hadoop Yet Another Resource Negotiator (YARN) and Hadoop MapReduce. The paper focuses on data locality on HDFS and MapReduce to improve the performance. The input data is divided into several blocks and stored in HDFS. Each block has sever key-value fair in map stages. The paper use the keys in block to make key-based Deep Data Locality (DDL). The MapReduce with key-based DDL reduce some steps on map stage, shuffle stage, and reducer stages to improve the performance of MapReduce. We tested the performance of MapReduce with block-based DDL and key-based DDL to compare with default MapReduce. According to the test, MapReduce with key-based DDL is 28% faster than default MapReduce and 15.4 % faster than MapReduce with block-based DDL. Additionally, key-based DDL can be combined other data locality methods to improve the Hadoop. Combined key-based DDL and block-based DDL improve the Hadoop performance up to 52.5%.The paper also introduced the simulator for testing the performance of MapReduce with applied data locality methods on Hadoop. The simulator display a performance of each stage of MapReduce using graph. Key-based DDL can be combined with other data locality research to get optimized performance in various data types and node' status in the simulator.


Hadoop; MapReduce; Data locality; HDFS, Big data


Computer Sciences



UNLV article access