Survey of Data Locality in Apache Hadoop

Document Type

Conference Proceeding

Publication Date


Publication Title

IEEE/ACIS International Conference on Big Data, Cloud Computing, and Data Science Engineering


ACIS International

First page number:


Last page number:



One of the key challenges in big data technology is the velocity at which the data is processed. Hadoop, an open-source software framework, is the dominant technology to support big data analytics. So, the researcher has tried to increase the performance of the Hadoop system. One of the Hadoop performance research is data locality. Recently, the data locality research receives attention to increasing the performance of Hadoop. Using the updated Hadoop software, the researchers can investigate data locality using the Hadoop Distributed File System (HDFS), Yet Another Resource Negotiator (YARN), MapReduce, and other features. Data locality research has potential to increase performance of big data processing by scheduling, data placement framework and service. Here we introduced data locality in the Hadoop system including data-local, rack-local, and off-rack. We studied the data locality research such as scheduling, data placement, networking, partition/key, framework and so on. We categorized prior research using MapReduce and found some of this research overlapped some MapReduce steps. Also, we graphed the data locality research to identify trends. This analysis showed different effects depending on the applications. Specifically, the number of taskers and data locations affected performance of MapReduce. We also tested Terasort Benchmark and WordCount using CloudLab and physical environment to show the effect of data locality in Hadoop.


Could computing; Data analysis; Distributed databases; Data locality; MapReduce; Yet another resource negotiator; YARN; Hadoop distributed file system; HDFS


Computer Sciences | Databases and Information Systems | Physical Sciences and Mathematics



UNLV article access

Find in your library