Survey of Data Locality in Apache Hadoop
Document Type
Conference Proceeding
Publication Date
10-31-2019
Publication Title
IEEE/ACIS International Conference on Big Data, Cloud Computing, and Data Science Engineering
Publisher
ACIS International
First page number:
46
Last page number:
53
Abstract
One of the key challenges in big data technology is the velocity at which the data is processed. Hadoop, an open-source software framework, is the dominant technology to support big data analytics. So, the researcher has tried to increase the performance of the Hadoop system. One of the Hadoop performance research is data locality. Recently, the data locality research receives attention to increasing the performance of Hadoop. Using the updated Hadoop software, the researchers can investigate data locality using the Hadoop Distributed File System (HDFS), Yet Another Resource Negotiator (YARN), MapReduce, and other features. Data locality research has potential to increase performance of big data processing by scheduling, data placement framework and service. Here we introduced data locality in the Hadoop system including data-local, rack-local, and off-rack. We studied the data locality research such as scheduling, data placement, networking, partition/key, framework and so on. We categorized prior research using MapReduce and found some of this research overlapped some MapReduce steps. Also, we graphed the data locality research to identify trends. This analysis showed different effects depending on the applications. Specifically, the number of taskers and data locations affected performance of MapReduce. We also tested Terasort Benchmark and WordCount using CloudLab and physical environment to show the effect of data locality in Hadoop.
Keywords
Could computing; Data analysis; Distributed databases; Data locality; MapReduce; Yet another resource negotiator; YARN; Hadoop distributed file system; HDFS
Disciplines
Computer Sciences | Databases and Information Systems | Physical Sciences and Mathematics
Language
English
Repository Citation
Lee, S.,
Jo, J.,
Kim, Y.
(2019).
Survey of Data Locality in Apache Hadoop.
IEEE/ACIS International Conference on Big Data, Cloud Computing, and Data Science Engineering
46-53.
ACIS International.
http://dx.doi.org/10.1109/BCD.2019.8885148