Key Based Deep Data Locality on Hadoop
Document Type
Conference Proceeding
Publication Date
12-10-2018
Publication Title
2018 IEEE International Conference on Big Data (Big Data)
Publisher Location
Seattle, WA
First page number:
3889
Last page number:
3898
Abstract
Apache Hadoop is a famous framework for big data science. Most of the research for improving the speed of big data analysis is researching based on Hadoop modules such as Hadoop common, Hadoop Distribute File System (HDFS), Hadoop Yet Another Resource Negotiator (YARN) and Hadoop MapReduce. The paper focuses on data locality on HDFS and MapReduce to improve the performance. The input data is divided into several blocks and stored in HDFS. Each block has sever key-value fair in map stages. The paper use the keys in block to make key-based Deep Data Locality (DDL). The MapReduce with key-based DDL reduce some steps on map stage, shuffle stage, and reducer stages to improve the performance of MapReduce. We tested the performance of MapReduce with block-based DDL and key-based DDL to compare with default MapReduce. According to the test, MapReduce with key-based DDL is 28% faster than default MapReduce and 15.4 % faster than MapReduce with block-based DDL. Additionally, key-based DDL can be combined other data locality methods to improve the Hadoop. Combined key-based DDL and block-based DDL improve the Hadoop performance up to 52.5%.The paper also introduced the simulator for testing the performance of MapReduce with applied data locality methods on Hadoop. The simulator display a performance of each stage of MapReduce using graph. Key-based DDL can be combined with other data locality research to get optimized performance in various data types and node' status in the simulator.
Keywords
Hadoop; MapReduce; Data locality; HDFS, Big data
Disciplines
Computer Sciences
Language
English
Repository Citation
Lee, S.,
Jo, J.,
Kim, Y.
(2018).
Key Based Deep Data Locality on Hadoop.
2018 IEEE International Conference on Big Data (Big Data)
3889-3898.
Seattle, WA:
http://dx.doi.org/10.1109/BigData.2018.8621885