Data analysis performance comparison between single-mode and multi-mode
Nowadays a large volume of data is generated and stored at data centers, universities, and portals. Such data can be processed using single-mode-based tools such as R or multi-mode-based tools such as Hadoop. This research compares the performance of those tools with various types and sizes of data. For performance comparison, two algorithms have been used, Pearson correlation to analyze the relationship between text data, and Image Similarity MapReduce (ISMR) to analyze picture data. All data has been obtained from the Nevada Research Data Center (NRDC). We analyzed text data with R and Maria DB for single mode, and RHadoop and MapReduce for multimode. In our experiments, with 3 GB of text data, the single mode outperformed the multi-mode consistently by a factor of 4 or more. With image data, the single mode outperformed multi-mode up to about 2,000 images (∼8GB), then the multi-mode started outperforming the single mode. At 10,000 images (∼40GB), multi-mode outperformed single mode by a factor of 4. We learned that, while Hadoop is useful for processing large data, it is not efficient for handling small data. Single-mode tools such as R or MATLAB are more cost-effective to handle small data up to some point. Deciding the threshold size for choosing single-mode or multi-mode is rather subjective, however, and it needs be decided based on the types of data, the cost and performance of individual machines, and the cost of development and maintenance. Copyright ISCA, SEDE 2016.
Big data; Hadoop; MapReduce; Maria DB; Pearson correlation; R; RHadoop
Data analysis performance comparison between single-mode and multi-mode. In S. Dascalu, F.C. Harris, Y. Shi (Eds.),
25th International Conference on Software Engineering and Data Engineering, SEDE 2016
The International Society for Computers and Their Applications (ISCA).