Paper Title: Apache Hadoop for large-scale data processing using machine learning techniques
Authors: Nidaa Ghalib Ali, Mohanaed Ajmi Falih, Ali Ajmi Falih
Corresponding Author: Nidaa Ghalib Ali (inb.nedaa10@atu.edu.iq)/ Iraq
Abstract
As big data volumes increase and data variety becomes greater, there is a need for more advanced technology. The paper discusses Volume, Variety, and Velocity, which are known as the 3Vs of Big Data, along with Valence and Veracity. As organizations battle with these complexities, Apache Spark perhaps emerges as a technology that can overcome the limitations of Hadoop MapReduce to enable real-time analytics. The focus of this paper is on Big Data. The study evaluates the effectiveness of the K-Nearest Neighbors (KNN) algorithm on structured data. Decision Tree regression is evaluated on unstructured data, and logistic regression on semi-structured data in this study. The algorithms performed well on structured data; however, all the models failed to predict unstructured data. Moreover, an examination of the framework’s performance proves the computational efficiency of Apache Hadoop and Apache Spark. Furthermore, in terms of processing speed across all data types and algorithms, Spark outperformed Hadoop. As a result, it requires advanced analytical tools. Apache Spark is a modern, high-performance data processing framework that enables organizations to manage Big Data in real time.
Keywords
Big Data, Hadoop, Spark, Machine learning
Cite:
Ghalib Ali , N. ., Ajmi Falih, M. ., & Ajmi Falih, A. . (2026). Apache Hadoop for large-scale data processing using machine learning techniques. Future Technology, 5(3), 128–138. Retrieved from https://fupubco.com/futech/article/view/762