A Comprehensive Review on Data Science Frameworks for Big Data Analytics
DOI:
https://doi.org/10.62671/perfect.v3i1.217Keywords:
Big Data, Data Science Frameworks, Hadoop, Spark, Real-Time AnalyticsAbstract
The importance of big data analytics is now essential in deriving insights in large and complex information in various industries. This review discusses major data science frameworks, such as Apache Hadoop, Spark, Flink, and Storm, their architecture, capabilities, and a relative advantage of processing batches and in real-time. It also presents major challenges that can affect the framework efficiency, including scalability, latency, and heterogeneity of data, security, and the complexity of operational, among others. Lastly, the new trends such as the adoption of AI, cloud-native architecture, real-time streaming, and intelligent automation are discussed to demonstrate the changing environment. This review gives an in-depth insight into the concept of big data frameworks and how they facilitate the achievement of effective analytics.
References
Abuqabita, F., Al-Omoush, R., & Alwidian, J. (2019). A comparative study on big data analytics frameworks, data resources and challenges. Modern Applied Science, 13(7), 1–14.
Abuqabita, F., Al-Omoush, R., & Alwidian, J. (2019). A comparative study on big data analytics frameworks, data resources and challenges. Modern Applied Science, 13(7), 1–14. DOI: https://doi.org/10.5539/mas.v13n7p1
Acharjya, D. P., & Ahmed, K. (2016). A survey on big data analytics: Challenges, open research issues and tools. International Journal of Advanced Computer Science and Applications, 7(2), 511–518. DOI: https://doi.org/10.14569/IJACSA.2016.070267
Ahmed, A., Xi, R., Hou, M., Shah, S. A., & Hameed, S. (2023). Harnessing big data analytics for healthcare: A comprehensive review of frameworks, implications, applications, and impacts. IEEE Access, 11, 112891–112928. DOI: https://doi.org/10.1109/ACCESS.2023.3323574
Ahn, J. S., Jhung, K., Oh, J., Heo, J., Kim, J.-J., & Park, J. Y. (2022). Association of resting-state theta–gamma coupling with selective visual attention in children with tic disorders. Frontiers in Human Neuroscience, 16, 1017703. DOI: https://doi.org/10.3389/fnhum.2022.1017703
Akil, B., Zhou, Y., & Röhm, U. (2017). On the usability of Hadoop MapReduce, Apache Spark and Apache Flink for data science. In Proceedings of the IEEE International Conference on Big Data (pp. 303–310). IEEE. DOI: https://doi.org/10.1109/BigData.2017.8257938
Ali, I. M. S., & Hariprasad, D. (2023). Hyper-heuristic salp swarm optimization of multi-kernel support vector machines for big data classification. International Journal of Information Technology, 15(2), 651–663. DOI: https://doi.org/10.1007/s41870-022-01141-2
Al-Omoush, K. S., Garcia-Monleon, F., & Mas Iglesias, J. M. (2024). Exploring the interaction between big data analytics, frugal innovation, and competitive agility: The mediating role of organizational learning. Technological Forecasting and Social Change, 200, 123188. DOI: https://doi.org/10.1016/j.techfore.2023.123188
Alosert, H., Savery, J., Rheaume, J., Cheeks, M., Turner, R., Spencer, C., Farid, S. S., & Goldrick, S. (2022). Data integrity within the biopharmaceutical sector in the era of Industry 4.0. Biotechnology Journal, 17(6), 2100609. DOI: https://doi.org/10.1002/biot.202100609
Al-Sai, Z. A., Husin, M. H., Syed-Mohamad, S. M., Abdin, R. M. S., Damer, N., Abualigah, L., & Gandomi, A. H. (2022). Explore big data analytics applications and opportunities: A review. Big Data and Cognitive Computing, 6(4), 157. DOI: https://doi.org/10.3390/bdcc6040157
Al-Salim, A. M., El-Gorashi, T. E. H., Lawey, A. Q., & Elmirghani, J. M. H. (2018). Greening big data networks: Velocity impact. IET Optoelectronics, 12(3), 126–135. DOI: https://doi.org/10.1049/iet-opt.2016.0165
Altuwairiqi, M. (2023). Combining extreme learning machine through random projections for dimensional information taxonomy and assembling. In Proceedings of the IEEE International Conference on Innovations in High Speed Communication and Signal Processing (pp. 488–491). IEEE. DOI: https://doi.org/10.1109/IHCSP56702.2023.10127156
Alwadi, M., Chetty, G., & Yamin, M. (2023). A framework for vehicle quality evaluation based on interpretable machine learning. International Journal of Information Technology, 15(1), 129–136. DOI: https://doi.org/10.1007/s41870-022-01121-6
Amalina, F., Hashem, I. A. T., Azizul, Z. H., Fong, A. T., Firdaus, A., Imran, M., & Anuar, N. B. (2019). Blending big data analytics: Review on challenges and a recent study. IEEE Access, 8, 3629–3645. DOI: https://doi.org/10.1109/ACCESS.2019.2923270
Arowoogun, J. O., Babawarun, O., Chidi, R., Adeniyi, A. O., & Okolo, C. A. (2024). A comprehensive review of data analytics in healthcare management: Leveraging big data for decision-making. World Journal of Advanced Research and Reviews, 21(2), 1810–1821. DOI: https://doi.org/10.30574/wjarr.2024.21.2.0590
Ayvaz, S., & Alpay, K. (2021). Predictive maintenance system for production lines in manufacturing: A machine learning approach using IoT data in real time. Expert Systems with Applications, 173, 114598. DOI: https://doi.org/10.1016/j.eswa.2021.114598
Backhoff, O., & Ntoutsi, E. (2016). Scalable online-offline stream clustering in Apache Spark. In Proceedings of the IEEE International Conference on Data Mining Workshops (pp. 37–44). IEEE. DOI: https://doi.org/10.1109/ICDMW.2016.0014
Bansal, M., Chana, I., & Clarke, S. (2020). A survey on IoT big data: Current status, 13 V’s challenges, and future directions. ACM Computing Surveys, 53(6), 1–59. DOI: https://doi.org/10.1145/3419634
Ben Atitallah, S., Driss, M., Boulila, W., & Ben Ghézala, H. (2020). Leveraging deep learning and IoT big data analytics to support smart cities development: Review and future directions. Computer Science Review, 38, 100303. DOI: https://doi.org/10.1016/j.cosrev.2020.100303
Ben Hamida, S., Benjelloun, G., & Hmida, H. (2021). Trends of evolutionary machine learning to address big data mining. In Proceedings of the International Conference on Information and Knowledge Systems (pp. 85–99). Springer. DOI: https://doi.org/10.1007/978-3-030-85977-0_7
Bhatia, S., & Kumar, R. (2018). Review of graph processing frameworks. In Proceedings of the IEEE International Conference on Data Mining Workshops (pp. 998–1005). IEEE. DOI: https://doi.org/10.1109/ICDMW.2018.00144
Brendel, M., Su, C., Bai, Z., Zhang, H., Elemento, O., & Wang, F. (2022). Application of deep learning on single-cell RNA sequencing data analysis: A review. Genomics, Proteomics & Bioinformatics, 20(5), 814–835. DOI: https://doi.org/10.1016/j.gpb.2022.11.011
Briard, T., Jean, C., Aoussat, A., & Véron, P. (2023). Challenges for data-driven design in early physical product design: A scientific and industrial perspective. Computers in Industry, 145, 103814. DOI: https://doi.org/10.1016/j.compind.2022.103814
Calude, C. S., & Longo, G. (2017). The deluge of spurious correlations in big data. Foundations of Science, 22(3), 595–612. DOI: https://doi.org/10.1007/s10699-016-9489-4
Cao, L. (2017). Data science: A comprehensive overview. ACM Computing Surveys (CSUR), 50(3), 1–42. DOI: https://doi.org/10.1145/3076253
Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., & Tzoumas, K. (2015). Apache Flink: Stream and batch processing in a single engine. IEEE Data Engineering Bulletin, 38(4), 28–38.
Chen, C. L. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information Sciences, 275, 314–347. DOI: https://doi.org/10.1016/j.ins.2014.01.015
Chen, Y., Hong, Z., & Yang, X. (2023). Cost-sensitive online adaptive kernel learning for large-scale imbalanced classification. IEEE Transactions on Knowledge and Data Engineering, 35(10), 10554–10568. DOI: https://doi.org/10.1109/TKDE.2023.3266648
Chopra, M., Singh, S. K., Gupta, A., Aggarwal, K., Gupta, B. B., & Colace, F. (2022). Analysis and prognosis of sustainable development goals using big data-based approach during COVID-19 pandemic. Sustainable Technology and Entrepreneurship, 1(2), 100012. DOI: https://doi.org/10.1016/j.stae.2022.100012
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113. DOI: https://doi.org/10.1145/1327452.1327492
Deepa, N., Pham, Q.-V., Nguyen, D. C., Bhattacharya, S., Gadekallu, T. R., Maddikunta, P. K. R., Fang, F., & Pathirana, P. N. (2022). A survey on blockchain for big data: Approaches, opportunities, and future directions. Future Generation Computer Systems, 131, 209–226. DOI: https://doi.org/10.1016/j.future.2022.01.017
Dhifli, W., Aridhi, S., & Mephu Nguifo, E. (2017). MR-SimLab: Scalable subgraph selection with label similarity for big data. Information Systems, 69, 155–163. DOI: https://doi.org/10.1016/j.is.2017.05.006
Dicuonzo, G., Galeone, G., Zappimbulso, E., & Dell’Atti, V. (2019). Risk management 4.0: The role of big data analytics in the bank sector. International Journal of Economics and Financial Issues, 9(6), 40–47. DOI: https://doi.org/10.32479/ijefi.8556
Diouf, P. S., Boly, A., & Ndiaye, S. (2018). Variety of data in the ETL processes in the cloud: State of the art. In Proceedings of the IEEE International Conference on Innovative Research and Development (pp. 1–5). IEEE. DOI: https://doi.org/10.1109/ICIRD.2018.8376308
Domann, J., Meiners, J., Helmers, L., & Lommatzsch, A. (2016). Real-time news recommendations using Apache Spark. In Proceedings of CLEF (pp. 628–641).
Dundar, M., Krishnapuram, B., Bi, J., & Rao, R. B. (2007). Learning classifiers when the training data is not IID. In Proceedings of the International Joint Conference on Artificial Intelligence (pp. 756–761).
Elser, B., & Montresor, A. (2013). An evaluation study of big data frameworks for graph processing. In Proceedings of the IEEE International Conference on Big Data (pp. 60–67). IEEE. DOI: https://doi.org/10.1109/BigData.2013.6691555
Emmanuel, I., & Stanier, C. (2016). Defining big data. In Proceedings of the International Conference on Big Data and Advanced Wireless Technologies (pp. 1–6). DOI: https://doi.org/10.1145/3010089.3010090
Galetsi, P., Katsaliaki, K., & Kumar, S. (2019). Values, challenges and future directions of big data analytics in healthcare: A systematic review. Social Science & Medicine, 241, 112533. DOI: https://doi.org/10.1016/j.socscimed.2019.112533
Imran, S., Mahmood, T., Morshed, A., & Sellis, T. (2021). Big data analytics in healthcare—A systematic literature review and roadmap for practical implementation. IEEE/CAA Journal of Automatica Sinica, 8(1), 1–22. DOI: https://doi.org/10.1109/JAS.2020.1003384
Khanra, S., Dhir, A., Islam, A. K. M. N., & Mäntymäki, M. (2020). Big data analytics in healthcare: A systematic literature review. Enterprise Information Systems, 14(7), 878–912. DOI: https://doi.org/10.1080/17517575.2020.1812005
Mohamed, A., Najafabadi, M. K., Wah, Y. B., Zaman, E. A., & Maskat, R. (2020). The state of the art and taxonomy of big data analytics: View from new big data framework. Artificial Intelligence Review, 53(2), 989–1037. DOI: https://doi.org/10.1007/s10462-019-09685-9
Nazir, S., Khan, S., Khan, H. U., Ali, S., García-Magariño, I., Atan, R. B., & Nawaz, M. (2020). A comprehensive analysis of healthcare big data management, analytics and scientific programming. IEEE Access, 8, 95714–95733. DOI: https://doi.org/10.1109/ACCESS.2020.2995572
Ochuba, N. A., Amoo, O. O., Okafor, E. S., Akinrinola, O., & Usman, F. O. (2024). Strategies for leveraging big data and analytics for business development: A comprehensive review across sectors. Computer Science & IT Research Journal, 5(3), 562–575. DOI: https://doi.org/10.51594/csitrj.v5i3.861
Olaniyi, O. O., Okunleye, O. J., & Olabanji, S. O. (2023). Advancing data-driven decision-making in smart cities through big data analytics: A comprehensive review of existing literature. Current Journal of Applied Science and Technology, 42(25), 10–18. DOI: https://doi.org/10.9734/cjast/2023/v42i254181
Pedro, F. (2023). A review of data mining, big data analytics, and machine learning approaches. Journal of Computational and Natural Sciences, 3, 169–181. DOI: https://doi.org/10.53759/181X/JCNS202303016
Rane, N. L., Paramesha, M., Choudhary, S. P., & Rane, J. (2024). Machine learning and deep learning for big data analytics: A review of methods and applications. Partners Universal International Innovation Journal, 2(3), 172–197. DOI: https://doi.org/10.2139/ssrn.4835655
Sakr, S., & Elgammal, A. (2016). Towards a comprehensive data analytics framework for smart healthcare services. Big Data Research, 4, 44–58. DOI: https://doi.org/10.1016/j.bdr.2016.05.002
Shahnawaz, M., & Kumar, M. (2025). A comprehensive survey on big data analytics: Characteristics, tools and techniques. ACM Computing Surveys, 57(8), 1–33. DOI: https://doi.org/10.1145/3718364
Szymańska, E. (2018). Modern data science for analytical chemical data: A comprehensive review. Analytica Chimica Acta, 1028, 1–10. DOI: https://doi.org/10.1016/j.aca.2018.05.038
Tandon, A., Dhir, A., Islam, A. K. M. N., & Mäntymäki, M. (2020). Blockchain in healthcare: A systematic literature review, synthesizing framework and future research agenda. Computers in Industry, 122, 103290. DOI: https://doi.org/10.1016/j.compind.2020.103290
Thayyib, P. V., Mamilla, R., Khan, M., Fatima, H., Asim, M., Anwar, I., Shamsudheen, M. K., & Khan, M. A. (2023). State-of-the-art of artificial intelligence and big data analytics reviews in five different domains: A bibliometric summary. Sustainability, 15(5), 4026. DOI: https://doi.org/10.3390/su15054026
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Hassan Raza, Tsendayush Erdenetsogt, A Singh, Mazhar Farooq, Muhammad Mohsin Kabeer, Muhammad Shahrukh Aslam (Author)

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.



