Lalmohan Behera, Vishnu Vardhan Reddy Chilukoori, 2024. "End-to-End Data Pipelines: Redefining the Architecture of Data Engineering in Cloud Environments" ESP International Journal of Advancements in Science & Technology (ESP-IJAST) Volume 2, Issue 4: 26-33.
Data pipelines are the critical component of contemporary data engineering that assumes the responsibility of accessing, processing, and delivering data to various organisational applications. In cloud environments where high efficiency, scalable architecture, and flexibility are critical, the need for high-quality, reliable and efficient data pipeline architectures has increased. This paper looks at the various aspects of selecting an appropriate end-to-end data pipeline, emphasising architectural strategies compatible with scale, resilience, and real-time computation. Modern large-scale applications generate enormous amounts of data, and the solution is to employ cloud native technologies like Apache Kafka, Spark, etc., as well as cloud computing technologies like AWS Glue Google Dataflow to build reliable pipelines for different use cases. The work comprehensively reviews and analyzes innovative methods and practices and offers information about their effectiveness according to the application. Specific importance is paid to meta-information, which controls data flow to achieve minimum time between inputs and outputs, maximum dependability, and minimum costs. The guidelines for using adaptive cloud-based pipelines for real-time and batch processing are described through the empirical calculations of the metrics and the case studies presented in the paper. The following study will focus on providing practical findings and recommendations on how to reformulate the approach to building and managing data pipelines to achieve better data-to-insight times in cloud-dominated business landscapes.
[1] Warren, J., & Marz, N. (2015). Big Data: Principles and best practices of scalable real-time data systems. Simon and Schuster.
[2] Plazotta, M., & Klettke, M. (2024). Data Architectures in Cloud Environments. Datenbank-Spektrum, 24(3), 243-247.
[3] Reis, J., & Housley, M. (2022). Fundamentals of data engineering. "O'Reilly Media, Inc.".
[4] Navathe, S. B. (1992). Evolution of data modeling for databases. Communications of the ACM, 35(9), 112-123.
[5] Gupta, D., & Rani, R. (2019). A study of big data evolution and research challenges. Journal of information science, 45(3), 322-340.
[6] Salamkar, M. A., & Allam, K. (2019). Architecting Data Pipelines: Best Practices for Designing Resilient, Scalable, and Efficient Data Pipelines. Distributed Learning and Broad Applications in Scientific Research, 5.
[7] Lipovac, I., & Babac, M. B. (2024). Developing a data pipeline solution for big data processing. International Journal of Data Mining, Modelling and Management, 16(1), 1-22.
[8] Dehury, C., Jakovits, P., Srirama, S. N., Tountopoulos, V., & Giotis, G. (2020). Data pipeline architecture for the serverless platform. In Software Architecture: 14th European Conference, ECSA 2020 Tracks and Workshops, L'Aquila, Italy, September 14–18, 2020, Proceedings 14 (pp. 241-246). Springer International Publishing.
[9] Konidala, S. (2019). Cloud-Based Data Pipelines: Design, Implementation and Example. Distributed Learning and Broad Applications in Scientific Research, 5, 1586-1603.
[10] Data Engineering Concepts, Processes, and Tools, Altexsoft, online. https://www.altexsoft.com/blog/what-is-data-engineering-explaining-data-pipeline-data-warehouse-and-data-engineer-role/
[11] Gade, K. R. (2017). Integrations: ETL vs. ELT: Comparative analysis and best practices. Innovative Computer Sciences Journal, 3(1).
[12] What Is a Data Pipeline? Definition, Best Practices, and Use Cases, Informatica, online. https://www.informatica.com/resources/articles/data-pipeline.html
[13] Mishra, S. (2020). Automating the data integration and ETL pipelines through machine learning to handle massive datasets in the enterprise. Distributed Learning and Broad Applications in Scientific Research, 6.
[14] Dong, H., Zhang, C., Li, G., & Zhang, H. (2024). Cloud-Native Databases: A Survey. IEEE Transactions on Knowledge and Data Engineering.
[15] Tomar, M., Ramalingam, S., & Krishnaswamy, P. (2023). Cloud-Native Enterprise Platform Engineering: Building Scalable, Resilient, and Secure Cloud Architectures for Global Enterprises. Australian Journal of Machine Learning Research & Applications, 3(1), 601-639.
[16] Why do you need Data Engineering Pipelines before an enterprise SIEM, Medium? Online. https://detect.fyi/why-you-need-data-engineering-pipelines-before-an-enterprise-siem-0be553584aa9
[17] Mitchell, B. S. (2023). Cloud Native Software Engineering. arXiv preprint arXiv:2307.01045.
[18] Raj, A., Bosch, J., Olsson, H. H., & Wang, T. J. (2020, August). Modelling data pipelines. In 2020, the 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) (pp. 13-20). IEEE.
[19] McGough, A. S., Cohen, J., Darlington, J., Katsiri, E., Lee, W., Panagiotidi, S., & Patel, Y. (2005). An end-to-end workflow pipeline for large-scale grid computing. Journal of Grid Computing, 3, 259-281.
[20] Yang, X., Bruin, R. P., & Dove, M. T. (2010). Developing an end-to-end scientific workflow. Computing in Science & Engineering, 12(3), 52-61.
[21] Salamkar, M. A. (2019). ETL vs ELT: A comprehensive exploration of both methodologies, including real-world applications and trade-offs. Distributed Learning and Broad Applications in Scientific Research, 5.
[22] Data Engineering Concepts, Approaches, Data Pipeline, Data Warehouse, Symphony Solution, online. https://symphony-solutions.com/insights/data-engineering-concepts-approaches
Data Pipelines, Cloud Computing, Data Engineering, Scalability, Real-time Processing, Fault Tolerance.