Author: Abd El-Aal, Sara Moustafa Mohamed./ Title: A Machine Learning Approach for Predicting Execution Time of Spark Jobs \

Search In this Thesis

العنوان

A Machine Learning Approach for Predicting Execution Time of Spark Jobs \

المؤلف

Abd El-Aal, Sara Moustafa Mohamed.

هيئة الاعداد

باحث / سارة مصطفى محمد عبد العال

sara.abdelaal4@alex-eng.edu.eg

مشرف / محمد عبد الحميد اسماعيل

drmaismail@gmail.com

مشرف / ايمان ابراهيم الغندور

ielghand@yahoo.com

مناقش / مجدي حسين ناجي

magdy.nagi@ieee.org

مناقش / أماني أنور سعد

الموضوع

Computer Engineering.

تاريخ النشر

2018.

عدد الصفحات

52 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

الهندسة (متفرقات)

تاريخ الإجازة

13/10/2018

مكان الإجازة

جامعة الاسكندريه - كلية الهندسة - هندسة الحاسب و النظم

الفهرس

Only 14 pages are availabe for public view

from

Abstract

The fast growth of data and the demand to analyze it to predict information or to answer queries are the main motivation of developing many various distributed computing frameworks. One of such frameworks is Hadoop, which is an open source project for processing large distributed datasets in parallel using MapReduce. Many vendors provide deploying Hadoop in the cloud such as Microsoft, Amazon, Google, IBM and Oracle. Every vendor offers its own cloud price, so predicting workload execution time will help in estimating cloud cost and select an appropriate one that fills a given budget. However, Hadoop is inefficient for interactive data mining and iterative algorithms because data must be stored on disk between iterations as mappers and reducers are stateless. One of the frameworks that has been developed recently to target Hadoop MapReduce drawbacks is Spark. Spark has gained growing attention in the past couple of years as an in- memory cloud computing platform. It supports execution of various types of workloads such as SQL queries and machine learning applications. Currently, many enterprises use Spark to exploit its fast in-memory processing of large scale data. Additionally, speeding up the execution in Spark is an important problem for many real-time applications. This can be achieved by improving the scheduling approaches employed by Spark, optimizing the execution plans generated by Spark for various applications, and selecting the best cluster configuration to run an input workload. A first step for all these optimization approaches is to predict the execution time of an input Spark application. In this thesis, we present a new platform that predicts with high accuracy the execution time of SQL queries and machine learning applications executed by Spark. We evaluate our proposed platform by measuring the accuracy of predicting execution time of various types of Spark jobs including TPC-H queries and machine learning classification/clustering applications. The evaluation experiments show that we are able to predict the execution time of Spark jobs using our proposed platform with an average accuracy of 90% for SQL queries and an average accuracy of 75% for machine learning jobs.