Qubole-SageMaker Integration Guide (AWS)

Introduction

Qubole is a cloud-native platform for self-service AI, Machine Learning, and Analytics. It removes the complexity and reduces the cost of managing Data, allowing data teams to focus on business outcomes rather than infrastructure management.

Qubole analyzes and learns from usage patterns through a combination of heuristics and machine learning to automate platform management. Qubole provides insights and recommendations that improve performance, reduce cost, and increase reliability of big data workload. Qubole provides you the flexibility to access, configure, and monitor your big data clusters in the cloud of your choice. Users can query the data through the web-based console in the programming language of choice, build integrated products using the REST API, use the SDK to build applications with Qubole, and connect to third-party BI tools through ODBC/JDBC connectors.

Amazon SageMaker is a fully-managed platform that provides Developers and Data Scientists an easy way to build, train, and deploy machine learning models at any scale. Amazon SageMaker removes all the barriers that typically slow down developers in using machine learning. The SageMaker and Qubole integration allows enterprise users to leverage Qubole Notebooks and Qubole Spark to explore, clean and prepare data in the format required for Machine Learning (ML) algorithms.

Users can read their AWS S3 data into Qubole Spark dataframes, use Qubole Notebooks to prepare the data, and start the model training using the estimator in the SageMaker Spark library. This will initiate ML training in SageMaker, build the model, and create the endpoint to host that model. This is described in the diagram below:

../../_images/prepare_data_and_start_ML.png

Alternatively, you can enhance the SageMaker data processing capabilities by connecting a SageMaker Notebook instance to Qubole.

Use Spark to process and prepare data at scale. Reduce the cost of computing by taking advantage of Qubole’s Adaptive Serverless Platform Architecture to prepare and clean data prior to ML training. This is described in the diagram below:

../../_images/preparing_data_in_qubole.png