Qubole Clusters

Informatica BDM supports a subset of cluster types, node types, and other features that are part of Qubole.

Qubole Cluster Types

Qubole allows you to create several types of clusters. BDM integration supports the Hadoop Qubole cluster type.

Hadoop Qubole Cluster Node Types

Qubole uses Amazon EC2 instances for cluster nodes. Each Hadoop Qubole cluster consists of a Coordinator node and at least one worker or task node, each of which runs a Hadoop distribution. Cluster nodes are of the following types:

Coordinator Node

Manages the cluster by running software components which coordinate the distribution of data and tasks among worker nodes for processing. The Coordinator node tracks the status of tasks and monitors the health of the cluster.

Worker Node

Worker nodes are used for computing tasks. The worker nodes have software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on the cluster.

Hadoop Qubole Cluster Data Processing Frameworks

The data processing framework layer is the engine used to process and analyze data. Many frameworks run on YARN or have their own resource management. Informatica refers to these data processing frameworks as run-time engines. The engine you choose depends on processing needs, such as batch, interactive, in-memory, or streaming. Your choice of run-time engine affects the languages and interfaces on the application layer, which is the layer used to interact with the data you want to process. Big Data Management can run jobs on the following run-time engines:

Spark

Apache Spark is a cluster framework and programming model for processing big data workloads. Like Hadoop MapReduce, Spark is an open-source, distributed processing system but uses directed acyclic graphs for execution plans and leverages in-memory caching for datasets. Spark supports multiple interactive query modules such as SparkSQL.

Note

By default, Qubole uses Spark version 2.1.1. The integration with Big Data Management requires Spark 2.2.1.

Blaze

Blaze is Informatica’s unique data processing engine integrated with YARN to provide intelligent data pipelining, job partitioning, job recovery, and scalability, which is optimized to deliver high performance, scalable data processing leveraging Informatica’s cluster aware data integration technology.

AWS Resources

Qubole uses AWS resources for the following elements:

EC2 Instances

Qubole uses Amazon EC2 virtual machines to host the Qubole server and web client as well as Qubole cluster nodes. When you configure Qubole integration with Big Data Management, you select from available EC2 node types. the node types available as Qubole cluster nodes are the same types as listed in AWS documentation.

Cluster Resource Management

The resource management layer is responsible for managing cluster resources and scheduling the jobs for processing data. By default, AWS uses YARN, which is a component introduced in Apache Hadoop 2.0 to centrally manage cluster resources for multiple data-processing frameworks.