Accessing Data Stores through Spark Clusters

Qubole supports accessing data stores through Spark clusters by adding a catalog parameter while creating a data store using a REST API request. Create a DbTap and Edit a DbTap describe the catalog parameter.

For Spark, the catalog parameter supports the following database types for data stores:

MySQL on Scala.
Redshift on Scala and Python. For more information about Spark Redshift connector, see Connecting to Redshift Data Source from Spark.

Note

To access data stores, you should have read or update permission on the Data Connections resource. For more information about the resources, see Resources, Actions, and What they Mean. This feature is supported on Spark 2.3.2, or 2.4.0 and later versions. This feature is not enabled for all users by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

To access a data store through a Spark cluster, perform these steps:

Create a ticket with Qubole Support to enable this feature.
Add the catalog parameter in the data store configuration. Create a DbTap and Edit a DbTap describe the catalog parameter.
Access the data store by using its JDBC URL, username, and password.

The QuboleDBTap class and companion object has been copied from com.qubole.QuboleDBTap ``to ``org.apache.spark.sql.qubole.QuboleDBTap for Spark 2.0.0 and later versions.

com.qubole.QuboleDBTap is still maintained to keep backward compatibility for all existing versions of Spark. However, Qubole strongly recommends migrating from com.qubole.QuboleDBTap to org.apache.spark.sql.qubole.QuboleDBTap as the support for com.qubole.QuboleDBTap will be removed starting from Spark 2.3.0. QuboleDBTap and its methods can only be used by importing org.apache.spark.sql.qubole.QuboleDBTap.

The following example shows how to register tables and query information through the API:

import org.apache.spark.sql.qubole.QuboleDBTap // NOTE: If you are using spark 1.6.x, use: import com.qubole.QuboleDBTap.
import org.apache.spark._
import org.apache.spark.sql._
val sqlContext = new  org.apache.spark.sql.hive.HiveContext(sc)
val catalogName = "catalog-name-created-during-create-dbtap" //See step 2 above
val databaseName = "database-name-created-during-create-dbtap" //See step 2 above
val quboleDBTap = QuboleDBTap.get(s"$catalogName",sqlContext)
//list of tables included, supports regex pattern matching
val includes = List()
//list of tables excluded, supports regex pattern matching
val excludes = List()
quboleDBTap.registerTables(s"$databaseName", includes, excludes)

val tableName = "mysql-tablename"
sqlContext.sql(s"select * from `$catalogName.$databaseName.$tableName`").show

//On completion of using the quboleDBTap object
quboleDBTap.unregister()

The following example shows how to create a short-lived DBTap object for a Spark session without using REST APIs as shown in the above example:

import org.apache.spark.sql.qubole.QuboleDBTap // NOTE: If you are using Spark 1.6.x, use: import com.qubole.QuboleDBTap.
import org.apache.spark._
import org.apache.spark.sql._
val sqlContext = new  org.apache.spark.sql.hive.HiveContext(sc)
val catalogName = "any-catalog-name"
val hostName = "<mysql-hostname>"
val databaseType = "mysql"
val jdbcUrl = s"jdbc:$databaseType://$hostName/"
val username = "<username>"
val password = "<password>"
val quboleDBTap = new QuboleDBTap(catalogName, jdbcUrl, username, password, sqlContext)
//list of tables included, supports regex pattern matching
val includes = List()
//list of tables excluded, supports regex pattern matching
val excludes = List()
val databaseName = "<mysql-databasename>"
quboleDBTap.registerTables(s"$databaseName", includes, excludes)

val tableName = "<mysql-tablename>"
sqlContext.sql(s"select * from `$catalogName.$databaseName.$tableName`").show

//On completion of using the quboleDBTap object
quboleDBTap.unregister()