Accessing Data Stores through Spark Clusters
Qubole supports accessing data stores through Spark clusters by adding a catalog
parameter while
creating a data store using a REST API request. Create a DbTap and Edit a DbTap describe the catalog
parameter.
For Spark, the catalog
parameter supports the following database types for data stores:
MySQL on Scala.
Redshift on Scala and Python. For more information about Spark Redshift connector, see Connecting to Redshift Data Source from Spark.
Note
To access data stores, you should have read or update permission on the Data Connections resource. For more information about the resources, see Resources, Actions, and What they Mean. This feature is supported on Spark 2.3.2, or 2.4.0 and later versions. This feature is not enabled for all users by default. Create a ticket with Qubole Support to enable this feature on the QDS account.
To access a data store through a Spark cluster, perform these steps:
Create a ticket with Qubole Support to enable this feature.
Add the
catalog
parameter in the data store configuration. Create a DbTap and Edit a DbTap describe thecatalog
parameter.Access the data store by using its JDBC URL, username, and password.
The QuboleDBTap
class and companion object has been copied from com.qubole.QuboleDBTap ``to
``org.apache.spark.sql.qubole.QuboleDBTap
for Spark 2.0.0 and later versions.
com.qubole.QuboleDBTap
is still maintained to keep backward compatibility for all existing versions of Spark. However,
Qubole strongly recommends migrating from com.qubole.QuboleDBTap
to org.apache.spark.sql.qubole.QuboleDBTap
as the support for com.qubole.QuboleDBTap
will be removed starting from Spark 2.3.0. QuboleDBTap
and its methods
can only be used by importing org.apache.spark.sql.qubole.QuboleDBTap
.
The following example shows how to register tables and query information through the API:
import org.apache.spark.sql.qubole.QuboleDBTap // NOTE: If you are using spark 1.6.x, use: import com.qubole.QuboleDBTap.
import org.apache.spark._
import org.apache.spark.sql._
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val catalogName = "catalog-name-created-during-create-dbtap" //See step 2 above
val databaseName = "database-name-created-during-create-dbtap" //See step 2 above
val quboleDBTap = QuboleDBTap.get(s"$catalogName",sqlContext)
//list of tables included, supports regex pattern matching
val includes = List()
//list of tables excluded, supports regex pattern matching
val excludes = List()
quboleDBTap.registerTables(s"$databaseName", includes, excludes)
val tableName = "mysql-tablename"
sqlContext.sql(s"select * from `$catalogName.$databaseName.$tableName`").show
//On completion of using the quboleDBTap object
quboleDBTap.unregister()
The following example shows how to create a short-lived DBTap object for a Spark session without using REST APIs as shown in the above example:
import org.apache.spark.sql.qubole.QuboleDBTap // NOTE: If you are using Spark 1.6.x, use: import com.qubole.QuboleDBTap.
import org.apache.spark._
import org.apache.spark.sql._
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val catalogName = "any-catalog-name"
val hostName = "<mysql-hostname>"
val databaseType = "mysql"
val jdbcUrl = s"jdbc:$databaseType://$hostName/"
val username = "<username>"
val password = "<password>"
val quboleDBTap = new QuboleDBTap(catalogName, jdbcUrl, username, password, sqlContext)
//list of tables included, supports regex pattern matching
val includes = List()
//list of tables excluded, supports regex pattern matching
val excludes = List()
val databaseName = "<mysql-databasename>"
quboleDBTap.registerTables(s"$databaseName", includes, excludes)
val tableName = "<mysql-tablename>"
sqlContext.sql(s"select * from `$catalogName.$databaseName.$tableName`").show
//On completion of using the quboleDBTap object
quboleDBTap.unregister()