Connecting to MySQL and JDBC Sources using Presto Clusters

Note

Presto is currently supported on AWS, Azure, GCP Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.

You can use the MySQL connector to query and create tables in an external MySQL database, and to share data between two different tools such as MySQL and Hive.

Connecting to MySQL Sources using Presto Clusters

To connect to a MySQL source using a Presto cluster, configure a MySQL connector by adding a catalog properties file in etc/catalog. You can do this in the QDS UI when adding or editing a Presto cluster. For example, add the following in the Override Presto Configuration text box (see Configuring a Presto Cluster for more information).

catalog/mysqlcatalog.properties:
connector.name=mysql
connection-url=jdbc:mysql://example.net:3306
connection-user=root
connection-password=secret

Adding the above properties in the Presto cluster settings creates a new file, mysqlcatalog.properties in etc/catalog when the cluster comes up.

In addition, add the following in the same text box:

config.properties:
datasources=jmx,hive,mysqlcatalog

Note

The datasources configuration is not required with Presto 317 for configuring connectors and it is described in Data Sources Properties.

Now start or restart the cluster to implement the configuration.

Querying MySQL

You can query a MySQL database as follows:

  1. The MySQL connector offers a schema for every MySQL database. You can run the following command to see the available MySQL database.

    SHOW SCHEMAS FROM mysqlcatalog;

  2. You can see the tables in a MySQL database by running the SHOW TABLES command. For example, to see the tables in a database named users, run the following command:

    SHOW TABLES FROM mysqlcatalog.users;

  3. To access a table from the MySQL database, run a SELECT query. For example, to access a permanentusers table in the users database, run:

    SELECT * FROM mysqlcatalog.users.permanentusers;

Connecting to JDBC Sources using Presto Clusters

In data analytics, integrating data from multiple sources is a common problem. This is because dimensional data such as user information reside in a relational database such as MySQL or PostrgreSQL and large semi-structured data such as clickstream data reside in a Cloud Object Storage. You can use Qubole Scheduler to periodically re-import the data to to a database. Using a Qubole Scheduler only helps when the database does not change very often. When a database is changed very frequently, you can use Hive storage handlers to plug-in other live data sources. A storage handler for databases based on JDBC would suit the purpose appropriately. You can create external Hive tables and map them to a database table. A query to the external Hive table gets rerouted to the underlying database table.

To use a Storage Handler to integrate data from multiple sources, perform the following steps:

  1. Build a Hive storage handler by using the code at github with an Apache Licence. The code is compatible with Apache Hive and Hadoop 2. The ReadMe file provides instructions to build the storage handler. Alternatively, you can use the Qubole storage handler jar available in the public bucket.

  2. After building the storage handler JAR, or using the Qubole storage handler JAR, connect to a database by adding the JAR and creating an external Hive table with specific tableproperties. tableproperties contains information about the JDBC driver class to use, hostname, username, password, table name, and so on. The following code snippet shows adding the storage handler JAR and creating an external Hive table with specific tableproperties.

    ADD JAR <scheme>paid-qubole/jars/jdbchandler/qubole-hive-jdbc-handler.jar ;
    DROP TABLE HiveTable;
    CREATE EXTERNAL TABLE HiveTable(
                                     id INT,
                                     id_double DOUBLE,
                                     names STRING,
                                     test INT
                                    )
    STORED BY 'org.apache.hadoop.hive.jdbc.storagehandler.JdbcStorageHandler'
    TBLPROPERTIES (
                   'mapred.jdbc.driver.class'='com.mysql.jdbc.Driver',
                   'mapred.jdbc.url'='jdbc:mysql://localhost:3306/rstore',
                   'mapred.jdbc.username'='-',
                   'mapred.jdbc.input.table.name'='JDBCTable',
                   'mapred.jdbc.output.table.name'='JDBCTable',
                   'mapred.jdbc.password'='-'
                  );
    

<scheme> is the Cloud-specific URI scheme: s3:// for AWS; wasb[s]:// or adl:// or abfs[s] for Azure.