Configuring the AWS Glue Sync Agent

Qubole supports using the AWS Glue Data Catalog sync agent with QDS clusters to synchronize metadata changes from Hive metastore to AWS Glue Data Catalog. It is supported on Hive versions 2.1.1 and 2.3. The AWS Glue sync agent also works with Presto and Spark clusters as Hive metastore handles it.

When you sync the AWS Glue Data Catalog with the Hive metastore, the metadata operations from the Hive metastore are replicated in the AWS Glue metastore although the Hive metastore is only the source of truth. However, the metadata operations from the AWS Glue Data Catalog are not replicated on the Hive metastore.

Note

This feature is not available by default. Create a ticket with Qubole Support to enable it on the Qubole account.

Prerequisites

  • You need an AWS IAM Role authentication to access the AWS Glue Data Catalog. Configuring the Qubole Data Service describes how to set up an IAM Role-based QDS account.
  • Contact Qubole Support to enable it on the QDS account.
  • The Glue sync agent is only applicable to queries run on HiveServer2.
  • The Hive Metastore Server/Hive client must run on Java 8. Ensure that this is enabled by Qubole Support.
  • If you enable HiveServer2 on the cluster, then ensure that it runs on Java 8. If it is not enabled, contact Qubole Support to enable it.

Configuring the Glue Sync Agent with QDS Clusters

Note

Currently, Qubole supports configuring it through a node bootstrap.

Perform these steps to use glue sync agent with the QDS clusters:

  1. Download this node bootstrap script file from Nodebootstrap for Glue Sync Agent and add the script in the cluster’s node bootstrap.

  2. The above node bootstrap script also has the required dependencies that are copied along with the installer.

  3. Note down the existing S3 bucket to stores the results related to the Hive metastore or create a new S3 bucket and note down the new bucket name.

  4. Create an IAM S3 policy to ensure that the instance has relevant permission for AWS Cloudwatch as this agent internally writes events in the AWS Cloudwatch and add the bucket name (noted down in the above step) in the IAM policy. The IAM policy should have these policy elements. (Replace <my-bucket> with the bucket name that you want to use (step 3 mentioned above).

    {
     "Version": "2012-10-17",
     "Statement": [
         {
             "Effect": "Allow",
             "Action": [
                 "glue:CreateDatabase",
                 "glue:DeleteDatabase",
                 "glue:GetDatabase",
                 "glue:GetDatabases",
                 "glue:UpdateDatabase",
                 "glue:CreateTable",
                 "glue:DeleteTable",
                 "glue:BatchDeleteTable",
                 "glue:UpdateTable",
                 "glue:GetTable",
                 "glue:GetTables",
                 "glue:BatchCreatePartition",
                 "glue:CreatePartition",
                 "glue:DeletePartition",
                 "glue:BatchDeletePartition",
                 "glue:UpdatePartition",
                 "glue:GetPartition",
                 "glue:GetPartitions",
                 "glue:BatchGetPartition"
             ],
             "Resource": [
             "*"
             ]
         },
         {
             "Sid": "VisualEditor0",
             "Effect": "Allow",
             "Action": [
                 "athena:*",
                 "logs:CreateLogGroup"
             ],
             "Resource": "*"
         },
         {
             "Sid": "VisualEditor1",
             "Effect": "Allow",
             "Action": [
                 "s3:GetBucketLocation",
                 "s3:GetObject",
                 "s3:ListBucket",
                 "s3:ListBucketMultipartUploads",
                 "s3:ListMultipartUploadParts",
                 "s3:AbortMultipartUpload",
                 "s3:CreateBucket",
                 "s3:PutObject"
             ],
             "Resource": [
                 "arn:aws:s3:::<my-bucket>",
                 "arn:aws:s3:::<my-bucket>/*"
             ]
         },
         {
             "Sid": "VisualEditor2",
             "Effect": "Allow",
             "Action": [
                 "logs:CreateLogStream",
                 "logs:DescribeLogGroups",
                 "logs:DescribeLogStreams",
                 "logs:PutLogEvents"
             ],
             "Resource": "arn:aws:logs:*:*:log-group:HIVE_METADATA_SYNC:*:*"
         }
         ]
     }
    
  5. Attach the above policy to the IAM Role that you have authenticated on the Qubole account.

  6. Add these keys to in the Hive Settings > Override Hive Configuration section under the Advanced Configuration tab of the Clusters page:

    • hive.metastore.event.listeners: - com.amazonaws.services.glue.catalog.HiveGlueCatalogSyncAgent
    • glue.catalog.athena.jdbc.url: It is the URL used to connect to Athena. The default URL is jdbc:awsathena://athena.**us-east-1**.amazonaws.com:443)
    • glue.catalog.athena.s3.staging.dir: The bucket and prefix used to store Athena’s query results
    • glue.catalog.dropTableIfExists: It is required when you want an existing table be dropped and created. Its default value is true. It is an optional key.
    • glue.catalog.createMissingDB: It is required when databases to be created if they do not exist. Its default value is true. It is an optional key.
    • glue.catalog.athena.suppressAllDropEvents: It avoids propagation of DropTable and DropPartition events to the remote environment.

For more information, see AWS Glue Sync Agent.

Limitations

Here are a few limitations to use the AWS Glue Sync Catalog as a service:

  • The Glue sync agent is only applicable to queries run on HiveServer2.
  • The HiveServer2 and the Hive client must run on Java 8.