Getting Started with Qubole on AWS

The world of Big Data is all around us. Transactions, sensors, social media, mobile devices, wearables, and a host of other sources are generating datasets of unprecedented volume, velocity and variety. Hadoop, Spark, Presto and the surrounding ecosystem of tools is lauded for its ability to handle massive volumes of structured and unstructured data, but the software is not easy to manage or use. For those just starting out in that world, Qubole Data Service (QDS) could be for you.

You do not have to stand up physical servers, virtual machines or even cloud instances. These things are still done but QDS provisions and manages them on your behalf, automating the scale of your Big Data clusters in AWS Elastic Compute Cloud (EC2) based on compute needs. By being cloud-native, QDS offers many advantages that includes the following:

  • QDS decouples compute and storage enabling users to take advantage of the increased elasticity of compute.
  • Ephemeral clusters minimize cost by running only when needed and provide administrators the ability to segregate work and meet SLA/SLO requirements.
  • It uses cloud object storage for persistent data storage.

To help get you started using Qubole Business Edition, refer to this introductory guide for setting up QDS secure access to Amazon Web Services (AWS).

Before you can use Qubole Data Service (QDS), you must create an account. You can do this by signing up on Qubole’s website, www.qubole.com.

Pre-Requisites

To be successful at using the Qubole Business Edition, you must have the following:

  • AWS Account (If you do not have one you can sign up for a free account).

    Note

    While Qubole can help you dramatically decrease your AWS spend, you are ultimately responsible for the infrastructure costs managed by Qubole on your behalf.

  • Qubole Business Edition Account (if you do not have one you can sign up for free here).

AWS Set Up for Qubole

To get started with QDS, you must get these two from your AWS account:

  • An AWS S3 Bucket for a default storage location
  • An AWS Identity and Access Management Role (with Qubole provided roles and policies)

In this section, a step through how to get the S3 bucket and AWS Identity is described.

Once you have these, the IAM Role is used for both Storage and Compute settings section in QDS. The S3 bucket that you create is required to complete the Storage settings section.

If you do not have an AWS account yet? No problem, you can sign up for a free AWS account here.

Setting up Amazon Simple Storage Service (S3)

First, you need some S3 storage, one of the services provided in your AWS account. S3 is storage for the Internet. You can use S3 to store, secure, and retrieve your data at any time, from anywhere on the web. You can accomplish these tasks using the simple and intuitive web interface of the AWS Management Console.

To launch the AWS S3 Management Console, click here.

Creating an S3 Bucket

From the AWS S3 Management Console, you are presented with a screen to create a bucket. If you have created buckets in the past, they will show up here. In the example below, this is the first bucket created. Select the blue Get Started button or the blue Create bucket button.

../../_images/CreateS3bucket.png

As you can see from the dialog below, there are 4 steps giving the user the ability to set properties, permissions, and so on. For our purposes, you can just choose a name (qdsexample in this case) and select Create and create the bucket with default settings.

../../_images/CreateBucketName.png

After creating a bucket, you should see this screen.

../../_images/NewBucket.png

Note

If you have given your bucket a name other than the example provided in this guide, copy down the name you provided as you will need this in a later step.

IAM Overview (IAM Roles vs IAM Keys)

Now it is time to set up your IAM roles. QDS offers two access modes using AWS Identity and Access Management (IAM):

  • IAM Keys
  • IAM Roles*

IAM Keys are simple and give QDS broad access. While convenient, such broad access can be a security risk. Hence, as a best practice Qubole recommends IAM Roles. IAM Roles are more granular. An IAM role is similar to a user, in that it is an AWS identity with permission policies that determine what the identity can and cannot do in AWS.

Important

The scope of this document is limited to IAM Roles only. IAM Roles are recommended as a best practice. IAM Keys, while allowed, are discouraged.

IAM Roles and Policies

You now need to do the following steps to configure robust, granular security for your QDS environment. The purpose here is to step you through it. Here is what you do next:

Create an IAM Service Role

A service role is a role that assumes to perform actions on your behalf. After creating the role, create policies that will empower this role.

Go to the Roles section of AWS Identity and Access Management by clicking here:

https://console.aws.amazon.com/iam/home#/roles

To create the role, click Create Role (blue button).

You have the option here to choose an AWS Trusted Entity. The default is AWS Service, and that is the one you want. Underneath there is a list of AWS Services.

Click EC2.

Then click:

EC2: Allows EC2 instances to call AWS services on your behalf.

Click Next Permissions (blue button).

Click Next: Review (blue button).

Choose a role name. QDS has a 20-character requirement, so for this example, type in QuboleQDS_BusinessEdition (or the role name of your choice). It can be anything as long as it contains AWS valid characters and meets the 20-character QDS requirement.

Click Create role (blue button).

Note

If you choose to name your role something other than the example provided here, make sure to copy down this name for use later on.

Configure Your AWS Policies

The next four steps involve creating and configuring granular security policies in AWS. Qubole provides the policies in the Qubole documentation. The process will involve copying and pasting the configuration text (it’s in a format known as JSON) into your AWS console. Some of these will require editing to match your AWS setup.

Configure Your AWS EC2 Policy

Here, delegate a minimal amount of EC2 permissions so that QDS can manage EC2 instances on your behalf.

Go to policies section of AWS Identity and Access Management by clicking here:

https://console.aws.amazon.com/iam/home#/policies

Click Create Policy (blue button).

On the Create Policy page, you will see tabs for Visual editor and JSON. Click the JSON tab.

You see four lines of configuration code that needs to be replaced by functional configuration code that can be found on the Qubole Documentation site.

Open a new browser tab (or window if you prefer) and go to: http://docs.qubole.com/en/latest/faqs/general-questions/policy-use-qubole-use-my-iam-credentials.html

Scroll down until you see the section titled:

Sample 1 - This sample is a simpler AWS policy for EC2 settings and it does:

There is block of configuration code that looks similar to the template in the AWS Console, but has more settings. Copy that and paste it into the JSON tab in the AWS Console, replacing what is already in there. When you are done, it should look similar to this.

../../_images/EC2SettingsPolicy.png

Click Review Policy (blue button).

On the Review Policy page, select the Name text field to name the policy. Name the policy: QuboleQDSforEC2

To create the policy, click Create Policy (blue button).

Configure Your AWS S3 Policy

Here, delegate a minimal amount of S3 permissions so that QDS can manage S3 storage on your behalf.

You must still be on the Policies page of AWS Identity and Access Management. Here is the link if you have wandered: https://console.aws.amazon.com/iam/home#/policies.

Click Create Policy (blue button). On the Create Policy page, you will see tabs for Visual editor and JSON.

Click the JSON tab.

Like the previous step, you see four lines of configuration code that will need to be replaced by functional configuration code that can be found on the Qubole Documentation site that was previously opened.

Here it is again if you need to open it in a new browser tab or window:

http://docs.qubole.com/en/latest/faqs/general-questions/policy-use-qubole-use-my-iam-credentials.html

Scroll down until you see the section titled:

Here is a sample IAM policy for creating an AWS S3 policy.

Like the previous step, there is block of configuration code that looks similar to the template in the AWS Console, but has more settings. Copy that and paste it into the JSON tab in the AWS Console, replacing what is currently in there.

After you have copied and pasted the configuration code, you also need to make an edit, since the configuration information in the document has a generic reference to the S3 bucket path. You will see it noted as <bucketpath>.

Replace <bucketpath> with the S3 bucket that you created in step 1.2. If you used the same naming convention as provided in the guide, it would be named qdsexample.

When you are done, it must look similar to this:

../../_images/S3Policy.png

Click Review Policy (blue button).

On the Review Policy page, select the Name text field to name the policy. Name the policy: QuboleQDSforS3.

To create the policy, click Create Policy (blue button).

Configure Your Cross Account Policy

Here, retrieve information about the specified instance profile and grant permissions to pass a role to an AWS service.

Return to the Policies page of AWS Identity and Access Management. Here is the link for your convenience: https://console.aws.amazon.com/iam/home#/policies.

Click Create Policy (blue button). On the Create Policy page, you see tabs for Visual editor and JSON.

Click the JSON tab.

As in previous efforts, you see four lines of configuration code that need to be replaced by functional configuration code that can be found on the Qubole Documentation site that was previously opened.

Here it is again if you need to open it in a new browser tab or window (unless of course you still have the tab or window open from the previous steps):

http://docs.qubole.com/en/latest/faqs/general-questions/policy-use-qubole-use-my-iam-credentials.html

Scroll down until you see the section titled:

Sample Policy for IAM Roles

Here is a sample policy for a cross-account IAM role. See Creating a Cross-account IAM Role for QDS for more information

As before, there is block of configuration code that looks like the template in the AWS Console, but has more settings. Copy that and paste it into the JSON tab in the AWS Console, replacing what is already in there.

After you have copied and pasted the configuration code, you will also need to make two edits, since the configuration information in the document has generic references. The two edits are to replace the entire Resource parameter with your own (the steps on how to do this is explained).

In another browser tab or window, go to the Roles section of AWS Identity and Access Management by clicking here: https://console.aws.amazon.com/iam/home#/roles.

Click your Role name **(that is *QuboleQDS_BusinessEdition* or the unique name you created in `Create an IAM Service Role`_) at the bottom of the page. This takes you to a Summary description of your role. There are two **Summary items that you need to copy.

  • Instance Profile ARNs
  • Role ARN

Copy the Instance Profile ARN from the Summary and replace "Resource": "arn:aws:iam:: arn_number :instanceprofile/quboleeducationrole".

Copy the Role ARN from the Summary and replace "Resource": "arn:aws:iam:: arn_number :role/quboleeducationrole"

When you are done, it must look similar to this:

../../_images/CrossIAMRolePolicy.png

Note

Note that there must be no spaces in either of the ARNs in the Resource parameter in the example above.

Click Review Policy (blue button).

On the Review Policy page, select the Name text field to name the policy.

Name the policy: QuboleQDSCrossAccount.

To create the policy, click Create Policy (blue button).

Gather Account Settings From Your Qubole Account

At this point, you need to gather some information from your Qubole account to finish the AWS set up.

Log into your Qubole account (if you do not have an account yet you can sign up for one. Signing up on QDS describes the steps for signing up on QDS). If you already have an account, then log into QDS and go to Recording the Trusted Credentials from Account Settings.

Signing up on QDS

Perform the following steps to sign up as a new QDS user:

  1. Go to https://www.qubole.com/products/pricing/.

  2. Click SIGN UP ON AWS.

    The following screen is displayed.

    ../../_images/BusinessEditionSignUp.png

    Note

    For a first-time user, you can sign up for QDS Business Edition. The details of the Business Edition are displayed in the Signup page as illustrated above.

    Provide the required information and click SIGN UP WITH EMAIL. Proceed to Step 3.

    You can also sign up with SAML or Google. If you click Sign up with Google or Sign up with SAML, it immediately logs you into the QDS Homepage.

  3. After you click sign up with email, a text field for email appears. Enter your Email ID. Enter the Answer to the question.

  4. Qubole sends an email to the email ID that is provided while signing up with an activation code. You can either confirm the account by clicking the link in the email or copy and paste the activation code in the signup window. Set a password in the user activation page as shown in this figure.

    ../../_images/UserActivation1.png
  5. After submitting the activation code, you are logged into QDS Homepage for the first time as shown in this figure.

    ../../_images/landing_page0.png

    Note

    Qubole now supports only HTTPS. All HTTP requests are now redirected to HTTPS. This is aimed at better security for Qubole users.

    Qubole Homepage provides the following information:

    Note

    With the launch of business edition, there is no free trial period and no AWS IAM Credentials. Hence, you must set AWS IAM Keys in Storage Settings and Compute Settings in Control Panel > Account Settings.

    • The top-first half of the homepage contains four banners that contains information about QDS UI and its salient features. It is dynamic and Qubole plans to enhance this in the near future.
    • Number of command running in the current account.
    • Number of clusters running in the current account.
    • Number of active schedules in the current account.
    • Recent commands that were run.
    • Recent notebooks that were used.
    • In the right-side of the page, Resources, Video Tutorials, and Running Clusters are displayed.

    Click see all to see the complete information of the homepage.

    Note

    Clicking the Qubole logo on the QDS UI displays the Qubole homepage.

    After the first login into the Qubole account and running at least four commands, the first banner goes away. Subsequently, the landing page would contain only three banners on the top-first half.

Recording the Trusted Credentials from Account Settings

Go to the Account Settings menu that is located under the Control Panel.

In the Account Settings area, look for the section labeled Access Mode (Keys/IAM Roles). Select IAM Role if it is not already selected.

../../_images/AccessSettings-IAMRoles.png

The Trusted Principal AWS Account ID and External ID (though blurred here for privacy) are set by QDS automatically. Record these for use in Step Update the Role with Policies and Trust Relationship.

Once recorded, you can return to your AWS account.

Update the Role with Policies and Trust Relationship

Here, use the AWS Security Token Service (AWS STS) to create and provide trusted users with security credentials that can control access to your AWS resources.

Go to the Roles section of AWS Identity and Access Management, if you do not already have this open from the previous step:

https://console.aws.amazon.com/iam/home#/roles

You will see the IAM Service Role that you created in a previous step. In this example the role name is QuboleQDS_BusinessEdition. Select the role name (not the check box beside it).

Under the Permissions tab, there must be an information dialog with the following text:

Get started with Permissions

This role does not have any permissions yet. Get started by attaching one or more policies to this role.

Click Attach Policy (blue button).

To add the policies you have created to this role, use the Filter field. In the text field, type in Qubole and you must see these 3 policies.

  • QuboleQDSCrossAccount
  • QuboleQDSforEC2
  • QuboleQDSforS3

Next to each of the permissions, there is a check box. Select the checkbox next to all three permissions policy that you created, click Attach Policy (blue button).

Now click the Trust relationships tab, (next to the Permissions tab).

Click Edit trust relationship (blue button).

Like the previous steps, you will see a standard configuration (on the Edit Trust Relationship page, under the title Policy Document) that needs to be replaced by functional configuration code that can be found on the Qubole Documentation site that was previously opened.

Here, it is again if you need to open it in a new browser tab or window:

http://docs.qubole.com/en/latest/faqs/general-questions/policy-use-qubole-use-my-iam-credentials.html

Look for the configuration code underneath the heading:

Here is a sample policy to update trust relationships of a cross-account IAM role.

Copy the block of configuration code from the Qubole Documentation site reference above and paste it into the Policy Document replacing what is already in there.

After you have copied and pasted the configuration code, you also need to make two edits, since the configuration information in the document has generic references. The two edits are the quboleawsaccountid and the externalid.

The first edit is to replace the generic quboleawsaccountid with your Trusted Principal AWS Account ID that you recorded in step Recording the Trusted Credentials from Account Settings that is under Gather Account Settings From Your Qubole Account.

The second edit is to replace the externalid with the External ID that that you recorded in step Gather Account Settings From Your Qubole Account.

When you are done, it should look similar to this:

../../_images/TrustRelationship.png

Note

Note that there must be no spaces in the ARN Resource parameter in the example above.

Click Update trust policy (blue button).

Finalize Your Account Set Up in Qubole

At this point, you need to finalize your account settings so you are ready to get started running queries in Qubole.

In your AWS account go to the Roles section of AWS Identity and Access Management by clicking here:

Go to: https://console.aws.amazon.com/iam/home#/roles

At bottom of the page, you find the role that you created. In this example, it was QuboleQDS_BusinessEdition.

Click QuboleQDS_BusinessEdition.

Copy the Role ARN from the top of the web page as shown in the screenshot below.

../../_images/BusinessEditionSuccess.png

In a new tab, return to your Qubole account. Once logged in, you should see the QDS Home screen.

../../_images/QDSHomeTab.png

Go to the Account Settings menu that is located under the Control Panel.

In the Account Settings menu, look for the section labeled Access Mode (Keys/IAM Roles). At this point the IAM role must be selected (you did this in step Gather Account Settings From Your Qubole Account.)

Paste ARN Role into the corresponding text field and press the Save button under the section labeled Access Mode (Keys/IAM Roles).

You should get a confirmation dialog that says:
The Account Details have been updated successfully.

Press OK.

That’s it! Your account is set up and you are ready to start running your first query.

Running a Hive Query and Extracting Sample Rows and Analyzing Data

After authenticating AWS using IAM Keys or IAM Role, perform the following steps:

  1. Navigate to the Analyze page, click the Compose button. A command editor on the right side is displayed with a list of commands. By default, Hive Query is selected.

    ../../_images/command-editor.jpg
  2. To run a Hive query, ensure that the Hive Query is selected from the drop-down list. In the Compose editor window, type a simple query. For example:

    show tables;
    
  3. Click Run to execute the query.

  4. The Results are shown in the Results tab.

    Note

    If the query is successful, the Log tab shows the status of the query as OK and displays the time taken to run the query. Also, next to the Query, a green dot indicates that the query Succeeded. You can also click the History tab to see the query status.

  5. To execute another query, click Compose. This clears the command window.

  6. Now type and execute any other query in the Compose editor window. For example:

    select * from default_qubole_memetracker limit 10;
    
  7. Click Run to execute the query.

    Note

    The query takes a little time if a large amount of data has to be fetched.

  8. To analyze the data, for example to find the total number of rows in a table corresponding to August 2008, submit the following query:

    select count(*) from default_qubole_memetracker where month="2008-08";
    

    Note

    This query is more complex than the previous queries and requires additional resources. In the background, Qubole Data Service provisions a Hadoop cluster.

    This can take a couple of minutes. When the query is being processed, the status of the query is shown with a spinning circle that indicates that it is in progress. Once it is processed successfully, the query result is displayed in the Results tab.

Congratulations! You have just executed your first Hive query on the Qubole Data Service. Create a ticket with Qubole Support. The Qubole support team gets back to you and help you on board.