Using QDS Package Management

Qubole provides a Environments UI for managing Python and R packages in Spark applications. In addition, Qubole automatically attaches an environment to an Airflow cluster that is configured with Python version 3.5.

QDS package manager provides:

  • R and Python version selection through UI to create an environment.
  • Environment which is loaded with default Anaconda packages. You can install additional Python and R packages through the Environments UI.
  • A feature that does a distributed installation of packages on a running Spark application or Airflow workflow.

In the Control Panel, the Environments tab acts as the package manager.

Note

This feature is not available by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

Use the Environments tab for:

Package Management Environment API provides a list of APIs to create, edit, clone, view an environment and attach a cluster to an environment.

Creating an Environment

Navigate to the Control Panel and in the Environments tab, click New to create a new environment. The dialog appears as shown here.

../../_images/Environment.png

To create an environment, perform these steps:

  1. Add a name to the environment.
  2. Add a description about the environment.
  3. Select Python Version. Python 2.7 is the default version and Python 3.5 is the other supported version.
  4. Currently, only R Version 3.3 is supported on Qubole.
  5. Click Create after filling the above fields. You can click Cancel if you do not want a new environment.

Once you click Create, a new environment is created and displayed as shown in this example.

../../_images/NewEnv.png

A newly created environment contains Anaconda distribution of R and Python packages by default and a list of pre-installed Python and R packages. Click See list of pre-installed packages. Viewing the List of Pre-installed Python and R Packages provides the list of pre-installed Python and R packages.

When you create a new Spark cluster, by default a package environment gets created and is attached to the cluster. This feature is not enabled by default. Create a ticket with Qubole Support to enable this feature on the QDS account. It applies to an Airflow 1.8.2 cluster with Python 3.5.

You can also edit/clone an environment as described in Editing an Environment and Cloning an Environment.

Attaching a Cluster to an Environment

Click Edit against Cluster Attached to attach this environment to a cluster. After you click Edit, you can see a drop-down list of available Spark clusters or Airflow clusters (if any) in the account as shown here.

../../_images/AttachClustertoEnv.png

You can only attach a cluster that is down and not active. You can attach only one cluster to an environment.

Select the cluster that you want to attach to that specific environment and click Attach Cluster.

In addition to attaching Spark clusters, you can attach environments with Python 3.5 to Airflow clusters that use Airflow version 1.8.2 only if the cluster’s earlier environment is detached from it. For more information, see Configuring an Airflow Cluster.

To detach a cluster from an environment, click the delete icon against the cluster. You can detach a cluster only when it is down and not active/running. In case, if you detach an environment from the Airflow 1.8.2 cluster with Python 3.5, then you must attach another Python 3.5 environment to get the Airflow cluster running. Otherwise, the cluster does not start.

Adding a Python or R Package

A newly created environment contains the Anaconda distribution of R and Python packages by default. An environment also supports the conda-forge channel which supports more packages. Questions about Package Management provides answers to questions related to adding packages.

To add Python or R packages, click Add against Packages in a specific environment. The Add Packages dialog appears as shown here.

../../_images/AddPackages.png

Perform these steps:

  1. By default, the Source shows Py Packages. You can choose R Packages as the source from the list to install an R package.

  2. Adding source supports two input modes: Simple and Advanced. The Simple mode is the default input mode and add the name of the package in the Name field.

    As you try to type the name, an autocomplete list appears and the package name can be added and the version is optional and it can be incremental as shown here.

    ../../_images/SimpleModePackage.png ../../_images/SimpleModePackage2.png

    If you just mention the package name, then the latest version of the package is installed.

    If you choose the Advanced mode, it shows suggestions and as you start typing the package name, you can see the autocomplete list as shown here.

    ../../_images/AutoCompletePackage.png

    In the Advanced mode, you can add multiple names of packages as a comma-separated list. You can also mention a specific version of the package, for example, numpy==1.1. For downgrading, you can just mention the version number to which you want to downgrade. If you just mention the package name, then the latest version of the package is installed.

    After adding the name, click Add. The package gets added with its status first shown as Installing.. as shown here.

    ../../_images/Package-Initial.png

    After a while, the status appears as Installed as shown here.

    ../../_images/Package-Success.png

Removing a Python or R Package

To remove a Python or R Package in an environment, click the delete icon that is against that package. Here is an example that shows the icon against the installed package.

../../_images/Package-Success.png

Editing an Environment

You can edit an existing environment. In the left-navigation bar, you can see a Gear (settings icon) if you do a mouse hover on a specific environment. Click the icon and you can see these options.

../../_images/EnvironSettings.png

Click Edit and you can see the dialog as shown here.

../../_images/EditEnv.png

You can edit the name and description of an environment. After changing the name and/or description, click Edit. You can click Cancel if you do not want to edit the environment.

Cloning an Environment

When you want to use the same environment on a different cluster, clone it and attach it to that cluster. (An environment can be attached to only one cluster). In the left-navigation bar, you can see a Gear (settings icon) if you do a mouse hover on a specific environment. Click the icon and you can see these options.

../../_images/EnvironSettings.png

Click Clone and you can see the dialog as shown here.

../../_images/CloneEnv.png

By default, a suffix to the name that is <environment name>-clone is added in the Name field. You can retain that name or change it. You can also change the description. You cannot change application versions. After doing the changes, click Clone. You can click Cancel if you do not want to clone the environment.

Managing Permissions of an Environment

Here, you can set permission for an environment. By default, all users in a Qubole account have read access on the environment but you can change the access. You can override the environment access that is granted at the account-level in the Control Panel. If you are part of the system-admin group or any group which have full access on the Environments and Packages resource, then you can manage permissions. For more information, see Managing Roles.

Set Object Policy for a Package Management Environment describes how to set the permissions through the REST API.

A system-admin and the owner can manage the permissions of a environment by default. Perform the following steps to manage a environment’s permissions:

  1. Click the gear box icon next to the environment and click Manage Permissions from the list of options (that are as displayed here).

    ../../_images/EnvironSettings.png
  2. The dialog to manage permissions for a specific environment is displayed as shown in the following figure.

    ../../_images/ManagePerm-PM.png
  3. You can set the following environment-level permissions for a user or a group:

    • Read: Set it if you want to change a user/group’s read access to this specific environment.
    • Update: Set it if you want a user/group to have write privileges for this specific environment.
    • Delete: Set it if you want a user/group who can delete this specific environment.
    • Manage: Set it if you want a user/group to grant and manage access to other users/groups for accessing this specific environment.
  4. You can add any number of permissions to the environment by clicking Add Permission.

  5. You can click the delete icon against a permission to delete it.

  6. Click Save for setting permissions to the user/group. Click Cancel to go back to the previous tab.

Deleting an Environment

You can delete an environment. In the left-navigation bar, you can see a Gear (settings icon) if you do a mouse hover on a specific environment. Click the icon and you can see these options.

../../_images/EnvironSettings.png

Click Delete to remove the environment.

Migrating Existing Interpreters to use the Package Management

Even after attaching a Spark cluster to an environment, existing Spark interpreters in the notebook keep using the system/virtualenv Python and system R. To use the environment, change Python and R interpreter property values in the existing interpreter to use Anaconda-specific Python and R. Change these interpreter property values:

  • Set zeppelin.R.cmd to cluster_env_default_r.
  • Set zeppelin.pyspark.python to cluster_env_default_py.

The interpreter automatically restarts after its properties change.

However, a new Spark (not a cloned cluster) cluster, which is attached to an environment contains the default Spark Interpreter set to Anaconda-specific Python and R that is cluster_env_default_py and cluster_env_default_r. Similarly, a new interpreter on an existing cluster uses the Anaconda-specific Python and R.

Note

After a cluster is detached from an environment, the Spark interpreter (existing or new) falls back to system/virtualenv Python and system R.

Limitations of the Package Management Feature

This is a limitation:

  • In case of a Python package version upgrade/downgrade, the changed version is reflected only after restarting the Spark interpreter.