RStudio for Running Distributed R Jobs (AWS)

Qubole supports RStudio Pro to run distributed R jobs using sparklyr. RStudio is an integrated development environment for R, a programming language for statistical computing and graphics. RStudio is supported on clusters running Spark 2.2 and later versions.

Note

RStudio is a Beta feature, and is not enabled for all users by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

When RStudio is enabled, a working home directory (workspace) is created for each user, which is also stored on S3. You can run queries on Hive tables by using sparklyr. All the packages that are installed through package management are available in RStudio. You can add more packages for RStudio from the Environments page. For more information, see Package Management.

Launching RStudio

  1. Navigate to the Clusters page.

  2. Select a Spark cluster that is running on Spark 2.2 or any later version.

  3. Navigate to Resources, and select RStudio.

    The RStudio interface opens in a separate tab as shown in the following figure.

    ../../../_images/rstudio-homepage.png

For detailed information about using RStudio Pro, see RStudio Reference Documentation.

Limitations

  • RStudio is only supported with the new package management. Packages that are installed through install.packages are not available on cluster restart. Therefore, you must use package management to retain the packages.
  • The user workspace is intended for storing code and relevant metadata. Storing large files or datasets might lead to performance deterioration. All such files must be stored on S3 separately.

The following demo video shows how to use RStudio for a Spark cluster.