Understanding the Qubole Folders in the Default Location on S3 (AWS)

Each Qubole account comes with a default location configured on the S3 where the cluster instance details, logs, engine information, or any other data are maintained. You can change the default location in Control Panel > Account Settings as described in Storage Settings with IAM Roles. (The same steps apply to the storage settings with IAM Keys as well).

Qubole writes data into the corresponding folders in the default location on S3. Here are such folders to which QDS has write access:

  • <number>: It is the cluster instance that contains the cluster instance details written into it.

  • airflow: It contains the airflow cluster logs, task and process logs written into it.

  • logs: It contains the logs sorted into sub-folders named after the engines and components such as hive, hadoop, and presto, written into it.

    Note

    Log for a particular Hive query is available at <Default location>/cluster_inst_id/<cmd_id>.log.gz.

  • packages: It contains the environment and package details of the Package Management feature stored as a compressed file.

  • qubole_folder_<number>: It is a folder, which is only maintained in older accounts and not the new accounts. It was the folder with the notebook details written into it.

  • qubole_pig_scripts: It contains the details of Pig scripts written into it.

  • scripts: The bootstrap scripts and command scripts are temporarily stored in here, which will be picked up by the engines later for processing of commands and so on. This folder is mainly used for Hadoop and Shell commands.

  • tmp: It stores the logs and results of each command’s execution. The logs and results are maintained in files, which are named with the <date/Account ID/Commands> file naming convention.

  • warehouse: It contains the Hive metastore data written into it.

  • asterix: It contains the asterix application logs.

  • zeppelin: It contains the notebooks and dashboards data written into it.

  • jupyter: It contains the Jupyter notebooks and settings data.

  • packages_v2: It contains the files generated by v2 of Package Management.

  • rstudio-workspaces: It contains the RStudio workspace for all RStudio users of the account.