Understanding the Qubole Folders in the Default Location on S3 (AWS)

Each Qubole account comes with a default location (DefLoc) configured on the S3 where the cluster instance details, logs, engine information, or any other data are maintained. You can set the default location in Control Panel > Account Settings as described in Storage Settings with IAM Roles.

Warning

After you use the account for sometime, Qubole does not recommend to change the default location. If the default location is changed without copying/moving contents to the new location, then you immediately lose logs and temporary results of previous run commands, notebooks, and dashboards. Copying the data to a new location can take a long time as DefLoc can grow quite large overtime.

Qubole writes data into the corresponding folders in the default location on S3. Here are such folders to which QDS has write access:

  • <number>: It is the cluster instance that contains the cluster instance details written into it. For example, the instances folder contains:
    • spark that contains these three sub folders:
      • conf: It contains the interpreter.json file written into it. The Zeppelin server backs up this file as it contains the properties of each interpreter and interpreter binding information of each notebook/dashboard created on that cluster.
      • dashboard: It contains the note.json file for each dashboard on that cluster.
      • notebook: It contains the note.json file for each Zeppelin notebook on that cluster.
  • <account_id>/<cluster_tag>/user_data: It contains the cluster configurations written into it. The files are created when the UDF size exceeds the maximum size as per AWS standards. AWS supports a maximum UDF size of 16 KB.
  • airflow: It contains the airflow cluster logs, task, and process logs written into it.
    • /<cluster-ID>/dag_logs: This folder contains DAG logs in Airflow. You should not delete it unless you want to delete logs of previous runs of dags.
    • /<cluster-ID>/process_logs: This folder stores the logs of airflow processes such as Webserver and Scheduler.
  • logs: It contains the logs sorted into sub-folders named after the engines and components such as hive, hadoop, and presto, written into it. If contents in the logs folder get deleted after some period of time, the logs for commands ran before that time period will no longer be available.
    • /logs/asterix/zeppelin: This contains logs for the asterix application created for each cluster for offline edit support. One app id corresponds to one cluster and it has multiple application instances every time the application starts. A single application can have multiple Pods and the log location is different for each Pod. The log location for each Pod is in this format: logs/asterix/zeppelin/<asterix_app_id>/<asterix_app_inst_id>/<Pod_NAME>. It contains snapshots of Zeppelin notebooks, dashboard for a particular application instance, and Zeppelin server logs.
    • /logs/asterix/jupyter: This contains logs for the asterix application created for Jupyter. A single application can have multiple Pods and the log location is different for each Pod. The log location for each Pod is in this format: logs/asterix/jupyter/<asterix_app_id>/<asterix_app_inst_id>/<Pod_NAME>.
    • /logs/hadoop: This folder contains YARN and Spark information that include:
      • /<cluster-inst-id>/<node_identifier>: This folder contains logs of various services/daemons run on a given node.
      • /<cluster-id>/<cluster-inst-id>/app-logs: This folder contains YARN container results/logs in the TFile format.
      • /<cluster-id>/<cluster-inst-id>/mr-history: This folder contains job-configuration and job-history files for MapReduce applications.
      • /<cluster-id>/<cluster-inst-id>/<node_ip>: This folder contains logs associated with various services run on the coordinator node and worker nodes.
      • /<cluster-id>/<cluster-inst-id>/sparkeventlogs: This folder contains Spark event files of Spark applications.
      • /<cluster-id>/<cluster-inst-id>/timeline-history/: This folder contains the history data for Tez Applications. Application Timeline Server (ATS) uses this folder.
    • logs/hive: This folder is obsolete.
    • /logs/query_logs/hive/<cluster_inst_id>/<cmdId>.log.gz: Hive queries that use Hive version 2.1.1 or later versions and running on the coordinator node get the corresponding query-based Hive logs.
    • /logs/presto: This folder contains one directory each for every Presto cluster:
      • /<cluster_id>: This folder contains one directory each for every cluster instance:
        • /<cluster_start_time in YYYY-MM-DD_hh-mm-ss>: This folder helps you to figure out the log location of the cluster instance by the cluster start timestamp instead of cluster instance_id. It also helps in finding the logs for a command that failed when you have the timestamp of the failed command. From the content of DEFLOC/logs/presto/cluster_id/, you can easily find the cluster_instance that ran the command by finding the highest timestamp lower than the command timestamp. This folder contains one directory for coordinator and one each for IP address used across worker nodes in the cluster instance:
          • /master: This folder contains logs from the coordinator node. It also lists the queryinfo sub-directory that contains the content used in the QueryTracker UI page of queries.
          • /<IP address 1>: This folder contains nodes’ start time with an IP address as illustrated here:
            • /<start time for worker node 1 with <IP address 1> in YYYY-MM-DD_hh-mm-ss>
            • /<start time for worker node 2 with <IP address 1> in YYYY-MM-DD_hh-mm-ss>
          • /<IP address 2>: Similar to the above folder, this folder contains nodes’ start time with the another IP address:
            • /<start time for worker node 3 with <IP address 2> in YYYY-MM-DD_hh-mm-ss>
            • /<start time for worker node 4 with <IP address 2> in YYYY-MM-DD_hh-mm-ss>
  • long_commands: It contains the commands/queries that are larger than 65 KB written into it. If a user types a command/query that is greater than 65 KB in size, it gets written into the corresponding engine folders such as hadoop, hive, or spark. For example, long-commands/spark contains Spark commands, long-commands/hive contains Hive queries, and long-commands/presto contains Presto queries.
  • packages: It contains the environment and package details of the Package Management feature stored as a compressed file.
  • qubole_folder_<number>: It is a folder, which is only maintained in older accounts and not the new accounts. It was the folder with the notebook details written into it.
  • qubole_pig_scripts: It contains the details of Pig scripts written into it.
  • scripts: The bootstrap scripts and command scripts are temporarily stored in here, which will be picked up by the engines later for processing of commands and so on. This folder is mainly used for Hadoop and Shell commands.
  • tmp: It stores the logs and results of each command’s execution. The logs and results are maintained in files, which are named with the <date/Account ID/Commands> file naming convention. If you delete the data, then you cannot see or download logs and results of the commands that you have executed.
  • warehouse: It contains the Hive metastore data written into it.
  • zeppelin: It contains the notebooks and dashboards data written into it.
  • jupyter: It contains the Jupyter notebooks and settings data.
  • packages_v2: It contains the files generated by v2 of Package Management.
  • rstudio-workspaces: It contains the RStudio workspace for all RStudio users of the account.
  • qubole_bi: It contains the folders related to Cost Explorer.
  • resolved-macros: Spark supports macros substitution in script files by reading the content of the file. After resolving macros, the script file is uploaded back to this folder ($DEFLOC/resolved-macros).

Default Location Folders You Should Never Delete

You must not delete these folders from the default location ($DEFLOC):

  • RStudio workspaces: $DEFLOC/rstudio_workspaces/
  • MLFlow: $DEFLOC/mlflow/
  • Jupyter: $DEFLOC/jupyter
  • Zeppelin notebooks and dashboards: $DEFLOC/zeppelin/
  • Zeppelin interpreter.json: $DEFLOC/<cluster_id>/spark/conf/
  • Old Package Management: $DEFLOC/packages/
  • New Package Management: $DEFLOC/packages_v2/
  • Cost Explorer: $DEFLOC/qubole_bi/
  • DAG logs: $DEFLOC/airflow/<Cluster_ID>/dag_logs/. You should not delete the DAG Logs folder unless you want to delete logs of previous runs of DAGs. It contains logs for DAG runs in Airflow.
  • Hadoop logs: $DEFLOC/logs/hadoop. You should not delete data from this folder as it contains YARN container results/logs. Except for this folder, $DEFLOC/logs/hadoop/<cluster-inst-id>/<node_identifier> that you can delete. It contains various services/daemons run on a given node.
  • Long commands: $DEFLOC/long_commands. Deleting the data from this S3 location breaks the user experience with long commands. As a result, the command text is unavailable on the UI.
  • Tmp folder: $DEFLOC/tmp. Deleting the data from this S3 location results in unavailability of logs and results of the commands that you have executed.