6. Should I have one large auto-scaling cluster or multiple smaller clusters?

This is a common question that we come across while using QDS! Note that all clusters under an account share the Hive metastore and storage credentials used to access data in Cloud storage. So the same set of commands and queries can be run irrespective of the cluster configuration. Some trade-offs are listed below:

  • Sharing a single large auto-scaling cluster allows efficient usage of compute resources. For example:

    • An application, App1, provisions a cluster but finishes leaving 30 minutes on the clock before the cluster reaches the hour mark and is terminated.
    • Another application, App2, now needs to be run. If App2 uses the same cluster, it can use paid-for compute resources that would otherwise go to waste.

    A careful Fair Scheduler configuration on a shared cluster can provide responsive behavior even when there are multiple users.

  • Multiple clusters allow configurations optimized for different workloads

    This is part of the reason why QDS has different types of clusters for different engines (such as Hadoop and Spark). Memory-intensive applications benefit from high-memory instances while compute-intensive benefit from a different instance type.

    Using multiple clusters is better if, for example, different data sets reside in different regions. It is better in that case to run multiple clusters, each closer to where the data resides.

  • Multiple clusters leads to higher isolation

    Although QDS uses frameworks such as the Hadoop Fair Scheduler to arbitrate access to a common cluster, contention can be avoided with the use of multiple clusters. This is an issue if there are production jobs with stringent SLAs. Running them on a separate cluster is always safe but expensive.

  • Efficiency gains from a shared cluster depend on type of job

    For example, a small job that runs every 15 minutes does not see much gain by sharing compute resources with larger bursty jobs. As such jobs are also often SLA-driven, it is better to run them on a different cluster.