Using the Spill to Disk Mechanism¶
Presto supports offloading intermediate operation results to disk for memory intensive operations. This is
called Spill to Disk mechanism. It enables execution of queries which would otherwise fail due to memory requirements
exceeding maximum memory per node limit (defined by
query.max-memory-per-node). It is a best effort mechanism which
increases the chances of success for queries with high memory requirements but it does not guarantee that all memory
intensive queries succeed. For more information, see Spill to Disk.
Qubole recommends using the Spill to Disk mechanism from Presto 0.208. For more information, see:
Enabling Spill to Disk Mechanism on a Presto Cluster¶
You can enable the Spill to Disk mechanism for a Presto cluster through the cluster configuration overrides as illustrated below.
config.properties: experimental.spiller-spill-path=<path to the directory that will be used to write the spilled data> experimental.spill-enabled=true experimental.max-spill-per-node=250GB experimental.query-max-spill-per-node=100GB
Enabling Spill to Disk Mechanism on a Session¶
To enable the Spill to Disk mechanism at the query level, use the session property as mentioned here.
set session spill_enabled =true
Spill to Disk works only with local disks on worker nodes and so, it does not work with the cloud object storage (for example, S3).
You must set the directory to write the spilled data if you want to enable the Spill to Disk mechanism at the cluster level/session level. To set the location that is used to write the spilled data at a query level, use the cluster-level configuration property as mentioned here:
config.properties: experimental.spiller-spill-path=<path to the directory that will be used to write the spilled data>
For more info on the cluster-level configuration, see Enabling Spill to Disk Mechanism on a Presto Cluster.
Spill Path on the Local Disk¶
Starting Presto version 0.208, there is a default value for a spill-path that is, the location on the disk where
intermediate operation results are offloaded. The default location used for spilling with the current configuration is
located on worker nodes at
/media/ephemeral0/presto/spill_dir. The default directory allows you to easily enable the
spill to disk configuration on a session or enable it at the cluster level for all queries by passing it as a Presto override.
Configuring the Maximum Spill Per Node¶
You should configure the
experimental.max-spill-per-node property (size for maximum spill per node) by considering
the free disk space on
Here is a sample command to check the disk space on
/media/ephemeral0 along with its output.
[[email protected]<ip-address> ~]# df -ah Filesystem Size Used Avail Use% Mounted on proc 0 0 0 - /proc sysfs 0 0 0 - /sys /dev/xvda1 59G 34G 26G 58% / devtmpfs 61G 3.9M 61G 1% /dev devpts 0 0 0 - /dev/pts tmpfs 61G 0 61G 0% /dev/shm none 0 0 0 - /proc/sys/fs/binfmt_misc /dev/xvdaa 296G 71M 293G 1% /media/ephemeral0 /dev/xvdp 197G 68M 195G 1% /media/ebs3 /dev/xvdo 197G 64M 195G 1% /media/ebs2
When RubiX is enabled, the
hadoop.cache.data.fullness.percentage override defines the
maximum amount of disk space it can use. Its default value is 80%. So, with RubiX enabled, you should define the
experimental.max-spill-per-node property by considering the value of