Introduction
Presto is an open source distributed SQL query engine developed by Facebook. Presto is used for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Presto was designed and written completely for interactive analytics and approaches the speed of commercial data warehouses. Facebook uses Presto for interactive queries against several internal data stores including its 300PB data warehouse. Over 1,000 Facebook employees use Presto everyday to run more than 30,000 queries that in total scan over a petabyte each per day. Learn more at prestosql.io.
The execution model of Presto is fundamentally different from Hive or MapReduce. Hive translates queries into multiple stages of MapReduce tasks that execute one after the other. Each task reads inputs from disk and writes intermediate output back to disk. In contrast, the Presto engine does not use MapReduce. It employs a custom query and execution engine with operators designed to support SQL semantics. In addition to improved scheduling, processing is in memory and pipelined across the network between stages. This avoids unnecessary I/O and associated latency overhead. The pipelined execution model runs multiple stages at once and streams data from one stage to the next as it becomes available. This significantly reduces end-to-end latency for many types of queries. For more information, see Presto’s architecture.
Note
Presto is supported on AWS, Azure, and GCP Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.
Sample Use Case
Qubole’s Presto-as-a-Service is primarily intended for Data Analysts who need to translate business questions into SQL queries. Since the questions are often ad-hoc, there is some trial and error involved; arriving at the final results may involve a series of SQL queries. By reducing the response time of these queries, the platform can reduce the time to insight and greatly benefit the business.
The typical use case involves a few 10GB-100TB tables in the Cloud. Tables are generally partitioned by date or other attributes. Analyst queries pick a few partitions at a time, typically span a week to a month of data, and involve WHERE clauses. Queries may involve a JOIN with a smaller table, and contain aggregate functions and GROUP-BY clauses.