Understanding Hive and Hadoop Security (Azure)

Qubole is built using open source components of Hadoop, Hive, Spark and Presto. So, by default it adopts the standard security models of each tool. However, Qubole also acknowledges that there are gaps in the default model so, it has been developing its own security model to enhance the basic security. The end result is a platform more secure than the default open source and more secure than most other commodity offerings.

Qubole is a multi-platform cloud service utilising the advanced security features of each platform. Standard security features such as Virtual Networks, security roles, secure key access, ssh and endpoint security are utilised as a default. For more details about platform security, see the Azure or AWS security pages. The data stored in the QDS servers or in the unified Hive Metastore are encrypted by default.

../_images/Hive-Security-Azure.png

Hadoop Security

On Azure (as the cloud provider), Qubole supports Hadoop 2, which has greater performance benefits utilising YARN. This also reduces the amount of inter-node traffic and intermediate data set write/reads to an external storage.

../_images/HadoopYarn.png

In the above figure (Figure 2), you can see that there is inter-node communication among clients, resource managers and nodes. All processing within Hadoop on Qubole happens in the Hadoop cluster inside the Virtual Network (RM, NM, AM all within the same VNet).

../_images/HadoopPhases.png

For job execution there may be one or more shuffle, copy, sort and reduce phase. Currently any data transferred between Hadoop nodes is not encrypted (encryption of data in transport between Hadoop nodes is a new feature in Hadoop 2.9). Qubole has developed an enhancement to add this transaction encryption to Hadoop 2.8 both between Hadoop nodes and the QDS server. This enhancement will be available shortly.

../_images/QuboleAzureSecurityModel.png

Data at rest is a feature that needs to be supported by the underlying cloud platform’s storage. In Azure, all data at rest is encrypted using Storage Service Encryption SSE by default (as of September 2017). For Azure, Qubole already offers encryption to Azure Data Lake Storage (ADLS) using TLS. Qubole plans added encryption to Azure Blob Storage (ABS) in the near future. For ephemeral HDFS storage (HDFS on the Hadoop nodes, used for disk spill and temporary storage), data is encrypted using block device encryption (using dm-crypt + LUKS mapping).

Hive Security

Here is a HiveServer2 architecture diagram.

../_images/HiveServer2Architecture.png

Hive utilises the Hadoop security model for a query execution, so all the Hadoop security that is described above is also true for Hive queries. When utilising HiveServer2 (HS2) with Hive, users will interface with HS2 directly either through the Qubole servers (for example, QDS Analyze page) or directly through a Business Intelligence (BI) tool (ODBC/JDBC).

Communication through the QDS servers to HS2 is encrypted by default and encryption from a BI tool to HS2 is also supported. This again is an additional security feature developed by Qubole to make QDS more secure than the Hadoop default model.