SQL Authorization through Apache Ranger in Spark
Spark on Qubole supports granular data access authorization of Hive Tables and Views using Apache Ranger. In addition to Table-Level Authorization, Spark also supports additional features supported by Apache Ranger, such as Column and Row Level Access Control and Column Masking. This section provides a detailed overview of different features supported in Spark and how to configure Apache Ranger with Spark.
For more information about Apache Ranger, see Apache Ranger Documentation.
Note
This feature is not enabled for all users by default. You can enable this feature from the Control Panel >> Account Features page.
For more information about enabling features, see Managing Account Features.
Supported Ranger features
SQL Standard Authorization
The following authorization are supported:
Authorization in all levels - Tables, Schema and Column Level.
All Ranger policy constructs for authorization: Allow-Conditions, Deny-Conditions, Resource-Inclusion, and Resource-Exclusion.
User Groups defined in Ranger are supported.
Row-level Filtering
In Ranger, administrators can set policies at row level by allowing or denying access to certain rows in a table to users or groups.
Data Masking Policies
For Datasets with sensitive information, Ranger provides option to mask certain columns in a Table like Redact, Hash, Nullify, and Custom. Administrators can configure a policy with Mask Conditions to mask the sensitive data for a particular set of users or groups. Spark on Qubole currently supports Hash Masking.
For more information about Data masking, see Use cases:data masking.
Configuring Spark to use Ranger for Authorization
Before you begin configuring your Ranger plugin, note down the following information:
Ranger URL. This is the endpoint of the Ranger Admin that serves your authorization policies. Ensure that the access is available to this endpoint from the Spark Cluster. Example: http://localhost:6080
Credentials. The credentials provide access to the Ranger Admin. They are used to communicate with Ranger Admin to fetch policies and user-group information. Example: user:
admin
, password:admin
.Service Name. This is the name of the service in Ranger Admin that holds the policies you want to apply to your catalog. You must specify name of the service using a spark configuration
spark.sql.qubole.ranger.service.name
. You can set this configuration through spark overrides at a cluster level, such as (spark-defaults.conf:spark.sql.qubole.ranger.service.name hiveService1
) or pass it as a job level configuration (--conf spark.sql.qubole.ranger.service.name=hiveService1
). The default value of the configuration ishive
.Example: serviceName:
hive
.Ranger user group cache expiry: The cache expiry time for ranger user group information in seconds. By default this value is set to 30 seconds. Higher value reduces calls to Ranger but the user group information can become stale between fetches. You can set this configuration through cluster overrides with a spark config
spark.sql.qubole.ranger.userGroupCacheExpirySecs
.
Creating Ranger Configuration Files
You should create the Ranger configuration files ranger-configs.properties
, ranger-<serviceName>-security.xml
and ranger-<serviceName>-audit.xml
by using the Ranger properties mentioned above.
The ranger-configs.properties
should contain the following information:
ranger.username=admin
ranger.password=admin
The ranger-<serviceName>-security.xml
should contain the following information:
<configuration>
<property>
<name>ranger.plugin.<serviceName>.service.name</name>
<value><service-name></value>
</property>
<property>
<name>ranger.plugin.<plugin-name>.policy.pollIntervalMs</name>
<value>5000</value>
</property>
<property>
<name>ranger.service.store.rest.url</name>
<value>http://myRangerAdmin.myOrg.com:6080</value>
</property>
<property>
<name>ranger.plugin.<plugin-name>.policy.rest.url</name>
<value>http://myRangerAdmin.myOrg.com:6080</value>
</property>
</configuration>
Note
plugin-name
is the name of the plug-ins, such as spark and hive. service-name
is the name of the service, policy or repository defined on
the Ranger Admin UI.
The ranger-<serviceName>-audit.xml
should contain the following information:
<configuration>
<property>
<name>xasecure.audit.is.enabled</name>
<value>true</value>
</property>
<property>
<name>xasecure.audit.solr.is.enabled</name>
<value>true</value>
</property>
<property>
<name>xasecure.audit.solr.async.max.queue.size</name>
<value>10</value>
</property>
<property>
<name>xasecure.audit.solr.async.max.flush.interval.ms</name>
<value>1000</value>
</property>
<property>
<name>xasecure.audit.solr.solr_url</name>
<value>http://localhost:6083/solr/ranger_audits</value>
</property>
</configuration>
If auditing is not required, set xasecure.audit.is.enabled
and xasecure.audit.solr.is.enabled
to false
.
After you create the above mentioned files, add the following node bootstrap to the cluster to copy the files to /usr/lib/spark/conf/
in all the nodes of the cluster.
RANGER_PROPERTIES_LOC=s3://<location>/ranger-configs.properties
RANGER_SECURITY_LOC=s3://<location>/ranger-<serviceName>-security.xml
RANGER_AUDIT_LOC=s3://<location>/ranger-<serviceName>-audit.xml
/usr/lib/hadoop2/bin/hadoop dfs -get s3://paid-qubole/spark-ranger/rangerBootstrap.sh
source rangerBootstrap.sh
Known Limitations
HTTPS mode of Ranger Admin is not yet supported.
Spark on Qubole currently supports Hash Masking.
All SQL Statements added as a Row filter should be supported by Spark. Admins should verify filters independently before adding them to Ranger.
In the case of Column Level Access Policies,
Select *
statement in Spark returns all the Columns that are accessible by the current user. The sameSelect *
statement leads to access control exception in Hive or Presto when running with Ranger. Example: Consider a table with the following definitionEmployee(id, name, salary)
and user A is only given access to columns (id
,name
). Running aselect * from Employee
in Spark returns all rows and columns (id
,name
). The same query throws Authorization Exception in Hive/Presto.