SQL Authorization through Apache Ranger in Spark

Spark on Qubole supports granular data access authorization of Hive Tables and Views using Apache Ranger. In addition to Table-Level Authorization, Spark also supports additional features supported by Apache Ranger, such as Column and Row Level Access Control and Column Masking. This section provides a detailed overview of different features supported in Spark and how to configure Apache Ranger with Spark.

For more information about Apache Ranger, see Apache Ranger Documentation.

Note

This feature is not enabled for all users by default. You can enable this feature from the Control Panel >> Account Features page.

For more information about enabling features, see Managing Account Features.

Supported Ranger features

SQL Standard Authorization

The following authorization are supported:
- Authorization in all levels - Tables, Schema and Column Level.
- All Ranger policy constructs for authorization: Allow-Conditions, Deny-Conditions, Resource-Inclusion, and Resource-Exclusion.
User Groups defined in Ranger are supported.

Row-level Filtering

In Ranger, administrators can set policies at row level by allowing or denying access to certain rows in a table to users or groups.

Data Masking Policies

For Datasets with sensitive information, Ranger provides option to mask certain columns in a Table like Redact, Hash, Nullify, and Custom. Administrators can configure a policy with Mask Conditions to mask the sensitive data for a particular set of users or groups. Spark on Qubole currently supports Hash Masking.

For more information about Data masking, see Use cases:data masking.

Configuring Spark to use Ranger for Authorization

Before you begin configuring your Ranger plugin, note down the following information:

Ranger URL. This is the endpoint of the Ranger Admin that serves your authorization policies. Ensure that the access is available to this endpoint from the Spark Cluster. Example: http://localhost:6080
Credentials. The credentials provide access to the Ranger Admin. They are used to communicate with Ranger Admin to fetch policies and user-group information. Example: user: admin, password: admin.
Service Name. This is the name of the service in Ranger Admin that holds the policies you want to apply to your catalog. You must specify name of the service using a spark configuration spark.sql.qubole.ranger.service.name. You can set this configuration through spark overrides at a cluster level, such as (spark-defaults.conf:spark.sql.qubole.ranger.service.name hiveService1) or pass it as a job level configuration (--conf spark.sql.qubole.ranger.service.name=hiveService1). The default value of the configuration is hive.

Example: serviceName: hive.
Ranger user group cache expiry: The cache expiry time for ranger user group information in seconds. By default this value is set to 30 seconds. Higher value reduces calls to Ranger but the user group information can become stale between fetches. You can set this configuration through cluster overrides with a spark config spark.sql.qubole.ranger.userGroupCacheExpirySecs.

Creating Ranger Configuration Files

You should create the Ranger configuration files ranger-configs.properties, ranger-<serviceName>-security.xml and ranger-<serviceName>-audit.xml by using the Ranger properties mentioned above.

The ranger-configs.properties should contain the following information:

ranger.username=admin
ranger.password=admin

The ranger-<serviceName>-security.xml should contain the following information:

<configuration>
    <property>
        <name>ranger.plugin.<serviceName>.service.name</name>
        <value><service-name></value>
    </property>
    <property>
        <name>ranger.plugin.<plugin-name>.policy.pollIntervalMs</name>
        <value>5000</value>
    </property>
    <property>
        <name>ranger.service.store.rest.url</name>
        <value>http://myRangerAdmin.myOrg.com:6080</value>
    </property>
    <property>
        <name>ranger.plugin.<plugin-name>.policy.rest.url</name>
        <value>http://myRangerAdmin.myOrg.com:6080</value>
    </property>
</configuration>

Note

plugin-name is the name of the plug-ins, such as spark and hive. service-name is the name of the service, policy or repository defined on the Ranger Admin UI.

The ranger-<serviceName>-audit.xml should contain the following information:

<configuration>
    <property>
        <name>xasecure.audit.is.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>xasecure.audit.solr.is.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>xasecure.audit.solr.async.max.queue.size</name>
            <value>10</value>
    </property>
    <property>
        <name>xasecure.audit.solr.async.max.flush.interval.ms</name>
        <value>1000</value>
    </property>
    <property>
        <name>xasecure.audit.solr.solr_url</name>
        <value>http://localhost:6083/solr/ranger_audits</value>
    </property>
</configuration>

If auditing is not required, set xasecure.audit.is.enabled and xasecure.audit.solr.is.enabled to false.

After you create the above mentioned files, add the following node bootstrap to the cluster to copy the files to /usr/lib/spark/conf/ in all the nodes of the cluster.

RANGER_PROPERTIES_LOC=s3://<location>/ranger-configs.properties
RANGER_SECURITY_LOC=s3://<location>/ranger-<serviceName>-security.xml
RANGER_AUDIT_LOC=s3://<location>/ranger-<serviceName>-audit.xml

/usr/lib/hadoop2/bin/hadoop dfs -get s3://paid-qubole/spark-ranger/rangerBootstrap.sh

source rangerBootstrap.sh

Known Limitations

HTTPS mode of Ranger Admin is not yet supported.
Spark on Qubole currently supports Hash Masking.
All SQL Statements added as a Row filter should be supported by Spark. Admins should verify filters independently before adding them to Ranger.
In the case of Column Level Access Policies, Select * statement in Spark returns all the Columns that are accessible by the current user. The same Select * statement leads to access control exception in Hive or Presto when running with Ranger. Example: Consider a table with the following definition Employee(id, name, salary) and user A is only given access to columns (id, name). Running a select * from Employee in Spark returns all rows and columns (id, name). The same query throws Authorization Exception in Hive/Presto.