Mongo Backed Tables

QDS allows you to create Hive tables that point to a MongoDB collection. When such a table is queried, QDS launches a MapReduce job that fetches the data from the collection, and Hive does further processing. Qubole’s implementation of the connector is based on the code at https://github.com/mongodb/mongo-hadoop.

The following is an example of an SQL statement which points a Hive table to the a MongoDB collection.

CREATE EXTERNAL TABLE mongotest (city string, pop int, state string)
STORED BY "com.mongodb.hadoop.hive.MongoHiveStorageHandler"
WITH serdeproperties ("qbol.mongo.input.format"="true")
tblproperties("mongo.input.uri" = "mongodb://<userid>:<password>@<hostname>.mongolab.com:43207/test.zips");

This points the table mongotest to a Mongo collection, and zips and stores the test DB in a Mongolab-hosted instance. Once this table is created, you can query it like an ordinary Hive table.

To get Mongo tables working, add the following setting:

set mongo.input.split.create_input_splits=false

You can now use queries such as the following:

select state, sum(pop) as pop from mongotest group by state

You can also use extract data out of Mongo and store it in the Cloud as a normal Hive table.

To limit the load on the Mongo database QDS limits the number of mappers that can connect to each database. By default, this number is set to 4. That is, at the most 4 simultaneous connections are made per MapReduce job to the Mongo DB. To change this, add the following setting:

set mongo.mapper.count=<n>;

where <n> is the number of mappers you want to allow.

Similarly, to connect MapReduce jobs to the read replicas instead of the coordinator, add the following setting:

set mongo.input.split.allow_read_from_secondaries=true;

MapReduce will now use read replicas whenever possible.