Configuring Map Join Options in Hive

Map join is a Hive feature that is used to speed up Hive queries. It lets a table to be loaded into memory so that a join could be performed within a mapper without using a Map/Reduce step. If queries frequently depend on small table joins, using map joins speed up queries’ execution. Map join is a type of join where a smaller table is loaded in memory and the join is done in the map phase of the MapReduce job. As no reducers are necessary, map joins are way faster than the regular joins.

In Qubole Hive, the mapjoin options are enabled by default/have default values.

Here are the Hive map join options:

  • hive.auto.convert.join: By default, this option is set to true. When it is enabled, during joins, when a table with a size less than 25 MB (hive.mapjoin.smalltable.filesize) is found, the joins are converted to map-based joins.

  • hive.auto.convert.join.noconditionaltask: When three or more tables are involved in the join condition. Using hive.auto.convert.join, Hive generates three or more map-side joins with an assumption that all tables are of smaller size. Using hive.auto.convert.join.noconditionaltask, you can combine three or more map-side joins into a single map-side join if size of n-1 table is less than 10 MB. (This rule is defined by hive.auto.convert.join.noconditionaltask.size.)

Outer joins are not always converted to map joins, which are as described below:

  • Full outer joins are never converted to map-side joins.

  • A left-outer join are converted to a map join only if the right table that is to the right side of the join conditions, is lesser than 25 MB in size.

  • Similarly, a right-outer join is converted to a map join only if the left table size is lesser than 25 MB.