Avro Tables

Qubole supports creating Hive tables against data in Avro format.

Getting Avro schema from a file

If you have an Avro file, you can extract the schema using Avro tools. Download avro-tools-1.7.4.jar and run the following command to produce the schema. This schema goes into the serdeproperties in the DDL statement.

$ java -jar avro-tools-1.7.4.jar getschema episodes.avro
{
  "type" : "record",
  "name" : "episodes",
  "namespace" : "testing.hive.avro.serde",
  "fields" : [ {
    "name" : "title",
    "type" : "string",
    "doc"  : "episode title"
  }, {
    "name" : "air_date",
    "type" : "string",
    "doc"  : "initial date"
  }, {
    "name" : "doctor",
    "type" : "int",
    "doc"  : "main actor playing the Doctor in episode"
     } ]
}

DDL Statement (AWS Example)

A DDL statement creates a Hive table called episodes against the Avro data. You can query the table just like any other Hive table.

CREATE EXTERNAL TABLE episodes
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES ('avro.schema.literal'='
{
  "type" : "record",
  "name" : "episodes",
  "namespace" : "testing.hive.avro.serde",
  "fields" : [ {
    "name" : "title",
    "type" : "string",
    "doc" : "episode title"
  }, {
    "name" : "air_date",
    "type" : "string",
    "doc" : "initial date"
  }, {
    "name" : "doctor",
    "type" : "int",
    "doc" : "main actor playing the Doctor in episode"
  } ]
}
')
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 's3://public-qubole/datasets/avro/episodes'
;