Schemas and Formats

Arrow Schema

The basic JSON structure for an Arrow schema is:

{
  "fields": [<field-object>, ..],
  "metadata": [<key-value-object>, ..]
}
  • The fields array must contain at least one field object.

  • The metadata array can contain zero or more key-value object's.

fields

Each JSON object in the fields array defines an Arrow column data type.

There are two general categories of an Arrow column data type:

  • Primitive

  • Nested

Currently, we only support Primitive field objects.

The general structure of the field object is:

{
  "name": "someFieldName",
  "nullable": <true|false>,
  "type": {
    "name": <field-type>,
    ..
  }
}

All the properties in a field object are required.

The following field object types are supported:

Consult the Arrow Schema specification for more details on each data type.

The rest of the JSON object structure is specific to each field type:

null

Property Type Allowed value(s)

.type.name

String

null

utf8

Property Type Allowed value(s)

.type.name

String

utf8

largeutf8

Property Type Allowed value(s)

.type.name

String

largeutf8

binary

Property Type Allowed value(s)

.type.name

String

binary

largebinary

Property Type Allowed value(s)

.type.name

String

largebinary

bool

Property Type Allowed value(s)

.type.name

String

bool

int

Property Type Allowed value(s)

.type.name

String

int

.type.bitWidth

Number

8, 16, 32, 64, 128

.type.isSigned

Boolean

true, false

floatingpoint

Property Type Allowed value(s)

.type.name

String

floatingpoint

.type.precision

String

HALF, SINGLE, DOUBLE

fixedsizebinary

Property Type Allowed value(s)

.type.name

String

fixedsizebinary

.type.byteWidth

Number

(Integer, minimum: 1)

decimal

Property Type Allowed value(s)

.type.name

String

decimal

.type.precision

Number

(Integer, minimum: 1)

.type.scale

Number

(Integer, minimum: 1)

.type.bitWidth

Number

8, 16, 32, 64, 128

date

Property Type Allowed value(s)

.type.name

String

date

.type.unit

String

DAY, MILLISECOND

time

Property Type Allowed value(s)

.type.name

String

time

.type.unit

String

SECOND, MILLISECOND, MICROSECOND, NANOSECOND

.type.bitWidth

Number

8, 16, 32, 64, 128, 256, 512

timestamp

Property Type Allowed value(s)

.type.name

String

timestamp

.type.unit

String

SECOND, MILLISECOND, MICROSECOND, NANOSECOND

.type.timezone

String

See: Arrow Schema specification

duration

Property Type Allowed value(s)

.type.name

String

duration

.type.unit

String

SECOND, MILLISECOND, MICROSECOND, NANOSECOND

metadata

Each key-value object represents user defined association of a key string to a value string:

{
  "key": "myKey",
  "value": "someValue"
}

Required Properties:

  • key

  • value

Avro Schema

The Apache Avro specification details the structure of an Avro schema. Both primitive and complex types are supported.

The Avro transformation does not currently support the proprietary Confluent wire format.

Primitive Types

  • null: no value

  • boolean: a binary value

  • int: 32-bit signed integer

  • long: 64-bit signed integer

  • float: single precision (32-bit)

  • double: double precision (64-bit)

  • bytes: sequence of 8-bit unsigned bytes

  • string: unicode character sequence

Complex Types

  • Records

  • Enums

  • Arrays

  • Maps

  • Unions

  • Fixed

Avro Schema Example

{
  "type": "record",
  "name": "person",
  "fields": [
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "age",
      "type": "int"
    },
    {
      "name": "pets",
      "type": {
        "type": "array",
        "items": {
          "type": "record",
          "name": "pet",
          "fields": [
            {
              "name": "name",
              "type": "string"
            },
            {
              "name": "type",
              "type": {
                "type": "enum",
                "name": "AnimalType",
                "symbols": [
                  "Cat",
                  "Dog"
                ]
              }
            }
          ]
        }
      }
    }
  ]
}

The above schema may be used to serialise the below JSON into a binary Avro representation.

{
  "name": "John Doe",
  "age": 58,
  "pets": [
    {
      "name": "Milo",
      "type": "Dog"
    },
    {
      "name": "Nero",
      "type": "Cat"
    }
  ]
}