writeToArrow

Processor for writing data in the Arrow format using Arrow Flight RPC protocol.

This processor requires a schema to be provided via a reference schemaId. The schema must be formatted as a JSON document, detailed in the Arrow Schema section.

For reference see Arrow data format Version 1.4 and schema specification.

This processor consumes data in JSON Compliant format (JSON objects and JSON lists for batch writes) or payloads that can be converted to JSON Compliant format. When necessary, the processor will convert the data to respective Arrow data types.

Data conversion has the following specifics:

  • All conversions to byte arrays are done via intermediate conversion to strings.

  • During conversion to numeric data types, overflows are not allowed. Large data types can only be converted to smaller types if they can hold the exact same numeric values.

  • Rounding of floating-point numbers during conversion is not permitted.

  • When converting to bit, numeric representation of 0 (integer or string) or false results in unsetting the bit. Any other values (numerics other than 0 or true) will set the bit.

  • To set Arrow Date type, the incoming value must be string or integer. When using a string representation of a date, the "yyyy-MM-dd" format should be followed (also known as "full-time" format in RFC 3339: Date and Time on the Internet: Timestamps).

  • To set Arrow Time type incoming value must be string or integer. When using a string representation of a time, the "HH:MM:SS.n" format should be followed (also known as "partial-time" format in RFC 3339: Date and Time on the Internet: Timestamps).

  • To set Arrow Timestamp type incoming value must be string or integer. When using a string representation of a timestamp, the "yyyy-MM-dd’T’HH:mm:ss.n" pattern should be followed. This pattern is a combination of "full-time" and "partial-time" formats from RFC 3339: Date and Time on the Internet: Timestamps. As a fixed epoch for Arrow Timestamp Unix epoch is used. Time zone offsets and time zone postfixes within the incoming payload are not supported. Time zone configuration relies on the schema. If the relevant field in the schema doesn’t specify a time zone, UTC is used, and the value is calculated accordingly. If a time zone is specified, it’s taken into consideration when calculating the timestamp value.

  • To set Arrow Duration type incoming value must be string or integer. When using a string representation of a duration, the ISO-8601 based "PnDTnHnMn.nS" pattern should be followed. Period values (year, month, week) except days are not allowed because they cannot be converted without approximation (e.g., there’s no fixed number of days in a month).

Currently unsupported Arrow Data Types:

  • Utf8View

  • BinaryView

  • Map

  • List

  • ListView

  • Struct

  • Union

  • Interval

  • Extension

Properties

Name Summary

address

URI of the Flight RPC service to call. Currently only grpc+tcp scheme is supported. Required.

dataPath

Path to data in Arrow instance. Required.

schemaId

ID of the Arrow schema following the pattern <type>:<resourceId>:<revision>. Set <revision> to latest to retrieve the latest one. For example: ArrowSchema:my-arrow-schema:latest. The schema referred by this id is a JSON representation of the Arrow schema. Data format the schema relies on is Version 1.4 Required.

failOnUnknownField

Whether to fail processing if an unknown field is encountered in the payload. Defaults to false (i.e., unknown fields are ignored).

retainPayload

Whether to retain the payload after writing it to Arrow instance, making it available for processing in downstream processors. Defaults to false (i.e., the payload is removed after it is written to Arrow).

connectionTimeoutMillis

The timeout of the HTTP client socket connection. Optional. Configuring of connection timeout is not supported in the current version.

receiveTimeoutMillis

The socket timeout to wait for the first byte of response from the server. Optional.

authenticationConfigKey

Key from the server configuration used to look up the credentials needed to connect to Arrow instance. Optional and uses no authentication by default.

name

Optional, descriptive name for the processor.

id

Required identifier of the processor, unique across all processors within the flow. Must be between 3 and 30 characters long; contain only lower and uppercase alphabetical characters (a-z and A-Z), numbers, dashes ("-"), and underscores ("_"); and start with an alphabetical character. In other words, it adheres to the regex pattern [a-zA-Z][a-zA-Z0-9_-]{2,29}.

exchangeProperties

Optional set of custom properties in a simple jdk-format, that are added to the message exchange properties before processing the incoming payload. Any existing properties with the same name will be replaced by properties defined here.

retainPayloadOnFailure

Whether the incoming payload is available for error processing on failure. Defaults to false.

Sub-builders

Name Summary

externalSystemDetails

Strategy for describing the external system integration. Optional.

processingStrategy

Strategy for providing message processing hints to the server. The concurrencyMaxNo property should be set, because the underlying Arrow writer client is blocking.

circuitBreakerStrategy

Strategy for configuring the processor’s circuit breaker. Optional.

messageLoggingStrategy

Strategy for describing how a processor’s message is logged on the server.

payloadArchivingStrategy

Strategy for archiving payloads.

inboundTransformationStrategy

Strategy that customizes the conversion of an incoming payload by a processor (e.g., string to object). Should be used when the processor’s default conversion logic cannot be used.