parseExcel

Processor that parses an Excel spreadsheet file (i.e., XLSX and XLS files). This processor is highly configurable but will require almost no configuration to process regular Excel files.

Input

The processor only accepts binary input data in either MS Office Open XML format (2007+, typically suffixed with xlsx) or the legacy Excel 97-2003 format (BIFF, typically suffixed with xls).

Note that merged cells are parsed as one cell, so use caution if you use merged cells. If you have to use merged cells, make sure that the same columns are merged in each row.

Output

The output will always be a JSON compliant object. The exact format is based on the outputType property, which can be one of the following two values:

  1. Objects: Returns a list of objects with the column names as keys.

    Example: [{"name": "Liv", "Age": 45}, {"name": "Ben", "Age": 20}].

    Useful if you need to process the data further using processors like map and dbStatement, or if you need to deliver the result as a JSON file with named keys.

  2. Values: Returns a list of lists with only the values, no column names.

    Example: [["Liv", 45], ["Ben", 20]].

Selecting Data Range

Use the worksheetName property to select a worksheet to use. If omitted, the first worksheet is used. Within the selected worksheet, use cellRange to define the range to process. Cell range can be a named range (e.g., MyRangeName), closed range (C7:S115), column range (A:F), or open ended range (A1:S). Note that row ranges (e.g., 10:20) are not supported, despite being a legal range in Excel.

Naming and Selecting Columns

If you want to use other names for the columns than the ones in the header row, or there isn’t a header row, you can name the columns using the columns property. The order of the names must reflect the order of the columns.

Use the selectedColumns property to define which columns to keep. You can also reorder the columns using the same property. The order that the columns is listed in is the order in which they will be output. The order is only significant when the output type is Values (i.e., a list of list of values). When the output type is Objects, the output is a list of maps where the order of the keys is not predictable.

Data Types

How to parse each cell into a Kotlin data type is detected for each cell. If you want to enforce a certain data type per column, use the columnTypes property.

In practice, Excel uses mostly two data types: numeric and text. Dates are simply numbers plus a formatting pattern. To avoid issues with locale-based differences of date formatting, the parseExcel processor outputs all dates as type String in ISO-8601 compliant format without the timezone. For example: "2011-12-03T10:15:30". This is, of course, different from the format displayed in Excel.

Numbers are returned as type Double unless type Long is specified using the columnTypes property.

Please note that when using MS Excel, the Format Cell dialogue can show that a cell is formatted as a "Number" even though it’s actually text.

Formulas are ignored and their cached value returned with the respective data type.

Anything else is returned as a String.

Properties

Name Summary

outputType

Required FlatFileOutputType that must be set to either Objects or Values. If Objects, the output is a list of objects with the header names as keys. For example: [{"name": "Liv", "Age": 45}, {"name": "Ben", "Age": 20}]. If Values, the output is a list of value-lists. For example: [["Liv", 45], ["Ben", 20]].

hasHeaderRow

If true, the first row will not be included in the dataset but used to extract the column names of the data. The column name values will be parsed as strings. Optional and defaults to false.

columns

Comma separated list of column names for the data. Escape with a backslash if you need a comma as a part of the name (e.g., \,). If both this and hasHeaderRow are set, then the names defined here take precedence over the names in the file header. Optional. If not set, and hasHeaderRow is false and outputType is Objects, then the columns will fall back to default Excel column naming (e.g., A, B, C, …​, AA, AB, …​).

selectedColumns

Optional, comma separated list of columns to include in the dataset. Can not be used if excludedColumns is defined. If not set, all columns will be included, unless excludedColumns is defined. Escape with a backslash if you need a comma as a part of the name (e.g., \,). Non-selected columns will be discarded. Selected columns can also be rearranged by specifying the preferred order here. For example, given columns "a, b, c, d", specifying "d, b" would only include those two columns, in that order, in the output.

excludedColumns

Optional, comma separated list of columns not to include in the dataset. Can not be used if selectedColumns is defined. If not set, all columns will be included, unless selectedColumns is defined. Escape with a backslash if you need a comma as a part of the name (e.g., \,). Excluded columns will be discarded. For example, given columns "a, b, c, d", specifying "b, d" would filter out those columns, and only include column a and c.

maxNoOfLinesToRead

Sets the maximum number of lines to process (including the header, if it’s part of the range). Otherwise, the full range (or full file, if range is open) is processed. Superfluous rows will be ignored. Optional.

skipEmptyLines

Whether to skip lines with no data. Optional and defaults to true.

defaultForNull

The default value to use when a value is null. Default is applied after any other processing of the value, such as expression processing. Optional.

worksheetName

Worksheet name, case-insensitive. Optional. If not set, the first sheet is used.

columnTypes

Optional, comma separated list of column names and their enforced data types. Supported data types include String, Long, IsoDateString, Double, and Boolean. For example: "A=String,B=Long,C=Double". If not set, or if a column is not mentioned in the list, the data types are inferred from the cell type. If a cell is a formula, a cached calculated value is used. If this property specifies a column type, and any of the cells of the column cannot be parsed as that type, an exception is thrown.

Note that Excel uses only one type for numbers, regardless if they are integers or floating point numbers. Therefore, all numbers are returned as the Kotlin type Double by default. If you want to force parsing as Long, set the column as Long with this property.

Another caveat is that Excel doesn’t use Date types in saved files (even though the Office Open XML standard supports it). It uses a Numeric type for dates, together with information on how to format such a number (i.e., the formatting pattern). In cases where the processor doesn’t correctly detect a column value as a date, specify IsoDateString to force parsing the numeric value as a date. If omitted, the processor tries to auto-detect which values are dates based on the formatting pattern. If a value is parsed as date, it is output as an ISO-formatted String.

cellRange

Cell range. The following range types are supported:

  • Named ranges (e.g., "my_named_range")

  • Closed ranges (e.g., "C7:S115")

  • Open-ended ranges (e.g., "A1:S")

The range must be contiguous. It is not possible to combine multiple ranges, even if their union would be contiguous. Note that open-ended ranges (e.g., "A1:S") are not a format supported by either Excel or Office Open XML format; it is a custom format.

If hasHeaderRow is set, the first row in the range will be considered the header and parsed as strings.

If maxNoOfLinesToRead is set, then it takes precedence over the height of the range. For example, if cellRange=B2:E50, hasHeaderRow=true, and maxNoOfLinesToRead=10, then B2:E2 is considered the header, and B3:E11 is parsed as data. The rest is ignored.

name

Optional, descriptive name for the processor.

id

Required identifier of the processor, unique across all processors within the flow. Must be between 3 and 30 characters long; contain only lower and uppercase alphabetical characters (a-z and A-Z), numbers, dashes ("-"), and underscores ("_"); and start with an alphabetical character. In other words, it adheres to the regex pattern [a-zA-Z][a-zA-Z0-9_-]{2,29}.

exchangeProperties

Optional set of custom properties in a simple jdk-format, that are added to the message exchange properties before processing the incoming payload. Any existing properties with the same name will be replaced by properties defined here.

retainPayloadOnFailure

Whether the incoming payload is available for error processing on failure. Defaults to false.

Sub-builders

Name Summary

messageLoggingStrategy

Strategy for describing how a processor’s message is logged on the server.

payloadArchivingStrategy

Strategy for archiving payloads.

inboundTransformationStrategy

Strategy that customizes the conversion of an incoming payload by a processor (e.g., string to object). Should be used when the processor’s default conversion logic cannot be used.

Details

Limitations

Processing Excel is not recommended for large amounts of data. Also note that processing XLSX is significantly slower than XLS.