Data Models
Datalake is an accelerator specific to power utilities. When enabled, it can be accessed via the menu button ( ) in the top-left corner of the Utilihive Console. |
Datalake favors generic data models and open-schema principles in order to serve a wide variety of use cases. The core data models are built on entity-attribute-value (EAV) principles and embrace Resource Description Framework (RDF) semantics, and are designed to keep persisted data compliant with common semantic technologies and data formats. The message envelope as well as the different interfaces also follow well-defined industry standards. By employing data models built on these design principles, Datalake seeks to enable machine understanding and ease integration to and from other systems.
Datalake also allows to plug in custom domain models (such as CIM profiles) and can project the persisted data onto these models during query-time, enabling users and clients to consume their custom model directly over APIs without further needs for mappings and transformations.
All data models are fully documented by either OpenAPI specifications or by GraphQL schema available through the corresponding API for which they are exposed.
Base Data Structure for Time Series
All data entities in Datalake are defined with respect to time. In other words, as time series.
A series, defined as a sequence of observations indexed by time, is keyed on source
and a seriesId
, which is a composite of entityType
, attribute
and entityId
.
The source
is a URI-reference identifying the context in which the series of observations was produced, representing the logical partition of the series.
This would typically include information about the system or application producing the data, but may also be used to represent multiple versions of the same series, for instance by using the fragment part of the URI.
The seriesId
is analog to a class instance, representing the class name, field and object id, respectively.
Hence, the entity type, attribute and entity id further specify what has been observed and by what particular entity or object.
A single observation row is thus uniquely keyed on source
, seriesId
and time
and is represented by a common data model.
Derived Data Structures and Observation Types
Different modes of time series imply differences in data management and schema design. Datalake operates with the following observation types, each being managed in separate tables:
DataPoint
-
Standard time series consisting of a sequence of data points indexed by time (regular or irregular). Each data point may carry a value that is simple (univariate) or encode multiple measurements (multivariate).
CategoricalEvent
-
An event occurring at a point in time representing a categorical value, such as a particular alarm, notice or alert. For such observations, one would possibly want to query e.g. histograms across category and time and perform roll-up aggregations at some TTL (instead of applying lifecycle policies for direct deletion).
ValueStateChangeEvent
-
An event occurring at a point in time representing a change of state that is effective until the next state change occurs.
LinkStateChangeEvent
-
An event occurring at a point in time representing a change of link state between two objects that is effective until the next link state change occurs.
GeoStateChangeEvent
-
Change of geolocation state, a special case of
ValueStateChangeEvent
.
The set of observation types are represented with a union type and share the same table for persistence.
Datalake defines various formats for how a set of observations across different dimenions may be formulated. For further details, refer to the OpenAPI specification which includes complete schemas for each type and all corresponding envelopes.
Examples
Examples for various observations follows.
source |
entityType |
attribute |
entityId |
Type | Description |
---|---|---|---|---|---|
|
|
|
|
Fixed-interval |
Energy consumed on usage point. |
|
|
|
|
Fixed-interval |
Energy consumed on a meter using CIM reading type to describe the attribute. |
|
|
|
|
Fixed-interval |
10-min rolling average over CPU consumption of a service. |
|
|
|
|
Varying-interval |
Daily CPU peak for a service. |
|
|
|
|
|
CPU threshold was reached for a service. |
|
|
|
|
|
CPU threshold changed for a service. |
Due to the nature of open-schema designs, many responsibilities around structure and convention is delegated to the domain model.
For instance, a series of wind direction 24-hour forecast at 1000m altitude for a given geohash cell may be represented by different source
and seriesId
:
source |
entityType |
attribute |
entityId |
---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
The set of queries that is to be performed against the time series is typically what would dictate the exact model.
For instance, for querying vertical wind profiles across locations, it could be natural to treat geohash
as the entity type, but for deriving wind momentum flux at any given altitude, operating on attributes that are independent of the altitude and passing in the altitude as a parameter — either through entity type or source — could be a better structure.
Following open-schema design principles enables full flexibility on how states are stored.
Let’s consider 2 scenarios:
-
Object and its state is static, meaning not expected to change over time.
Storing serialized JSON object as a value appears to be a good approach in this case.
source |
entityType |
entityId |
attribute |
value |
version |
effectiveTime |
---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Updating even one of the object’s values would require updating the whole object, as its entire representation is stored as one value.
-
Another scenario is a situation where state is expected to change over time. In such a case normalizing the object’s state and storing each value independently seems more appropriate.
source |
entityType |
entityId |
attribute |
value |
version |
effectiveTime |
---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
In the example above the description changed over time, and it was possible to update only the changed state, due to the fact that each state is stored separately as an independent value.
Domain Models
Datalake can be configured with a set of domain models and generate API extensions for validating and querying data against the model. CIM profiles according to IEC 61970 and related specifications are especially well-supported, for which a richer set of features in Datalake may be exploited.