Persistence

Datalake is an accelerator specific to power utilities. When enabled, it can be accessed via the menu button (

) in the top-left corner of the Utilihive Console.

Datalake uses an object store for raw file storage and a columnar database for storing tabular data.

Object Store

All file objects, such as raw messages, archived or exported tabular data, workflow outputs and backups are stored in an object store.

The object store is not maintained by the Datalake service and must be provided by the platform. For installations on cloud platforms, Datalake provisions and uses the object store offered by the cloud provider.

Tabular Storage

Datalake stores tabular data in Clickhouse across two databases:

Time series data derived from core data models are stored in the default database and is fully managed by Datalake. All tables and columns are documented through comments.
Auxiliary data may be stored in a separate workspace database, allowing users to enrich and join time series data with custom datasets.

Time Series State Tables

The state of each observation type is managed in a separate table with schema semantics for deduplication and soft-deleting entries. For instance, when posting dataPoints on a source and seriesId that already contains data on the given time index, the new dataPoints replace the state of the previous ones.

The state is eventually consistent, meaning that queries that target fresh data and need to operate over the correct state must make sure to select the latest version and filter out rows marked as deleted in the query.

Each table may derive its own set of derived tables (Clickhouse materialized views) for query optimizations or be further extended at query-time by regular views.

History of All Data Mutations

All time series types are represented by a base observation type (as a sum/union type) and all data mutations are stored in the common observationHistory table. This table is append-only and may serve as a basis for reporting, auditing, event sourcing and subscriptions.

Message Index

Metadata for all messages are inserted into messageHistory, which is also append-only, and also a suitable source for reporting and auditing.

Deletes

Internally, deletes are performed by querying and re-inserting rows subject to deletion, assigning isDeleted=true to indicate deletion. The history would thus include the full set of observations that was deleted, enabling complete auditing and an easy mechanism for restoring deleted observations.

Deletes should typically not be used under normal operating conditions.

Nulled states

Another way of removing a state is setting it to null. Different types of state are nulled in a different way:

Literal and compound values in valueState are nulled by setting value=null.
Links in linkState are nulled by setting isLinked=false.
Nulled geometries in geoState are represented by an empty array of points.

Data Lifecycles

The data lifecycle is controlled by the time-to-live (TTL) configuration of the Datalake instance. TTL is individually configured for each table. By default, schemas define no TTL and all data is kept indefinitely.

Datalake supports a hot/cold architecture, moving blocks of data to colder storages such as an object store as observations reach a certain age. This may be configured alongside the regular delete-TTL, which would set the lifecycle to hot → cold → delete.

Blocks of data residing in cold storage are still subject to reads, but with obvious implications for performance and query latency.

Schema

Datalake maintains a base schema which defines the SQL interface of the service. The schema utilizes materialized views, effectively acting as insert triggers, to delegate and cascade inserts of observations.

Replication

The schema supports replication of tables.

Sharding

The schema supports sharding. Individual observations are distributed based on a sharding key with all triggered inserts taking place locally, meaning that all data derived from an observation resides on the same shard.

The sharding key is random by default and may be overriden if the data distribution for the instance is well known. A custom sharding key can enable direct and deterministic control over how data is distributed to each shard, which may be exploited for locality constraints at query time.

Schema Management

Database schemas are managed with Flyway and make use of Flyway placeholders populated from instance configuration.

Instances of Datalake may extend the base schema, such as adding another materialized view that performs incremental aggregations to reduce latency of a particular set of queries.

User Schemas and Auxiliary Data

By default, a workspace database along with database users of different privileges is included in the deploy. This provides a space for users to create and manage their own database objects and data, which may read, join and augment database objects in the base schema.

User schemas or custom extensions to the base schema must be maintained by the administrators of each respective instance, which includes staying in sync with any migrations to the base schema. This is considered expert use and generally not recommended unless there are special requirements for the database. It is also highly recommended to keep the datalake.clickhouse.schema configuration static as changes may require manual intervention from database administrators.

Integration

Data from Utilihive Datalake can be queried using standard drivers and programming language libraries.

ClickHouse supports a SQL dialect close to ANSI SQL with extensions for online analytical processing. Standard drivers and client libraries are available to connect to ClickHouse and run read-only SQL queries.

ODBC

The ODBC driver is available for Windows, macOS, and Linux. Download and install the latest stable driver from the project’s Releases page.

After the driver has been installed, configure the DSN using a ClickHouse HTTPS endpoint. For example:

Name: assign a name for DSN
Host: e.g., clickhouse.domain.com
Port: e.g., 443
Database: default (unless otherwise noted)
User: username
Password: password
SSLMode: leave blank
Other fields: leave blank

JDBC

Perform the following steps to query data using JDBC drivers:

Download the latest release driver (*-all.jar) from the project’s Releases page. Save the *-all.jar file in a permanent location (e.g., c:\drivers\ on Windows).
Configure the software that will run queries using a ClickHouse HTTPS endpoint. For example:
- Class Path: path of .jar file (e.g., c:\drivers\clickhouse-jdbc-0.6.0-all.jar)
- Class Name: com.clickhouse.jdbc.ClickHouseDriver
- If required, set example connection URL: jdbc:ch:https://clickhouse.domain.com:443
- Set connection URL with host and port: jdbc:ch:https://clickhouse.domain.com:443

Python Language

Perform the following steps to query data using Python:

Install the clickhouse-connect package with the following pip command:
```
pip install clickhouse-connect
```
Import the dependencies into your Python script with the following statement:
```
import clickhouse_connect
```

Create a client and establish a connection to the ClickHouse HTTPS endpoint. For example:

client = clickhouse_connect.get_client(
    username = "<username>",
    password = "<password>",
    host = "clickhouse.domain.com",
    secure = True,
    port = 443
)

Run a query. For example:

result = client.query("select source, time, value from dataPoint limit 5")
print(result.result_rows)

R Language

ClickHouse’s HTTP interface can be directly used for SQL queries. Currently, the R library over native connections does not support SSL, and the library over the HTTP interface does not support a read-only mode.

Start R and install the following packages from CRAN:
```
install.packages("httr")
install.packages("readr")
```
Load the required libraries in the following manner:
```
library(httr)
library(readr)
```

Set connection parameters using a ClickHouse HTTPS endpoint. For example:

url <- "https://clickhouse.domain.com:443/query"
user <- "username"
pass <- "password"

Format and run a query. For example:

# Define a query
query <- 'select source, time, value from reading limit 5'

# Run the query
r <- GET(url, authenticate(user, pass), query = list(query = query))

# Convert response to data frame
# Content-Type: text/tab-separated-values
content = content(r, "raw")
result <- read_tsv(content, col_names = FALSE)

# Set column names
colnames(result) <- c('source', 'time', 'value')

# Get data frame spec
spec(result)

# Get a single result
result$value[1]

Other languages

For information on integrating with other programming languages, check out the ClickHouse documentation on Interfaces and Client Libraries.

Was this page helpful?