Persistence
Datalake is an accelerator specific to power utilities. When enabled, it can be accessed via the menu button ( ) in the top-left corner of the Utilihive Console. |
Datalake uses an object store for raw file storage and a columnar database for storing tabular data.
Object Store
All file objects, such as raw messages, archived or exported tabular data, workflow outputs and backups are stored in an object store.
The object store is not maintained by the Datalake service and must be provided by the platform. For installations on cloud platforms, Datalake provisions and uses the object store offered by the cloud provider. |
Tabular Storage
Datalake stores tabular data in Clickhouse across two databases:
-
Time series data derived from core data models are stored in the default database and is fully managed by Datalake. All tables and columns are documented through comments.
-
Auxiliary data may be stored in a separate
workspace
database, allowing users to enrich and join time series data with custom datasets.
Time Series State Tables
The state of each observation type is managed in a separate table with schema semantics for deduplication and soft-deleting entries.
For instance, when posting dataPoints
on a source
and seriesId
that already contains data on the given time index, the new dataPoints
replace the state of the previous ones.
The state is eventually consistent, meaning that queries that target fresh data and need to operate over the correct state must make sure to select the latest version and filter out rows marked as deleted in the query.
Each table may derive its own set of derived tables (Clickhouse materialized views) for query optimizations or be further extended at query-time by regular views.
History of All Data Mutations
All time series types are represented by a base observation type (as a sum/union type) and all data mutations are stored in the common observationHistory
table.
This table is append-only and may serve as a basis for reporting, auditing, event sourcing and subscriptions.
Message Index
Metadata for all messages are inserted into messageHistory
, which is also append-only, and also a suitable source for reporting and auditing.
Deletes
Internally, deletes are performed by querying and re-inserting rows subject to deletion, assigning isDeleted=true
to indicate deletion.
The history would thus include the full set of observations that was deleted, enabling complete auditing and an easy mechanism for restoring deleted observations.
Deletes should typically not be used under normal operating conditions.
Nulled states
Another way of removing a state is setting it to null. Different types of state are nulled in a different way:
-
Literal and compound values in valueState are nulled by setting value=null.
-
Links in linkState are nulled by setting isLinked=false.
-
Nulled geometries in geoState are represented by an empty array of points.
Data Lifecycles
The data lifecycle is controlled by the time-to-live (TTL) configuration of the Datalake instance. TTL is individually configured for each table. By default, schemas define no TTL and all data is kept indefinitely.
Datalake supports a hot/cold architecture, moving blocks of data to colder storages such as an object store as observations reach a certain age. This may be configured alongside the regular delete-TTL, which would set the lifecycle to hot → cold → delete.
Blocks of data residing in cold storage are still subject to reads, but with obvious implications for performance and query latency.
Schema
Datalake maintains a base schema which defines the SQL interface of the service. The schema utilizes materialized views, effectively acting as insert triggers, to delegate and cascade inserts of observations.
Sharding
The schema supports sharding. Individual observations are distributed based on a sharding key with all triggered inserts taking place locally, meaning that all data derived from an observation resides on the same shard.
The sharding key is random by default and may be overriden if the data distribution for the instance is well known. A custom sharding key can enable direct and deterministic control over how data is distributed to each shard, which may be exploited for locality constraints at query time.
Schema Management
Database schemas are managed with Flyway and make use of Flyway placeholders populated from instance configuration.
Instances of Datalake may extend the base schema, such as adding another materialized view that performs incremental aggregations to reduce latency of a particular set of queries.
User Schemas and Auxiliary Data
By default, a workspace
database along with database users of different privileges is included in the deploy.
This provides a space for users to create and manage their own database objects and data, which may read, join and augment database objects in the base schema.
User schemas or custom extensions to the base schema must be maintained by the administrators of each respective instance, which includes staying in sync with any migrations to the base schema.
This is considered expert use and generally not recommended unless there are special requirements for the database.
It is also highly recommended to keep the datalake.clickhouse.schema configuration static as changes may require manual intervention from database administrators.
|
Integration
Data from Utilihive Datalake can be queried using standard drivers and programming language libraries.
ClickHouse supports a SQL dialect close to ANSI SQL with extensions for online analytical processing. Standard drivers and client libraries are available to connect to ClickHouse and run read-only SQL queries.
ODBC
The ODBC driver is available for Windows, macOS, and Linux. Download and install the latest stable driver from the project’s Releases page.
After the driver has been installed, configure the DSN using a ClickHouse HTTPS endpoint. For example:
-
Name: assign a name for DSN
-
Host: e.g., clickhouse.domain.com
-
Port: e.g., 443
-
Database: default (unless otherwise noted)
-
User: username
-
Password: password
-
SSLMode: leave blank
-
Other fields: leave blank
JDBC
Perform the following steps to query data using JDBC drivers:
-
Download the latest release driver (
*-all.jar
) from the project’s Releases page. Save the*-all.jar
file in a permanent location (e.g.,c:\drivers\
on Windows). -
Configure the software that will run queries using a ClickHouse HTTPS endpoint. For example:
-
Class Path: path of
.jar
file (e.g.,c:\drivers\clickhouse-jdbc-0.6.0-all.jar
) -
Class Name: com.clickhouse.jdbc.ClickHouseDriver
-
If required, set example connection URL:
jdbc:ch:https://clickhouse.domain.com:443
-
Set connection URL with host and port:
jdbc:ch:https://clickhouse.domain.com:443
-
Python Language
Perform the following steps to query data using Python:
-
Install the clickhouse-connect package with the following
pip
command:pip install clickhouse-connect
-
Import the dependencies into your Python script with the following statement:
import clickhouse_connect
-
Create a client and establish a connection to the ClickHouse HTTPS endpoint. For example:
client = clickhouse_connect.get_client( username = "<username>", password = "<password>", host = "clickhouse.domain.com", secure = True, port = 443 )
-
Run a query. For example:
result = client.query("select source, time, value from dataPoint limit 5") print(result.result_rows)
R Language
ClickHouse’s HTTP interface can be directly used for SQL queries. Currently, the R library over native connections does not support SSL, and the library over the HTTP interface does not support a read-only mode.
-
Start R and install the following packages from CRAN:
install.packages("httr") install.packages("readr")
-
Load the required libraries in the following manner:
library(httr) library(readr)
-
Set connection parameters using a ClickHouse HTTPS endpoint. For example:
url <- "https://clickhouse.domain.com:443/query" user <- "username" pass <- "password"
-
Format and run a query. For example:
# Define a query query <- 'select source, time, value from reading limit 5' # Run the query r <- GET(url, authenticate(user, pass), query = list(query = query)) # Convert response to data frame # Content-Type: text/tab-separated-values content = content(r, "raw") result <- read_tsv(content, col_names = FALSE) # Set column names colnames(result) <- c('source', 'time', 'value') # Get data frame spec spec(result) # Get a single result result$value[1]
Other languages
For information on integrating with other programming languages, check out the ClickHouse documentation on Interfaces and Client Libraries.