Monitor Connect Services

To get an overview of all Connect services, see Connect Platform Overview.

Service Health and Resource Usage

Use the Connect Console and Grafana Dashboards to monitor the health of Connect services. If any service is degraded or unavailable, it can impact the overall health of Connect and its ability to process messages. As a result, the first step in troubleshooting is to verify the health of all services.

All Connect microservices are JVM-based, so memory usage is one of the most important resources to monitor. Memory-related issues often appear as pod restarts caused by Out of Memory (OOM) kills. If you notice a high restart count for any Connect service, investigate the root cause.

Integration traffic is the main driver for resource usage across Connect services. Different aspects of the traffic will impact different services and resources. As traffic patterns change, resource usage and performance will also vary across services. High CPU usage can be expected when the system processes a large volume of messages. However, excessive memory usage is generally a sign of a problem and should be investigated. It is important to understand the type and volume of traffic the Connect cluster is expected to handle, and to configure flows accordingly by applying flow throttling parameters. You can monitor integration traffic over time by using the Connect Dashboard for inflight, buffered, and stashed messages.

In some scenarios, the volume of integration traffic may require increasing allocated memory, but this needs to be verified by analyzing the traffic. If a flow is misconfigured and consuming excessive memory, increasing memory allocation is unlikely to resolve the problem and may at best only mask the underlying issue.

Of all Connect services, integration traffic primarily affects the resource usage of the Flow Server. It also impacts its supporting services.

A high number of messages being processed directly affects PostgreSQL and Elastic resources.
Large messages that exceed the side-channel threshold impact MinIO resources because they are written to the side-channel.

If any other Connect services are running out of memory or consistently using high CPU, contact Connect support.

GridOS Connect Container Debugging

Starting with Connect version 1.24.0, all Java-based services use Chainguard "distroless" base images, which means there is no shell to drop into if ever needed, for instance during a Root Cause Analysis. If a shell is needed to debug the live service container, it is recommended to use kubectl debug.

The Java based Connect images are:

Flow Server
Identity
Insights
Resource Registry

The frontend services rely on NodeJS. In the near future, the frontend NodeJS services will use Chainguard as its base container image.