Troubleshooting

This section helps you identify, diagnose, and resolve common issues in GridOS Connect.

1. Access

  • If you have authentication problems when accessing services, or if your session has timed out, sign in to Zitadel using the IDP configured credentials and then force refresh your service web page.

  • If you have trouble accessing any services, make sure you delete all cookies and local storage/cache. Alternatively, use the incognito mode of your web browser.

2. Load Balancers and Reverse Proxies

You do not require any additional configuration for loadbalancers for Connect. As long as the loadbalancer points to the VMs port 443 and preserves the host in http requests (for application loadbalancers), and all required DNS names are added to point to the loadbalancer (zitadel, service, console, api), the load balancer will work.

3. Flow-traces/Logs

This issue will produce the following error message in the GridOS Connect Console Overview:

co.elastic.clients.elasticsearch._types.ElasticsearchException: [es/search] failed: [index_not_found_exception] no such index [flowserver-logs]

This is a known issue with fluent-bit Operator - after adding new custom resources (clusterInput, clusterFilter and clusterOuput), fluent-bit is not able to fetch flowserver logs.

Assuming flowserver is running as expected and flows have been deployed, but flow traces are not visible in the "Flow traces" tab in GridOS Connect console, then it is safe to assume that fluent-bit is not able to fetch logs for flowserver.

Workaround:

  • Restart fluent-bit pods

    kubectl delete po -n foundation-cluster-monitoring -l app.kubernetes.io/name=fluent-bit
  • Restart flowserver pods

    kubectl delete po -n foundation-env-default -l app.kubernetes.io/name=flowserver

4. Prevent Log Duplication of Connect Flow Server

The default Foundation fluent-bit clusterOutput configuration leads to log duplication across Elasticsearch data streams. Each log event is stored multiple times, which increases storage usage and might impact query performance.

Specifically, for each log event indexed as a document in Elasticsearch from Connect flow-server or other Connect services, the following happens:

  • Two copies are stored in the logs data stream if the log event originates from any Connect service, including flow-server. This happens because the default Foundation fluent-bit configuration uses a catch-all clusterInput that collects logs from all containers, including Connect services, and sends them to the logs data stream in Elasticsearch.

  • One copy is stored in the flowserver-logs data stream if the log event originates from the Connect flow-server.

  • One copy is stored in the connect-services data stream if the log event originates from a Connect service other than flow-server.

As a result, each Connect-related log event can be stored up to three times.

Solution:
Since Connect flow-server logs only need to be retained in the flowserver-logs data stream, and other Connect services logs in the connect-services data stream, exclude these logs from the logs data stream. This reduces redundancy and optimizes storage.

The Foundation clusterInput exclusion prevents the catch-all Foundation log collector from reading Connect container logs, while the catch-all Elasticsearch clusterOutput update ensures Connect logs are not written to the default logs data stream.

Apply both changes below:

  • Exclude Connect logs from the default Foundation clusterInput:

    Ensure that the Foundation Helm overlay values have been used to redeploy the Foundation monitoring-apps Helm release in your environment. This configuration adds the connect- pattern to the extraExcludedContainers field of the catch-all Foundation clusterInput. This prevents log collection for all Connect containers, including flow-server, in the default logs data stream.

    ...
    fluent-bit:
      enabled: true
      fluentBit:
        input:
          extraExcludedContainers: "*connect-*"
    ...
  • Update catch-all Elasticsearch clusterOutput:

    Assuming the clusteroutput.fluentbit.fluent.io/elasticsearch resource has not been modified outside from default Foundation configuration.

    1. Fetch elasticsearch clusterOutput configuration

      kubectl get clusteroutputs elasticsearch -o yaml
    2. Modify elasticsearch clusterOutput the configuration

      kubectl get clusteroutput elasticsearch -o yaml \
        | yq e '
            del(.metadata.annotations."kubectl.kubernetes.io/last-applied-configuration") |
            del(.metadata.generation) |
            del(.metadata.resourceVersion) |
            del(.metadata.uid) |
            del(.metadata.creationTimestamp)
          ' - \
        | yq e '
            .spec |= (
              select(has("match")) |
              .matchRegex = "^(?!flowserver.*)(?!connect.*).*" |
              del(.match)
            )
          ' -
    3. Compare the outputs Compare and verify that the output from steps 1 and 2 matches the table below. The match field should be replaced with matchRegex.

      Original Modified
      apiVersion: fluentbit.fluent.io/v1alpha2
      kind: ClusterOutput
      metadata:
        annotations:
          meta.helm.sh/release-name: monitoring-apps
          meta.helm.sh/release-namespace: foundation-cluster-monitoring
        labels:
          app.kubernetes.io/managed-by: Helm
          fluentbit.fluent.io/enabled: "true"
        name: elasticsearch
      spec:
        es:
          generateID: true
          host: monitoring-apps-es-client
          index: logs
          logstashFormat: false
          port: 9200
          replaceDots: true
          suppressTypeName: "On"
          timeKey: '@timestamp'
          traceError: false
          traceOutput: false
          type: _doc
        match: '*'
        retry_limit: "False"
      apiVersion: fluentbit.fluent.io/v1alpha2
      kind: ClusterOutput
      metadata:
        annotations:
          meta.helm.sh/release-name: monitoring-apps
          meta.helm.sh/release-namespace: foundation-cluster-monitoring
        labels:
          app.kubernetes.io/managed-by: Helm
          fluentbit.fluent.io/enabled: "true"
        name: elasticsearch
      spec:
        es:
          generateID: true
          host: monitoring-apps-es-client
          index: logs
          logstashFormat: false
          port: 9200
          replaceDots: true
          suppressTypeName: "On"
          timeKey: '@timestamp'
          traceError: false
          traceOutput: false
          type: _doc
        matchRegex: ^(?!flowserver.*)(?!connect.*).*
        retry_limit: "False"

      If your output is consistent with the table above, use the following commands to exclude Connect-related tags in the elasticsearch clusterOutput.

    4. Create a YAML file for the modified clusterOutput resource

      kubectl get clusteroutput elasticsearch -o yaml \
        | yq e '
            del(.metadata.annotations."kubectl.kubernetes.io/last-applied-configuration") |
            del(.metadata.generation) |
            del(.metadata.resourceVersion) |
            del(.metadata.uid) |
            del(.metadata.creationTimestamp)
          ' - \
        | yq e '
            .spec |= (
              select(has("match")) |
              .matchRegex = "^(?!flowserver.*)(?!connect.*).*" |
              del(.match)
            )
          ' - \
        > modified-clusteroutput-es.yaml
    5. Delete the old clusterOutput resource

      kubectl delete clusteroutput elasticsearch
    6. Apply the modified clusterOutput resource

      kubectl apply -f modified-clusteroutput-es.yaml
    7. Restart fluent-bit pods

      kubectl delete po -n foundation-cluster-monitoring -l app.kubernetes.io/name=fluent-bit

5. Connect-PostgreSQL Fails to Install

If you are installing connect on a Foundation version lower than 25R01, the connect-postgresql Helm chart will fail with the following error:

Error: UPGRADE FAILED: cannot patch "connect-postgresql" with kind postgresql: postgresql.acid.zalan.do "connect-postgresql" is invalid: spec.postgresql.version: Unsupported value: "17": supported values: "10", "11", "12", "13", "14", "15"

Resolution Options:

  • Upgrade Foundation to 25R01+

Upgrade your environment to Foundation version 25R01 or higher and then run apply-helmfile to redeploy the GridOS connect-postgresql with the latest supported version.

  • Downgrade connect-postgresql chart

Edit helmfile.yaml.gotmpl in the root directory of the Helm Deployment Template.

  • Ensure that the connect-postgresql chart version is set to 0.5.0, and run apply-helmfile to redeploy the GridOS connect-postgresql.

...
- name: connect-postgresql
  <<: *defaults
  chart: connect-helm/connect-postgresql
  version: 0.5.0
  values:
  - ./connect-postgresql/values.yaml
  # - ./offline-image-overrides.yaml
...

6. connect-identityreconciler Pod Fails to Start Due to Renamed APISIX Secret

Connect release 1.20.0 includes updates to support changes in APISIX secrets introduced in Foundation version 25r09.

Symptom: The connect-identityreconciler pod may fail to start and log one of the following errors:

  • When deploying Connect from scratch:

    MountVolume.SetUp failed for volume "secrets": secret "connect-identityreconciler-apisix-api-token" not found

  • When upgrading Connect:

    Getting upstream 'connect-flowserver-service' failed with status: 401 Unauthorized

Root Cause: In Foundation version 25r09, an APISIX secret was renamed. Connect relies on this secret, so if Connect and Foundation are on mismatched versions, the secret name does not align. This can cause the pod to fail at startup or result in service requests being rejected.

Compatibility Overview: The following matrix shows the required actions based on the versions of Foundation and Connect:

Foundation version Connect version Action

25r09 or later

1.20.0 or later

No action required; the setup is fully compatible.

Older than 25r09

1.20.0 or later

Apply the override described in Solution.

25r09 or later

Older than 1.20.0

Apply the override described in Solution.

Solution:

In connect/values.yaml, add the following override under the identityreconciler block:

Foundation version older than 25r09 and Connect 1.20.0 or later

Foundation version 25r09 or later and Connect version older than 1.20.0

# Override for Foundation older than 25r09 and Connect 1.20.0+
secretManager:
  secrets:
    apisix-api-token:
      name: apisix-control-plane-api-token
# Override for Foundation 25r09+ and Connect version older than 1.20.0
secretManager:
  secrets:
    apisix-api-token:
      name: apisix-control-plane-token