AIOps agent is installed on client Kubernetes cluster to collect and send data from the system to AIOps central.
Installation
Prerequisites
Compatibility matrix
aiops-agent | Kubernetes 1.23 | Kubernetes 1.24 |
---|---|---|
v1.5.6 | ✓ | ✓ |
v1.5.7 | ✓ | ✓ |
We do our best to test the compatibility, nevertheless it may not work as expexted due to:
- breaking changes between minor versions of Kubernetes,
- vendor-specific configurations or modifications.
The matrix should be treated as best effort. In case of problems contact our support.
Obtaining Client Secret
To get Client Secret contact our Support to participate in early participation program.
Service mesh
Service mesh is required for AIOps to work. Supported service mesh technologies include:
Go to the Service mesh section of our integrations' documentation for more details on supported service mesh technologies.
Helm
To install AIOps agent in its default configuration, issue the following helm command with CLIENT_ID
, CLIENT_SECRET
, and TENANT
replaced with provided values:
helm upgrade --install --create-namespace aiops-agent https://storage.googleapis.com/public-charts/aiops-agent-1.5.7.tgz -n aiops \
--set global.clientId=CLIENT_ID \
--set global.clientSecret=CLIENT_SECRET \
--set global.tenant=TENANT \
--set global.site={SITE}
--set tags.linkerd=true
By default the agent is installed in aiops
namespace. Use -n
switch to change it.
SITE parameter is optional
Authorization
To be able to send data to the co.brick AIOps central system it is required to authenticate using oauth2 standard using provided by co.brick AIOPS team clientID, clientSecret. To obtain credentials contact co.brick AIOps team. All co.brick AIOps elements are using the same credentials configurable inside the .global section in values.yaml.
AIOPS agent also supports providing the clientSecret as a file. Secret needs to be created (or available) before the agent is deployed. For example:
kubectl create secret generic grafana-agent-secret --from-literal=client_secret=EXAMPLE_CLIENT_SECRET -n aiops
Additionally, it is required to add secret name to global.SecretName in values.yaml. For more info see values.yaml.
Network Policy
If enabled aiops-agent will install the deny-all policy and policy which allows metrics-compressor to collect metrics from the services.
Data persistence
The aiops-agent utilizes Write-Ahead Logging (WAL) as a technique to mitigate potential data loss when sending to the cloud system. If data transmission fails for any reason, the aiops-agent stores the data within the pod’s storage. By default, the stored data is retained for a period of 10 minutes. However, it’s important to note that due to the absence of Persistent Volumes (PV), the stored data is volatile and will be lost when the pod restarts. Hence, it’s crucial to understand that the WAL does not retain data during pod restarts.
Scraping metrics
At co.brick, we understand the importance of metrics in monitoring and managing your Kubernetes applications. Our co.brick observe service is designed to automatically gather a set of default metrics that are crucial for maintaining the health and performance of your applications.
Our team of experts has carefully selected these default metrics based on their relevance and usefulness in improving DevOps processes. These metrics provide valuable insights into various aspects of your applications, such as resource usage, error rates, request latencies, and more. They allow you to monitor your applications' performance, identify potential issues early, and make informed decisions about scaling and resource allocation.
By automatically collecting these metrics, co.brick observe saves you the time and effort of having to configure and manage your own metrics collection. You can focus on what matters most - developing and deploying your applications - while we take care of the monitoring.
Advanced options
The following parameters are configurable in helm:
Parameter | Default value | Description |
---|---|---|
tags.linkerd | true | Enables Linkerd. |
tags.dataplane-v2 | false | Enables dataplane-v2 integration. |
tags.cilium | false | Enables Cilium integration. |
networkPolicy.create | false | Create network policy in installed namespace |
grafanaAgent.globalScrapeInterval | 15s | How frequently grafana agent should scrape metrics. |
global.tenant | None | Tenant name. |
global.site | None | Site name (eg. stage, prod). |
global.clientId | None | Client Id. |
global.clientSecret | None | Client Secret. |
global.compressionEnabled | false | Either send metrics compressed or not compressed |
hubble-metrics-exporter.hubbleRelay | aiops-agent-hubble-gke-exporter | URL of deployed hubble-relay, required only for dataplane-v2. |
hubble-metrics-exporter.hubbleRelay | 80 | Port of deployed hubble-relay, required only for dataplane-v2. |
global.scrapeConfigFile | Allows to inject custom prometheus scrape config |
Components of AIOps agent
Visit AIOps agent project page for more details of AIOps helm chart.
Grafana Agent
The backbone of aiops-agent is Grafana Agent. For more information about it see: Grafana Agent Repository
It is the main component for metrics collection. It is a remote_write Prometheus client. It is a subset of Prometheus without any querying or local storage, using the same service discovery, relabeling, WAL, and remote_write code found in Prometheus.
This agent allows co.brick AIOps to collect metrics from client clusters with minimal impact on existing infrastructure.
In addition to the co.brick AIOps agent, two collectors are also deployed. They allow AIOps to collect more data about clusters.
Promtail
For more information about promtail see: Promtail documentation
Promtail is an agent which ships the contents of local logs to the co.brick AIOps central system. It is usually deployed to every machine that has applications needed to be monitored.
It primarily:
- Discovers services from which logs are collected
- Attaches labels to log streams
- Pushes them to the co.brick AIOps central system.
Node Exporter
Node Exporter is a tool developed by Prometheus that collects system-level metrics from the host machine and exposes them for scraping by the Aiops-agent. These metrics include CPU usage, memory usage, disk I/O, network statistics, and many others.
The Node Exporter is deployed as a daemonset in a Kubernetes cluster, which ensures that an instance of Node Exporter runs on each node in the cluster. This allows to monitor the resource usage and performance of each individual node.
By default Node Exporter is disabled in order to give flexibility during deploying Aiops-agent. Please see deployment section, for the details about enabling this component.
Topology Registry
The Topology Registry is a key part of the AIOps agent. It precisely tracks and reports changes in the Kubernetes cluster topology. This includes when nodes, pods, containers, services, or ingresses are added, changed, or removed. This detailed information about the cluster's topology is very important for the AIOps agent. It helps to fully understand the state and structure of the Kubernetes cluster. This understanding is needed for monitoring the health and performance of applications on the cluster, spotting any unusual activity, and making informed decisions to optimize how resources are used and improve application performance.
Log Compressor
The custom component which is responsible for compressing Promtail logs before they are shipped to the target cluster. It acts as a side car for the Promtail.
Metrics Compressor
The custom component which is responsible for compressing Prometheus metrics before they are shipped to the target cluster. It acts as a side car for the Grafana Agent.
Collectors Monitor
The custom component which is responsible for self-monitoring of all required agents, their health, and general K8s cluster configuration. K8s check is based on Popeye. Today, we parse Popeye cronJob and its output is firstly filtered and then selected issues are sent to AIOPS central system.
hubble-gke-exporter (dataplane-v2)
Proxying dataplane-v2 data.
hubble-metrics-exporter (dataplane-v2)
Generating metrics from dataplane-v2.
Popeye (Optional)
AIOps agent will install Popeye on the cluster automatically as CronJob. AIOps will use it to detect potential issues with resources and configurations. Visit project page for detailed instructions on how Popeye is installed.
Node Exporter (Optional)
In addition, Aiops-agent can be configured to install node-exporter on the cluster as a daemonset. This tool is a valuable addition for monitoring and understanding system's metrics, exposing a wide variety of hardware and kernel-related metrics.
By default, the Node Exporter is disabled to provide the flexibility to decide whether you want to use it in your environment. To enable it during the installation of the aiops-agent, use the following flag:
--set node-exporter.enabled=true
The Node Exporter requires access to three specific directories on the host system:
- The /proc directory is a pseudo-filesystem which provides an interface to kernel data structures. It is often referred to as a process information pseudo-file system. It doesn't contain 'real' files but runtime system information (e.g., system memory, devices mounted, hardware configuration, etc.).
- The /sys directory, also known as sysfs, is a pseudo-file system that holds information about the system's hardware components and drivers. It provides detailed information about various hardware devices and subsystems.
- The / directory, also known as the root directory, is the top-level directory in the filesystem hierarchy. Access to this directory allows Node Exporter to gather metrics from various other directories that fall under the root directory.
Node Exporter needs access to these directories because it works by collecting system and server metrics from these locations and then exposing them as Prometheus scrape endpoints. This allows Aiops-agent to pull or scrape these metrics and store them in its time-series database.
Please, keep in mind that these directories are accessed in read-only mode, meaning that while the Node Exporter can read data from these directories, it cannot modify any data within them. This is a security measure to prevent any unintended changes to your system.
Permissions
AIOps agent uses k8s Role-Based Access Control (RBAC) and each component has its own ClusterRole. By default, all of the components have read-only permissions.
Troubleshooting
Get all error-level logs of the AIOps agent by issuing the following command in terminal:
kubectl -n default logs daemonset/aiops-agent | grep level=err
Problem | Reason | Action |
---|---|---|
HTTP status 404 Not Found | The configuration may be invalid. | Re-run agent deployment script with correct parameters. |
HTTP status 401 Unauthorized | Oauth2 authorization fails. | Check CLIENT_ID , CLIENT_SECRET , and TENANT provided in the installation script. |
HTTP status 429 Too Many Requests: ingestion rate limit exceeded | You may have exceeded your current active series limits. | Contact our Support to extend the limits. |
In all other cases contact us for direct support.
Uninstalling
Adjust the commands if a custom namespace was used to install the agent.
Helm
Uninstall AIOps agent from your cluster:
helm -n aiops uninstall https://storage.googleapis.com/public-charts/aiops-agent-1.0.0.tgz
Kustomize
Delete namespace used by AIOps agent:
kubectl delete ns aiops
Data erasure and tenant removal
In order to close your subscription contact us directly. During the termination process the following will happen:
- Tenant will be removed.
- Oauth2 client and secret will be removed.
- Frontend users will be removed.
- Data cleanup will remove:
- Incidents.
- Alerts.
- Topology information.
- Logs.
At this point no more data will be collected from the system.
Some data may not be erased immediately due to retention policies or existing technical limitations. That includes:
- Logs: retention policy of 15 weeks.
- Metrics: data may be kept indefinitely.