From dc238fed4d3534572c4bb8fb94dcef042360d1b3 Mon Sep 17 00:00:00 2001 From: Mateo Lelong Date: Fri, 24 Apr 2026 13:28:45 +0200 Subject: [PATCH] feat: add kata_containers integration (dashboard, dataflows, manifest ...) Signed-off-by: Mateo Lelong --- kata_containers/CHANGELOG.md | 9 + kata_containers/README.md | 131 ++++ .../assets/configuration/spec.yaml | 45 ++ .../dashboards/kata_containers_dashboard.json | 706 ++++++++++++++++++ kata_containers/assets/dataflows.yaml | 5 + kata_containers/assets/service_checks.json | 17 + kata_containers/manifest.json | 63 ++ kata_containers/metadata.csv | 81 ++ 8 files changed, 1057 insertions(+) create mode 100644 kata_containers/CHANGELOG.md create mode 100644 kata_containers/README.md create mode 100644 kata_containers/assets/configuration/spec.yaml create mode 100644 kata_containers/assets/dashboards/kata_containers_dashboard.json create mode 100644 kata_containers/assets/dataflows.yaml create mode 100644 kata_containers/assets/service_checks.json create mode 100644 kata_containers/manifest.json create mode 100644 kata_containers/metadata.csv diff --git a/kata_containers/CHANGELOG.md b/kata_containers/CHANGELOG.md new file mode 100644 index 0000000000000..4be60dd61626b --- /dev/null +++ b/kata_containers/CHANGELOG.md @@ -0,0 +1,9 @@ +# CHANGELOG - Kata Containers + + + +## 1.0.0 / 2026-04-24 + +***Added***: + +* Initial release of the Kata Containers integration. diff --git a/kata_containers/README.md b/kata_containers/README.md new file mode 100644 index 0000000000000..b974668939515 --- /dev/null +++ b/kata_containers/README.md @@ -0,0 +1,131 @@ +# Agent Check: Kata Containers + +## Overview + +This check collects metrics from [Kata Containers][1], a secure container runtime that runs each workload inside a lightweight virtual machine (VM) for hardware-enforced isolation. + +The check is a long-running Go corecheck built into the Datadog Agent. It automatically discovers running Kata sandboxes by scanning the sandbox storage paths for shim Unix sockets, scrapes Prometheus metrics from each shim directly, and enriches the resulting metrics with Kubernetes orchestrator tags from the Datadog tagger. + +With the Datadog Kata Containers integration, you can: + +- Track the CPU and memory overhead introduced by the Kata VM infrastructure per sandbox. +- Monitor the health and resource usage of the containerd shim v2 (`containerd-shim-kata-v2`) for each running sandbox. +- Observe guest OS metrics (CPU, memory, disk, network) from inside each VM. +- Monitor hypervisor resource usage per sandbox. +- Alert on anomalous goroutine growth, high agent RPC latency, and elevated file descriptor counts. + +**Minimum Agent version:** 7.79.0 + +## Setup + +### Installation + +The Kata Containers check is built into the [Datadog Agent][2]. No additional installation is required on your server. + +### Prerequisites + +Kata Containers must be installed and running on the host. The check discovers sandboxes by looking for `shim-monitor.sock` files under the configured `sandbox_storage_paths` (default: `/run/vc/sbs` and `/run/kata`). + +### Configuration + +The check runs automatically with default settings. No configuration is required unless you need to override the default sandbox storage paths or customize label handling. + +To configure the check, create or edit `kata_containers.d/conf.yaml` in the `conf.d/` folder at the root of your Agent's configuration directory: + +```yaml +instances: + - sandbox_storage_paths: + - /host/run/vc/sbs + - /host/run/kata + # rename_labels: + # version: go_version + # exclude_labels: [] + # tags: [] +``` + +**Note:** On Kubernetes, the Agent requires access to the host paths where Kata stores its sandbox sockets. Mount the relevant host directories into the Agent pod: + +```yaml +volumeMounts: + - name: kata-run + mountPath: /host/run/vc + readOnly: true + - name: kata-run-alt + mountPath: /host/run/kata + readOnly: true +volumes: + - name: kata-run + hostPath: + path: /run/vc + - name: kata-run-alt + hostPath: + path: /run/kata +``` + +### Tag enrichment + +For each sandbox, the check resolves the associated container IDs from the Datadog workloadmeta store and queries the Datadog tagger at `OrchestratorCardinality` to retrieve Kubernetes orchestrator tags. This means all per-sandbox metrics are automatically tagged with `kube_namespace`, `pod_name`, `cluster_name`, and other orchestrator-level tags alongside `sandbox_id`. + +### Validation + +[Run the Agent's `status` subcommand][3] and look for `kata_containers` under the Checks section. + +## Data Collected + +### Metrics + +See [metadata.csv][4] for a list of metrics provided by this check. Metrics are grouped as follows: + +| Group | Description | +|---|---| +| `kata.shim.*` | Shim process metrics scraped from each sandbox socket (`containerd-shim-kata-v2`) | +| `kata.go.*` | Go runtime metrics exposed by the shim (goroutines, GC, memory) | +| `kata.guest.*` | Guest OS metrics proxied by the shim (CPU, memory, disk, network) | +| `kata.hypervisor.*` | Hypervisor process resource usage per sandbox | +| `kata.agent.*` | Kata agent metrics proxied by the shim from inside the VM | +| `kata.firecracker.*` | Firecracker VMM-specific metrics (only when using the Firecracker hypervisor) | + +`kata.running_shim_count` is emitted once per check run and reflects the total number of discovered sandboxes on the node. + +All per-sandbox metrics carry a `sandbox_id` tag. Prometheus labels from the shim metrics (such as `item`, `cpu`, `disk`, `interface`, `action`) are mapped directly to Datadog tags. The `version` label is renamed to `go_version` by default. + +### Events + +The Kata Containers integration does not emit any events. + +### Service Checks + +See [service_checks.json][5] for a list of service checks provided by this integration. + +**`kata_containers.openmetrics.health`**: Returns `CRITICAL` if the Agent fails to connect to or parse metrics from a sandbox shim socket, otherwise returns `OK`. Grouped by `sandbox_id`. + +## Troubleshooting + +### No sandboxes discovered + +The check scans `sandbox_storage_paths` for directories containing a `shim-monitor.sock` file. If no sandboxes are found: + +- Verify that Kata Containers sandboxes are running: `ls /run/vc/sbs/` or `ls /run/kata/` +- On Kubernetes, ensure the host paths are mounted into the Agent pod. +- Check that the Agent process has read access to the socket files. + +### Metrics missing Kubernetes tags + +Tag enrichment requires the Datadog workloadmeta store to have resolved the container-to-sandbox mapping. This mapping is updated on container lifecycle events from the container runtime. If tags are missing, verify that the Agent has access to the container runtime socket (containerd or CRI-O). + +Need help? Contact [Datadog support][6]. + +## Further Reading + +- [Kata Containers official documentation][1] +- [Kata 2.0 Metrics design document][7] +- [Kata Containers architecture][8] + +[1]: https://katacontainers.io/ +[2]: /account/settings/agent/latest +[3]: https://docs.datadoghq.com/agent/guide/agent-commands/#agent-status-and-information +[4]: https://github.com/DataDog/integrations-core/blob/master/kata_containers/metadata.csv +[5]: https://github.com/DataDog/integrations-core/blob/master/kata_containers/assets/service_checks.json +[6]: https://docs.datadoghq.com/help/ +[7]: https://github.com/kata-containers/kata-containers/blob/main/docs/design/kata-2-0-metrics.md +[8]: https://github.com/kata-containers/kata-containers/blob/main/docs/design/architecture/README.md diff --git a/kata_containers/assets/configuration/spec.yaml b/kata_containers/assets/configuration/spec.yaml new file mode 100644 index 0000000000000..ee9a308b26e4f --- /dev/null +++ b/kata_containers/assets/configuration/spec.yaml @@ -0,0 +1,45 @@ +name: Kata Containers +files: +- name: kata_containers.yaml + options: + - template: init_config + options: + - template: init_config/default + - template: instances + options: + - name: sandbox_storage_paths + description: | + List of directories where Kata Containers stores sandbox runtime state. + The check scans each directory for subdirectories containing a + shim-monitor.sock file and scrapes Prometheus metrics from each. + value: + type: array + items: + type: string + example: + - /host/run/vc/sbs + - /host/run/kata + - name: rename_labels + description: | + Map of Prometheus label names to rename when building Datadog tags. + Applied to all metrics collected from shim sockets. + value: + type: object + example: + version: go_version + - name: exclude_labels + description: | + List of Prometheus label names to drop when building Datadog tags. + value: + type: array + items: + type: string + example: [] + - name: tags + description: | + List of tags to attach to all metrics emitted by this check instance. + value: + type: array + items: + type: string + example: [] diff --git a/kata_containers/assets/dashboards/kata_containers_dashboard.json b/kata_containers/assets/dashboards/kata_containers_dashboard.json new file mode 100644 index 0000000000000..41c47a5bcad7c --- /dev/null +++ b/kata_containers/assets/dashboards/kata_containers_dashboard.json @@ -0,0 +1,706 @@ +{ + "title": "Kata Containers", + "description": "## Kata Containers\n\nPer-sandbox observability for Kata Containers, a secure container runtime using lightweight VMs as isolation boundaries. All metrics are scoped to `sandbox_id` and enriched with Kubernetes orchestrator tags.\n\n**Further reading:**\n- [Kata Containers Documentation](https://katacontainers.io/docs/)\n- [Kata 2.0 Metrics Reference](https://github.com/kata-containers/kata-containers/blob/main/docs/design/kata-2-0-metrics.md)\n- [Datadog Kata Containers Integration](https://docs.datadoghq.com/integrations/kata_containers/)", + "widgets": [ + { + "id": 1001, + "definition": { + "title": "Fleet Overview", + "show_title": true, + "type": "group", + "layout_type": "ordered", + "widgets": [ + { + "id": 1002, + "definition": { + "type": "note", + "content": "## Kata Containers\n\nEach sandbox runs inside a lightweight VM. The corecheck scrapes each shim's Unix socket directly and enriches metrics with Kubernetes orchestrator tags (`cluster_name`, `kube_namespace`, `pod_name`).\n\n**Further reading:**\n- [Kata Containers Documentation](https://katacontainers.io/docs/)\n- [Kata 2.0 Metrics Reference](https://github.com/kata-containers/kata-containers/blob/main/docs/design/kata-2-0-metrics.md)", + "background_color": "white", + "font_size": "14", + "text_align": "left", + "vertical_align": "top", + "show_tick": false, + "tick_pos": "50%", + "tick_edge": "left", + "has_padding": true + }, + "layout": { "x": 0, "y": 0, "width": 4, "height": 4 } + }, + { + "id": 1003, + "definition": { + "title": "Running Sandboxes", + "title_size": "16", + "title_align": "left", + "type": "query_value", + "requests": [ + { + "response_format": "scalar", + "queries": [ + { + "data_source": "metrics", + "name": "query1", + "query": "sum:kata.running_shim_count{$cluster_name,$node_name}", + "aggregator": "last" + } + ], + "formulas": [{ "formula": "query1" }] + } + ], + "autoscale": true, + "precision": 0, + "timeseries_background": { "type": "area", "yaxis": { "include_zero": true } } + }, + "layout": { "x": 4, "y": 0, "width": 2, "height": 2 } + }, + { + "id": 1004, + "definition": { + "title": "Total Pod CPU Overhead", + "title_size": "16", + "title_align": "left", + "type": "query_value", + "requests": [ + { + "response_format": "scalar", + "queries": [ + { + "data_source": "metrics", + "name": "query1", + "query": "sum:kata.shim.pod.overhead.cpu{$cluster_name,$node_name,$namespace,$pod_name,$sandbox_id}", + "aggregator": "last" + } + ], + "formulas": [{ "formula": "query1" }] + } + ], + "autoscale": true, + "precision": 2, + "timeseries_background": { "type": "area", "yaxis": { "include_zero": true } } + }, + "layout": { "x": 6, "y": 0, "width": 3, "height": 2 } + }, + { + "id": 1005, + "definition": { + "title": "Total Pod Memory Overhead", + "title_size": "16", + "title_align": "left", + "type": "query_value", + "requests": [ + { + "response_format": "scalar", + "queries": [ + { + "data_source": "metrics", + "name": "query1", + "query": "sum:kata.shim.pod.overhead.memory.in.bytes{$cluster_name,$node_name,$namespace,$pod_name,$sandbox_id}", + "aggregator": "last" + } + ], + "formulas": [{ "formula": "query1" }] + } + ], + "autoscale": true, + "precision": 2, + "timeseries_background": { "type": "area", "yaxis": { "include_zero": true } } + }, + "layout": { "x": 9, "y": 0, "width": 3, "height": 2 } + }, + { + "id": 1006, + "definition": { + "title": "Running Sandboxes Over Time", + "title_size": "16", + "title_align": "left", + "show_legend": false, + "type": "timeseries", + "requests": [ + { + "response_format": "timeseries", + "queries": [ + { + "data_source": "metrics", + "name": "query1", + "query": "sum:kata.running_shim_count{$cluster_name,$node_name} by {host}" + } + ], + "formulas": [{ "formula": "query1" }], + "style": { "palette": "dog_classic", "line_type": "solid", "line_width": "normal" }, + "display_type": "area" + } + ] + }, + "layout": { "x": 4, "y": 2, "width": 8, "height": 2 } + } + ] + }, + "layout": { "x": 0, "y": 0, "width": 12, "height": 5 } + }, + { + "id": 2001, + "definition": { + "title": "Pod Overhead", + "background_color": "vivid_green", + "show_title": true, + "type": "group", + "layout_type": "ordered", + "widgets": [ + { + "id": 2002, + "definition": { + "type": "note", + "content": "CPU and memory cost of the Kata VM infrastructure (hypervisor + agent) per sandbox, beyond the workload itself. Tagged: `sandbox_id`, `pod_name`, `kube_namespace`, `cluster_name`.", + "background_color": "green", + "font_size": "14", + "text_align": "center", + "vertical_align": "center", + "show_tick": false, + "tick_pos": "50%", + "tick_edge": "left", + "has_padding": true + }, + "layout": { "x": 0, "y": 0, "width": 12, "height": 1 } + }, + { + "id": 2003, + "definition": { + "title": "Pod CPU Overhead by Sandbox", + "title_size": "16", + "title_align": "left", + "show_legend": true, + "legend_layout": "auto", + "legend_columns": ["avg", "max", "value"], + "type": "timeseries", + "requests": [ + { + "response_format": "timeseries", + "queries": [ + { + "data_source": "metrics", + "name": "query1", + "query": "avg:kata.shim.pod.overhead.cpu{$cluster_name,$node_name,$namespace,$pod_name,$sandbox_id} by {sandbox_id,pod_name}" + } + ], + "formulas": [{ "formula": "query1" }], + "style": { "palette": "green", "line_type": "solid", "line_width": "normal" }, + "display_type": "line" + } + ] + }, + "layout": { "x": 0, "y": 1, "width": 6, "height": 3 } + }, + { + "id": 2004, + "definition": { + "title": "Pod Memory Overhead by Sandbox", + "title_size": "16", + "title_align": "left", + "show_legend": true, + "legend_layout": "auto", + "legend_columns": ["avg", "max", "value"], + "type": "timeseries", + "requests": [ + { + "response_format": "timeseries", + "queries": [ + { + "data_source": "metrics", + "name": "query1", + "query": "avg:kata.shim.pod.overhead.memory.in.bytes{$cluster_name,$node_name,$namespace,$pod_name,$sandbox_id} by {sandbox_id,pod_name}" + } + ], + "formulas": [{ "formula": "query1" }], + "style": { "palette": "green", "line_type": "solid", "line_width": "normal" }, + "display_type": "line" + } + ], + "yaxis": { "include_zero": true } + }, + "layout": { "x": 6, "y": 1, "width": 6, "height": 3 } + } + ] + }, + "layout": { "x": 0, "y": 5, "width": 12, "height": 5 } + }, + { + "id": 3001, + "definition": { + "title": "Shim", + "background_color": "vivid_purple", + "show_title": true, + "type": "group", + "layout_type": "ordered", + "widgets": [ + { + "id": 3002, + "definition": { + "type": "note", + "content": "`containerd-shim-kata-v2` metrics scraped directly from each sandbox's Unix socket. One shim per sandbox. Tagged: `sandbox_id`, `pod_name`, `kube_namespace`, `cluster_name`.", + "background_color": "purple", + "font_size": "14", + "text_align": "center", + "vertical_align": "center", + "show_tick": false, + "tick_pos": "50%", + "tick_edge": "left", + "has_padding": true + }, + "layout": { "x": 0, "y": 0, "width": 12, "height": 1 } + }, + { + "id": 3003, + "definition": { + "title": "Shim Open FDs by Sandbox", + "title_size": "16", + "title_align": "left", + "show_legend": true, + "legend_layout": "auto", + "legend_columns": ["avg", "max", "value"], + "type": "timeseries", + "requests": [ + { + "response_format": "timeseries", + "queries": [ + { + "data_source": "metrics", + "name": "query1", + "query": "avg:kata.shim.fds{$cluster_name,$node_name,$namespace,$pod_name,$sandbox_id} by {sandbox_id,pod_name}" + } + ], + "formulas": [{ "formula": "query1" }], + "style": { "palette": "purple", "line_type": "solid", "line_width": "normal" }, + "display_type": "line" + } + ] + }, + "layout": { "x": 0, "y": 1, "width": 4, "height": 3 } + }, + { + "id": 3004, + "definition": { + "title": "Shim Threads by Sandbox", + "title_size": "16", + "title_align": "left", + "show_legend": true, + "legend_layout": "auto", + "legend_columns": ["avg", "max", "value"], + "type": "timeseries", + "requests": [ + { + "response_format": "timeseries", + "queries": [ + { + "data_source": "metrics", + "name": "query1", + "query": "avg:kata.shim.threads{$cluster_name,$node_name,$namespace,$pod_name,$sandbox_id} by {sandbox_id,pod_name}" + } + ], + "formulas": [{ "formula": "query1" }], + "style": { "palette": "purple", "line_type": "solid", "line_width": "normal" }, + "display_type": "line" + } + ] + }, + "layout": { "x": 4, "y": 1, "width": 4, "height": 3 } + }, + { + "id": 3005, + "definition": { + "title": "Go Goroutines by Sandbox", + "title_size": "16", + "title_align": "left", + "show_legend": true, + "legend_layout": "auto", + "legend_columns": ["avg", "max", "value"], + "type": "timeseries", + "requests": [ + { + "response_format": "timeseries", + "queries": [ + { + "data_source": "metrics", + "name": "query1", + "query": "avg:kata.go.goroutines{$cluster_name,$node_name,$namespace,$pod_name,$sandbox_id} by {sandbox_id,pod_name}" + } + ], + "formulas": [{ "formula": "query1" }], + "style": { "palette": "purple", "line_type": "solid", "line_width": "normal" }, + "display_type": "line" + } + ] + }, + "layout": { "x": 8, "y": 1, "width": 4, "height": 3 } + }, + { + "id": 3006, + "definition": { + "title": "Agent RPC Avg Latency by Sandbox (ms)", + "title_size": "16", + "title_align": "left", + "show_legend": true, + "legend_layout": "auto", + "legend_columns": ["avg", "max", "value"], + "type": "timeseries", + "requests": [ + { + "response_format": "timeseries", + "queries": [ + { + "data_source": "metrics", + "name": "sum_rate", + "query": "avg:kata.shim.agent.rpc.durations.histogram.milliseconds.sum{$cluster_name,$node_name,$namespace,$pod_name,$sandbox_id} by {sandbox_id,action}.as_rate()" + }, + { + "data_source": "metrics", + "name": "count_rate", + "query": "avg:kata.shim.agent.rpc.durations.histogram.milliseconds.count{$cluster_name,$node_name,$namespace,$pod_name,$sandbox_id} by {sandbox_id,action}.as_rate()" + } + ], + "formulas": [ + { + "formula": "sum_rate / count_rate", + "alias": "avg latency (ms)" + } + ], + "style": { "palette": "purple", "line_type": "solid", "line_width": "normal" }, + "display_type": "line" + } + ] + }, + "layout": { "x": 0, "y": 4, "width": 6, "height": 3 } + }, + { + "id": 3007, + "definition": { + "title": "Shim RPC Avg Latency by Sandbox (ms)", + "title_size": "16", + "title_align": "left", + "show_legend": true, + "legend_layout": "auto", + "legend_columns": ["avg", "max", "value"], + "type": "timeseries", + "requests": [ + { + "response_format": "timeseries", + "queries": [ + { + "data_source": "metrics", + "name": "sum_rate", + "query": "avg:kata.shim.rpc.durations.histogram.milliseconds.sum{$cluster_name,$node_name,$namespace,$pod_name,$sandbox_id} by {sandbox_id,action}.as_rate()" + }, + { + "data_source": "metrics", + "name": "count_rate", + "query": "avg:kata.shim.rpc.durations.histogram.milliseconds.count{$cluster_name,$node_name,$namespace,$pod_name,$sandbox_id} by {sandbox_id,action}.as_rate()" + } + ], + "formulas": [ + { + "formula": "sum_rate / count_rate", + "alias": "avg latency (ms)" + } + ], + "style": { "palette": "purple", "line_type": "solid", "line_width": "normal" }, + "display_type": "line" + } + ] + }, + "layout": { "x": 6, "y": 4, "width": 6, "height": 3 } + } + ] + }, + "layout": { "x": 0, "y": 10, "width": 12, "height": 8 } + }, + { + "id": 4001, + "definition": { + "title": "Guest OS", + "background_color": "vivid_blue", + "show_title": true, + "type": "group", + "layout_type": "ordered", + "widgets": [ + { + "id": 4002, + "definition": { + "type": "note", + "content": "Metrics from inside the guest VM, proxied by the shim. Tagged: `sandbox_id`, `pod_name`, `kube_namespace`, `cluster_name`, plus `cpu`, `disk`, or `interface`.", + "background_color": "blue", + "font_size": "14", + "text_align": "center", + "vertical_align": "center", + "show_tick": false, + "tick_pos": "50%", + "tick_edge": "left", + "has_padding": true + }, + "layout": { "x": 0, "y": 0, "width": 12, "height": 1 } + }, + { + "id": 4003, + "definition": { + "title": "Guest CPU Time by Sandbox", + "title_size": "16", + "title_align": "left", + "show_legend": true, + "legend_layout": "auto", + "legend_columns": ["avg", "max", "value"], + "type": "timeseries", + "requests": [ + { + "response_format": "timeseries", + "queries": [ + { + "data_source": "metrics", + "name": "query1", + "query": "avg:kata.guest.cpu.time{$cluster_name,$node_name,$namespace,$pod_name,$sandbox_id} by {sandbox_id,cpu,item}" + } + ], + "formulas": [{ "formula": "query1" }], + "style": { "palette": "cool", "line_type": "solid", "line_width": "normal" }, + "display_type": "line" + } + ] + }, + "layout": { "x": 0, "y": 1, "width": 6, "height": 3 } + }, + { + "id": 4004, + "definition": { + "title": "Guest Memory by Sandbox", + "title_size": "16", + "title_align": "left", + "show_legend": true, + "legend_layout": "auto", + "legend_columns": ["avg", "max", "value"], + "type": "timeseries", + "requests": [ + { + "response_format": "timeseries", + "queries": [ + { + "data_source": "metrics", + "name": "query1", + "query": "avg:kata.guest.meminfo{$cluster_name,$node_name,$namespace,$pod_name,$sandbox_id} by {sandbox_id,item}" + } + ], + "formulas": [{ "formula": "query1" }], + "style": { "palette": "cool", "line_type": "solid", "line_width": "normal" }, + "display_type": "line" + } + ], + "yaxis": { "include_zero": true } + }, + "layout": { "x": 6, "y": 1, "width": 6, "height": 3 } + }, + { + "id": 4005, + "definition": { + "title": "Guest Disk Stats by Sandbox", + "title_size": "16", + "title_align": "left", + "show_legend": true, + "legend_layout": "auto", + "legend_columns": ["avg", "max", "value"], + "type": "timeseries", + "requests": [ + { + "response_format": "timeseries", + "queries": [ + { + "data_source": "metrics", + "name": "query1", + "query": "avg:kata.guest.diskstat{$cluster_name,$node_name,$namespace,$pod_name,$sandbox_id} by {sandbox_id,disk,item}" + } + ], + "formulas": [{ "formula": "query1" }], + "style": { "palette": "cool", "line_type": "solid", "line_width": "normal" }, + "display_type": "line" + } + ] + }, + "layout": { "x": 0, "y": 4, "width": 6, "height": 3 } + }, + { + "id": 4006, + "definition": { + "title": "Guest Network Stats by Sandbox", + "title_size": "16", + "title_align": "left", + "show_legend": true, + "legend_layout": "auto", + "legend_columns": ["avg", "max", "value"], + "type": "timeseries", + "requests": [ + { + "response_format": "timeseries", + "queries": [ + { + "data_source": "metrics", + "name": "query1", + "query": "avg:kata.guest.netdev.stat{$cluster_name,$node_name,$namespace,$pod_name,$sandbox_id} by {sandbox_id,interface,item}" + } + ], + "formulas": [{ "formula": "query1" }], + "style": { "palette": "cool", "line_type": "solid", "line_width": "normal" }, + "display_type": "line" + } + ] + }, + "layout": { "x": 6, "y": 4, "width": 6, "height": 3 } + } + ] + }, + "layout": { "x": 0, "y": 18, "width": 12, "height": 8 } + }, + { + "id": 5001, + "definition": { + "title": "Hypervisor", + "background_color": "vivid_yellow", + "show_title": true, + "type": "group", + "layout_type": "ordered", + "widgets": [ + { + "id": 5002, + "definition": { + "type": "note", + "content": "Hypervisor process (QEMU, Firecracker, etc.) resource usage per sandbox. Tagged: `sandbox_id`, `pod_name`, `kube_namespace`, `cluster_name`.", + "background_color": "yellow", + "font_size": "14", + "text_align": "center", + "vertical_align": "center", + "show_tick": false, + "tick_pos": "50%", + "tick_edge": "left", + "has_padding": true + }, + "layout": { "x": 0, "y": 0, "width": 12, "height": 1 } + }, + { + "id": 5003, + "definition": { + "title": "Hypervisor Open FDs by Sandbox", + "title_size": "16", + "title_align": "left", + "show_legend": true, + "legend_layout": "auto", + "legend_columns": ["avg", "max", "value"], + "type": "timeseries", + "requests": [ + { + "response_format": "timeseries", + "queries": [ + { + "data_source": "metrics", + "name": "query1", + "query": "avg:kata.hypervisor.fds{$cluster_name,$node_name,$namespace,$pod_name,$sandbox_id} by {sandbox_id,pod_name}" + } + ], + "formulas": [{ "formula": "query1" }], + "style": { "palette": "yellow", "line_type": "solid", "line_width": "normal" }, + "display_type": "line" + } + ] + }, + "layout": { "x": 0, "y": 1, "width": 4, "height": 3 } + }, + { + "id": 5004, + "definition": { + "title": "Hypervisor Threads by Sandbox", + "title_size": "16", + "title_align": "left", + "show_legend": true, + "legend_layout": "auto", + "legend_columns": ["avg", "max", "value"], + "type": "timeseries", + "requests": [ + { + "response_format": "timeseries", + "queries": [ + { + "data_source": "metrics", + "name": "query1", + "query": "avg:kata.hypervisor.threads{$cluster_name,$node_name,$namespace,$pod_name,$sandbox_id} by {sandbox_id,pod_name}" + } + ], + "formulas": [{ "formula": "query1" }], + "style": { "palette": "yellow", "line_type": "solid", "line_width": "normal" }, + "display_type": "line" + } + ] + }, + "layout": { "x": 4, "y": 1, "width": 4, "height": 3 } + }, + { + "id": 5005, + "definition": { + "title": "Hypervisor IO Stats by Sandbox", + "title_size": "16", + "title_align": "left", + "show_legend": true, + "legend_layout": "auto", + "legend_columns": ["avg", "max", "value"], + "type": "timeseries", + "requests": [ + { + "response_format": "timeseries", + "queries": [ + { + "data_source": "metrics", + "name": "query1", + "query": "avg:kata.hypervisor.io.stat{$cluster_name,$node_name,$namespace,$pod_name,$sandbox_id} by {sandbox_id,item}" + } + ], + "formulas": [{ "formula": "query1" }], + "style": { "palette": "yellow", "line_type": "solid", "line_width": "normal" }, + "display_type": "line" + } + ] + }, + "layout": { "x": 8, "y": 1, "width": 4, "height": 3 } + } + ] + }, + "layout": { "x": 0, "y": 26, "width": 12, "height": 5 } + } + ], + "template_variables": [ + { + "name": "cluster_name", + "prefix": "cluster_name", + "available_values": [], + "default": "*" + }, + { + "name": "node_name", + "prefix": "host", + "available_values": [], + "default": "*" + }, + { + "name": "namespace", + "prefix": "kube_namespace", + "available_values": [], + "default": "*" + }, + { + "name": "pod_name", + "prefix": "pod_name", + "available_values": [], + "default": "*" + }, + { + "name": "sandbox_id", + "prefix": "sandbox_id", + "available_values": [], + "default": "*" + } + ], + "layout_type": "ordered", + "notify_list": [], + "reflow_type": "fixed" +} diff --git a/kata_containers/assets/dataflows.yaml b/kata_containers/assets/dataflows.yaml new file mode 100644 index 0000000000000..76da5a012ee17 --- /dev/null +++ b/kata_containers/assets/dataflows.yaml @@ -0,0 +1,5 @@ +provides: + - id: kata-containers-metrics + always_on: true + data_type: metrics + direction: inbound diff --git a/kata_containers/assets/service_checks.json b/kata_containers/assets/service_checks.json new file mode 100644 index 0000000000000..7216553fe6fbc --- /dev/null +++ b/kata_containers/assets/service_checks.json @@ -0,0 +1,17 @@ +[ + { + "agent_version": "7.79.0", + "integration": "Kata Containers", + "check": "kata_containers.openmetrics.health", + "statuses": [ + "ok", + "critical" + ], + "groups": [ + "host", + "endpoint" + ], + "name": "Kata Containers OpenMetrics endpoint health", + "description": "Returns `CRITICAL` if the Agent is unable to connect to the kata-monitor OpenMetrics endpoint, otherwise returns `OK`." + } +] diff --git a/kata_containers/manifest.json b/kata_containers/manifest.json new file mode 100644 index 0000000000000..91ea65cb41299 --- /dev/null +++ b/kata_containers/manifest.json @@ -0,0 +1,63 @@ +{ + "manifest_version": "2.0.0", + "app_uuid": "4e277530-ae7d-4e54-ab2f-f47fb4b6132a", + "app_id": "kata_containers", + "owner": "agent-integrations", + "display_on_public_website": true, + "tile": { + "overview": "README.md#Overview", + "configuration": "README.md#Setup", + "support": "README.md#Support", + "changelog": "CHANGELOG.md", + "description": "Collect metrics from Kata Containers, a secure container runtime using lightweight virtual machines.", + "title": "Kata Containers", + "media": [], + "classifier_tags": [ + "Supported OS::Linux", + "Category::Containers", + "Category::Kubernetes", + "Offering::Integration", + "Submitted Data Type::Metrics" + ] + }, + "assets": { + "integration": { + "auto_install": true, + "source_type_id": 39847291, + "source_type_name": "Kata Containers", + "configuration": { + "spec": "assets/configuration/spec.yaml" + }, + "events": { + "creates_events": false + }, + "metrics": { + "prefix": "kata.", + "check": "kata.running_shim_count", + "metadata_path": "metadata.csv" + }, + "service_checks": { + "metadata_path": "assets/service_checks.json" + }, + "process_signatures": [ + "kata-monitor", + "kata-runtime", + "containerd-shim-kata-v2" + ] + }, + "logs": { + "source": "kata_containers" + }, + "dashboards": { + "Kata Containers Overview": "assets/dashboards/kata_containers_dashboard.json" + }, + "monitors": { + } + }, + "author": { + "support_email": "help@datadoghq.com", + "name": "Datadog", + "homepage": "https://www.datadoghq.com", + "sales_email": "info@datadoghq.com" + } +} diff --git a/kata_containers/metadata.csv b/kata_containers/metadata.csv new file mode 100644 index 0000000000000..0e1275d3e1b57 --- /dev/null +++ b/kata_containers/metadata.csv @@ -0,0 +1,81 @@ +metric_name,metric_type,interval,unit_name,per_unit_name,description,orientation,integration,short_name,curated_metric +kata.running_shim_count,gauge,,,,Number of running Kata Containers sandboxes on the node.,0,kata_containers,running sandboxes,T +kata.agent.io.stat,gauge,,,,Agent process IO stat.,0,kata_containers,, +kata.agent.proc.stat,gauge,,,,Agent process stat.,0,kata_containers,, +kata.agent.proc.status,gauge,,,,Agent process status.,0,kata_containers,, +kata.agent.scrape.count,count,,,,Number of times the agent metrics endpoint has been scraped.,0,kata_containers,, +kata.agent.total.rss,gauge,,byte,,Agent process total RSS size.,0,kata_containers,, +kata.agent.total.time,gauge,,second,,Agent process total CPU time.,0,kata_containers,, +kata.agent.total.vm,gauge,,byte,,Agent process total virtual memory size.,0,kata_containers,, +kata.firecracker.api.server,gauge,,,,Metrics related to the Firecracker internal API server.,0,kata_containers,, +kata.firecracker.block,gauge,,,,Block device metrics for Firecracker.,0,kata_containers,, +kata.firecracker.get.api.requests,gauge,,,,GET API request metrics for Firecracker.,0,kata_containers,, +kata.firecracker.i8042,gauge,,,,Metrics specific to the Firecracker i8042 device.,0,kata_containers,, +kata.firecracker.latencies.us,gauge,,microsecond,,Performance metrics related to Firecracker snapshots.,0,kata_containers,, +kata.firecracker.logger,gauge,,,,Metrics for the Firecracker logging subsystem.,0,kata_containers,, +kata.firecracker.mmds,gauge,,,,Metrics for the Firecracker MMDS functionality.,0,kata_containers,, +kata.firecracker.net,gauge,,,,Network-related metrics for Firecracker.,0,kata_containers,, +kata.firecracker.patch.api.requests,gauge,,,,PATCH API request metrics for Firecracker.,0,kata_containers,, +kata.firecracker.put.api.requests,gauge,,,,PUT API request metrics for Firecracker.,0,kata_containers,, +kata.firecracker.rtc,gauge,,,,Metrics specific to the Firecracker RTC device.,0,kata_containers,, +kata.firecracker.seccomp,gauge,,,,Metrics for Firecracker seccomp filtering.,0,kata_containers,, +kata.firecracker.signals,gauge,,,,Metrics related to Firecracker signals.,0,kata_containers,, +kata.firecracker.uart,gauge,,,,Metrics specific to the Firecracker UART device.,0,kata_containers,, +kata.firecracker.vcpu,gauge,,,,Metrics specific to Firecracker VCPUs.,0,kata_containers,, +kata.firecracker.vmm,gauge,,,,Metrics specific to the Firecracker virtual machine manager.,0,kata_containers,, +kata.firecracker.vsock,gauge,,,,VSOCK-related metrics for Firecracker.,0,kata_containers,, +kata.go.gc.duration.seconds,gauge,,second,,Summary of the pause duration of garbage collection cycles in the kata shim.,0,kata_containers,, +kata.go.goroutines,gauge,,,,Number of goroutines in the kata shim.,0,kata_containers,, +kata.go.info,gauge,,,,Information about the Go environment running the kata shim.,0,kata_containers,, +kata.go.memstats.alloc.bytes,gauge,,byte,,Heap bytes allocated and still in use by the kata shim.,0,kata_containers,, +kata.go.memstats.alloc.bytes.total,count,,byte,,Total bytes allocated by the kata shim, even if freed.,0,kata_containers,, +kata.go.memstats.buck.hash.sys.bytes,gauge,,byte,,Profiling bucket hash table bytes for the kata shim.,0,kata_containers,, +kata.go.memstats.frees.total,count,,,,Total number of heap object frees in the kata shim.,0,kata_containers,, +kata.go.memstats.gc.cpu.fraction,gauge,,,,Fraction of CPU time used by GC in the kata shim.,0,kata_containers,, +kata.go.memstats.gc.sys.bytes,gauge,,byte,,GC system metadata bytes in the kata shim.,0,kata_containers,, +kata.go.memstats.heap.alloc.bytes,gauge,,byte,,Heap bytes allocated and in use by the kata shim.,0,kata_containers,, +kata.go.memstats.heap.idle.bytes,gauge,,byte,,Heap bytes waiting to be used in the kata shim.,0,kata_containers,, +kata.go.memstats.heap.inuse.bytes,gauge,,byte,,Heap bytes in use by the kata shim.,0,kata_containers,, +kata.go.memstats.heap.objects,gauge,,,,Number of allocated heap objects in the kata shim.,0,kata_containers,, +kata.go.memstats.heap.released.bytes,gauge,,byte,,Heap bytes released to the OS by the kata shim.,0,kata_containers,, +kata.go.memstats.heap.sys.bytes,gauge,,byte,,Heap bytes obtained from the system by the kata shim.,0,kata_containers,, +kata.go.memstats.last.gc.time.seconds,gauge,,second,,Time of the last GC in the kata shim since the Unix epoch.,0,kata_containers,, +kata.go.memstats.lookups.total,count,,,,Total pointer lookups in the kata shim.,0,kata_containers,, +kata.go.memstats.mallocs.total,count,,,,Total heap object allocations in the kata shim.,0,kata_containers,, +kata.go.memstats.mcache.inuse.bytes,gauge,,byte,,Bytes in use by mcache structures in the kata shim.,0,kata_containers,, +kata.go.memstats.mcache.sys.bytes,gauge,,byte,,Bytes for mcache structures from system in the kata shim.,0,kata_containers,, +kata.go.memstats.mspan.inuse.bytes,gauge,,byte,,Bytes in use by mspan structures in the kata shim.,0,kata_containers,, +kata.go.memstats.mspan.sys.bytes,gauge,,byte,,Bytes for mspan structures from system in the kata shim.,0,kata_containers,, +kata.go.memstats.next.gc.bytes,gauge,,byte,,Heap bytes at which next GC will trigger in the kata shim.,0,kata_containers,, +kata.go.memstats.other.sys.bytes,gauge,,byte,,Bytes used for other system allocations in the kata shim.,0,kata_containers,, +kata.go.memstats.stack.inuse.bytes,gauge,,byte,,Stack allocator bytes in use by the kata shim.,0,kata_containers,, +kata.go.memstats.stack.sys.bytes,gauge,,byte,,Stack allocator bytes from system in the kata shim.,0,kata_containers,, +kata.go.memstats.sys.bytes,gauge,,byte,,Total bytes obtained from the system by the kata shim.,0,kata_containers,, +kata.go.threads,gauge,,,,Number of OS threads created by the kata shim.,0,kata_containers,, +kata.guest.cpu.time,gauge,,,,Guest CPU stat.,0,kata_containers,, +kata.guest.diskstat,gauge,,,,Disk stats in the guest system.,0,kata_containers,, +kata.guest.load,gauge,,,,Guest system load average.,0,kata_containers,, +kata.guest.meminfo,gauge,,,,Statistics about memory usage in the guest system.,0,kata_containers,, +kata.guest.netdev.stat,gauge,,,,Guest network device stats.,0,kata_containers,, +kata.guest.tasks,gauge,,,,Number of tasks in the guest system.,0,kata_containers,, +kata.guest.vm.stat,gauge,,,,Guest virtual memory stat.,0,kata_containers,, +kata.hypervisor.fds,gauge,,,,Number of open file descriptors for the hypervisor process.,0,kata_containers,, +kata.hypervisor.io.stat,gauge,,,,Hypervisor process IO statistics.,0,kata_containers,, +kata.hypervisor.netdev,gauge,,,,Hypervisor network device statistics.,0,kata_containers,, +kata.hypervisor.proc.stat,gauge,,,,Hypervisor process statistics.,0,kata_containers,, +kata.hypervisor.proc.status,gauge,,,,Hypervisor process status.,0,kata_containers,, +kata.hypervisor.threads,gauge,,,,Number of threads in the hypervisor process.,0,kata_containers,, +kata.shim.agent.rpc.durations.histogram.milliseconds.bucket,gauge,,millisecond,,RPC latency histogram bucket for agent calls from the shim.,0,kata_containers,, +kata.shim.agent.rpc.durations.histogram.milliseconds.count,gauge,,,,Total number of agent RPC calls observed.,0,kata_containers,, +kata.shim.agent.rpc.durations.histogram.milliseconds.sum,gauge,,millisecond,,Total duration of agent RPC calls.,0,kata_containers,, +kata.shim.fds,gauge,,,,Number of open file descriptors for the kata containerd shim v2.,0,kata_containers,, +kata.shim.io.stat,gauge,,,,Kata containerd shim v2 process IO statistics.,0,kata_containers,, +kata.shim.netdev,gauge,,,,Kata containerd shim v2 network device statistics.,0,kata_containers,, +kata.shim.pod.overhead.cpu,gauge,,,,CPU overhead attributable to the Kata pod (VM and agent).,0,kata_containers,pod CPU overhead,T +kata.shim.pod.overhead.memory.in.bytes,gauge,,byte,,Memory overhead attributable to the Kata pod (VM and agent).,0,kata_containers,pod memory overhead,T +kata.shim.proc.stat,gauge,,,,Kata containerd shim v2 process statistics.,0,kata_containers,, +kata.shim.proc.status,gauge,,,,Kata containerd shim v2 process status.,0,kata_containers,, +kata.shim.rpc.durations.histogram.milliseconds.bucket,gauge,,millisecond,,Shim RPC latency histogram bucket.,0,kata_containers,, +kata.shim.rpc.durations.histogram.milliseconds.count,gauge,,,,Total number of shim RPC calls observed.,0,kata_containers,, +kata.shim.rpc.durations.histogram.milliseconds.sum,gauge,,millisecond,,Total duration of shim RPC calls.,0,kata_containers,, +kata.shim.threads,gauge,,,,Number of threads in the kata containerd shim v2 process.,0,kata_containers,,