Use the API Server auditing feature of a cluster for security O&M - Container Compute Service

Auditing in the API server records requests to the Kubernetes API and the results of those requests. Alibaba Cloud Container Service for Kubernetes (ACK) provides API server audit logs. These logs help cluster administrators determine who performed which operation on which resource and when. You can use the logs to trace the history of cluster operations and troubleshoot cluster issues. This reduces the workload for cluster security operations and maintenance (O&M).

Step 1: Enable the API Server auditing feature for the cluster

When you create an ACK cluster, Use Log Service is selected by default to enable the API Server auditing feature. If this feature is not enabled, follow these steps to enable it.

Log on to the ACS console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its ID. In the left-side navigation pane of the cluster details page, choose Security > Cluster Auditing.

If cluster logging or cluster auditing is not enabled, follow the on-screen instructions to select a Simple Log Service (SLS) project and enable the feature.

Important

Make sure that the SLS resource quotas in your account are not exceeded. Otherwise, you cannot enable the cluster auditing feature.

The quota on the number of SLS projects.
The quota on the number of Logstores per SLS project.
The quota on the number of dashboards per SLS project.

For more information about SLS quotas and how to adjust them, see Adjust resource quotas.

Step 2: View audit reports

Important

Do not modify the audit reports. To customize audit reports, you can create new reports in the Simple Log Service console.

ACK clusters have four built-in audit log reports: Audit Center Overview, Resource Operation Overview, Resource Operation Details, and Kubernetes CVE Security Risks. On the Cluster Audit page, you can select dimensions, such as namespace and Resource Access Management (RAM) user, to filter audit events and obtain the following information from the reports.

After you obtain the results, you can click the icon in the upper-right corner of a specific area to perform more operations. For example, you can view the chart for a specific area in full screen or preview the query statement for the area.

Audit Center Overview

The Audit Center Overview report displays an overview of events in the ACK cluster and provides details about important events. Important events include RAM user operations, public network access, command execution, resource deletion, access to secrets, and Kubernetes CVE security risks.

Resource Operation Overview

The Resource Operation Overview report displays statistics about operations on common computing, network, and storage resources in the ACK cluster. These operations include create, update, delete, and access. The resources include the following:

Computing resources: deployments, StatefulSets, jobs, CronJobs, pods, and DaemonSets.
Network resources: services and Ingresses.
Storage resources: ConfigMaps, secrets, and PersistentVolumeClaims.
Access control resources: Roles, ClusterRoles, RoleBindings, and ClusterRoleBindings.

Resource Operation Details

This report displays a detailed list of operations on a specific resource type in the ACK cluster. You must select or enter a resource type to perform a real-time query. The report shows the total number of events for each resource operation type, the namespace distribution, the success rate, the time-series trend, and a detailed list of operations.

Note

To view operations related to CustomResourceDefinition (CRD) resources that are registered in Kubernetes or other unlisted resources, you can enter the plural form of the resource name. For example, if the CRD resource is AliyunLogConfig, enter AliyunLogConfigs.

Kubernetes CVE Security Risks

This report displays potential Kubernetes CVE security risks in the cluster. You can select or enter a sub-account ID, which is the RAM user ID, to perform a real-time query. The report then displays the Kubernetes CVE security risks for the specified account. For more information about CVE details and solutions, see [CVE Security] Vulnerability Fix Announcement.

(Optional) Step 3: View detailed log records

To run custom queries or analyze audit logs, you can go to the Simple Log Service console to view detailed log records.

Note

By default, the data in the Logstore that corresponds to the API server audit logs of an ACK cluster is retained for 30 days. To change the retention period, see Manage a Logstore.

Log on to the ACS console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane of the cluster details page, click Cluster Information.
Click the Basic Information tab. In the Cluster Resources section, click the project ID next to Log Service Project. Then, in the project list, click the Logstore named audit-${clusterid}.
During cluster creation, a Logstore named audit-${clusterid} is automatically created in the specified SLS project.
Important
Indexes are configured for the audit Logstore by default. Do not modify the indexes. Otherwise, the reports may become invalid.
Enter a query statement in the search box, specify a time range for the query, such as the last 15 minutes, and then click Query/Analysis to view the results.
The following list describes common ways to search audit logs:
- To query the operation records of a RAM user, enter the RAM user ID and click Query/Analysis.
- To query the operations on a resource, enter the name of a computing, network, storage, or access control resource in the cluster and click Query/Analysis.
- To filter out the operations of system components, enter NOT user.username: node NOT user.username: serviceaccount NOT user.username: apiserver NOT user.username: kube-scheduler NOT user.username: kube-controller-manager and click Query/Analysis.
For more information about query and statistical methods, see Query and analysis methods for Simple Log Service.

(Optional) Step 4: Configure alerts

If you require real-time alerts for operations on specific resources, you can use the alerting feature of Simple Log Service. Supported notification methods include DingTalk chatbots, custom webhooks, and the Notification Center. For more information, see Quickly set log-based alerts.

Example 1: Alert on command execution in a container

A company has strict restrictions on the use of its Kubernetes clusters and prohibits users from logging on to containers or running commands in them. If a user runs a command, an alert must be sent immediately. The alert information must include the container that the user logged on to, the command that was run, the operator, the event ID, the time, and the source IP address.

Query statement:

verb : create and objectRef.subresource:exec and stage:  ResponseStarted | SELECT auditID as "Event ID", date_format(from_unixtime(__time__), '%Y-%m-%d %T' ) as "Operation Time",  regexp_extract("requestURI", '([^\?]*)/exec\?.*', 1)as "Resource",  regexp_extract("requestURI", '\?(.*)', 1)as "Command" ,"responseStatus.code" as "Status Code",
 CASE 
 WHEN "user.username" != 'kubernetes-admin' then "user.username"
 WHEN "user.username" = 'kubernetes-admin' and regexp_like("annotations.authorization.k8s.io/reason", 'RoleBinding') then regexp_extract("annotations.authorization.k8s.io/reason", ' to User "(\w+)"', 1)
 ELSE 'kubernetes-admin' END  
 as "Operator Account", 
CASE WHEN json_array_length(sourceIPs) = 1 then json_format(json_array_get(sourceIPs, 0)) ELSE  sourceIPs END
as "Source Address" order by "Operation Time" desc  limit 10000

Conditional expression: Operation Event =~ ".*".

Example 2: Alert on failed public network access to the API server

A cluster has public network access enabled. To prevent malicious attacks, you must monitor the number of access attempts and the failure rate. When the number of access attempts reaches a specific threshold, such as 10, and the failure rate is higher than a specific threshold, such as 50%, an alert must be sent immediately. The alert information must include the region of the user's IP address, the source IP address, and whether the IP address is high-risk.

Query statement:

* | select ip as "Source Address", total as "Access Count", round(rate * 100, 2) as "Failure Rate %", failCount as "Illegal Access Count", CASE when security_check_ip(ip) = 1 then 'yes' else 'no' end  as "Is High-Risk IP",  ip_to_country(ip) as "Country", ip_to_province(ip) as "Province", ip_to_city(ip) as "City", ip_to_provider(ip) as "Carrier" from (select CASE WHEN json_array_length(sourceIPs) = 1 then json_format(json_array_get(sourceIPs, 0)) ELSE  sourceIPs END
as ip, count(1) as total,
sum(CASE WHEN "responseStatus.code" < 400 then 0 
ELSE 1 END) * 1.0 / count(1) as rate,
count_if("responseStatus.code" = 403) as failCount
from log  group by ip limit 10000) where ip_to_domain(ip) != 'intranet' and ip not LIKE '%,%' and not try(is_subnet_of('<Your subnet IP address>')) ORDER by "Access Count" desc limit 10000

Conditional expression: Source Address =~ ".*".

Related operations

Change Log Project

To migrate the API server audit log data of a cluster to another SLS project, you can use the Change SLS Project feature.

Log on to the ACS console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its ID. In the left-side navigation pane of the cluster details page, choose Security > Cluster Auditing.
In the upper-right corner of the Cluster Audit page, click Change Log Service Project to migrate the cluster audit log data to a different SLS project.

Disable the API Server auditing feature for the cluster

If you no longer require the API Server auditing feature, you can disable it.

Log on to the ACS console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its ID. In the left-side navigation pane of the cluster details page, choose Security > Cluster Auditing.
In the upper-right corner of the Cluster Audit page, click Disable Cluster Audit to disable the auditing feature for the cluster.

In an ACS cluster: Use a third-party logging solution

We recommend that you use Alibaba Cloud Simple Log Service (SLS) to record ACK cluster audit logs. However, if you want to use a third-party logging service, you can choose not to use SLS when you create the cluster. You can then connect to a different logging solution to collect and retrieve audit logs.

Reference: Introduction to the API Server audit configuration for ACK clusters

When you create an ACK cluster and configure its components, Use Log Service is selected by default in the console. This enables the API Server auditing feature, which collects event data based on an audit policy and writes the data to the backend.

Audit policy

An audit policy defines the configuration of the auditing feature and the rules for collecting requests. The rules for collecting event logs vary based on the audit level. The following audit levels are available.

Audit Level	Log collection rule
None	Events that match the rule are not collected.
Metadata	Collects the metadata of requests, such as user information and timestamps, but does not collect the request or response body.
Request	Collects the metadata and body of requests, but does not collect the response body. This does not apply to non-resource requests.
RequestResponse	Collects the metadata, request body, and response body. This does not apply to non-resource requests.

You can use the --audit-policy-file command-line flag to save the following sample YAML file as a startup parameter for the API server. The following code block shows a sample YAML file for an audit log configuration policy.

View the sample YAML file

apiVersion: audit.k8s.io/v1 # Required. The value is audit.k8s.io/v1 for clusters of Kubernetes v1.24 or later, and audit.k8s.io/v1beta1 for clusters of earlier versions.
kind: Policy
# Do not generate audit events for requests at the RequestReceived stage.
omitStages:
  - "RequestReceived"
rules:
  # The following types of requests are frequent and have low potential risks. We recommend that you set the level to None to skip auditing.
  - level: None
    users: ["system:kube-proxy"]
    verbs: ["watch"]
    resources:
      - group: "" # core
        resources: ["endpoints", "services"]
  - level: None
    users: ["system:unsecured"]
    namespaces: ["kube-system"]
    verbs: ["get"]
    resources:
      - group: "" # core
        resources: ["configmaps"]
  - level: None
    users: ["kubelet"] # legacy kubelet identity
    verbs: ["get"]
    resources:
      - group: "" # core
        resources: ["nodes"]
  - level: None
    userGroups: ["system:nodes"]
    verbs: ["get"]
    resources:
      - group: "" # core
        resources: ["nodes"]
  - level: None
    users:
      - system:kube-controller-manager
      - system:kube-scheduler
      - system:serviceaccount:kube-system:endpoint-controller
    verbs: ["get", "update"]
    namespaces: ["kube-system"]
    resources:
      - group: "" # core
        resources: ["endpoints"]
  - level: None
    users: ["system:apiserver"]
    verbs: ["get"]
    resources:
      - group: "" # core
        resources: ["namespaces"]
  # For read-only URLs, such as /healthz*, /version*, and /swagger*, set the level to None to skip auditing.
  - level: None
    nonResourceURLs:
      - /healthz*
      - /version
      - /swagger*
  # Set the level to None for events to skip auditing.
  - level: None
    resources: 
      - group: "" # core
        resources: ["events"]
  # For interfaces such as Secrets, ConfigMaps, and TokenReviews that may contain sensitive information or binary files, set the level to Metadata.
  - level: Metadata
    resources:
      - group: "" # core
        resources: ["secrets", "configmaps"]
      - group: authentication.k8s.io
        resources: ["tokenreviews"]
  # Requests may return large amounts of data. Set the level to Request to not collect the response body.
  - level: Request
    verbs: ["get", "list", "watch"]
    resources:
      - group: "" # core
      - group: "admissionregistration.k8s.io"
      - group: "apps"
      - group: "authentication.k8s.io"
      - group: "authorization.k8s.io"
      - group: "autoscaling"
      - group: "batch"
      - group: "certificates.k8s.io"
      - group: "extensions"
      - group: "networking.k8s.io"
      - group: "policy"
      - group: "rbac.authorization.k8s.io"
      - group: "settings.k8s.io"
      - group: "storage.k8s.io"
  # For known Kubernetes APIs, the level is set to RequestResponse by default to return the request and response bodies.
  - level: RequestResponse
    resources:
      - group: "" # core
      - group: "admissionregistration.k8s.io"
      - group: "apps"
      - group: "authentication.k8s.io"
      - group: "authorization.k8s.io"
      - group: "autoscaling"
      - group: "batch"
      - group: "certificates.k8s.io"
      - group: "extensions"
      - group: "networking.k8s.io"
      - group: "policy"
      - group: "rbac.authorization.k8s.io"
      - group: "settings.k8s.io"
      - group: "storage.k8s.io"
  # For all other requests, the level is set to Metadata by default.
  - level: Metadata

Note

Logs are not recorded immediately after a request is received. Recording starts only after the response header is sent.

The system does not audit redundant kube-proxy watch requests, GET requests for nodes from the kubelet and system:nodes, endpoint operations performed by kube components in the kube-system namespace, or GET requests for namespaces from the API server.

For sensitive interfaces such as authentication, rbac, certificates, autoscaling, and storage, the system records the corresponding request and response bodies based on read and write operations.

Audit backend

After audit events are collected, they are stored in the log backend file system. The log files are in standard JSON format.