K8s下iptables-invalid-drop引起的耗时波动或者偶发断流随记

来自三线的随记

环境前提

  • 有kube-proxy组件且工作在iptables模式下
  • 可有可无的条件: calico CNI

可能的诱因 & 现象结果

  • overlay POD 与集群外服务通讯
  • underlay与overlay网络通讯(去程overlay 回程underlay导致 asymmetrical routing 即非对称路由)
  • conntrack saturation? (conntrack 饱和)

产生偶发性大耗时 或者 偶发性断流现象

在 kube-proxy 所维护的filter KUBE-FORWARD iptables规则链中,存在一条规则-A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP

[root@gzu-prd ~]# iptables -L KUBE-FORWARD --line -nv
Chain KUBE-FORWARD (1 references)
num   pkts bytes target     prot opt in     out     source               destination
1        0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate INVALID
2        4   240 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes forwarding rules */ mark match 0x4000/0x4000
3    11412   33M ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes forwarding conntrack pod source rule */ ctstate RELATED,ESTABLISHED
4        0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes forwarding conntrack pod destination rule */ ctstate RELATED,ESTABLISHED

这一条规则会导致在connection track标记为INVALID的流量被DROP处理,同时这一行为目前不支持配置禁用(除非改代码重新编译)

其中关于TCP的connection track状态可以在conntrack -L 或者 cat /proc/net/nf_conntrack中查到(例如[UNREPLIED]之类的)

kube-proxy会在endpoint发生变动的时候粗暴地Flush iptables规则,导致不能简单地在KUBE-FORWARD中插入一条ACCEPT规则来避免这种问题


同样在calico所维护的各种iptables filter表中,每一个cali-fw-cali****链基本也存在规则-m conntrack --ctstate INVALID -j DROP

[root@gzu-prd ~]# iptables-save -t filter|grep INVALID
-A cali-fw-cali02fca994756 -m comment --comment "cali:Zgj-5PhkyRyRGc5v" -m conntrack --ctstate INVALID -j DROP
-A cali-fw-cali091fd1acd82 -m comment --comment "cali:vySNraYuHVkcwzZC" -m conntrack --ctstate INVALID -j DROP
-A cali-fw-cali0945b5ec7e6 -m comment --comment "cali:YpO6T4K2fN2biMqp" -m conntrack --ctstate INVALID -j DROP
-A cali-fw-cali09725d6075c -m comment --comment "cali:3Q23jKsPGkXWWHjs" -m conntrack --ctstate INVALID -j DROP

但是这一行为是可以通过FELIX_DISABLECONNTRACKINVALIDCHECK环境变量关闭

具体是否受影响,利用iptables命中计数器是观测手段之一

iptables -w 3 -L  --line -nv|grep DROP|sort -rn -k 2|head -n 10
[root@gzu-prd ~]# iptables -w 3 -L  --line -nv|grep DROP|sort -rn -k 2|head -n 10
2    19020  773K DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:kRQn4VHUEHOpigCm */ ctstate INVALID
2    15617  937K DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:DTf_pGZFWLZaqlg8 */ ctstate INVALID
2     7068  283K DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:HGKygSKf4SfkbRyf */ ctstate INVALID
2     3845  154K DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:t5nJs-UfMTVjRtBI */ ctstate INVALID
2     2312  139K DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:h3VJGUlERuK34Tcz */ ctstate INVALID
2     2115  110K DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:dTQ4mHZc378Z1e33 */ ctstate INVALID
2     1828  110K DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:kp1Tzme9aWaPgdKP */ ctstate INVALID
2     1556 62240 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:VaeGtNK_681jKlg9 */ ctstate INVALID
2     1330 69160 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:meQqPUz96UN62T8l */ ctstate INVALID
2     1025 53300 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:mIIn1Wh34t2SZwbR */ ctstate INVALID

如果在不修改kube-proxy和calico-node参数的情况下,想避免这种情况,可以简单粗暴地在集群中设置一个daemonset

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: iptables-conntrack-hacker
  namespace: kube-system
  labels:
    app: iptables-conntrack
spec:
  selector:
    matchLabels:
      app: iptables-conntrack-hacker
  template:
    metadata:
      name: iptables-conntrack-hacker
      labels:
        app: iptables-conntrack-hacker
    spec:
      volumes:
        - name: lib-modules
          hostPath:
            path: /lib/modules
            type: ''
        - name: xtables-lock
          hostPath:
            path: /run/xtables.lock
            type: ''
      containers:
        - name: iptables-conntrack-hacker
          image: 'your-registry-address/kube-system/kube-proxy:v1.18.20'
          command:
            - /bin/sh
            - '-ce'
            - |
              export TZ=Asia/Shanghai;
              echo "$(date) Container started...";
              echo "Current iptables rule state:"
              iptables -w 10 -L --line -nv|grep INVALID || true
              while (true)
              do
                iptables -C FORWARD -w 15 -m conntrack -m comment --comment "To avoid invalid tcp traffic dropped by kubelet or calico" --ctstate INVALID -j ACCEPT || \
                (iptables -I FORWARD -w 10 -m conntrack -m comment --comment "To avoid invalid tcp traffic dropped by kubelet or calico" --ctstate INVALID -j ACCEPT && echo "$(date) Adding iptables rules ...");
                sleep 60
              done
          resources:
            limits:
              cpu: 250m
              memory: 256Mi
            requests:
              cpu: 100m
              memory:64Mi
          volumeMounts:
            - name: lib-modules
              mountPath: /lib/modules
            - name: xtables-lock
              mountPath: /run/xtables.lock
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
            runAsUser: 0
      restartPolicy: Always
      terminationGracePeriodSeconds: 5
      dnsPolicy: ClusterFirstWithHostNet
      hostNetwork: true
      securityContext: {}
      schedulerName: default-scheduler
      tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
        - operator: Exists
          effect: NoExecute
        - operator: Exists
          effect: NoSchedule
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 50%
  revisionHistoryLimit: 5

这个Daemonset只有在启动的时候会去操作宿主机的iptables以粗暴地插入一条INVALID ACCEPT规则

有条件的同学可以修改为死循环并且每10 - 30秒检测一次iptables是否存在ACCEPT规则,不存在则插入

注意使用这个Daemonset还存在一个前提约束,如果使用的overlay CNI为calico,需要确认calico-node的iptables操作模式为追加模式

FELIX_CHAININSERTMODE环境变量要修改为Append ,否则cali-FORWARD这个链会被插在FORWARD链最前面,导致INVALID ACCEPT规则失效

Related

kube-proxy(v1.18.20) code: https://github.com/kubernetes/kubernetes/blob/1f3e19b7beb1cc0110255668c4238ed63dadb7ad/pkg/proxy/iptables/proxier.go#L1503-L1511

calico v3.16 config(FELIX_DISABLECONNTRACKINVALIDCHECK): https://docs.tigera.io/archive/v3.16/reference/felix/configuration

github issue