AF_PACKET SOCK_RAWソケットのためにnftablesでLinux kernelの挙動を抑制した話

問題
#

raw socketを使ったプロトコルスタックを作って以下のようなやつをdocker環境に閉じ込めようとしたものの

client <-client_router_network-> router <-router_server_network-> server

ICMPパケットを1個送ると4つも返ってくるような代物になってしまっていた。

ネットワークは以下のようになっており

d4e7c59005dc   brstack_client_router   bridge    local
d4203e5fc61a   brstack_router_server   bridge    local

以下のような挙動が観察された。

client <-> router
03:21:41.914019 IP 192.168.30.10 > 192.168.31.10: ICMP echo request, id 0, seq 65, length 13
03:21:41.914336 IP 192.168.31.10 > 192.168.30.10: ICMP echo reply, id 0, seq 65, length 13
03:21:41.914521 IP 192.168.31.10 > 192.168.30.10: ICMP echo reply, id 0, seq 65, length 13
03:21:41.914532 IP 192.168.31.10 > 192.168.30.10: ICMP echo reply, id 0, seq 65, length 13
03:21:41.914715 IP 192.168.31.10 > 192.168.30.10: ICMP echo reply, id 0, seq 65, length 13

seqが65のICMP echo requestを1つ送ったのに対して同じくseqが65のICMP echo replyが4つほど返ってきている。

また以下はrouterとserverの間の挙動であるが途中で2倍になりそして帰りは送っただけ返っている。

router <->server
03:22:39.919549 IP 192.168.30.10 > 192.168.31.10: ICMP echo request, id 0, seq 94, length 13
03:22:39.919794 IP 192.168.30.10 > 192.168.31.10: ICMP echo request, id 0, seq 94, length 13
03:22:39.919850 IP 192.168.31.10 > 192.168.30.10: ICMP echo reply, id 0, seq 94, length 13
03:22:39.920096 IP 192.168.31.10 > 192.168.30.10: ICMP echo reply, id 0, seq 94, length 13

原因
#

まず主要な原因はDockerがnet.ipv4.forwarding=1を設定しておりlinuxカーネルが他ホスト宛のICMPパケットを自動でforwardingする挙動をしていることが原因である。つまり自分で書いたraw socket経由で処理してforwardした分とlinuxカーネルがforwardした分が重なってclient->server方向で2倍にserver->client方向でも2倍になって最終的に4倍になっているというわけである。

以下の解決策を試す前にdocker-compose.ymlなどでsysctlsのnet.ipv4.icmp_echo_ignore_all: 1を指定してみるなども試してみたがおそらく挙動から推察するにserverに到達したもの自体は無視されたがforwardの挙動はどうにもならなかったようである。(おそらくこれも無効にしていた場合はさらにserver側でも2倍になることで8個ほどechoが返るというのが正解のような気もするが実験の際にはそこには気づいていなかった)

またnet.ipv4.forwarding: 0を設定した場合は以下のようなエラーになってしまった

Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: open /proc/sys/net/ipv4/forwarding: no such file or directory

解決策
#

nftablesを使うことで通常のネットワークパケットは遮断できる。 https://knowledge.sakura.ad.jp/22636/

この記事を参考に以下のようにnftablesのrulesetを設定し、 client,router,server全部に設定してみたところ無事にOSのICMPのforwarding及び自動replyの挙動が遮断ができた。またついでなのでarpなどの挙動も抑制した。

table inet inet_table {
    chain input {
        type filter hook input priority filter + 1; policy drop;
    }

    chain forward {
        type filter hook forward priority filter + 1; policy drop;
    }

    chain output {
        type filter hook output priority filter + 1; policy drop;
    }
}
table arp arp_table {
    chain input {
        type filter hook input priority filter + 1; policy drop;
    }

    chain output {
        type filter hook output priority filter + 1; policy drop;
    }
}

まずinputでそもそもOSのipスタックが自ホスト宛てのIPパケットを処理するのをpolicy dropで抑制し、forwardでは他ホスト宛のipパケットを転送するのを抑制しそしてoutputでは自ホストからのパケットを送信するのを抑制した。

なおnet.ipv4.icmp_echo_ignore_all: 1を設定しなかった場合rulesetのchainがforwardだけだとserver側で受け取ったときに返す挙動が入ってrouter-server間ではreplyが以下のように二重になっていたためinput及びoutputにも指定した。(主にnftで全部制御するという一貫性重視の面が強い)

$ sudo tcpdump -i br-d4203e5fc61a
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on br-d4203e5fc61a, link-type EN10MB (Ethernet), snapshot length 262144 bytes
16:27:58.456736 IP 192.168.30.10 > 192.168.31.10: ICMP echo request, id 0, seq 34, length 13
16:27:58.456771 IP 192.168.31.10 > 192.168.30.10: ICMP echo reply, id 0, seq 34, length 13
16:27:58.457042 IP 192.168.31.10 > 192.168.30.10: ICMP echo reply, id 0, seq 34, length 13

将来的にtcpなどを実装する際にrstを出すのも止めるのにも有効かもしれない(未検証)

おまけ(ちょっと長い)
#

関連して以下のようなQ&Aを見つけたがnetfilterはraw socketには効かないといったような話を見た。

https://serverfault.com/questions/1097938/does-iptables-rules-have-control-over-raw-socket-packets

そこでLinuxカーネルを見てnetfilterがなぜsocket(AF_PACKET,　SOCK_RAW, htons(ETH_P_ALL))に効かないのか推測してみる。

ipv4のentryポイント及びnetfilterのpreroutingがかかっているのは以下の位置だと推測される。おそらくinputやforwardはpreroutingよりあとであるため一番早くともnetfilterはここから適用されているようである。 https://github.com/torvalds/linux/blob/1a33418a69cc801d48c59d7d803af5c9cd291be2/net/ipv4/ip_input.c#L558

/*
 * IP receive entry point
 */
int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
	   struct net_device *orig_dev)
{
	struct net *net = dev_net(dev);

	skb = ip_rcv_core(skb, net);
	if (skb == NULL)
		return NET_RX_DROP;

	return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
		       net, NULL, skb, dev, NULL,
		       ip_rcv_finish);
}

このip_rcvは以下のようにpacket_typeに設定されており https://github.com/torvalds/linux/blob/1a33418a69cc801d48c59d7d803af5c9cd291be2/net/ipv4/af_inet.c#L1886

static struct packet_type ip_packet_type __read_mostly = {
	.type = cpu_to_be16(ETH_P_IP),
	.func = ip_rcv,
	.list_func = ip_list_rcv,
};

そしておそらくここら周辺のコードでip_rcvが呼び出されている。 https://github.com/torvalds/linux/blob/1a33418a69cc801d48c59d7d803af5c9cd291be2/net/core/dev.c#L5843

deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
			       &orig_dev->ptype_specific);

https://github.com/torvalds/linux/blob/1a33418a69cc801d48c59d7d803af5c9cd291be2/net/core/dev.c#L2422

static inline void deliver_ptype_list_skb(struct sk_buff *skb,
					  struct packet_type **pt,
					  struct net_device *orig_dev,
					  __be16 type,
					  struct list_head *ptype_list)
{
	struct packet_type *ptype, *pt_prev = *pt;

	list_for_each_entry_rcu(ptype, ptype_list, list) {
		if (ptype->type != type)
			continue;
		if (pt_prev)
			deliver_skb(skb, pt_prev, orig_dev);
		pt_prev = ptype;
	}
	*pt = pt_prev;
}

一方でAF_PACKETの受信処理は以下であり https://github.com/torvalds/linux/blob/1a33418a69cc801d48c59d7d803af5c9cd291be2/net/packet/af_packet.c#L2168

/*
 * This function makes lazy skb cloning in hope that most of packets
 * are discarded by BPF.
 *
 * Note tricky part: we DO mangle shared skb! skb->data, skb->len
 * and skb->cb are mangled. It works because (and until) packets
 * falling here are owned by current CPU. Output packets are cloned
 * by dev_queue_xmit_nit(), input packets are processed by net_bh
 * sequentially, so that if we return skb to original state on exit,
 * we will not harm anyone.
 */

static int packet_rcv(struct sk_buff *skb, struct net_device *dev,
		      struct packet_type *pt, struct net_device *orig_dev)
{

ここで受信用のhookが設定されており https://github.com/torvalds/linux/blob/1a33418a69cc801d48c59d7d803af5c9cd291be2/net/packet/af_packet.c#L3442

	po->prot_hook.func = packet_rcv;

	if (sock->type == SOCK_PACKET)
		po->prot_hook.func = packet_rcv_spkt;

	po->prot_hook.af_packet_priv = sk;
	po->prot_hook.af_packet_net = sock_net(sk);

	if (proto) {
		po->prot_hook.type = proto;
		__register_prot_hook(sk);
	}

https://github.com/torvalds/linux/blob/1a33418a69cc801d48c59d7d803af5c9cd291be2/net/packet/af_packet.c#L344

/* __register_prot_hook must be invoked through register_prot_hook
 * or from a context in which asynchronous accesses to the packet
 * socket is not possible (packet_create()).
 */
static void __register_prot_hook(struct sock *sk)
{
	struct packet_sock *po = pkt_sk(sk);

	if (!packet_sock_flag(po, PACKET_SOCK_RUNNING)) {
		if (po->fanout)
			__fanout_link(sk, po);
		else
			dev_add_pack(&po->prot_hook);

		sock_hold(sk);
		packet_sock_flag_set(po, PACKET_SOCK_RUNNING, 1);
	}
}

https://github.com/torvalds/linux/blob/1a33418a69cc801d48c59d7d803af5c9cd291be2/net/core/dev.c#L603

/**
 *	dev_add_pack - add packet handler
 *	@pt: packet type declaration
 *
 *	Add a protocol handler to the networking stack. The passed &packet_type
 *	is linked into kernel lists and may not be freed until it has been
 *	removed from the kernel lists.
 *
 *	This call does not sleep therefore it can not
 *	guarantee all CPU's that are in middle of receiving packets
 *	will see the new packet type (until the next received packet).
 */

void dev_add_pack(struct packet_type *pt)
{
	struct list_head *head = ptype_head(pt);

	if (WARN_ON_ONCE(!head))
		return;

	spin_lock(&ptype_lock);
	list_add_rcu(&pt->list, head);
	spin_unlock(&ptype_lock);
}
EXPORT_SYMBOL(dev_add_pack);

https://github.com/torvalds/linux/blob/1a33418a69cc801d48c59d7d803af5c9cd291be2/net/core/dev.c#L573

/*
 *	Add a protocol ID to the list. Now that the input handler is
 *	smarter we can dispense with all the messy stuff that used to be
 *	here.
 *
 *	BEWARE!!! Protocol handlers, mangling input packets,
 *	MUST BE last in hash buckets and checking protocol handlers
 *	MUST start from promiscuous ptype_all chain in net_bh.
 *	It is true now, do not change it.
 *	Explanation follows: if protocol handler, mangling packet, will
 *	be the first on list, it is not able to sense, that packet
 *	is cloned and should be copied-on-write, so that it will
 *	change it and subsequent readers will get broken packet.
 *							--ANK (980803)
 */

static inline struct list_head *ptype_head(const struct packet_type *pt)
{
	if (pt->type == htons(ETH_P_ALL)) {
		if (!pt->af_packet_net && !pt->dev)
			return NULL;

		return pt->dev ? &pt->dev->ptype_all :
				 &pt->af_packet_net->ptype_all;
	}

	if (pt->dev)
		return &pt->dev->ptype_specific;

	return pt->af_packet_net ? &pt->af_packet_net->ptype_specific :
				 &ptype_base[ntohs(pt->type) & PTYPE_HASH_MASK];
}

最終的にpt->dev->ptype_allに設定されている。そしておそらくこのあたりで配送されているため結果としてnetfilterの処理は丸々スキップされているのだと推定される。 https://github.com/torvalds/linux/blob/1a33418a69cc801d48c59d7d803af5c9cd291be2/net/core/dev.c#L5727

	list_for_each_entry_rcu(ptype, &dev_net_rcu(skb->dev)->ptype_all,
				list) {
		if (pt_prev)
			ret = deliver_skb(skb, pt_prev, orig_dev);
		pt_prev = ptype;
	}

	list_for_each_entry_rcu(ptype, &skb->dev->ptype_all, list) {
		if (pt_prev)
			ret = deliver_skb(skb, pt_prev, orig_dev);
		pt_prev = ptype;
	}

なおこれはinput側にしか焦点を当てていないのでoutput側がどうなのかはわからないがおそらく似たようにnetfilterはバイパスされているのではないかと推測される。(調査不足)

そして更に調べていくついでにnetfilterのwikiを読んでみるとNETDEVのingress hookが一番最初にあるということが書かれている。これを見てみればnetfilterでraw socketを止められそうかが分かるかなと思ったので調べてみた。 https://wiki.nftables.org/wiki-nftables/index.php/Netfilter_hooks

ingress hookはおそらくここで https://github.com/torvalds/linux/blob/1a33418a69cc801d48c59d7d803af5c9cd291be2/include/linux/netfilter_netdev.h#L31

	nf_hook_state_init(&state, NF_NETDEV_INGRESS,
			   NFPROTO_NETDEV, skb->dev, NULL, NULL,
			   dev_net(skb->dev), NULL);

ここから呼ばれる経路にあり https://github.com/torvalds/linux/blob/1a33418a69cc801d48c59d7d803af5c9cd291be2/net/core/dev.c#L5652

static inline int nf_ingress(struct sk_buff *skb, struct packet_type **pt_prev,
			     int *ret, struct net_device *orig_dev)
{
	if (nf_hook_ingress_active(skb)) {
		int ingress_retval;

		if (*pt_prev) {
			*ret = deliver_skb(skb, *pt_prev, orig_dev);
			*pt_prev = NULL;
		}

		rcu_read_lock();
		ingress_retval = nf_hook_ingress(skb);
		rcu_read_unlock();
		return ingress_retval;
	}
	return 0;
}

それはここで呼ばれている。 https://github.com/torvalds/linux/blob/1a33418a69cc801d48c59d7d803af5c9cd291be2/net/core/dev.c#L5747

#ifdef CONFIG_NET_INGRESS
	if (static_branch_unlikely(&ingress_needed_key)) {
		bool another = false;

		nf_skip_egress(skb, true);
		skb = sch_handle_ingress(skb, &pt_prev, &ret, orig_dev,
					 &another);
		if (another)
			goto another_round;
		if (!skb)
			goto out;

		nf_skip_egress(skb, false);
		if (nf_ingress(skb, &pt_prev, &ret, orig_dev) < 0)
			goto out;
	}
#endif

位置関係で見るとptype_allに先にdeliverされたあとnf_ingressによるチェックが入っている… つまりここでもnetfilterをスキップできるようだ…

しかし前段でgeneric XDP処理がされているのを見つけた…

generic XDP https://yunazuno.hatenablog.com/entry/2017/06/12/094101

(最初からwikipediaのXDPのページを見れば一目瞭然であったということに最後の最後になって気付いた。) https://en.wikipedia.org/wiki/Express_Data_Path

https://github.com/torvalds/linux/blob/1a33418a69cc801d48c59d7d803af5c9cd291be2/net/core/dev.c#L5698

if (static_branch_unlikely(&generic_xdp_needed_key)) {
    int ret2;

    migrate_disable();
    ret2 = do_xdp_generic(rcu_dereference(skb->dev->xdp_prog),
                    &skb);
    migrate_enable();

    if (ret2 != XDP_PASS) {
        ret = NET_RX_DROP;
        goto out;
    }
}

https://github.com/torvalds/linux/blob/1a33418a69cc801d48c59d7d803af5c9cd291be2/net/core/dev.c#L5338

int do_xdp_generic(const struct bpf_prog *xdp_prog, struct sk_buff **pskb)
{

https://github.com/torvalds/linux/blob/1a33418a69cc801d48c59d7d803af5c9cd291be2/net/core/dev.c#L5258

static u32 netif_receive_generic_xdp(struct sk_buff **pskb,
				     struct xdp_buff *xdp,
				     const struct bpf_prog *xdp_prog)
{

https://github.com/torvalds/linux/blob/1a33418a69cc801d48c59d7d803af5c9cd291be2/net/core/dev.c#L5133

u32 bpf_prog_run_generic_xdp(struct sk_buff *skb, struct xdp_buff *xdp,
			     const struct bpf_prog *xdp_prog)
{

最終的にeBPFの実行箇所まで言ったがこれ以上進むと沼が深すぎるのでとりあえず今回はここまでとする。 https://github.com/torvalds/linux/blob/1a33418a69cc801d48c59d7d803af5c9cd291be2/include/net/xdp.h#L646

static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
					    struct xdp_buff *xdp)
{
	/* Driver XDP hooks are invoked within a single NAPI poll cycle and thus
	 * under local_bh_disable(), which provides the needed RCU protection
	 * for accessing map entries.
	 */
	u32 act = __bpf_prog_run(prog, xdp, BPF_DISPATCHER_FUNC(xdp));

	if (static_branch_unlikely(&bpf_master_redirect_enabled_key)) {
		if (act == XDP_TX && netif_is_bond_slave(xdp->rxq->dev))
			act = xdp_master_redirect(xdp);
	}

	return act;
}

ということで結論としてはAF_PACKET socketにも渡したくない場合はXDPを使えということである。

(正しいのかどうかはデバッグしたわけではないため本当にそうかまではわからない部分もあるので有識者の方これ違うとかあったら教えて下さい…)

追記(5/18): その後送信側を調べてみたところAF_PACKETでも送信側の方は netfilterのnetdevファミリーのegressフックを使えば抑制できる事がわかった。

https://github.com/torvalds/linux/blob/205b2bd7939cc126f445ce3010af22858c18ef1f/net/packet/af_packet.c#L273

static int packet_xmit(const struct packet_sock *po, struct sk_buff *skb)
{
	if (!packet_sock_flag(po, PACKET_SOCK_QDISC_BYPASS))
		return dev_queue_xmit(skb);

#ifdef CONFIG_NETFILTER_EGRESS
	if (nf_hook_egress_active()) {
		skb = nf_hook_direct_egress(skb);
		if (!skb)
			return NET_XMIT_DROP;
	}
#endif
	return dev_direct_xmit(skb, packet_pick_tx_queue(skb));
}

問題#

原因#

解決策#

おまけ(ちょっと長い)#

問題
#

原因
#

解決策
#

おまけ(ちょっと長い)
#