2017-06-28

pivot_root できる条件

ふとしたきっかけで man 2 pivot_root の制限に疑問を持ったので、雑にカーネルのコードを読んでみたエントリです。かなり雑にみただけなので間違いの指摘を歓迎します。というか指摘を受けるために書いたようなもの😅

pivot_root の使われ方

コンテナを起動して、コンテナイメージの root をコンテナの root に設定する際、chroot を抜けられないように権限を制御したりしながら chroot を使ったりすることがあります。

一方で、LXC や Docker では pivot_root が使われます。chroot は比較的簡単に使えるのに対して、pivot_root はいくつか制限があります。

man 2 pivot_root すると、その制限について説明があります。

new_root および put_old には以下の制限がある:

ディレクトリでなければならない。

new_root と put_old は現在の root と同じファイルシステムにあってはならない。

put_old は new_root 以下になければならない。すなわち put_old を差す文字列に 1 個以上の ../ を付けることによって new_root と同じディレクトリが得られなければならない。

他のファイルシステムが put_old にマウントされていてはならない。

https://linuxjm.osdn.jp/html/LDP_man-pages/man2/pivot_root.2.html

(「差す」は「指す」の Typo ?)

しかし、この説明は少し微妙です。例えば、LXC ではコンテナイメージの root を、例えば /usr/lib/lxc/rootfs にバインドマウントして、そこに pivot_root します。

バインドマウントですので、言ってみれば new_root は新たにマウントされたファイルシステムとも言えますが、同じファイルシステム上のディレクトリとも言えます。

他に、

 998        /* change into new root fs */
 999        if (fchdir(newroot)) {
1000                SYSERROR("can't chdir to new rootfs '%s'", rootfs);
1001                goto fail;
1002        }
1003
1004        /* pivot_root into our new root fs */
1005        if (pivot_root(".", ".")) {
1006                SYSERROR("pivot_root syscall failed");
1007                goto fail;
1008        }
1009
1010        /*
1011         * at this point the old-root is mounted on top of our new-root
1012         * To unmounted it we must not be chdir'd into it, so escape back
1013         * to old-root
1014         */
1015        if (fchdir(oldroot) < 0) {
1016                SYSERROR("Error entering oldroot");
1017                goto fail;
1018        }
1019        if (umount2(".", MNT_DETACH) < 0) {
1020                SYSERROR("Error detaching old root");
1021                goto fail;
1022        }
1023
1024        if (fchdir(newroot) < 0) {
1025                SYSERROR("Error re-entering newroot");
1026                goto fail;
1027        }

(lxc-2.0.8時点のsrc/lxc/conf.c)

このあたりですね。新しく root としたいコンテナの root (newroot) に移動したあと、pivot_root(".", ".") として、以前の root である oldroot も newroot と同じディレクトリにマウントしてしまいます。その後 oldroot をアンマウントしています。これは「すなわち put_old を指す文字列に 1 個以上の ../ を付けることによって new_root と同じディレクトリが得られなければならない」ではないようにも思えてしまいます。

なので、実際どうなのかカーネルのコードをみてみました。

カーネルのコメント

pivot_root システムコールは fs/namespace.c 内にあります。手元にはなぜか 4.1.15 のソースコードがあるので、それでみると 2941 行目付近からが実装です。

実はここのコメントにも詳細な説明があります。これを読めば一件落着! かも。

/*
 * pivot_root Semantics:
 * Moves the root file system of the current process to the directory put_old,
 * makes new_root as the new root file system of the current process, and sets
 * root/cwd of all processes which had them on the current root to new_root.
 *
 * Restrictions:
 * The new_root and put_old must be directories, and  must not be on the
 * same file  system as the current process root. The put_old  must  be
 * underneath new_root,  i.e. adding a non-zero number of /.. to the string
 * pointed to by put_old must yield the same directory as new_root. No other
 * file system may be mounted on put_old. After all, new_root is a mountpoint.
 *
 * Also, the current root cannot be on the 'rootfs' (initial ramfs) filesystem.
 * See Documentation/filesystems/ramfs-rootfs-initramfs.txt for alternatives
 * in this situation.
 *
 * Notes:
 *  - we don't move root/cwd if they are not at the root (reason: if something
 *    cared enough to change them, it's probably wrong to force them elsewhere)
 *  - it's okay to pick a root that isn't the root of a file system, e.g.
 *    /nfs/my_root where /nfs is the mount point. It must be a mountpoint,
 *    though, so you may need to say mount --bind /nfs/my_root /nfs/my_root
 *    first.
 */

私の超（＝ヒドい）訳を。

/*
 * pivot_root のセマンティクス:
 * カレントプロセスのルートファイルシステムを put_old ディレクトリへ移
 *  動させ、new_root をカレントプロセスの新しいルートファイルシステムに
 *  します。そして、現在のルートを使用しているすべてのプロセスのルート
 *  とカレントワーキングディレクトリを新しいルートに設定します
 *
 * 制限:
 * new_root と put_old はディレクトリでなくてはなりません。そして、
 * 現在のプロセスのルートと同じファイルシステム上にあってはなりません。
 * put_old は new_root の下になくてはなりません。すなわち、put_old
 * が指す文字列に 0 個以外の /.. を追加すると new_root と同じディレク
 * トリにならなくてはいけません。他のファイルシステムが put_old にマ
 * ウントされていてはいけません。結局、new_root はマウントポイントです。
 *
 * さらに、カレントの root が 'rootfs' (initramfs) となることはできま
 * せん。この場合の代替策は
 * Documentation/filesystems/ramfs-rootfs-initramfs.txt をご覧ください。
 *
 * 注意:
 *  - root にいない場合は、root/cwd に移動しません (理由: 十分に注意し
 *    て、それらを変更するのであれば、別の場所を強制するのはたぶん間違
 *    いでしょう)
 *  - ファイルシステムの root でない root を選択することができます。例
 *    えば、/nfs がマウントポイントである /nfs/my_root を選択できます。
 *    マウントポイントでなくてはなりませんので、mount --bind
 *    /nfs/my_root /nfs/my_root を最初に実行しておく必要があるかもしれ
 *    ません
 */

これでほぼ解決でしょうか :-)

カーネルコード

ですが、カーネルのコードを追って、どのような条件が設定されていうのかを確認しておきましょう。

2966SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
2967                const char __user *, put_old)
2968{
2969        struct path new, old, parent_path, root_parent, root;
2970        struct mount *new_mnt, *root_mnt, *old_mnt;
2971        struct mountpoint *old_mp, *root_mp;
2972        int error;

文字列で与えられたパスから struct path を取得します。ここで new_root、put_old がディレクトリであるかどうかのチェックをしているようです。また、カレントプロセスの root のパス (= root) を取得します。

2977        error = user_path_dir(new_root, &new);
  :(snip)
2981        error = user_path_dir(put_old, &old);
  :(snip)
2989        get_fs_root(current->fs, &root);

それぞれのパスから、struct mount を取得します。

2996        new_mnt = real_mount(new.mnt);
2997        root_mnt = real_mount(root.mnt);
2998        old_mnt = real_mount(old.mnt);

shared mount 以外

まず最初の条件、

2999        if (IS_MNT_SHARED(old_mnt) ||
3000                IS_MNT_SHARED(new_mnt->mnt_parent) ||
3001                IS_MNT_SHARED(root_mnt->mnt_parent))

new_mnt と root_mnt は struct mount で、そのメンバ mnt_parent はマウントの親子関係がある場合、要は他のマウント配下にマウントポイントがあり、そこにマウントされているような場合に、その親マウントを示すメンバです (たぶん)。(struct mount)

以前の root の移動先のマウント (put_old)、新しい root (new_root)およびカレントプロセスの root がマウントされている親のマウント (ファイルシステム) が shared マウントであってはいけません。

新しいrootのマウントが現プロセスと同じマウント名前空間

ふたつめの条件、

 if (!check_mnt(root_mnt) || !check_mnt(new_mnt))
        goto out4;

check_mnt は、次のように引数 mnt で指定したマウントの名前空間とカレントプロセスのマウント名前空間が等しいかどうかをチェックしています。

 771static inline int check_mnt(struct mount *mnt)
 772{
 773        return mnt->mnt_ns == current->nsproxy->mnt_ns;
 774}

つまり、新しい root のマウントは、カレントプロセスのマウント名前空間に属していなければなりません。

カレントプロセスのrootと、new, oldが異なるマウント

同じファイルシステムでグルグル循環してはいけませんので、

3011        if (new_mnt == root_mnt || old_mnt == root_mnt)
3012                goto out4; /* loop, on the same file system  */

新しい root の mount 構造体と、カレントプロセスの root の mount 構造体が同じ、つまり同じマウント (ファイルシステム) である
古い root の mount 構造体と、カレントプロセスの root の mount 構造体が同じ、つまり同じマウント (ファイルシステム) である

このような場合はエラーとなります。old_mnt と root_mnt は同じじゃないの? と一瞬思ってしまうかもしれませんが、元の root を、新しい root 以下の put_old にマウントする際のマウントを表しています。例えば、マウントポイントが違いますので構造体のインスタンスは別ですね。

そもそもここが同じだと、pivot_root でルートを移動することになりません。

カレントプロセスのrootはマウントポイント

root.mnt->mnt_rootで、カレントプロセスの root マウント (ファイルシステム) の root の dentry を求めています。これがカレントプロセスの root の dentry と異なっている場合はエラーになります。

3014        if (root.mnt->mnt_root != root.dentry)
3015                goto out4; /* not a mountpoint */

ややこしいですが、カレントプロセスの root ディレクトリがマウントポイントと異なっていてはいけません。これは、カレントプロセスが chroot でマウントポイント以外を root としている場合でしょう (たぶん)。

カレントプロセスの root は attach されている

ここはちょっとよくわからないのですが、カレントプロセスの root がマウントの親子関係のツリー内にいるかどうかをチェックしています。

3016        if (!mnt_has_parent(root_mnt))
3017                goto out4; /* not attached */

mnt_has_parent は fs/mount.h 内にあります。

static inline int mnt_has_parent(struct mount *mnt)
{
    return mnt != mnt->mnt_parent;
}

指定した mnt と、自身のメンバである親マウントを表す mnt->mnt_parent が異なっている場合、つまり親マウントがあるかどうかをチェックしています。(構造体の初期化時点で mnt->mnt_parent = mnt という処理があるので、ちゃんと親子関係がなければ mnt == mnt->mnt_parent となるはずです、たぶん)

システム起動時の root を含め、きちんとマウントされていれば、親マウントは存在するんだと思います。というのは、カーネルパラメータで指定した root を起動時にマウントする際、do_move_mountを通りますが、この中でも同じようなチェックをしていて、エラーになったら root がマウントできないはずです (たぶん)。

このあたりで、元の root のマウントを mnt_parent に入れているような処理があります (たぶん、do_mount_move→attach_mnt→mnt_set_mountpointあたりの流れ)。

新しい root はマウントポイントで attach されている

3019        if (new.mnt->mnt_root != new.dentry)
3020                goto out4; /* not a mountpoint */
3021        if (!mnt_has_parent(new_mnt))
3022                goto out4; /* not attached */

先の説明のカレントプロセスのチェックと同じですね。新しい root マウントの root ディレクトリはマウントポイントで、きちんとマウントの親子関係の中に入っている必要があります。

old は new (新たな root) 配下

3023        /* make sure we can reach put_old from new_root */
3024        if (!is_path_reachable(old_mnt, old.dentry, &new))
3025                goto out4;

is_path_reachable は関数名から想像はつきますが、

2916/*
2917 * Return true if path is reachable from root
2918 *
2919 * namespace_sem or mount_lock is held
2920 */
2921bool is_path_reachable(struct mount *mnt, struct dentry *dentry,
2922                         const struct path *root)
2923{
2924        while (&mnt->mnt != root->mnt && mnt_has_parent(mnt)) {
2925                dentry = mnt->mnt_mountpoint;
2926                mnt = mnt->mnt_parent;
2927        }
2928        return &mnt->mnt == root->mnt && is_subdir(dentry, root->dentry);
2929}

という関数です。つまり、

old と new が異なるマウントで old に親マウントがある場合には、old の親マウントを参照

という処理を終えた後に条件判定をしていますので、

old の親 (マウント) が new
old (のマウントポイント) が new (のマウントポイント) のサブディレクトリ

という条件になります。ちなみに is_subdir はこんな。

3302/**
3303 * is_subdir - is new dentry a subdirectory of old_dentry
3304 * @new_dentry: new dentry
3305 * @old_dentry: old dentry
3306 *
3307 * Returns 1 if new_dentry is a subdirectory of the parent (at any depth).
3308 * Returns 0 otherwise.
3309 * Caller must ensure that "new_dentry" is pinned before calling is_subdir()3310 */
3311
3312int is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)
3313{
3314        int result;
3315        unsigned seq;
3316
3317        if (new_dentry == old_dentry)
3318                return 1;

(fs/dcache.cのis_subdir付近付近)

3317行目で、指定されているふたつのディレクトリが同じでも 1 を返してるので、「"/.." をつけて」とかは説明につけなくても良いのでは?

new はカレントプロセス root 配下

3026        /* make certain new is below the root */
3027        if (!is_path_reachable(new_mnt, new.dentry, &root))
3028                goto out4;

これは上 (old は new 配下) と同じですね。

まとめ

ここまできて、man やコード中のコメントをみても、わかったようなわかってないような気分になるのは、用語の曖昧さと気づきました。

ファイルシステム: マウントポイントにマウントされるマウントの情報と含まれるツリー (=`mount`構造体)

とするとすっきりします。つまり

new_root と put_old は、man 2 pivot_root にあったように:

ディレクトリでなければならない
現在の root ファイルシステムと同じファイルシステム であってはならない
put_old は new_root 以下になければならない
put_old に他のファイルシステムがマウントされていてはいけない

加えて、

new_root はカレントプロセスの root ファイルシステムの root 以下になければならない
カレントプロセスの root ファイルシステムの root はマウントポイントでなければならない
- chroot でマウントポイントから root が移動していてはいけない
new_root の root ディレクトリはマウントポイントでなければならない
old_put にマウントされるファイルシステム、new_mnt の親ファイルシステム、カレントプロセスの root ファイルシステムの親ファイルシステムが shared マウントであってはならない

こんな感じでしょうか。

ファイルシステムの実体 (配下のディレクトリやファイル) は判定には関係なく、マウントそのものとマウントポイントがキーとなります。新しい root はマウントポイントであれば良いので、bind マウントでも良いということになりますね。

pivot_root の条件のところだけ追って力尽きたので、実際のマウントやら移動の部分の解説はありません😅

2017-05-23

Linux 4.11 での cgroup 関連の話題

cgroup Linux Kernel

Linux 4.11 で cgroup に動きがありました。と言ってもしばらく新機能追えてないので、これまでも色々変更されているかも?

追加された事自体を忘れてしまいそうなのでメモしておくだけで、詳しく調べるわけではありません。

rdma controller

rdma コントローラというコントローラが新たに追加されています。RDMA は “Remote Direct Memory Access” ですか。知りませんでした。

Added rdma cgroup controller that does accounting, limit enforcement on rdma/IB resources.

とのことですから、Infiniband 関係でリソース制限を行うために追加されたのでしょうか。

rdmacg: Added rdma cgroup controller

Drop the matching uid requirement on migration for cgroup v2

cgroup v2 での、root 以外のユーザによる cgroup 操作の制限を一部外しましょう、というもののようですね。cgroup v2 では別に制限かかっているから、これがなくても OK みたいな。

cgroup: drop the matching uid requirement on migration for cgroup v2

slub: make sysfs directories for memcg sub-caches optional

slub: make sysfs directories for memcg sub-caches optional

2017-03-27

cgroup 初期化時の v1 の処理だけ抜き出して追ってみる

cgroup Linux Kernel

cgroup 初期化のメモ。間違っている可能性大なので信用しないでください。何度も同じところを繰り返し見てるので忘れないようにメモです。

サブシステム (cgroup_subsys) と各グループのサブシステムの状態 (cgroup_subsys_state) とそれらのセット (css_set) の関係とかわかってないと以下を見てもわけわからんかも。自分用のメモで、間違いあるかもしれません。

start_kernel からまず cgroup_init_early が呼び出された後、cgroup_init が呼び出される。

   492 asmlinkage __visible void __init start_kernel(void)
    :(snip)
   511         cgroup_init_early();
    :(snip)
   661         cgroup_init();

cgroup_init_early

  4960         RCU_INIT_POINTER(init_task.cgroups, &init_css_set);

init_task の css_set 型のメンバ cgroups を init_css_set で初期化します。init_css_set は cgroup.c で定義されています。

  4962         for_each_subsys(ss, i) {

これは静的に定義されているサブシステム分要素を持つ cgroup_subsys 型の cgroup_subsys[] 変数 (配列) 分ループをまわすものでした (cgroup の SUBSYS マクロ参照)。ss には i 番目のサブシステムが入ります。

  4963                 WARN(!ss->css_alloc || !ss->css_free || ss->name || ss->id,
  4964                      "invalid cgroup_subsys %d:%s css_alloc=%p css_free=%p name:id=%d:%s\n",
  4965                      i, cgroup_subsys_name[i], ss->css_alloc, ss->css_free,
  4966                      ss->id, ss->name);

まずは

サブシステムに必須のメソッドである css_alloc/free が存在しているかのチェック
サブシステム名とIDがまだ構造体のメンバに設定されていないチェック

を行います。

  4967                 WARN(strlen(cgroup_subsys_name[i]) > MAX_CGROUP_TYPE_NAMELEN,
  4968                      "cgroup_subsys_name %s too long\n", cgroup_subsys_name[i]);

次にサブシステム名 (cgroup_subsys_name[i] に入ってます) の名前が長すぎないかチェックしています。

  4969 
  4970                 ss->id = i;
  4971                 ss->name = cgroup_subsys_name[i];

サブシステム (変数の各メンバ) に、ID と名前を設定します。(各サブシステムで cgroup_subsys を定義する際には定義されていないメンバ)

  4973                 if (ss->early_init)
  4974                         cgroup_init_subsys(ss, true);

そしてサブシステムの early_init を行うように 1 が設定されている時は cgroup_init_subsys を呼び出します。

cgroup_init_subsys を見てみると、

  4895 static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
  4896 {
   :(snip)
  4906         /* Create the root cgroup state for this subsystem */
  4907         ss->root = &cgrp_dfl_root;
  4908         css = ss->css_alloc(cgroup_css(&cgrp_dfl_root.cgrp, ss));

と ss->root には v2 の root が放り込まれます。つまり cgroup_init_subsys は v2 の処理を行うだけです。

cgroup_init

cgroup_init はちょっと長いのですが、これも v2 の処理を無視するとやっていることはシンプルです。

  4992         BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));

まずはこれ。cgroup を操作するために各グループに現れるファイルがあります。このファイルを cftype という構造体で定義します。

このファイルのうち、サブシステムの処理とは直接関係ない、cgroup 自体の操作に関連する cgroup コアで扱うファイル群があり、その定義が cftype の配列として静的に定義されています。それが cgroup_legacy_base_files 変数です。これを cgroup_init_cftypes 関数に渡して、ファイル操作の定義を行います (多分)。

  4996         /* Add init_css_set to the hash table */
  4997         key = css_set_hash(init_css_set.subsys);
  4998         hash_add(css_set_table, &init_css_set.hlist, key);

システム上の css_set はすべて css_set_table というハッシュテーブルで管理されますので、それに init_css_set を登録します。css_set 内のサブシステムから計算される値を使ってハッシュのキー (key) を生成し、使用します。とは言っても init_css_set って v2 用じゃ…

そしてまたサブシステム分ループ。ループの最初に以下のような処理がありますが、

  5005                 if (ss->early_init) {
  5006                         struct cgroup_subsys_state *css =
  5007                                 init_css_set.subsys[ss->id];
  5008 
  5009                         css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2,
  5010                                                    GFP_KERNEL);
  5011                         BUG_ON(css->id < 0);
  5012                 } else {
  5013                         cgroup_init_subsys(ss, false);
  5014                 }

ここですが、cgroup_init_early で初期化したサブシステムは ID を確保、されなかったサブシステムは cgroup_init_subsys で初期化します。とは言っても、いずれも v2 処理。

その後の条件分岐、

  5035                 if (ss->dfl_cftypes == ss->legacy_cftypes) {
  5036                         WARN_ON(cgroup_add_cftypes(ss, ss->dfl_cftypes));
  5037                 } else {
  5038                         WARN_ON(cgroup_add_dfl_cftypes(ss, ss->dfl_cftypes));
  5039                         WARN_ON(cgroup_add_legacy_cftypes(ss, ss->legacy_cftypes));
  5040                 }

if 文 true の条件は特別な起動オプションを付けないと満たさない開発目的の条件なので無視。通常は false になるので、v2 の設定を行った (cgroup_add_dfl_cftypes呼び出し) 後の 5039 行目だけが v1 の処理でしょう、と思えますが、実は通常はここは何もしません。この理由は後で。

  5042                 if (ss->bind)
  5043                         ss->bind(init_css_set.subsys[ssid]);
  5044         }

その後、サブシステムに bind 関数が設定されている場合はそれを実行。

  5046         err = sysfs_create_mount_point(fs_kobj, "cgroup");

sysfs にマウントポイントとなるディレクトリを作成 (/sys/fs/cgroup)。

  5050         err = register_filesystem(&cgroup_fs_type);

cgroup というファイルシステムを登録。

  5056         proc_create("cgroups", 0, NULL, &proc_cgroupstats_operations);

proc に “cgroups” というファイルを作成。で終わりです。

cgroup_add_legacy_cftypes で何もしない理由

cgroup_add_legacy_cftypes で v1 っぽい処理を呼び出しているにも関わらず「通常は何もしません」と書いた理由を追ってみます。

cgroup_add_legacy_cftypes

cgroup_add_legacy_cftypes にサブシステムと cgroup_subsys 構造体型の各サブシステムの変数で定義されている legacy_cftypes を渡します。legacy_cftypes は各サブシステムで v1 のグループに出現させるファイルが各サブシステムごとに定義されています。

cgroup_add_legacy_cftypes 関数を見てみます。

  3313 int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts)
  3314 {
    :(snip)
  3322         if (!cgroup_legacy_files_on_dfl ||
  3323             ss->dfl_cftypes != ss->legacy_cftypes) {
  3324                 for (cft = cfts; cft && cft->name[0] != '\0'; cft++)
  3325                         cft->flags |= __CFTYPE_NOT_ON_DFL;
  3326         }
  3327 
  3328         return cgroup_add_cftypes(ss, cfts);
  3329 }

通常は (開発目的のフラグがオン、またはデフォルトの cftypes が v1 の場合 (これも通常はない) でなければ)、各 cftype 型の変数のメンバ flags に “v2 階層には表示させない” というフラグを設定したのち、cgroup_add_cftypes を呼び出します。

cgroup_add_cftypes

cgroup_add_cftypes 関数

  3263 static int cgroup_add_cftypes(struct cgroup_subsys *ss, struct cftype *cfts)
  3264 {
   :(snip)
  3273         ret = cgroup_init_cftypes(ss, cfts);
   :(snip)
  3280         ret = cgroup_apply_cftypes(cfts, true);

cgroup_init_cftypes で cft->ss = ss という cftype 構造体の ss にサブシステムを入れている部分があります。そして cgroup_apply_cftypes 関数を呼びます。

cgroup_apply_cftypes

  3138 static int cgroup_apply_cftypes(struct cftype *cfts, bool is_add)
  3139 {
  3141         struct cgroup_subsys *ss = cfts[0].ss;
  3142         struct cgroup *root = &ss->root->cgrp;

ss には cfts[0].ss を入れてますが、これは cgroup_init_cftypes でサブシステムそのものを入れました。root 変数にはサブシステムの root が入りますが、これは cgroup_init_subsys で v2 の root が入っていたはず。

  3155                 ret = cgroup_addrm_files(cgrp, cfts, is_add);

cgroup_addrm_files

そして cgroup_addrm_files 呼び出し。

  3105 static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
  3106                               bool is_add)
  3107 {
   :(snip)
  3113         for (cft = cfts; cft->name[0] != '\0'; cft++) {
  3114                 /* does cft->flags tell us to skip this file on @cgrp? */
  3115                 if ((cft->flags & __CFTYPE_ONLY_ON_DFL) && !cgroup_on_dfl(cgrp))
  3116                         continue;
  3117                 if ((cft->flags & __CFTYPE_NOT_ON_DFL) && cgroup_on_dfl(cgrp))
  3118                         continue;
   :(snip)

というループがあり、

現在の cgrp は v2 の root の cgroup が入っているはず (cgroup_apply_cftypes の処理参照)
cft->flags には __CFTYPE_NOT_ON_DFL が放り込まれていたはず (cgroup_add_legacy_cftypes の処理参照)

というわけで、ここはずっと continue で何もしないままループを抜けます。ループを抜けなければ cgroup_add_file で cgroup 用のファイルを追加するのですが、何もしないということです。

cgroup の SUBSYS マクロ

cgroup Linux Kernel

Linux カーネルの cgroup 関連のコードのお話。

cgroup_subsys.h というヘッダがあって、cpuset の部分だけ抜き出すと

#if IS_ENABLED(CONFIG_CPUSETS)
SUBSYS(cpuset)
#endif

という風に SUBSYS マクロの中に cpuset のようなサブシステム名を与えているだけのヘッダファイルがあります。

この SUBSYS マクロは cgroup.h と cgroup.c で計 4 回ほど定義されています。

まず最初に、cgroup.h で

#define SUBSYS(_x) _x ## _cgrp_id,

と定義されています。サブシステム名と _cgrp_id という文字列を連結します。

直後に enum の定義があり、

enum cgroup_subsys_id {
#include <linux/cgroup_subsys.h>
        CGROUP_SUBSYS_COUNT,
};

上記のように、そこで cgroup_subsys.h が include されているので、さきほどの SUBSYS(cpuset) が cpuset_cgrp_id という風に展開されます。

つまり、

enum cgroup_subsys_id {
        cpuset_cgrp_id,

となります。cgroup_subsys.h には、他に存在するサブシステムが (kernel の config で有効になっている場合に) SUBSYS(サブシステム名) という風に定義されているので、(サブシステム名)_cgrp_id という値をサブシステム分持つ enum が定義されます。そして、最後が CGROUP_SUBSYS_COUNT となるわけですね。つまり、

enum cgroup_subsys_id {
        cpuset_cgrp_id,
        cpu_cgrp_id,
        cpuacct_cgrp_id,
        blkio_cgrp_id,
        memory_cgrp_id,
        devices_cgrp_id,
        freezer_cgrp_id,
        net_cls_cgrp_id,
        perf_event_cgrp_id,
        net_prio_cgrp_id,
        hugetlb_cgrp_id,
        debug_cgrp_id,
        CGROUP_SUBSYS_COUNT,
};

という enum が完成します。直後に

#undef SUBSYS

と SUBSYS は undef されているので、ここで上記の enum を作って役割は終わりです。この CGROUP_SUBSYS_COUNT はサブシステム分ループする場合に使えますね。例えばこんなふうに

#define for_each_subsys(ss, ssid)                   \
   for ((ssid) = 0; (ssid) < CGROUP_SUBSYS_COUNT &&      \
        (((ss) = cgroup_subsys[ssid]) || true); (ssid)++)

そして、サブシステムの配列などで、それぞれのサブシステムの ID を表す (サブシステム名)_cgrp_id という値が利用できます。あとは配列の領域確保のときに使われています。

もう 1 か所、cgroup.h に SUBSYS の定義があります。

#define SUBSYS(_x) extern struct cgroup_subsys _x ## _cgrp_subsys;
#include <linux/cgroup_subsys.h>
#undef SUBSYS

サブシステム名を _cgrp_subsys と連結して、extern struct cgroup_subsys の後に書くので、例えば cpuset なら

extern struct cgroup_subsys cpuset_cgrp_subsys;

となります。cpuset_cgrp_subsys 変数は kernel/cpuset.c に定義がありますので、それの extern 宣言になりますね。同様に、有効になっている各サブシステムの cgroup_subsys 構造体ごとに宣言されます。

kernel/cgroup.c にも 2 ヶ所ほどあります。

/* generate an array of cgroup subsystem pointers */
#define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
static struct cgroup_subsys *cgroup_subsys[] = {
#include <linux/cgroup_subsys.h>
};
#undef SUBSYS

最初に紹介した、サブシステム名と _cgrp_id を連結した enum の id が使われていますね。これも同様に

[cpuset_cgrp_id] = &cpuset_cgrp_subsys,

という行が生成されますね。つまり cgroup_subsys という配列の (サブシステム名)_cgrp_id 番目は (サブシステム名)_cgrp_subsys へのポインタということです。

これは先に紹介した for_each_subsys マクロで使われていましたね。

さいごに kernel/cgroup.c にあるのが、サブシステム名を命名するマクロです。

/* array of cgroup subsystem names */
#define SUBSYS(_x) [_x ## _cgrp_id] = #_x,
static const char *cgroup_subsys_name[] = {
#include <linux/cgroup_subsys.h>
};
#undef SUBSYS

これもさきほどと同様の処理でサブシステム名の配列を作っています。配列の (サブシステム名)_cgrp_id 番目は (サブシステム名)、つまり SUBSYS マクロに指定した文字列となります。

[cpuset_cgrp_id] = cpuset,

つまり、カーネルが使うサブシステムに関わる構造体の配列とか ID とか名前の配列なんかは、カーネルの config 時 (make config とか make menuconfig とか) の時点で定義され、コンパイル時点で作成されるってことですね。

いやー、さすがカーネル開発するような人は賢いなあ (←言ってる人アホっぽい)

2017-03-04

cgroup のデフォルトルート cgrp_dfl_root

cgroup Linux Kernel

このエントリはほぼ個人的なメモで、色々唐突です。

cgroupのコアは kernel/cgroup.c に色々処理があります。

その中に「デフォルトヒエラルキ（のルート）」という変数があります。

4.1 kernel のコードです。

/*
 * The default hierarchy, reserved for the subsystems that are otherwise
 * unattached - it never has more than a single cgroup, and all tasks are
 * part of that cgroup.
 */
struct cgroup_root cgrp_dfl_root;

cgroup の初期化処理は start_kernel から呼び出される cgroup_init_early と cgroup_init で行われますが、cgrp_dfl_root は cgroup_init で初期化されているようです。

int __init cgroup_init(void)
{
	struct cgroup_subsys *ss;
	unsigned long key;
	int ssid, err;

	BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
	BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));

	mutex_lock(&cgroup_mutex);

	/* Add init_css_set to the hash table */
	key = css_set_hash(init_css_set.subsys);
	hash_add(css_set_table, &init_css_set.hlist, key);

	BUG_ON(cgroup_setup_root(&cgrp_dfl_root, 0));

ここで cgroup_setup_root に cgrp_dfl_root が渡されます。この中で渡された root (つまり cgrp_dfl_root) の初期化処理がなされていきます。

ここで root = cgrp_dfl_root の場合は cgroup_dfl_base_files

static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
{

    : (snip)

	if (root == &cgrp_dfl_root)
		base_files = cgroup_dfl_base_files;
	else
		base_files = cgroup_legacy_base_files;

    : (snip)

	ret = cgroup_addrm_files(root_cgrp, base_files, true);

ここで設定した base_files を cgroup_addrm_files に渡して、cgroup コアが各 cgroup (つまり cgroup を表す各ディレクトリ) に作成するファイルを設定します。

cgroup_dfl_base_files ってのは

/* cgroup core interface files for the default hierarchy */
static struct cftype cgroup_dfl_base_files[] = {
	{
		.name = "cgroup.procs",
		.seq_start = cgroup_pidlist_start,
		.seq_next = cgroup_pidlist_next,
		.seq_stop = cgroup_pidlist_stop,
		.seq_show = cgroup_pidlist_show,
		.private = CGROUP_FILE_PROCS,
		.write = cgroup_procs_write,
		.mode = S_IRUGO | S_IWUSR,
	},
	{
		.name = "cgroup.controllers",
		.flags = CFTYPE_ONLY_ON_ROOT,
		.seq_show = cgroup_root_controllers_show,
	},
	{
		.name = "cgroup.controllers",
		.flags = CFTYPE_NOT_ON_ROOT,
		.seq_show = cgroup_controllers_show,
	},
	{
		.name = "cgroup.subtree_control",
		.seq_show = cgroup_subtree_control_show,
		.write = cgroup_subtree_control_write,
	},
	{
		.name = "cgroup.populated",
		.flags = CFTYPE_NOT_ON_ROOT,
		.seq_show = cgroup_populated_show,
	},
	{ }	/* terminate */
};

という内容ですが、ここで .name で指定されているファイルって cgroup-v2 の各 cgroup (ディレクトリ) に作成されるファイルです。

つまり start_kernel -> cgroup_init で起動時に設定される「デフォルトのヒエラルキ（のルート）」は v2 ?

確かに v2 は単一階層構造なのでここで初期化しておけば、あとはそれを利用すれば良いのに対して、v1 はいくつでもマウントできるので後で設定してもいいのかな? と思えますが、もしかして今後は v2 をデフォルトにしたいという考えの現れ?

3.15 kernel では

ちなみに v2 のコードがまだそれほど入っていないころの 3.15 だと、cgroup_init 内で cgroup_setup_root が呼ばれているのは同じですが、cgroup_setup_root は

static int cgroup_setup_root(struct cgroup_root *root, unsigned long ss_mask)
{

    :(snip)

	ret = cgroup_addrm_files(root_cgrp, cgroup_base_files, true);

と条件文はなくて直接 cgroup_base_files が渡されており、cgroup_base_files は

static struct cftype cgroup_base_files[] = {
	{
		.name = "cgroup.procs",
    : (snip)

	},
	{
		.name = "cgroup.clone_children",
		.flags = CFTYPE_INSANE,
    : (snip0)
	},
	{
		.name = "cgroup.sane_behavior",
		.flags = CFTYPE_ONLY_ON_ROOT,
    : (snip)
	},

	/*
	 * Historical crazy stuff.  These don't have "cgroup."  prefix and
	 * don't exist if sane_behavior.  If you're depending on these, be
	 * prepared to be burned.
	 */
	{
		.name = "tasks",
		.flags = CFTYPE_INSANE,		/* use "procs" instead */
    : (snip)

	},
	{
		.name = "notify_on_release",
		.flags = CFTYPE_INSANE,
    : (snip)

	},
	{
		.name = "release_agent",
		.flags = CFTYPE_INSANE | CFTYPE_ONLY_ON_ROOT,

    : (snip)

	},
	{ }	/* terminate */
};

というふうに v1 のファイルは flags に CFTYPE_INSANE ってのが指定されています。追加の際に v1, v2 を判定して追加するかどうかを決めています。

static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
			      bool is_add)
{

    : (snip)

	for (cft = cfts; cft->name[0] != '\0'; cft++) {
		/* does cft->flags tell us to skip this file on @cgrp? */
		if ((cft->flags & CFTYPE_ONLY_ON_DFL) && !cgroup_on_dfl(cgrp))
			continue;
		if ((cft->flags & CFTYPE_INSANE) && cgroup_sane_behavior(cgrp))
			continue;
		if ((cft->flags & CFTYPE_NOT_ON_ROOT) && !cgrp->parent)
			continue;
		if ((cft->flags & CFTYPE_ONLY_ON_ROOT) && cgrp->parent)
			continue;

		if (is_add) {
			ret = cgroup_add_file(cgrp, cft);
    : (snip)
}

という風に v1, v2 混在したようなコードになってます。

TenForward

技術ブログ。はてなダイアリーから移転しました

pivot_root できる条件

pivot_root の使われ方

カーネルのコメント

カーネルコード

shared mount 以外

新しいrootのマウントが現プロセスと同じマウント名前空間

カレントプロセスのrootと、new, oldが異なるマウント

カレントプロセスのrootはマウントポイント

カレントプロセスの root は attach されている

新しい root はマウントポイントで attach されている

old は new (新たな root) 配下

new はカレントプロセス root 配下

まとめ

Linux 4.11 での cgroup 関連の話題

rdma controller

Drop the matching uid requirement on migration for cgroup v2

slub: make sysfs directories for memcg sub-caches optional

cgroup 初期化時の v1 の処理だけ抜き出して追ってみる

cgroup_init_early

cgroup_init

cgroup_add_legacy_cftypes で何もしない理由

cgroup_add_legacy_cftypes

cgroup_add_cftypes

cgroup_apply_cftypes

cgroup_addrm_files

関連

cgroup の SUBSYS マクロ

cgroup のデフォルトルート cgrp_dfl_root

3.15 kernel では