Intel CPU上面的Linux大小核调度问题
自从12代Alder lake开始, Intel引入了大小核架构. 我最近也遇到了这样一个问题: 在Linux环境下面, 对于单线程的workload来说, 有时候它会运行在p-core, 有时候它会运行在e-core(即使p-core是idle的). 碰到这个问题的时候, 我是一脸懵逼的Orz. 因为印象中, 这类问题应该在adl刚出来的时候, 就已经受到广泛讨论了, 应该已经解决了才对, 毕竟我用的是kernel 6.1.
目前, 第一个线索是: 这个问题只发生在virtualization的环境下面(我们的方案是基于acrn的https://github.com/projectacrn), 在baremetal下面没有这个问题.
浅浅地读了一下sched这边的kernel代码. 我第一个怀疑的是select_task_rq_fair这个函数. 但是我仔细读了一下, 发现它就是遵循cfs的描述, 单纯地选择一个idlest的cpu, 塞进rq里面.
# kernel/sched/fair.c
7500 * select_task_rq_fair: Select target runqueue for the waking task in domains
7501 * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
7502 * SD_BALANCE_FORK, or SD_BALANCE_EXEC.
7503 *
7504 * Balances load by selecting the idlest CPU in the idlest group, or under
7505 * certain conditions an idle sibling CPU if the domain has SD_WAKE_AFFINE set.
7506 *
7507 * Returns the target CPU number.
7508 */
7509 static int
7510 select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
接着我试着把VM/BM的load_balance都关掉. 发现即使是Baremetal的情况下, 也会选择一个e-core去运行. 这就表明了:
- select_task_rq_fair也是会选择e-core的.
- 在Baremetal下面, 是cpu load balance使得任务最终运行在e-core上面.
echo 0 > /sys/fs/cgroup/cpuset/[cgroup_name]/sched_load_balance
现在我已经知道了这个问题和load balance有关, 接着我就在寻找kernel是如何分辨p-core/e-core的. 最终发现是通过ITMT这个组件. ITMT(Intel Turbo Max Tech)这个名字听起来和scheduler没多大关系, 但它确实是负责cpu调度优先级的汗.
175 /**
176 * sched_set_itmt_core_prio() - Set CPU priority based on ITMT
177 * @prio: Priority of cpu core
178 * @core_cpu: The cpu number associated with the core
179 *
180 * The pstate driver will find out the max boost frequency
181 * and call this function to set a priority proportional
182 * to the max boost frequency. CPU with higher boost
183 * frequency will receive higher priority.
184 *
185 * No need to rebuild sched domain after updating
186 * the CPU priorities. The sched domains have no
187 * dependency on CPU priorities.
188 */
189 void sched_set_itmt_core_prio(int prio, int core_cpu)
190 {
191 int cpu, i = 1;
192
193 for_each_cpu(cpu, topology_sibling_cpumask(core_cpu)) {
194 int smt_prio;
195
196 /*
197 * Ensure that the siblings are moved to the end
198 * of the priority chain and only used when
199 * all other high priority cpus are out of capacity.
200 */
201 smt_prio = prio * smp_num_siblings / (i * i);
202 per_cpu(sched_core_priority, cpu) = smt_prio;
203 i++;
204 }
205 }
下面的命令可以查看ITMT的状态
cat /proc/sys/kernel/sched_itmt_enabled
在我有问题的环境下面, ITMT的这个节点根本就不存在, 看来就是和它有关了. 再继续往下看, ITMT是在intel_pstate的driver里面初始化的(所以要先确保cpufreq driver = intel_pstate). 它先去cppc里面拿一个core的highest_perf, 然后再根据这个值去分配core priority. 我的环境里, 挂在了line 357上面.
351 static void intel_pstate_set_itmt_prio(int cpu)
352 {
353 struct cppc_perf_caps cppc_perf;
354 static u32 max_highest_perf = 0, min_highest_perf = U32_MAX;
355 int ret;
356
357 ret = cppc_get_perf_caps(cpu, &cppc_perf);
358 if (ret)
359 return;
360
361 /*
362 * On some systems with overclocking enabled, CPPC.highest_perf is hardcoded to 0xff.
363 * In this case we can't use CPPC.highest_perf to enable ITMT.
364 * In this case we can look at MSR_HWP_CAPABILITIES bits [8:0] to decide.
365 */
366 if (cppc_perf.highest_perf == CPPC_MAX_PERF)
367 cppc_perf.highest_perf = HWP_HIGHEST_PERF(READ_ONCE(all_cpu_data[cpu]->hwp_cap_cached));
368
...
374 sched_set_itmt_core_prio(cppc_perf.highest_perf, cpu);
375
376 if (max_highest_perf <= min_highest_perf) {
377 if (cppc_perf.highest_perf > max_highest_perf)
378 max_highest_perf = cppc_perf.highest_perf;
379
380 if (cppc_perf.highest_perf < min_highest_perf)
381 min_highest_perf = cppc_perf.highest_perf;
382
383 if (max_highest_perf > min_highest_perf) {
...
390 schedule_work(&sched_itmt_work);
391 }
392 }
393 }
再往下跟下去, 发现cppc的初始化失败了.
# drivers/acpi/cppc_acpi.c
663 int acpi_cppc_processor_probe(struct acpi_processor *pr)
664 {
665 struct acpi_buffer output = {ACPI_ALLOCATE_BUFFER, NULL};
666 union acpi_object *out_obj, *cpc_obj;
667 struct cpc_desc *cpc_ptr;
668 struct cpc_reg *gas_t;
669 struct device *cpu_dev;
670 acpi_handle handle = pr->handle;
671 unsigned int num_ent, i, cpc_rev;
672 int pcc_subspace_id = -1;
673 acpi_status status;
674 int ret = -ENODATA;
675
676 if (!osc_sb_cppc2_support_acked) {
677 pr_debug("CPPC v2 _OSC not acked\n");
678 if (!cpc_supported_by_cpu())
679 return -ENODEV;
680 }
至此, 问题已经比较清楚了. ITMT是帮助scheduler了解cpu调度优先级的. 因为ITMT工作有问题, 所以scheduler不知道哪些是p-core, 哪些是e-core, 所以调度出了问题. 而ITMT初始化失败是由于它依赖于CPPC. 在我的环境里, cppc初始化失败, 是因为它依赖于hypervisor提供acpi _osc的查询功能, 而目前这点还没有实现. 作为workaround, 暂时只能强制cpc_supported_by_cpu() return true. 或者干脆不要从cppc拿highest_perf了, 从MSR_HWP_CAPABILITIES 里面拿应该也可以?