Home About Me

Why a Display Power-Key Callback Ended in a Suspend-Time Kernel Panic

Symptom

The device froze and eventually hit a kernel panic during suspend:

[  189.052980][ T5068] Unable to handle kernel paging request at virtual address 00046ffca9037bf9
[  189.052991][ T5068] Mem abort info:
[  189.052997][ T5068]   ESR = 0x0000000096000004
[  189.053005][ T5068]   EC = 0x25: DABT (current EL), IL = 32 bits
[  189.053013][ T5068]   SET = 0, FnV = 0
[  189.053020][ T5068]   EA = 0, S1PTW = 0
[  189.053027][ T5068]   FSC = 0x04: level 0 translation fault
[  189.053035][ T5068] Data abort info:
[  189.053039][ T5068]   ISV = 0, ISS = 0x00000004
[  189.053045][ T5068]   CM = 0, WnR = 0
[  189.053053][ T5068] [00046ffca9037bf9] address between user and kernel address ranges
[  189.053064][ T5068] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[  189.053311][ T5068] Dumping ftrace buffer:
[  189.053331][ T5068]    (ftrace buffer empty)
[  189.055391][ T5068] CPU: 1 PID: 5068 Comm: binder:1027_3 Tainted: G        WC OE      6.1.118-android14-11-maybe-dirty #1
[  189.055405][ T5068] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
[  189.055412][ T5068] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  189.055426][ T5068] pc : dpm_complete+0x128/0x44c
[  189.055451][ T5068] lr : dpm_complete+0x114/0x44c
[  189.055462][ T5068] sp : ffffffc0243fbb40
[  189.055468][ T5068] x29: ffffffc0243fbb60 x28: ffffff8035d52580 x27: ffffffc00a1fc000
[  189.055489][ T5068] x26: ffffffc00a1fc210 x25: ffffffc0243fbb48 x24: ffffff8093e724a0
[  189.055508][ T5068] x23: ffffff8093e72518 x22: ffffff8093e72400 x21: ffffffc0092f0ae9
[  189.055527][ T5068] x20: ffffffc00a1fc1c0 x19: 0000000000000010 x18: ffffffc022c2d078
[  189.055545][ T5068] x17: 000000007b71745f x16: 000000007b71745f x15: ffffff8179342180
[  189.055564][ T5068] x14: 0000000000000010 x13: ffffffc0082809d4 x12: ffffffc00939e698
[  189.055582][ T5068] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffffc00a0c7000
[  189.055600][ T5068] x8 : a9046ffca9037bfd x7 : 3a4d50006574656c x6 : 0000101a1e00090b
[  189.055619][ T5068] x5 : 0b09001e1a100000 x4 : 0000008000000000 x3 : ffffff8056d3a9c8
[  189.055637][ T5068] x2 : 00000000ffff93a3 x1 : 0000000000000000 x0 : ffffff8093e72400
[  189.055657][ T5068] Call trace:
[  189.055663][ T5068]  dpm_complete+0x128/0x44c
[  189.055677][ T5068]  suspend_devices_and_enter+0x894/0xc04
[  189.055698][ T5068]  pm_suspend+0x330/0x694
[  189.055711][ T5068]  state_store+0x104/0x1c8
[  189.055724][ T5068]  kobj_attr_store+0x30/0x48
[  189.055747][ T5068]  sysfs_kf_write+0x54/0x6c
[  189.055769][ T5068]  kernfs_fop_write_iter+0x104/0x1a4
[  189.055789][ T5068]  vfs_write+0x244/0x2e0
[  189.055805][ T5068]  ksys_write+0x78/0xe8
[  189.055816][ T5068]  __arm64_sys_write+0x1c/0x2c
[  189.055829][ T5068]  invoke_syscall+0x58/0x114
[  189.055845][ T5068]  el0_svc_common+0xb4/0xfc
[  189.055857][ T5068]  do_el0_svc+0x24/0x84
[  189.055867][ T5068]  el0_svc+0x2c/0x90
[  189.055884][ T5068]  el0t_64_sync_handler+0x68/0xb4
[  189.055897][ T5068]  el0t_64_sync+0x1a4/0x1a8
[  189.055920][ T5068] Code: b40002a8 f9400508 b40003e8 aa1603e0 (b85fc110)
[  189.055933][ T5068] ---[ end trace 0000000000000000 ]---
[  189.169167][ T5068] Kernel panic - not syncing: Oops: Fatal exception

The call trace already points to a failure while the system is moving through the suspend path, specifically around dpm_complete.

First pass: where the fault shows up

module trace

The crash happens while devices are being suspended one after another. The problematic device is disp_feature/disp-DSI-0.

disp_feature issue

During the suspend flow, disp-DSI-0 appears to be referencing a class object that looks as if it had already been unregistered.

First issue: display initialization was triggered from the power-key IRQ path

Looking through dmesg, two threads were running parts of initialization at nearly the same time:

  • Around 7.0x seconds, thread T615 reached mi_display_pwrkey_callback_set
  • Around 7.04 seconds, thread T710 handled the power-key IRQ
  • Around 7.40 seconds, T710 initialized mi_disp_core and mi_disp_log
  • Around 7.45 seconds, T675 initialized mi_disp_core and mi_disp_log again, but returned early because they were already initialized
  • Around 7.45 seconds, T675 initialized mi_disp_feature

That reveals the first clear design problem: display initialization was being pulled in by the power-key interrupt callback instead of following the normal display initialization path. That behavior is unsafe and needs to be corrected at the source.

Second issue: duplicate sysfs creation kicked off cleanup on partially shared state

The next clue is the duplicate sysfs node creation:

    Line 4538: [    7.456376][  T710] sysfs: cannot create duplicate filename '/devices/virtual/mi_display/disp_feature'
    Line 4549: [    7.467624][  T710] CPU: 1 PID: 710 Comm: irq/135-pm8941_ Tainted: G        WC OE      6.1.118-android14-11-maybe-dirty #1
    Line 4559: [    7.485547][  T710] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
    Line 4560: [    7.485552][  T710] Call trace:
    Line 4561: [    7.485555][  T710]  dump_backtrace+0xf4/0x11c
    Line 4562: [    7.485569][  T710]  show_stack+0x18/0x24
    Line 4563: [    7.485573][  T710]  dump_stack_lvl+0x60/0x90
    Line 4564: [    7.485580][  T710]  sysfs_create_dir_ns+0xf0/0x150
    Line 4565: [    7.485588][  T710]  kobject_add_internal+0x228/0x478
    Line 4566: [    7.485595][  T710]  kobject_add+0x94/0x10c
    Line 4567: [    7.485600][  T710]  device_add+0x144/0x618
    Line 4568: [    7.485607][  T710]  device_create_groups_vargs+0xcc/0x12c
    Line 4570: [    7.499011][  T710]  device_create+0x58/0x80
    Line 4571: [    7.499017][  T710]  mi_disp_feature_init+0xdc/0x20c [msm_drm]
    Line 4573: [    7.510902][  T710]  mi_get_disp_feature+0x20/0x40 [msm_drm]
    Line 4575: [    7.522143][  T710]  mi_display_powerkey_callback+0x18/0x80 [msm_drm]
    Line 4577: [    7.537274][  T710]  pm8941_pwrkey_irq+0x1e8/0x330 [pm8941_pwrkey]
    Line 4578: [    7.537302][  T710]  irq_thread_fn+0x44/0xa4
    Line 4579: [    7.537315][  T710]  irq_thread+0x164/0x290
    Line 4580: [    7.537320][  T710]  kthread+0x10c/0x154
    Line 4581: [    7.537328][  T710]  ret_from_fork+0x10/0x20
    Line 4583: [    7.547231][  T710] kobject_add_internal failed for disp_feature with -EEXIST, don't try to register things with the same name in the same directory.
    Line 4588: [    7.559217][  T710] [mi_disp:mi_disp_feature_init [msm_drm]] [E]create device failed for disp_feature
    Line 4591: [    7.572531][  T710] ------------[ cut here ]------------
    Line 4593: [    7.584887][  T710] remove_proc_entry: removing non-empty directory '/proc/mi_display', leaking at least 'mipi_rw_prim'
    Line 4594: [    7.584917][  T710] WARNING: CPU: 1 PID: 710 at fs/proc/generic.c:720 remove_proc_entry+0x1e0/0x1ec
    Line 4595: [    7.584935][  T710] Modules linked in: rmnet_wlan(OE) rmnet_offload(OE) rmnet_perf(OE) rmnet_shs(OE) rmnet_perf_tether(OE) rmnet_core(OE) gauge_iio(E) ipanetm(OE)
    Line 4625: [    7.672205][  T710] CPU: 1 PID: 710 Comm: irq/135-pm8941_ Tainted: G        WC OE      6.1.118-android14-11-maybe-dirty #1
    Line 4626: [    7.672211][  T710] Hardware name: Qualcomm Technologies, Inc. Spring QRD (DT)
    Line 4627: [    7.672214][  T710] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    Line 4628: [    7.672219][  T710] pc : remove_proc_entry+0x1e0/0x1ec
    Line 4629: [    7.672234][  T710] lr : remove_proc_entry+0x1e0/0x1ec
    Line 4630: [    7.672240][  T710] sp : ffffffc00b4a3c60
    Line 4631: [    7.672242][  T710] x29: ffffffc00b4a3c80 x28: 0000000000000000 x27: 00000000ffffffff
    Line 4632: [    7.672250][  T710] x26: 0000000000000001 x25: ffffffc00a1a4580 x24: 000000000000000a
    Line 4633: [    7.672256][  T710] x23: 000000000000000a x22: ffffffc009318048 x21: ffffff804c52b180
    Line 4634: [    7.672263][  T710] x20: ffffff804c52b22c x19: ffffff804c52b200 x18: ffffffc00aafd048
    Line 4635: [    7.672269][  T710] x17: 0000000000000015 x16: 00000000000000a4 x15: ffffffc00902ec88
    Line 4636: [    7.672276][  T710] x14: 0000000000000001 x13: 000000000000004e x12: 0000000000000018
    Line 4637: [    7.672282][  T710] x11: 00000000ffffffff
    Line 4640: [    7.687628][  T710]  x10: ffffffc00a09eb5c x9 : 67aa0542b3522000
    Line 4641: [    7.687638][  T710] x8 : 67aa0542b3522000 x7 : 656c20746120676e x6 : 0000000000000027
    Line 4642: [    7.687644][  T710] x5 : ffffff8179154234 x4 : ffffffc0093675d5 x3 : ffff0a00ffffff04
    Line 4643: [    7.687651][  T710] x2 : 0000000000000001 x1 : 0000000000000000 x0 : 0000000000000063
    Line 4644: [    7.687658][  T710] Call trace:
    Line 4645: [    7.687663][  T710]  remove_proc_entry+0x1e0/0x1ec
    Line 4646: [    7.687673][  T710]  mi_disp_core_deinit+0x34/0x60 [msm_drm]
    Line 4653: [    7.705247][  T710]  mi_disp_feature_init+0x16c/0x20c [msm_drm]
    Line 4663: [    7.722296][  T710]  mi_get_disp_feature+0x20/0x40 [msm_drm]
    Line 4669: [    7.739086][  T710]  mi_display_powerkey_callback+0x18/0x80 [msm_drm]
    Line 4671: [    7.762509][  T710]  pm8941_pwrkey_irq+0x1e8/0x330 [pm8941_pwrkey]
    Line 4672: [    7.762528][  T710]  irq_thread_fn+0x44/0xa4
    Line 4673: [    7.762539][  T710]  irq_thread+0x164/0x290
    Line 4674: [    7.762544][  T710]  kthread+0x10c/0x154
    Line 4675: [    7.762550][  T710]  ret_from_fork+0x10/0x20
    Line 4677: [    7.784476][  T710] ---[ end trace 0000000000000000 ]---
    Line 4678: [    7.784632][  T710] [mi_disp:mi_display_powerkey_callback [msm_drm]] [E]invalid dsi_display or dsi_panel ptr

The key point here is that pm8941_pwrkey_irq eventually drives mi_disp_core_deinit, whose code is:

void mi_disp_core_deinit(void)
{
    if (!g_disp_core)
        return;
    debugfs_remove_recursive(g_disp_core->debugfs_dir);
    remove_proc_entry(MI_DISPLAY_PROCFS_DIR, NULL);
    class_destroy(g_disp_core->class);
    kfree(g_disp_core);
    g_disp_core = NULL;     //置空g_disp_core ,但是
}

This destroys g_disp_core->class, frees g_disp_core, and then sets g_disp_core to NULL.

A few details matter here:

  1. class_destroy removes the class object.
  2. kfree(g_disp_core) does not zero the old memory immediately; it marks that memory as free and available for reuse.
  3. g_disp_core = NULL only updates that specific pointer variable. It does not automatically invalidate every other pointer that still refers to data inside the freed object.

To see how this path is reached, the relevant caller is mi_disp_feature_init:

int mi_disp_feature_init(void)
{
    int ret = 0;
    struct disp_feature *df = NULL;
    struct disp_core *disp_core = NULL;
    int i;

    ret = mi_disp_core_init();
    if (ret < 0)
        return -ENODEV;

    mi_disp_log_init();

    disp_core = mi_get_disp_core();
    if (!disp_core)
        return -ENODEV;

    if (g_disp_feature) {
        DISP_INFO("mi disp_feature already initialized, return!\n");
        return 0;
    }

    df = kzalloc(sizeof(struct disp_feature), GFP_KERNEL);
    if (!df) {
        DISP_ERROR("can not allocate Buffer\n");
        ret = -ENOMEM;
        goto err_core_deinit;
    }

    ret = mi_disp_cdev_register(DISP_FEATURE_DEVICE_NAME,
                &disp_feature_fops, &df->cdev);
    if (ret < 0) {
        DISP_ERROR("cdev register failed for %s\n", DISP_FEATURE_DEVICE_NAME);
        goto err_alloc_mem;
    }

    df->dev_id = df->cdev->dev;
    df->class = disp_core->class;                     ///disp_core->class赋值给disp_feature
    df->pdev = device_create(df->class, NULL, df->dev_id, df, DISP_FEATURE_DEVICE_NAME);
    if (IS_ERR(df->pdev)) {
        DISP_ERROR("create device failed for %s\n", DISP_FEATURE_DEVICE_NAME);    /////这里log打印了
        ret = -ENODEV;
        goto err_cdev_register;
    }

    df->version = MI_DISP_FEATURE_VERSION;
    for (i = MI_DISP_PRIMARY; i < MI_DISP_MAX; i++) {
        df->d_display[i].dev = NULL;
        df->d_display[i].display = NULL;
        df->d_display[i].disp_id = MI_DISP_MAX;
        df->d_display[i].intf_type = MI_INTF_MAX;
        mutex_init(&df->d_display[i].mutex_lock);
    }
    INIT_LIST_HEAD(&df->client_list);
    spin_lock_init(&df->client_spinlock);

    g_disp_feature = df;                                //第一次初始化时将申请的内存df指针 赋值给全局变量

    DISP_INFO("mi disp_feature driver initialized!\n");

    if (hwconf_init() < 0) {
        DISP_ERROR("can not initialize hwconf.\n");
    }

    return 0;

err_cdev_register:                                      ////跳到这里执行
    mi_disp_cdev_unregister(df->cdev);
err_alloc_mem:
    kfree(df);
err_core_deinit:
    mi_disp_core_deinit();     /////这里
    return ret;
}

Once device_create() fails because of the duplicate node, execution drops into the error path:

  • unregister cdev
  • free df
  • deinit display core

That is where shared state begins to drift out of sync.

Third issue: setting a local parameter to NULL does not fix the caller’s pointer

The unregister helper is also misleading:

void mi_disp_cdev_unregister(struct cdev *cdev)
{
    unregister_chrdev_region(cdev->dev, 1);
    cdev_del(cdev);
    cdev = NULL;
}

This is the second concrete bug in the code path.

cdev is only a local function parameter. Setting that local variable to NULL does not change the caller’s actual pointer. So df->cdev remains non-NULL after the function returns.

That can be verified directly by checking g_disp_feature->cdev, which still retains the old value.

g_disp_feature cdev still present

The same conclusion is visible at the assembly level:

assembly confirmation

From the register behavior, x0 carries the incoming cdev value, then the function saves it into x19 immediately. The subsequent operations work from the saved register, and there is no mechanism here that would propagate a NULL assignment back to the original caller-owned pointer. So by the time the function returns, the external pointer has not been cleared.

Fourth issue: freed memory was still reachable through another pointer

There is another dangerous part in the error path:

goto err_cdev_register
    err_cdev_register:                                      ////跳到这里执行
        mi_disp_cdev_unregister(df->cdev);   ////注销cdev
    err_alloc_mem:
        kfree(df);                                             ////标记df的内存可释放
    err_core_deinit:
        mi_disp_core_deinit();                        /////这里

After kfree(df), neither df = NULL nor g_disp_feature = NULL is done. That is highly error-prone.

The important distinction is this:

  • df and g_disp_feature point to the same memory block
  • but they are still different pointer variables

Freeing df only marks the underlying memory as reusable. If some other path later reuses that memory, both df and g_disp_feature still hold the stale address. Any later dereference can become a use-after-free and trigger exactly the kind of invalid access seen during suspend.

What actually caused the suspend-time crash

There were three major problem points in this flow, but the crash itself was ultimately tied to stale class state.

This assignment is the critical link:

    df->class = disp_core->class;                     ///disp_core->class赋值给disp_feature

And later, core teardown does this:

void mi_disp_core_deinit(void)
{
    if (!g_disp_core)
        return;
    debugfs_remove_recursive(g_disp_core->debugfs_dir);
    remove_proc_entry(MI_DISPLAY_PROCFS_DIR, NULL);
    class_destroy(g_disp_core->class);
    kfree(g_disp_core);
    g_disp_core = NULL;     //置空g_disp_core ,但是disp_core->class没有被删除,仍然有df->class指针可以访问到这个成员
}

g_disp_core is cleared, but the class pointer already copied into df->class is not synchronized with that teardown. In other words, the class gets destroyed, yet another structure still believes it owns a valid class pointer.

When suspend later runs, the flow behaves as if that class still exists. From the memory view, the entire class object is already corrupted or reused, which strongly suggests the freed block has been taken over by something else. At that point, touching it during the suspend completion path leads straight to the paging fault and panic.

Final takeaway

The initialization path being callable from the power-key IRQ was the first structural problem. The error handling around cdev and df made the state even more fragile. But the direct reason for the final crash was that class destruction was not reflected into the pointers still held by display feature state.

To make this path safe, both g_disp_core and g_disp_feature need proper invalidation and synchronization after teardown, instead of leaving stale pointers behind for suspend to trip over.