When I enter the command adb shell "cat /d/shrinker", the system crashes with 100% after dumping the following kernel message as below.
[ 761.636711] Unable to handle kernel paging request at virtual address f38a9a84
[ 761.645048] pgd = e8074000
[ 761.649800] [f38a9a84] *pgd=a0721811, *pte=00000000, *ppte=00000000
[ 761.658106] Internal error: Oops: 7 [#1] PREEMPT SMP ARM
[ 761.665481] Modules linked in:
[ 761.670605] CPU: 6 PID: 6978 Comm: sh Tainted: G W 3.18.31-g5b0528e-dirty #12
[ 761.680852] task: eb5b8000 ti: d228e000 task.ti: d228e000
[ 761.688315] PC is at do_raw_spin_lock+0x8/0x17c
[ 761.694912] LR is at list_lru_count_node+0x14/0x60
//--- skip ---
[ 761.885719] [<c016a650>] (do_raw_spin_lock) from [<c02057f4>] (list_lru_count_node+0x14/0x60)
[ 761.896307] [<c02057f4>] (list_lru_count_node) from [<c022dffc>] (super_cache_count+0x40/0x94)
[ 761.906982] [<c022dffc>] (super_cache_count) from [<c01f2708>] (debug_shrinker_show+0x44/0x7c)
[ 761.917658] [<c01f2708>] (debug_shrinker_show) from [<c024a56c>] (seq_read+0x1a4/0x3f0)
[ 761.927726] [<c024a56c>] (seq_read) from [<c022c87c>] (vfs_read+0x88/0x138)
[ 761.936755] [<c022c87c>] (vfs_read) from [<c022cda8>] (SyS_read+0x3c/0x7c)
[ 761.945695] [<c022cda8>] (SyS_read) from [<c0106840>] (ret_fast_syscall+0x0/0x38)
[ 761.955242] Code: e12fff1e deaf1eed e92d47f0 e1a08000 (e5902004)
After loading coredump using T32 program, we can find out that the data abort is triggered inside debug_spin_lock_before() function during sanity-check procedure of spinlock where list_lru_count_node() is attempting to acquire.
-000|do_DataAbort(addr = 170, fsr = 3247439880, regs = 0xEC4901C0)
-001|__dabt_svc(asm)
-->|exception
-002|debug_spin_lock_before(inline) //<<-- data abort
-002|do_raw_spin_lock(lock = 0xF38A9A80)
-003|list_lru_count_node(?, ?)
-004|super_cache_count(shrink = 0xEBC47DE0, sc = 0xD228FEF8)
-005|debug_shrinker_show(s = 0xE7CF63C0, ?)
-006|seq_read(file = 0xE805CA00, buf = 0xB3246000, size = 4096, ppos = 0xD228FF80)
-007|vfs_read(file = 0xFFFFFFFF, buf = 0x0, ?, pos = 0xD228FF80)
-008|SYSC_read(inline)
-008|sys_read(?, buf = -1289461760, count = 4096)
-009|ret_fast_syscall(asm)
Upon acquiring spinlock, all members of raw_spinlock_t(0xF38A9A80) have been scribbled.
It is necessary to examine the variable which contains spinlock as member.
(register raw_spinlock_t *) [R0] lock = 0xF38A9A80 -> (
(arch_spinlock_t) [NSD:0xF38A9A80] raw_lock = (
(u32) [NSD:0xF38A9A80] slock = 0x6F632F34,
(struct __raw_tickets) [NSD:0xF38A9A80] tickets = (
(u16) [NSD:0xF38A9A80] owner = 0x2F34, //<<--
(u16) [NSD:0xF38A9A82] next = 0x6F63)), //<<--
(unsigned int) [NSD:0xF38A9A84] magic = 0x6E6F6D6D //<<--
(unsigned int) [NSD:0xF38A9A88] owner_cpu = 0x6F6F675F,
(void *) [NSD:0xF38A9A8C] owner = 0x5F656C67))
The address of spin_lock is passed from &nlru->lock which is responsible for saving &lru->node[nid].
51unsigned long
52list_lru_count_node(struct list_lru *lru, int nid)
53{
54 unsigned long count = 0;
55 struct list_lru_node *nlru = &lru->node[nid]; <<--
56
57 spin_lock(&nlru->lock);
The list_lru_count_node() is called inside super_cache_count() [128 line.].
108static unsigned long super_cache_count(struct shrinker *shrink,
109 struct shrink_control *sc)
110{
111 struct super_block *sb;
112 long total_objects = 0;
113
114 sb = container_of(shrink, struct super_block, s_shrink);
//snip
124 if (sb->s_op && sb->s_op->nr_cached_objects)
125 total_objects = sb->s_op->nr_cached_objects(sb,
126 sc->nid);
127
128 total_objects += list_lru_count_node(&sb->s_dentry_lru, //<<--
129 sc->nid);
Tracing back to the caller function of super_cache_count() using T32 program, we can find out that debug_shrinker_show() calls super_cache_count().
Here, the second parameter in a call to super_cache_count() is &sc defined as local variable whose memory location belongs to stack area, which looks very *odd*. odd debug information.
205static int debug_shrinker_show(struct seq_file *s, void *unused)
206{
207 struct shrinker *shrinker;
208 struct shrink_control sc;
209
210 sc.gfp_mask = -1;
211 sc.nr_to_scan = 0;
212
213 down_read(&shrinker_rwsem);
214 list_for_each_entry(shrinker, &shrinker_list, list) {
215 int num_objs;
216
217 num_objs = shrinker->count_objects(shrinker, &sc); //<<-- shrinker->count_objects = super_cache_count
Here, let me wrap up the call trace flow as follows.
1. shrinker->count_objects(shrinker, &sc); // <<-- super_cache_count
2. list_lru_count_node(&sb->s_dentry_lru, sc->nid); // The second parameter sc->nid contains garbage value(since it is defined as local variable without initialization)
3. list_lru_count_node(struct list_lru *lru, int nid) // <<-- nid(sc->nid): garbage value
3.1 struct list_lru_node *nlru = &lru->node[nid]; // &lru->node[nid]; // access lru->node from nid(garbage value)
3.2 spin_lock(&nlru->lock); // &nlru->lock: garbage address
Solution
After visiting linux kernel community and google web-site, we have found that google already generated revert commit
https://chromium-review.googlesource.com/c/345895/
The commit message below seems to contain identical debugging signature what we analyzed(shrink_control.nid is used but not initialzed)
Commit log
ANDROID: Revert "mm: vmscan: Add a debug file for shrinkers"
Kernel panic when type "cat /sys/kernel/debug/shrinker"
Unable to handle kernel paging request at virtual address 0af37d40
pgd = d4dec000
[0af37d40] *pgd=00000000
Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[<c0bb8f24>] (_raw_spin_lock) from [<c020aa08>] (list_lru_count_one+0x14/0x28)
[<c020aa08>] (list_lru_count_one) from [<c02309a8>] (super_cache_count+0x40/0xa0)
[<c02309a8>] (super_cache_count) from [<c01f6ab0>] (debug_shrinker_show+0x50/0x90)
[<c01f6ab0>] (debug_shrinker_show) from [<c024fa5c>] (seq_read+0x1ec/0x48c)
[<c024fa5c>] (seq_read) from [<c022e8f8>] (__vfs_read+0x20/0xd0)
[<c022e8f8>] (__vfs_read) from [<c022f0d0>] (vfs_read+0x7c/0x104)
[<c022f0d0>] (vfs_read) from [<c022f974>] (SyS_read+0x44/0x9c)
[<c022f974>] (SyS_read) from [<c0107580>] (ret_fast_syscall+0x0/0x3c)
Code: e1a04000 e3a00001 ebd66b39 f594f000 (e1943f9f)
---[ end trace 60c74014a63a9688 ]---
Kernel panic - not syncing: Fatal exception
shrink_control.nid is used but not initialzed, same for
shrink_control.memcg.
This reverts commit b0e7a582b2264cdf75874dcd8df915b6b4427755.
Reported-by: Xiaowen Liu <xiaowen.liu@freescale.com>
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
(cherry picked from https://android.googlesource.com/kernel/common android-4.4
commit ad95c12f66df9efae04b15d5c4d0d0ba56ab2620)
BUG=chromium:612862
TEST=build and boot on a few devices
'Kernel Crash Case-Studies' 카테고리의 다른 글
[KernelCrash] panic due to voltage droop in the specific device (0) | 2019.03.09 |
---|---|
[Liunx][Kernel] Abort at __list_del_entry() inside process_one_work() (0) | 2019.03.09 |
[KernelCrash] Abort at tty_wakeup() due to port_tty(null) (0) | 2019.03.09 |
[KernelCrash] Abort at 0xecb29f00(defective device) (0) | 2019.03.09 |
[KernelCrash] Abort at mmc_wait_data_done() due to race (0) | 2019.03.09 |