본문 바로가기

Kernel Crash Case-Studies

[KernelCrash] Abort at rmqueue_bulk() due to page.lru->next corruption

Kernel panic occurs with defective device under the following call trace.

crash> bt -I C01002D8 -S  E7AABC08 0xE1804200

PID: 2285   TASK: e1804200  CPU: 5   COMMAND: "python"

bt: WARNING:  stack address:0xe7aabd80, program counter:0xc0ee5b60

 #0 [<c01002d8>] (do_DataAbort) from [<c010ad58>]

    pc : [<c01d7308>]    lr : [<c01d72ec>]    psr: 60020193

    sp : e7aabcf8  ip : c193e69c  fp : edf34bf4

    r10: 00000000  r9 : 0000001f  r8 : 00000002

    r7 : c1938280  r6 : c1938200  r5 : 00000010  r4 : ef4bddb4

    r3 : ef4bddb4  r2 : 00000100  r1 : 00000000  r0 : ef4bdda0

    Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM

 #1 [<c010ad58>] (__dabt_svc) from [<c01d72ec>]

 #2 [<c01d7308>] (rmqueue_bulk.constprop.11) from [<c01d7540>]  //<<-- kernel panic

 #3 [<c01d7540>] (get_page_from_freelist) from [<c01d79c4>]

 #4 [<c01d79c4>] (__alloc_pages_nodemask) from [<c01f7bf4>]

 #5 [<c01f7bf4>] (handle_mm_fault) from [<c011525c>]

 #6 [<c011525c>] (do_page_fault) from [<c01002d8>]

 #7 [<c01002d8>] (do_DataAbort) from [<c010b03c>]


The data abort is raised since page.lru->next(R2) holds invalid address 0x100.

0xc01d72f4 <rmqueue_bulk.constprop.11+0x58>:    cmp     r10, #0

0xc01d72f8 <rmqueue_bulk.constprop.11+0x5c>:    add     r3, r0, #20

0xc01d72fc <rmqueue_bulk.constprop.11+0x60>:    ldreq   r2, [r4]

0xc01d7300 <rmqueue_bulk.constprop.11+0x64>:    ldrne   r2, [r4, #4]

0xc01d7304 <rmqueue_bulk.constprop.11+0x68>:    strne   r3, [r4, #4]

0xc01d7308 <rmqueue_bulk.constprop.11+0x6c>:    streq   r3, [r2, #4]  //<<-- data abort

crash> struct page.lru  0xEF4BDDA0  -px

    lru = {

      next = 0x100,  //<<--

      prev = 0x200

    }


After having code review, I have figured out that attribute of page is pcp(per-cpu page frame cache: buddy system, 0 order page)

static int rmqueue_bulk(struct zone *zone, unsigned int order,

   unsigned long count, struct list_head *list,

   int migratetype, bool cold)

{

 int i;


 spin_lock(&zone->lock);

 for (i = 0; i < count; ++i) {

  struct page *page;


//snip

  if (likely(!cold))

   list_add(&page->lru, list);  //<<--

  else

   list_add_tail(&page->lru, list);


To find out pcp address for CPU5, the following command s are used.

crash> p contig_page_data.node_zones[1].pageset

$5 = (struct per_cpu_pageset *) 0xc177ebdc


crash> struct per_cpu_pages  EDF34BDC

struct per_cpu_pages {

  count = 0x1,

  high = 0xba,

  batch = 0x1f,

  lists = {{

      next = 0xef51fc74,  //<<--MIGRATE_UNMOVABLE

      prev = 0xef51fc74

    }, {

      next = 0xedf34bf0, //<<--MIGRATE_RECLAIMABLE

      prev = 0xedf34bf0

    }, {

      next = 0xef4bdcd4,//<<--MIGRATE_MOVABLE

      prev = 0xef4bddf4

    }, {

      next = 0xedf34c00, //<<--MIGRATE_PCPTYPES

      prev = 0xedf34c00

    }}

}


(where) 0xEDF34BDC = 0xc177ebdc+0x2c7b6000

crash> p  __per_cpu_offset[5]

$7 = 0x2c7b6000


BTW the listed list 0xef4bdcd4 address is found to be corrupted as follows. 

crash> list 0x0 0xef4bdcd4

ef4bdcd4

ef4bdcf4

ef4bdd14

ef4bdd34

ef4bdd54

ef4bdd74

ef4bddb4

100

(where)

 #0 [<c01002d8>] (do_DataAbort) from [<c010ad58>]

    pc : [<c01d7308>]    lr : [<c01d72ec>]    psr: 60020193

    sp : e7aabcf8  ip : c193e69c  fp : edf34bf4

    r10: 00000000  r9 : 0000001f  r8 : 00000002

    r7 : c1938280  r6 : c1938200  r5 : 00000010  r4 : ef4bddb4

    r3 : ef4bddb4  r2 : 00000100  r1 : 00000000  r0 : ef4bdda0

    Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM

 #1 [<c010ad58>] (__dabt_svc) from [<c01d72ec>]

 #2 [<c01d7308>] (rmqueue_bulk.constprop.11) from [<c01d7540>]  //<<-- kernel panic

 #3 [<c01d7540>] (get_page_from_freelist) from [<c01d79c4>]


After the device is disassembled again with another PMIC, the crash disappears.