본문 바로가기

Core BSP 분석/커널 트러블슈팅

[Linux][Kernel] data abort @tty_wakeup - 리눅스 커널

#커널 크래시 디버깅 및 TroubleShooting
 
USB와 TTY Driver 관련한 커널 패닉 디버깅 정보를 좀 공유하고자 해요.
 
아래 Signature로 커널 패닉이 발생했는데요. __irq_usr -> gic_handle_irq 코드 흐름으로 보아 유저 공간에서 실행 중인 프로세스로 IRQ가 Trigger되었다는 정보를 확인할 수 있어요.
바꾸어 말하면 아래 콜스택은 IRQ Context라고 할 수 있죠. 문제가 발생한 보드는 Mediatek MT6755랍니다.
[<c001df40>] do_page_fault+0x338/0x3f8 
[<c0008544>] do_DataAbort+0x38/0x98 
[<c0015058>] __dabt_svc+0x38/0x60 
[<c031a5a0>] tty_wakeup+0xc/0x64 
[<c06f82ec>] gs_start_io+0x94/0xf4 
[<c06f8698>] gserial_connect+0xe0/0x180
[<c06f7578>] acm_set_alt+0x88/0x1a8 
[<c06f3c5c>] composite_setup+0xd34/0x1520 
[<c070d154>] android_setup+0x1f4/0x1fc 
[<c03fc6e8>] forward_to_driver+0x64/0x100 
[<c03fd418>] musb_g_ep0_irq+0x7d8/0x1c18 
[<c03fb094>] musb_interrupt+0x94/0xc78 
[<c04024f0>] generic_interrupt+0xc34/0x1218 
[<c009b020>] handle_irq_event_percpu+0xe0/0x2e4
[<c009b268>] handle_irq_event+0x44/0x64
[<c009df54>] handle_fasteoi_irq+0xe8/0x1a4 
[<c009a7dc>] __handle_domain_irq+0x104/0x264 
[<c0008668>] gic_handle_irq+0x2c/0x64 
[<c00153a8>] __irq_usr+0x48/0x60
 
비슷한 시기에 nVidia Tegra4i SoC에서 아래 커널 로그와 함께 크래시가 발생했어요.
Tegra4i에서 정의한 OTG IRQ를 처리하는 함수만 다르지 처리되는 함수 흐름은 똑같네요.
[71653.259161 08-15 11:27:39.779] android_work: did not send uevent (0 0   (null))
[71653.290526 08-15 11:27:39.811] android_work: sent uevent USB_STATE=CONNECTED
[71653.292844 08-15 11:27:39.813] android_work: sent uevent USB_STATE=DISCONNECTED
[71653.365187 08-15 11:27:39.885] android_usb gadget: high speed config #1: android
[71653.365758 08-15 11:27:39.885] Unable to handle kernel NULL pointer dereference at virtual address 000000a8
[71653.365775 08-15 11:27:39.885] pgd = c0004000
[71653.365785 08-15 11:27:39.885] [000000a8] *pgd=00000000
[71653.365805 08-15 11:27:39.885] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[71653.365841 08-15 11:27:39.885] Modules linked in: bcmdhd(O)
[71653.365864 08-15 11:27:39.885] CPU: 0    Tainted: G           O  (3.4.65-g000d394 #1)
[71653.365891 08-15 11:27:39.885] PC is at tty_wakeup+0x14/0x80
[71653.365910 08-15 11:27:39.885] LR is at gs_start_io+0x90/0xe8
[71653.365922 08-15 11:27:39.885] pc : [<c03b2fe8>]    lr : [<c04a2520>]    psr: 20030193
[71653.365928 08-15 11:27:39.885] sp : d2b2db10  ip : d2b2db28  fp : d2b2db24
[71653.365944 08-15 11:27:39.885] r10: e29ba318  r9 : e29ba324  r8 : 00000000
[71653.365955 08-15 11:27:39.885] r7 : e29ba2f4  r6 : e16290a8  r5 : e29ba2e8  r4 : e29ba2c0
[71653.365967 08-15 11:27:39.885] r3 : 0000000a  r2 : 00000000  r1 : 60030193  r0 : 00000000
[71653.365984 08-15 11:27:39.885] Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
[71653.365996 08-15 11:27:39.885] Control: 10c5387d  Table: 8567c04a  DAC: 00000015
// ... 생략...
[71653.369827 08-15 11:27:39.885] [<c03b2fe8>] (tty_wakeup+0x14/0x80) from [<c04a2520>] (gs_start_io+0x90/0xe8)
[71653.369848 08-15 11:27:39.885] [<c04a2520>] (gs_start_io+0x90/0xe8) from [<c04a2f24>] (gserial_connect+0xdc/0x164)
[71653.369865 08-15 11:27:39.885] [<c04a2f24>] (gserial_connect+0xdc/0x164) from [<c04a3440>] (acm_set_alt+0x90/0x1ac)
[71653.369883 08-15 11:27:39.885] [<c04a3440>] (acm_set_alt+0x90/0x1ac) from [<c04995b4>] (composite_setup+0x244/0xf80)
[71653.369899 08-15 11:27:39.885] [<c04995b4>] (composite_setup+0x244/0xf80) from [<c049a3b0>] (android_setup+0xc0/0x154)
[71653.369921 08-15 11:27:39.885] [<c049a3b0>] (android_setup+0xc0/0x154) from [<c0487d44>] (tegra_udc_irq+0xbd4/0x11d0)
[71653.369944 08-15 11:27:39.885] [<c0487d44>] (tegra_udc_irq+0xbd4/0x11d0) from [<c00e64a8>] (handle_irq_event_percpu+0x88/0x2ec)
[71653.369962 08-15 11:27:39.885] [<c00e64a8>] (handle_irq_event_percpu+0x88/0x2ec) from [<c00e6758>] (handle_irq_event+0x4c/0x6c)
[71653.369980 08-15 11:27:39.885] [<c00e6758>] (handle_irq_event+0x4c/0x6c) from [<c00e96ec>] (handle_fasteoi_irq+0xcc/0x174)
[71653.369998 08-15 11:27:39.885] [<c00e96ec>] (handle_fasteoi_irq+0xcc/0x174) from [<c00e5c28>] (generic_handle_irq+0x3c/0x50)
[71653.370022 08-15 11:27:39.885] [<c00e5c28>] (generic_handle_irq+0x3c/0x50) from [<c000f9c4>] (handle_IRQ+0x5c/0xbc)
[71653.370040 08-15 11:27:39.885] [<c000f9c4>] (handle_IRQ+0x5c/0xbc) from [<c0008504>] (gic_handle_irq+0x34/0x68)
[71653.370057 08-15 11:27:39.885] [<c0008504>] (gic_handle_irq+0x34/0x68) from [<c000ec00>] (__irq_svc+0x40/0x70)
[71653.370070 08-15 11:27:39.885] Exception stack(0xd2b2dd80 to 0xd2b2ddc8)
 
이번에는 모바일계의 맞형 퀄컴님의 MSM8917 보드에서 똑같이 크래시가 발생했어요.
커널 로그는 아래와 같은데 똑같이 tty_wakeup() 함수에서 NULL pointer dereference로 크래시가 발생했네요.
[   51.610867 / 01-01 11:22:44.521][1] cfg80211: Invalid regulatory domain detected:
[   52.471153 / 01-01 11:22:45.381][0] Unable to handle kernel NULL pointer dereference at virtual address 000001cc
[   52.471172 / 01-01 11:22:45.381][0] pgd = d04f4000
[   52.471183 / 01-01 11:22:45.381][0] [000001cc] *pgd=00000000
[   52.471198 / 01-01 11:22:45.381][0] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[   52.471209 / 01-01 11:22:45.381][0] Modules linked in: texfat(PO)
[   52.471228 / 01-01 11:22:45.381][0] CPU: 0 PID: 1026 Comm: rild Tainted: P        W  O   3.18.31-gadfe554 #1
[   52.471240 / 01-01 11:22:45.381][0] task: e8ffb3c0 ti: d076e000 task.ti: d076e000
[   52.471260 / 01-01 11:22:45.381][0] PC is at tty_wakeup+0x8/0x58
[   52.471275 / 01-01 11:22:45.381][0] LR is at gs_start_io+0xb8/0xfc
[   52.471288 / 01-01 11:22:45.381][0] pc : [<c04f534c>]    lr : [<c06f364c>]    psr: 20000193
[   52.471288 / 01-01 11:22:45.381][0] sp : d076fd70  ip : 0000000b  fp : ec6ebc80
[   52.471305 / 01-01 11:22:45.381][0] r10: e97b5d84  r9 : e97b5d90  r8 : 00000000
[   52.471316 / 01-01 11:22:45.381][0] r7 : c42a63b0  r6 : e97b5d58  r5 : e97b5d64  r4 : 00000000
[   52.471328 / 01-01 11:22:45.381][0] r3 : e885e400  r2 : 00000000  r1 : 00320031  r0 : 00000000
//..
[   52.471712 / 01-01 11:22:45.381][0] [<c04f534c>] (tty_wakeup) from [<c06f364c>] (gs_start_io+0xb8/0xfc)
[   52.471729 / 01-01 11:22:45.381][0] [<c06f364c>] (gs_start_io) from [<c06f37d4>] (gserial_connect+0x144/0x1b0)
[   52.471745 / 01-01 11:22:45.381][0] [<c06f37d4>] (gserial_connect) from [<c06f22ec>] (acm_set_alt+0x1e8/0x248)
[   52.471760 / 01-01 11:22:45.381][0] [<c06f22ec>] (acm_set_alt) from [<c06efa1c>] (composite_setup+0x1448/0x159c)
[   52.471778 / 01-01 11:22:45.381][0] [<c06efa1c>] (composite_setup) from [<c0725390>] (android_setup+0x124/0x1ac)
[   52.471796 / 01-01 11:22:45.381][0] [<c0725390>] (android_setup) from [<c074967c>] (msm_udc_irq+0xd78/0xfb0)
[   52.471813 / 01-01 11:22:45.381][0] [<c074967c>] (msm_udc_irq) from [<c016f7e4>] (handle_irq_event_percpu+0xb4/0x2ac)
[   52.471829 / 01-01 11:22:45.381][0] [<c016f7e4>] (handle_irq_event_percpu) from [<c016fa18>] (handle_irq_event+0x3c/0x5c)
[   52.471845 / 01-01 11:22:45.381][0] [<c016fa18>] (handle_irq_event) from [<c01727b8>] (handle_fasteoi_irq+0xb0/0x128)
[   52.471863 / 01-01 11:22:45.381][0] [<c01727b8>] (handle_fasteoi_irq) from [<c016ef6c>] (generic_handle_irq+0x20/0x30)
[   52.471880 / 01-01 11:22:45.381][0] [<c016ef6c>] (generic_handle_irq) from [<c016f230>] (__handle_domain_irq+0x7c/0xa0)
[   52.471897 / 01-01 11:22:45.381][0] [<c016f230>] (__handle_domain_irq) from [<c01005c4>] (gic_handle_irq+0x48/0x78)
[   52.471916 / 01-01 11:22:45.381][0] [<c01005c4>] (gic_handle_irq) from [<c0f3a004>] (__irq_usr+0x44/0x60)
 
3개의 다른 SoC에서 다른 콜스택으로 호출되어 결국 똑같은 함수에서 Data Abort가 발생한 게 신기하죠?
 
이제 덤프를 분석해야 할 시간이에요.
 
커널 패닉이 일어난 코드를 열여서 확인해보니, 
tty_wakeup() 함수로 전달되는 파라미터 struct tty_struct *tty 가 NULL이라 데이터 Abort가 발생한 것처럼 보이는데요.
void tty_wakeup(struct tty_struct *tty)
{
    struct tty_ldisc *ld;
    if (test_bit(TTY_DO_WRITE_WAKEUP, &tty->flags)) { // <<-- kernel panic
        ld = tty_ldisc_ref(tty);
 
struct tty_struct *tty 파라미터를 어떤 함수가 호출하는지 점검할 필요가 있어 보여요.
그럼 tty_wakeup() 함수를 어느 함수가 호출했는지 찾아가 보면, gs_start_io() 함수에서 port->port.tty 파라미터를 전달하고 있군요. port->port.tty 변수가 NULL인 것이 문제죠.
(struct gs_port *)0xE97B5C00 = 0xE97B5C00 -> (
    port = (
      buf = (head = 0xE97B5C40, work = (data = (counter = 0xFFFFFFE0), entry = (
      tty = 0x0,  // <<--
      itty = 0x0,
      ops = 0x0,
      lock = (rlock = (raw_lock = (slock = 0x0, tickets = (owner = 0x0, next = 0
      blocked_open = 0x0,
 
해당 코드는 아래와 같아요.      
static int gs_start_io(struct gs_port *port)
{
// .. skip ..
        /* unblock any pending writes into our circular buffer */
         if (started) {
                 tty_wakeup(port->port.tty);  //<<--
         } else {
                 gs_free_requests(ep, head, &port->read_allocated);
                 gs_free_requests(port->port_usb->in, &port->write_pool,
                         &port->write_allocated);
                 status = -EIO;
         }
 
그런데 커널 패닉이 일어나기 전 커널 로그를 아주 자세히 관찰해보면, 아래 커널 로그를 확인할 수가 있어요.
gs_close() 함수가 호출되었다는 정보를 확인할 수 있는데요.
[ 6285.082083 / 06-01 11:43:40.318] -(4)[981:mdl_sock_host]gs_close: ttyGS1 (d3681900,d7e1ca80) ...
 
커널 로그에 위 로그에 매핑되는 코드를 찾아가 보면 gs_close() 함수를 확인할 수 있어요.
이 함수에서 port->port.tty 변수를 NULL로 설정하는군요.
static void gs_close(struct tty_struct *tty, struct file *file)
{
struct gs_port *port = tty->driver_data;
struct gserial *gser;
// ..skip..
if (gser == NULL)
gs_buf_free(&port->port_write_buf);
else
gs_buf_clear(&port->port_write_buf);
 
tty->driver_data = NULL;
port->port.tty = NULL; //<<--
 
정리하면, USB단에서 USB OTG IRQ가 떠서 RX 패킷을 받아서 tty layer로 넘겨주려고 tty port를 wakeup 하는 동작에서 tty port가 close 되었군요.
아래 조건으로 race condition이 발생했음을 알 수 있어요.
CPU0                                         CPU1
-gs_close
  port->port.tty = NULL 설정
                                             IRQ Trigger
                                             gs_start_io+0x94/0xf4
                                               tty_wakeup+0xc/0x64
                                                  커널 패닉
 
그럼 아래 패치를 적용해서 평소 어떤 흐름으로, gs_close() 함수가 호출되는지 확인해볼까요?
diff --git a/drivers/usb/gadget/function/u_serial.c b/drivers/usb/gadget/function/u_serial.c
index d0b519d..d5cc9f1 100644
--- a/drivers/usb/gadget/function/u_serial.c
+++ b/drivers/usb/gadget/function/u_serial.c
@@ -789,6 +789,10 @@ static int gs_start_io(struct gs_port *port)
 
        if (!port->port_usb)
                return -EIO;
+
+       show_stack(NULL, NULL);
+       printk("[+][flash][%s][%s,%d]: caller: (%pS)\n", current->comm, __func__,__LINE__,  (void *)__builtin_return_address(0));
+
        /* unblock any pending writes into our circular buffer */
        if (started) {
@@ -982,6 +986,9 @@ static void gs_close(struct tty_struct *tty, struct file *file)
        tty->driver_data = NULL;
        port->port.tty = NULL;
 
+       show_stack(NULL, NULL);
+       printk("[+][flash][%s][%s,%d]: caller: (%pS)\n", current->comm, __func__,__LINE__,  (void *)__builtin_return_address(0));
+
        port->openclose = false;
 
        pr_debug("gs_close: ttyGS%d (%pK,%pK) done!\n",
 
아래와 같은 커널 로그를 확인할 수 있어요. 
tty driver를 등록한 프로세스가 종료될 때 tty_release -> gs_close 순서로 함수가 호출되고 있네요.
<6>[  172.692835 / 01-06 19:00:22.975][5] [<c0016f78>] (unwind_backtrace) from [<c0013720>] (show_stack+0x20/0x24)
<6>[  172.692877 / 01-06 19:00:22.975][5] [<c0013720>] (show_stack) from [<c067a3b8>] (gs_close+0x1c8/0x258)
<6>[  172.692910 / 01-06 19:00:22.975][5] [<c067a3b8>] (gs_close) from [<c045d16c>] (tty_release+0x148/0x4cc)
<6>[  172.692941 / 01-06 19:00:22.975][5] [<c045d16c>] (tty_release) from [<c015b708>] (__fput+0x104/0x1e4)
<6>[  172.692969 / 01-06 19:00:22.975][5] [<c015b708>] (__fput) from [<c015b85c>] (____fput+0x18/0x1c)
<6>[  172.693004 / 01-06 19:00:22.975][5] [<c015b85c>] (____fput) from [<c0049240>] (task_work_run+0xc4/0xf0)
<6>[  172.693038 / 01-06 19:00:22.975][5] [<c0049240>] (task_work_run) from [<c0012f40>] (do_work_pending+0xb8/0xd0)
<6>[  172.693069 / 01-06 19:00:22.975][5] [<c0012f40>] (do_work_pending) from [<c000f320>] (work_pending+0xc/0x20)
<6>[  172.693092 / 01-06 19:00:22.975][5] [+][flash][atd][gs_close,990]: caller: (tty_release+0x148/0x4cc)
 
그럼 어떻게 예외 처리 코드를 추가하면 좋을까요?
아래와 같은 패치 코드를 작성하여 예외처리 루틴을 추가했더니 이 현상은 사라졌어요. port->port.tty이 null 포인터가 아닐 때만 tty_wakeup(port->port.tty); 함수를 호출하는거죠.
diff --git a/drivers/usb/gadget/function/u_serial.c b/drivers/usb/gadget/function/u_serial.c
index 2610e22..4c28a25 100644
--- a/drivers/usb/gadget/function/u_serial.c
+++ b/drivers/usb/gadget/function/u_serial.c
@@ -790,7 +790,11 @@ static int gs_start_io(struct gs_port *port)
        if (!port->port_usb)
                return -EIO;
        /* unblock any pending writes into our circular buffer */
-        if (started) {
+       if (started && port->port.tty) {
                tty_wakeup(port->port.tty);
        } else {
                gs_free_requests(ep, head, &port->read_allocated);
 
# Reference: For more information on 'Linux Kernel';
 
디버깅을 통해 배우는 리눅스 커널의 구조와 원리. 1
 
디버깅을 통해 배우는 리눅스 커널의 구조와 원리. 2