[Kernel][Crash][0320] 워크큐(workqueue) 락업(1)

#커널 크래시 디버깅 및 TroubleShooting

커널 크래시가 발생했습니다. 우선 커널 크래시 발생 직전 커널 로그부터 천천히 볼까요?

1 [73669.590105 / 12-01 13:26:44.379][0] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x1 nice=0 stuck for 115s!

2 [73669.592865 / 12-01 13:26:44.379][2] kernel BUG at home001/austindh.kim/src/kernel/workqueue.c:5381!

3 [73669.592905 / 12-01 13:26:44.379][2] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP

4 [73669.592922 / 12-01 13:26:44.379][0] Modules linked in: texfat(PO) snd_soc_sdm845

5 [73669.593061 / 12-01 13:26:44.379][2] CPU: 2 PID: 1416 Comm: lowi-server Tainted: P W O 4.9.60+ #1

6 [73669.593080 / 12-01 13:26:44.379][2] Hardware name: Qualcomm Technologies, Inc. SDM845 V2 MTP (DT)

7 [73669.593094 / 12-01 13:26:44.379][2] task: ffffffc5e9db5580 task.stack: ffffffc5e9d60000

8 [73669.593132 / 12-01 13:26:44.379][2] PC is at wq_watchdog_timer_fn+0x188/0x18c

9 [73669.593145 / 12-01 13:26:44.379][2] LR is at wq_watchdog_timer_fn+0x188/0x18c

첫 번째 줄 로그는 매우 중요한 정보를 담고 있는데요. workerpool(CPU2, node=0)에 등록된 워커 쓰레드가 115초 동안 실행되지 않았다고 알려줍니다.

workpool은 struct *worker_pool이란 자료 구조이며 각 CPU마다 node 2개를 갖고 있습니다.

BUG: workqueue lockup - pool cpus=2 node=0 flags=0x1 nice=0 stuck for 115s!

이번에는 2번째와 8번째 줄 로그입니다.

2 [73669.592865 / 12-01 13:26:44.379][2] kernel BUG at home001/src/kernel/workqueue.c:5381!

8 [73669.593132 / 12-01 13:26:44.379][2] PC is at wq_watchdog_timer_fn+0x188/0x18c

이제 BUG가 실행된 코드를 분석해야 할 시간입니다. 왜냐면 정확히 커널 크래시가 발생한 원인을 알기 위해서죠.

그럼 wq_watchdog_timer_fn 함수를 분석해볼까요?

분석하기 전에 커널 크래시 발생 전 아주 유익한 힌트를 이미 커널 로그가 말해 줬습니다.

CPU2에 해당 워커풀에서 타임 아웃이 발생했다는 점입니다. 이 점을 유념하면서 함수 리뷰를 해야 겠습니다.

1 static void wq_watchdog_timer_fn(unsigned long data)

2 {

3 unsigned long thresh = READ_ONCE(wq_watchdog_thresh) * HZ;

4 bool lockup_detected = false;

5 struct worker_pool *pool;

6 int pi;

8 if (!thresh)

9 return;

11 rcu_read_lock();

13 for_each_pool(pool, pi) {

14 unsigned long pool_ts, touched, ts;

16 if (list_empty(&pool->worklist))

17 continue;

19 /* get the latest of pool and touched timestamps */

20 pool_ts = READ_ONCE(pool->watchdog_ts);

21 touched = READ_ONCE(wq_watchdog_touched);

23 if (time_after(pool_ts, touched))

24 ts = pool_ts;

25 else

26 ts = touched;

28 if (pool->cpu >= 0) {

29 unsigned long cpu_touched =

30 READ_ONCE(per_cpu(wq_watchdog_touched_cpu,

31 pool->cpu));

32 if (time_after(cpu_touched, ts))

33 ts = cpu_touched;

34 }

36 /* did we stall? */

37 if (time_after(jiffies, ts + thresh)) {

38 lockup_detected = true;

39 pr_emerg("BUG: workqueue lockup - pool");

40 pr_cont_pool_info(pool);

41 pr_cont(" stuck for %us!\n",

42 jiffies_to_msecs(jiffies - pool_ts) / 1000);

43 }

44 }

46 rcu_read_unlock();

48 if (lockup_detected) {

49 show_workqueue_state();

50 BUG();

51 }

13번째와 14번째 줄 코드부터 봐야 겠습니다. for_each_pool는 각 워커풀 갯수만큼 접근하는 for loop입니다.

13 for_each_pool(pool, pi) {

14 unsigned long pool_ts, touched, ts;

커널 로그에서 CPU2에 워커풀에 문제가 있다고 했으니 CPU2에 해당하는 워커풀을 봐야 겠습니다.

그럼 코어 덤프에서 CPU2에 해당하는 워커풀은 어떻게 가져올까요?

for_each_pool 매크로를 보면 worker_pool_idr이란 IDR변수에서 워커풀을 가져옴을 알 수 있습니다.

#define for_each_pool(pool, pi) \

idr_for_each_entry(&worker_pool_idr, pool, pi) \

if (({ assert_rcu_or_pool_mutex(); false; })) { } \

else

worker_pool_idr 전역 변수에 접근하면 (*(worker_pool_idr.top)).ary[4]에 CPU2에 대한 워커풀 인스턴스가 있네요.

(static struct idr) worker_pool_idr = (

(struct idr_layer *) hint = 0xFFFFFFC6307B9308,

(struct idr_layer *) top = 0xFFFFFFC6307B9308 -> (

(int) prefix = 0x0,

(int) layer = 0x0,

(struct idr_layer * [256]) ary = (

[0x0] = 0xFFFFFFC63A30B000, //<<--CPU0, node0

[0x1] = 0xFFFFFFC63A30B400, //<<--CPU0, node1

[0x2] = 0xFFFFFFC63A322000, //<<--CPU1, node0

[0x3] = 0xFFFFFFC63A322400, //<<--CPU1, node1

[0x4] = 0xFFFFFFC63A339000, //<<--CPU2, node0

[0x5] = 0xFFFFFFC63A339400, //<<--CPU2, node1

[0x6] = 0xFFFFFFC63A350000,

[0x7] = 0xFFFFFFC63A350400,

[0x8] = 0xFFFFFFC63A367000,

[0x9] = 0xFFFFFFC63A367400,

[0x0A] = 0xFFFFFFC63A37E000,

[0x0B] = 0xFFFFFFC63A37E400,

[0x0C] = 0xFFFFFFC63A395000,

0xFFFFFFC63A339000 주소를 (struct worker_pool *) 구조체로 캐스팅하니 다음과 같은 멤버 변수를 볼 수 있습니다.

1 (struct worker_pool *) (struct worker_pool*)0xFFFFFFC63A339000 = 0xFFFFFFC63A339000 -> (

2 (spinlock_t) lock = ((struct raw_spinlock) rlock = ((arch_spinlock_t) raw_lock = ((u16) owner =

3 (int) cpu = 2 = 0x2,

4 (int) node = 0 = 0x0,

5 (int) id = 4 = 0x4,

6 (unsigned int) flags = 1 = 0x1,

7 (long unsigned int) watchdog_ts = 4302292746 = 0x00000001006FC70A, //<<--

8 (struct list_head) worklist = (

9 (struct list_head *) next = 0xFFFFFFC62B7DB6E8,

10 (struct list_head *) prev = 0xFFFFFF97D9308360),

11 (int) nr_workers = 3 = 0x3,

12 (int) nr_idle = 0 = 0x0,

13 (struct list_head) idle_list = ((struct list_head *) next = 0xFFFFFFC63A339048, (struct list_hea

14 (struct timer_list) idle_timer = ((struct hlist_node) entry = ((struct hlist_node *) next = 0xDE

15 (struct timer_list) mayday_timer = ((struct hlist_node) entry = ((struct hlist_node *) next = 0x

16 (struct hlist_head [64]) busy_hash = ([0] = ((struct hlist_node *) first = 0x0), [1] = ((struct

17 (struct worker *) manager = 0xFFFFFFC50C460C80,

18 (struct mutex) attach_mutex = ((atomic_t) count = ((int) counter = 1 = 0x1), (spinlock_t) wait_l

19 (struct list_head) workers = (

20 (struct list_head *) next = 0xFFFFFFC4DE83A7D0,

21 (struct list_head *) prev = 0xFFFFFFC4E26A30D0),

22 (struct completion *) detach_completion = 0x0,

위에서 3번째, 4번째 그리고 6번째 멤버를 보면 cpu2에 대한 워커풀에 node가 0이고 flags가 1이란 점을 알 수 있습니다.

3 (int) cpu = 2 = 0x2,

4 (int) node = 0 = 0x0,

6 (unsigned int) flags = 1 = 0x1,

커널 크래시 발생 전 로그와 일치하죠.

BUG: workqueue lockup - pool cpus=2 node=0 flags=0x1 nice=0 stuck for 115s!

다음은 워크큐 락업을 감지하는 37번째부터 42번째 코드를 봐야 할 차례입니다.

이 부분이 커널 크래시가 발생한 이유를 말해주는 핵심 코드라 봐야 합니다.

37 if (time_after(jiffies, ts + thresh)) {

38 lockup_detected = true;

39 pr_emerg("BUG: workqueue lockup - pool");

40 pr_cont_pool_info(pool);

41 pr_cont(" stuck for %us!\n",

42 jiffies_to_msecs(jiffies - pool_ts) / 1000);

43 }

그런데 코드에서 시간을 처리하는 코드가 눈이 띄는데요. 이를 분석하려면 우선 이 시스템이 구동되고 있는 조건에 대해서 잠시 살펴봐야 하는데요.

1. HZ

HZ은 CONFIG_HZ이 100이므로 100입니다. HZ이 100이란 의미는 1초에 jiffies가 100번 씩 증분된다고 봐야겠죠.

2. jiffies

이 시스템은 ARM64 아키텍처입니다. 그래서 jiffies 값은 전처리 파일로 보면 jiffies_64로 대응합니다..

이제 커널 코드와 디버깅 정보를 함께 볼 시간입니다..

37 if (time_after(jiffies, ts + thresh)) {

time_after이란 매크로가 등장했는데요. 함수 이름만 봐도 주눅이 드는 것 같군요.

리눅스 커널의 아주 난해한 time을 처리하는 함수로 보이지만, 아주 간단한 매크로입니다.

time_after은 다음 조건을 만족하면 true를 리턴하는 함수입니다.

jiffies > (ts + thresh)

그럼 jiffies 값을 알아 봐야 겠습니다. jiffies_64 변수를 확인하니 4302304310이군요.

jiffies_64 = 4302304310

그럼 (ts + thresh)이 결괏값이 jiffies 4302304310 값 보다 작았군요. 그래야 if문 내 코드가 실행되어

다음 커널 로그를 출력할 수 있잖아요.

BUG: workqueue lockup - pool cpus=2 node=0 flags=0x1 nice=0 stuck for 115s!

그럼 ts값을 알아봐야 겠습니다. 다음 코드를 잠깐 보면 ts는 pool->watchdog_ts 멤버 변수 값이란 걸 알 수 있습니다.

20 pool_ts = READ_ONCE(pool->watchdog_ts);

21 touched = READ_ONCE(wq_watchdog_touched);

23 if (time_after(pool_ts, touched))

24 ts = pool_ts;

25 else

26 ts = touched;

watchdog_ts는 struct worker_pool 구조체의 멤버 중 하나입니다.

그럼 잠깐 이 변수의 의미를 알아볼게요.

디바이스 드라이버에서 워크(strut work_struct)을 큐잉할 때와 워커 쓰레드가 실행되어

워크가 실행될 때 struct worker_pool.watchdog_ts 멤버에 그 당시 시간 정보인 jiffies값을 저장합니다. 워커풀이 가장 마지막에 실행된 시간 정보를 담고 있다고 봐야죠.

해당 함수는 __queue_work와 worker_thread이니 시간되면 코드를 한번 열어보세요.

그럼 코어 덤프에서 해당 멤버 변수를 확인하니 4302292746입니다.

(struct worker_pool *) (struct worker_pool*)0xFFFFFFC63A339000 = 0xFFFFFFC63A339000 -> (

(spinlock_t) lock = ((struct raw_spinlock) rlock = ((arch_spinlock_t) raw_lock = ((u16) owner =

(int) cpu = 2,

(int) node = 0,

(int) id = 4,

(unsigned int) flags = 1,

(long unsigned int) watchdog_ts = 4302292746,

(struct list_head) worklist = ((struct list_head *) next = 0xFFFFFFC62B7DB6E8, (struct list_head

여기까지 다음 37번째 줄 코드에서 jiffies와 ts 값을 확인했으니, thresh 변수를 알아볼 차례입니다.

37 if (time_after(jiffies, ts + thresh)) {

38 lockup_detected = true;

thresh 변수는 지역 변수로 선언됐는데, 다음 3번째 줄 코드와 같이 선언됐습니다.

3 unsigned long thresh = READ_ONCE(wq_watchdog_thresh) * HZ;

4 bool lockup_detected = false;

확인해보니 코어 덤프에서는 wq_watchdog_thresh이 60입니다.

wq_watchdog_thresh = 60

그럼 다음 계산식으로 thresh는 6000이라는 걸 알 수 있습니다.

6000 = 60 * 100 = wq_watchdog_thresh * HZ

여기서 wq_watchdog_thresh * HZ 코드가 뭘 의미하는지 생각해볼까요?

HZ이 100이면 1초에 jiffies가 100번 증분된다고 알고 있습니다. 그런데 여기에 60을 곱하고 있어요.

thresh는 1분 동안의 jiffies 값이라고 할 수 있습니다.

허벌라게 어렵죠? 이렇게 커널 크래시를 제대로 분석하려면 리눅스 커널 전반에 대해 깊히 알아야 합니다.

이제 이제까지 진행했던 커널 함수 코드 리뷰로 돌아갈께요.

37 if (time_after(jiffies, ts + thresh)) {

38 lockup_detected = true;

39 pr_emerg("BUG: workqueue lockup - pool");

40 pr_cont_pool_info(pool);

41 pr_cont(" stuck for %us!\n",

42 jiffies_to_msecs(jiffies - pool_ts) / 1000);

43 }

그럼 그 동안 모은 디버깅 정보로 jiffies > (ts + thresh) 조건을 만족하는군요. 그래서 if문이 실행된 것 겠죠.

4302304310(jiffies) > 4302298746(ts + thresh = 4302292746 + 6000)

(where: 디버깅 정보)

jiffies = 4302304310

ts = 4302292746 = struct worker_pool.watchdog_ts

thresh = 6000 = 60 * 100 = wq_watchdog_thresh * HZ

다음 코드를 볼게요. "stuck for %us!"란 에러 메시지를 커널 로그로 출력합니다.

41 pr_cont(" stuck for %us!\n",

42 jiffies_to_msecs(jiffies - pool_ts) / 1000);

그 동안 반복했던 아래 커널 로그에 대응합니다. 115s 동안 stuck됐다는 군요.

workqueue lockup - pool cpus=2 node=0 flags=0x1 nice=0 stuck for 115s!

그럼 115초가 어떻게 계산됐는지 알아볼게요. 115초는 아래 jiffies_to_msecs 함수를 실행해서 얻은 결괏값인데요.

41 pr_cont(" stuck for %us!\n",

42 jiffies_to_msecs(jiffies - pool_ts) / 1000);

jiffies_to_msecs 함수는 jiffies값을 밀리초로 변환합니다. 여기서 jiffies, pool_ts값은 다음과 같죠.

jiffies = 4302304310

pool_ts = 4302292746

이 값을 밀리초 단위로 계산하면 어떤 값일까요? 결과는 다음과 같습니다.

jiffies(밀리초) = 73370140

pool_ts(밀리초) = 73254500

밀리초로 변환한 값을 (jiffies - pool_ts) 식으로 계산하니 정말 115초가 나오는군요.

115640 = 73370140 - 73254500 = jiffies(밀리초) - pool_ts(밀리초)

그럼 여기까지 모든 디버깅 정보를 모아 이 문제가 왜 발생했는지 정리해볼게요.

아래는 워크(struct work_struct)을 처리하는 과정입니다.

1. 워크(&host->work, struct work_struct 타입)을 해당 워크큐에 큐잉합니다.

2. schedule_work -> queue_work -> queue_work_on -> __queue_work 순서 함수 호출

3. schedule_work() 함수가 실행된 CPU번호에 대한 워커풀을 찾아서, pool->watchdog_ts에 jiffies을 업데이트

즉, __queue_work 함수를 통해 해당 워커풀에 접근했을 때 시간 정보를 저장

이후 pool->worklist에 &host->work을 등록함

4. 워커풀에 등록된 워커 쓰레드를 깨움

5. 워커 쓰레드가 깨어나 실행할 때 worker_thread 함수에서 pool->watchdog_ts에 jiffies 즉 시간 정보 업데이트

6. 등록된 워크(pool->worklist)를 실행

7. wq_watchdog_timer_fn 함수에서 워크 큐 락업을 감지하여 커널 크래시를 유발합니다.

즉, 위 schedule_work -> queue_work -> queue_work_on 함수 흐름으로 워크를 등록했는데,

해당 워커 쓰레드가 1분 동안 실행되지 않았는지 점검한다는 의미죠. 즉 4/5/6 번 동작에 문제가 있어 제대로 실행안된 것입니다.

자 여기서, thresh가 1분을 의미한다고 했죠? 60초 동안 워커풀이 실행됐는지 점검하는 thresh hold라고 봐야 합니다.

sdmmc_request 코드를 예를 들까요? 워크를 큐잉할 때 schedule_work 란 함수를 호출합니다. schedule() 함수에 전달하는 &host->work 변수의 구조체는 struct work_struct임을 기억하세요.

static void sdmmc_request(struct mmc_host *mmc, struct mmc_request *mrq)

{

struct realtek_pci_sdmmc *host = mmc_priv(mmc);

struct mmc_data *data = mrq->data;

mutex_lock(&host->host_mutex);

host->mrq = mrq;

mutex_unlock(&host->host_mutex);

if (sd_rw_cmd(mrq->cmd) || sdio_extblock_cmd(mrq->cmd, data))

host->using_cookie = sd_pre_dma_transfer(host, data, false);

schedule_work(&host->work);

}

여기까지 1차로 워크큐 락업으로 커널 크래시가 발생한 원인을 알아 봤습니다. 그럼 다음 시간에는 조금 더 구체적으로 왜 커널 크래시가 발생했는지 분석을 해야 겠죠? 다음 세미나 시간을 기대해주세요.

# Reference: For more information on 'Linux Kernel';

디버깅을 통해 배우는 리눅스 커널의 구조와 원리. 1

http://www.yes24.com/Product/Goods/90085976

https://wikibook.co.kr/linux-kernel-1/

디버깅을 통해 배우는 리눅스 커널의 구조와 원리. 2

http://www.yes24.com/Product/Goods/90087307

https://wikibook.co.kr/linux-kernel-2/

저작자표시

'Core BSP 분석 > 커널 트러블슈팅' 카테고리의 다른 글

[Kernel][Crash] 워크큐(workqueue) 락업(3) at [0321] (0)	2023.05.07
[Kernel][Crash] 워크큐(workqueue) 락업(2) at [0321] (0)	2023.05.07
[Kernel][Panic] panic@ttwu_do_activate (메모리 불량 보드) (0)	2023.05.07
[Kernel][Stability] tcp_v4_rcv -> __stack_chk_fail (0)	2023.05.07
[Linux][Kernel] data abort@cpuacct_charge - 리눅스 커널 (0)	2023.05.07

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

RISC-V and Arm Linux Kernel Hacks

[Kernel][Crash][0320] 워크큐(workqueue) 락업(1)

'Core BSP 분석 > 커널 트러블슈팅' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

[Kernel][Crash][0320] 워크큐(workqueue) 락업(1)

'Core BSP 분석 > 커널 트러블슈팅' 카테고리의 다른 글

'Core BSP 분석/커널 트러블슈팅' Related Articles

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역