Linux Kernel panic issue: How to fix hung_task_timeout_secs and blocked for more than 120 seconds problem

by blackMOREOps 12 years ago

17 comments

This guide shows how to fix hung_task_timeout_secs and blocked for more than 120 seconds problem in Linux. A panic may occur as a result of a hardware failure or a software bug in the operating system. In many cases, the operating system is capable of continued operation after an error has occurred. However, the system is in an unstable state and rather than risking security breaches and data corruption, the operating system stops to prevent further damage and facilitate diagnosis of the error and, in usual cases, restart. After recompiling a kernel binary image from source code, a kernel panic during booting the resulting kernel is a common problem if the kernel was not correctly configured, compiled or installed. Add-on hardware or malfunctioning RAM could also be sources of fatal kernel errors during start up, due to incompatibility with the OS or a missing device driver. A kernel may also go into panic() if it is unable to locate a root file system. During the final stages of kernel userspace initialization, a panic is typically triggered if the spawning of init fails, as the system would then be unusable.

Fix hung_task_timeout_secs and blocked for more than 120 seconds problem

Table of Contents

Background

My server became unresponsive today (around 15:38hrs)

I’ve collected following logs that shows Memory and CPU usage and narrowed down /var/log/messages.

After doing a hard reboot, it came back online but I was unable to access it via VNC or SSH.

VNC connection showed an error (many errors but all contained /proc/sys/kernel/hung_task_timeout_secs“)

INFO: task jbd2/vda3-8:250 blocked for more than 120 seconds.
 Not tainted 2.6.32-431.11.2.el6.x86_64 #1
 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Step by step troubleshooting data and logs

Check Memory usage

Following log shows server memory usage

someuser@servercore [/var/log]# sar -r

15:00:01 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit
15:20:01 476604 1396772 74.56 110140 707116 1201652 30.64
15:30:02 526240 1347136 71.91 110412 710536 1165148 29.71

15:55:53 LINUX RESTART

16:00:01 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit
16:10:01 517168 1356208 72.39 136040 588964 1196724 30.52
16:20:01 510580 1362796 72.75 137488 596560 1191664 30.39

As you can see, it’s not that high and I had plenty of free Memory.

Check CPU usage

Following log shows CPU usage.

someuser@servercore [/var/log]# sar -u
15:00:01 CPU %user %nice %system %iowait %steal %idle
15:20:01 all 6.01 0.04 1.74 1.59 0.14 90.48
15:30:02 all 4.97 0.04 1.54 7.87 0.15 85.44
Average: all 7.20 0.06 2.19 2.69 0.26 87.60

15:55:53 LINUX RESTART

16:00:01 CPU %user %nice %system %iowait %steal %idle
16:10:01 all 9.13 0.04 2.78 6.98 0.31 80.76
16:20:01 all 4.21 0.04 1.39 3.49 0.15 90.73

Again, CPU wasn’t at 100%. This is now getting annoying that I can’t explain why I am getting into s**tstorm for nothing.

Let’s check //var/log/messages to find all the error logs related this this kernel panic

Check Kernel Panic Logs

Now I am getting somewhere …

someuser@servercore [/var/log]# grep 'Aug 22 15' messages | grep -v Firewall | grep -v blackmore | grep -v operational | grep -v ec2
Aug 22 15:38:05 servercore kernel: INFO: task jbd2/vda3-8:250 blocked for more than 120 seconds.
Aug 22 15:38:05 servercore kernel: Not tainted 2.6.32-431.11.2.el6.x86_64 #1
Aug 22 15:38:05 servercore kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 15:38:05 servercore kernel: jbd2/vda3-8 D 0000000000000000 0 250 2 0x00000000
Aug 22 15:38:06 servercore kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 15:38:06 servercore kernel: Call Trace:
Aug 22 15:38:06 servercore kernel: INFO: task rs:main Q:Reg:1035 blocked for more than 120 seconds.
Aug 22 15:38:06 servercore kernel: Not tainted 2.6.32-431.11.2.el6.x86_64 #1
Aug 22 15:38:06 servercore kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 15:38:06 servercore kernel: rs:main Q:Reg D 0000000000000000 0 1035 1 0x00000080
Aug 22 15:38:06 servercore kernel: Call Trace:
Aug 22 15:38:06 servercore kernel: INFO: task queueprocd - qu:1793 blocked for more than 120 seconds.
Aug 22 15:38:06 servercore kernel: Not tainted 2.6.32-431.11.2.el6.x86_64 #1
Aug 22 15:38:06 servercore kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 15:38:06 servercore kernel: queueprocd - D 0000000000000000 0 1793 1 0x00000080
Aug 22 15:38:06 servercore kernel: Call Trace:
Aug 22 15:38:06 servercore kernel: Not tainted 2.6.32-431.11.2.el6.x86_64 #1
Aug 22 15:38:06 servercore kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 15:38:06 servercore kernel: Call Trace:
Aug 22 15:38:06 servercore kernel: INFO: task httpd:30439 blocked for more than 120 seconds.
Aug 22 15:38:06 servercore kernel: Not tainted 2.6.32-431.11.2.el6.x86_64 #1
Aug 22 15:38:07 servercore kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 15:38:07 servercore kernel: httpd D 0000000000000000 0 30439 2223 0x00000080
Aug 22 15:38:07 servercore kernel: Call Trace:
Aug 22 15:38:11 servercore kernel: INFO: task httpd:30482 blocked for more than 120 seconds.
Aug 22 15:38:11 servercore kernel: Not tainted 2.6.32-431.11.2.el6.x86_64 #1
Aug 22 15:38:11 servercore kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 15:38:11 servercore kernel: httpd D 0000000000000000 0 30482 2223 0x00000080
Aug 22 15:38:11 servercore kernel: Call Trace:
Aug 22 15:39:54 servercore kernel: INFO: task jbd2/vda3-8:250 blocked for more than 120 seconds.
Aug 22 15:39:54 servercore kernel: Not tainted 2.6.32-431.11.2.el6.x86_64 #1
Aug 22 15:39:54 servercore kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 15:39:54 servercore kernel: jbd2/vda3-8 D 0000000000000000 0 250 2 0x00000000
Aug 22 15:39:54 servercore kernel: Call Trace:
Aug 22 15:39:54 servercore kernel: INFO: task flush-253:0:263 blocked for more than 120 seconds.
Aug 22 15:39:54 servercore kernel: Not tainted 2.6.32-431.11.2.el6.x86_64 #1
Aug 22 15:39:54 servercore kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 15:39:54 servercore kernel: flush-253:0 D 0000000000000000 0 263 2 0x00000000
Aug 22 15:39:54 servercore kernel: Call Trace:
Aug 22 15:39:56 servercore kernel: INFO: task rs:main Q:Reg:1035 blocked for more than 120 seconds.
Aug 22 15:39:56 servercore kernel: Not tainted 2.6.32-431.11.2.el6.x86_64 #1
Aug 22 15:39:56 servercore kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 15:39:56 servercore kernel: rs:main Q:Reg D 0000000000000000 0 1035 1 0x00000080
Aug 22 15:39:56 servercore kernel: Call Trace:
Aug 22 15:42:11 servercore kernel: Clocksource tsc unstable (delta = -8589964877 ns)

15:55:53 LINUX RESTART

As you can see all the errors contained “echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.” and “blocked for more than 120 seconds” somewhere.

Now let’s fix this problem once and for all..

Solution for hung_task_timeout_secs

Explanation

By default Linux uses up to 40% of the available memory for file system caching. After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous. For flushing out this data to disk this there is a time limit of 120 seconds by default. In the case here the IO subsystem is not fast enough to flush the data withing 120 seconds. As IO subsystem responds slowly and more requests are served, System Memory gets filled up resulting in the above error, thus serving HTTP requests.

Testing

I tested this theory with the following:
Change vm.dirty_ratio and vm.dirty_backgroud_ratio

someuser@servercore [/home/someuser]$ sudo sysctl -w vm.dirty_ratio=10
someuser@servercore [/home/someuser]$ sudo sysctl -w vm.dirty_background_ratio=5

Commit Change

someuser@servercore [/home/someuser]# sudo sysctl -p

Make it permanent

When the server seemed more stable and no Kernel/Swap/Memory Panic for a week, I edited /etc/sysctl.conf file to make these permanent after reboot.

someuser@servercore [/home/someuser]$ sudo vi /etc/sysctl.conf

ADD 2 lines at the bottom

vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

Save and exit.

someuser@servercore [/home/someuser]$ sudo reboot

That’s it. I never had this issue .. ever again..

Hope someone find this info useful.

Reference

Ronny Egners Blog – INFO: task blocked for more than 120 seconds.

How to check loaded and compiled modules in Apache in Linux?

17 comments

hdaz October 8, 2014 - 8:16 am

This can be shortened a little be interesting to see the time of this v the multiple grep’s
grep ‘Aug 22 15’ messages | egrep -vi “Firewall | blackmore | operational | ec2”

matthew June 9, 2015 - 9:38 pm

The sysctl -p command you use in ‘testing’ to commit the changes actually causes the values in the sysctl.conf file to be read back… so it’s not committing the changes, it’s removing them!
It would be correct to use sysctl -p after modifying the sysctl.conf file to make the changes permanent, there is no need to reboot the system!

bancaldo October 2, 2015 - 2:28 am

the /etc/sysctl.conf trick works for me. After a week no more problems during shutdown.
Thank You.

Evgenii Lepikhin December 22, 2015 - 4:59 am

For flushing out this data to disk this there is a time limit of 120 seconds by default.

This is incorrect interpretation of the error message. “… blocked for more than 120 seconds” means process didn’t leave uninterruptible sleep after this mark. This state can be caused by waiting for disk IO, by vfork() and many other cases.

matt July 2, 2016 - 2:11 am

Thank you! This solved my issues!

Areeb Yasir July 5, 2016 - 6:14 am

Very useful information, I have found there are usually two reasons for the blocked task issue
#1) Disk failure or faults
#2) Your issue here and I have run into it myself a few times (it can be common on a LiveCD if you are doing lots of IO and makes it seems like a network or disk problem when it is really not). I will try editing the kernel parameters manually next time I run into this issue.

Thanks

lelivreurdecroquettes.com October 8, 2016 - 6:48 pm

Ⅰ love reading thrօugh a post that will mаke men and women think.
Aⅼsо, tһank you foг allowing me to сomment!

Vanderson January 14, 2017 - 12:16 am

Thank you !!!

Problem solved !!

Arnold April 26, 2017 - 2:22 pm

Hi, I edited the /etc/sysctl.conf. Then reboot. It says. Can’t connect to default. Skipping. Shutting down managed connections……………. How long will it take?

changtonn May 15, 2017 - 5:09 pm

Thank you very much, it’s useful

Haidee May 25, 2017 - 5:56 pm

Thanks a Lot . Problem solved 🙂

lundgren@byu.net August 15, 2017 - 3:38 am

Thank you!!

Thomas de Graaf August 18, 2017 - 8:49 pm

Works great. Thank you!

خرید سرور مجازی December 19, 2017 - 1:21 pm

it helped figuring out our problem
but that was related to CGI process and we managed to solve the main issue
Thanks

John Holcomb May 18, 2019 - 7:45 am

still great post, thanks!

xmnx April 2, 2020 - 7:50 pm

Problem solved on my Rsyslog server in production ^^
Thanks!

Jan Dohnal November 15, 2020 - 5:27 am

Hi, do you change sysctl configuration on Virtual machine, or on Host machine? A have same problem with Ubuntu (Libvirt and Apache virtual machine) 🙁

Leave your solution or comment to help others. Comment don't need registration or real email, so feel free.Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Random Posts

Useful Links

Newsletter

Linux Kernel panic issue: How to fix hung_task_timeout_secs and blocked for more than 120 seconds problem

Background

Step by step troubleshooting data and logs

Check Memory usage

Check CPU usage

Check Kernel Panic Logs

Solution for hung_task_timeout_secs

Explanation

Testing

Make it permanent

Reference

How to check loaded and compiled modules in Apache in Linux?

Find number of unique IP’s and active connections to Web server

You may also like

17 comments

Leave your solution or comment to help others. Comment don't need registration or real email, so feel free.Cancel reply

Random Posts

Useful Links

Newsletter

Discover more from blackMORE Ops