A number of us who are running caudium on 7.6.50 have been experiencing what I'll describe as "lockups" on a pretty frequent basis. Normally, the watchdog will kill the process and restart things, but I wanted to figure out what was causing the problem. I managed to catch the hangup on my server a few minutes ago, and this is what I get:
Describing all threads: Thread 1: Thread.Thread(3351712).Queue: Thread.Thread(3351712).Queue(0)->wait (_static_modules.Builtin()->mutex_key()) /usr/local/pike/7.6.50/lib/modules/Thread.pmod:278: Thread.Thread (3351712).Queue(0)->read() base_server/caudium.pike:585: /opt/caudium/server/base_server/caudium ()->handler_thread(2)
Thread 2: Thread.Thread(3351712).Queue: Thread.Thread(3351712).Queue(0)->wait (_static_modules.Builtin()->mutex_key()) /usr/local/pike/7.6.50/lib/modules/Thread.pmod:278: Thread.Thread (3351712).Queue(0)->read() base_server/caudium.pike:585: /opt/caudium/server/base_server/caudium ()->handler_thread(3)
Thread 3: Thread.Thread(3351712).Queue: Thread.Thread(3351712).Queue(0)->wait (_static_modules.Builtin()->mutex_key()) /usr/local/pike/7.6.50/lib/modules/Thread.pmod:278: Thread.Thread (3351712).Queue(0)->read() base_server/caudium.pike:585: /opt/caudium/server/base_server/caudium ()->handler_thread(4)
Thread 4: _static_modules.Builtin()->thread_id: Thread.Thread(1)->backtrace() base_server/caudium.pike:3493: /opt/caudium/server/base_server/caudium ()->describe_all_threads() Stdio.File: Stdio.File("socket", "63.85.143.88 1198", 777 /* fd=103 */)->set_write_callback(Stdio.File("socket", "63.85.143.88 1198", 777 /* fd=103 */)$ /usr/local/pike/7.6.50/lib/modules/Stdio.pmod/module.pmod:1097: Stdio.File("socket", "63.85.143.88 1198", 777 /* fd=103 */)-
set_write_callback(SSL.ssl$
/usr/local/pike/7.6.50/lib/modules/SSL.pmod/sslfile.pike:1321: SSL.sslfile(Fd(103))->direct_write() /usr/local/pike/7.6.50/lib/modules/SSL.pmod/sslfile.pike:516: SSL.sslfile(Fd(103))->close(UNDEFINED,UNDEFINED) protocols/http.pike:812: /opt/caudium/server/protocols/ssl3()->end (0,UNDEFINED) _static_modules.Builtin()->Backend: Pike.Backend(0)->`()(0) /usr/local/pike/7.6.50/lib/master.pike:2703: master()->_main(({"/usr/ local/bin/pike","-DENABLE_THREADS","-DCAUDIUM","-DCAUDIUM_CACHE","- DROXEN","-Ietc/$
Thread 5: Thread.Thread(3351712).Queue: Thread.Thread(3351712).Queue(0)->wait (_static_modules.Builtin()->mutex_key()) /usr/local/pike/7.6.50/lib/modules/Thread.pmod:278: Thread.Thread (3351712).Queue(0)->read() base_server/caudium.pike:585: /opt/caudium/server/base_server/caudium ()->handler_thread(0)
Thread 6: Thread.Thread(3351712).Queue: Thread.Thread(3351712).Queue(0)->wait (_static_modules.Builtin()->mutex_key()) /usr/local/pike/7.6.50/lib/modules/Thread.pmod:278: Thread.Thread (3351712).Queue(0)->read() base_server/caudium.pike:585: /opt/caudium/server/base_server/caudium ()->handler_thread(1)
The hangup appears to be:
Thread.Queue()->read()
I also did a truss to see what the threads were up to, and here's what I got:
bash-2.05$ sudo truss -p 26891 /1: ioctl(105, DP_POLL, 0xFFBFE8B8) (sleeping...) /2: lwp_park(0x00000000, 1) (sleeping...) /3: lwp_park(0x00000000, 0) (sleeping...) /4: lwp_park(0x00000000, 1) (sleeping...) /5: lwp_park(0x00000000, 0) (sleeping...) /6: lwp_park(0x00000000, 1) (sleeping...) /1: Received signal #5, SIGTRAP, in ioctl() [caught]
This is Solaris 9/SPARC... any suggestions on how I might be able to troubleshoot the problem further?
Bill
Describing all threads: Thread 1: Thread 2: Thread 3:
Idle handler threads.
Thread 4: _static_modules.Builtin()->thread_id: Thread.Thread(1)->backtrace() base_server/caudium.pike:3493: /opt/caudium/server/base_server/caudium ()->describe_all_threads() Stdio.File: Stdio.File("socket", "63.85.143.88 1198", 777 /* fd=103 */)->set_write_callback(Stdio.File("socket", "63.85.143.88 1198", 777 /* fd=103 */)$
The probable cause.
Thread 5: Thread 6:
Idle handler threads.
The hangup appears to be:
Thread.Queue()->read()
Not quite...
I also did a truss to see what the threads were up to, and here's what I got:
bash-2.05$ sudo truss -p 26891 /1: ioctl(105, DP_POLL, 0xFFBFE8B8) (sleeping...)
The above hints that the problem lies in the backend.
/2: lwp_park(0x00000000, 1) (sleeping...) /3: lwp_park(0x00000000, 0) (sleeping...) /4: lwp_park(0x00000000, 1) (sleeping...) /5: lwp_park(0x00000000, 0) (sleeping...) /6: lwp_park(0x00000000, 1) (sleeping...)
A few cond_wait()s.
This is Solaris 9/SPARC... any suggestions on how I might be able to troubleshoot the problem further?
The problem looks like the one fixed in Pike 7.7's backend.cmod 1.174:
| revision 1.174 | date: 2005/04/13 14:16:43; author: grubba; state: Exp; lines: +45 -4 | Fixed deadlock in BACKEND_USES_DEVPOLL mode for POLL_DEVICE_SET_EVENTS() hanging when the backend thread is in MY_POLL(). | Fixes performance problems with Roxen on Solaris.
That fix may not have been backported to Pike 7.6.
Bill
Thread 4: _static_modules.Builtin()->thread_id: Thread.Thread(1)->backtrace() base_server/caudium.pike:3493: /opt/caudium/server/base_server/caudium ()->describe_all_threads() Stdio.File: Stdio.File("socket", "63.85.143.88 1198", 777 /* fd=103 */)->set_write_callback(Stdio.File("socket", "63.85.143.88 1198", 777 /* fd=103 */)$
The probable cause.
Say what? That's the thread serving his request for a thread dump, is it not?
Yes, and? Getting the backtrace is simply a matter of which thread that first calls check_threads_etc().
Minor thought lapse. I imagined for a minute that the dump was served over https. :-)
I also did a truss to see what the threads were up to, and here's what I got:
bash-2.05$ sudo truss -p 26891 /1: ioctl(105, DP_POLL, 0xFFBFE8B8) (sleeping...) /2: lwp_park(0x00000000, 1) (sleeping...) /3: lwp_park(0x00000000, 0) (sleeping...) /4: lwp_park(0x00000000, 1) (sleeping...) /5: lwp_park(0x00000000, 0) (sleeping...) /6: lwp_park(0x00000000, 1) (sleeping...)
On closer examination all of the above look normal; but note that hanged backends usually unstick if a signal is received.
Bill
The above hints that the problem lies in the backend.
That's what I originally thought...
The problem looks like the one fixed in Pike 7.7's backend.cmod 1.174:
| revision 1.174 | date: 2005/04/13 14:16:43; author: grubba; state: Exp; lines: +45 -4 | Fixed deadlock in BACKEND_USES_DEVPOLL mode for POLL_DEVICE_SET_EVENTS() hanging when the backend thread is in MY_POLL(). | Fixes performance problems with Roxen on Solaris.
That fix may not have been backported to Pike 7.6.
Actually, I backported that fix a few months ago, which did fix problems with SSL, but it doesn't seem to fix this problem. The main thread does wake up on signals, but none of the handler threads will break out of their read() on the thread queue.
Bill
I've been looking into this further, and it doesn't seem like pike itself has locked up; there are 5 worker threads in a thread queue, and one thread (the "main" thread) that's handling background tasks. The main thread is working just fine. It just seems like at some point, the thread queue seems to stop working. Any suggestions about how to track the problem? It seems to be occurring several times a week, so it's got to be a reasonably common situation.
Bill
Describing all threads: Thread 1: Thread 2: Thread 3:
Idle handler threads.
Thread 4: _static_modules.Builtin()->thread_id: Thread.Thread(1)->backtrace() base_server/caudium.pike:3493: /opt/caudium/server/base_server/caudium ()->describe_all_threads() Stdio.File: Stdio.File("socket", "63.85.143.88 1198", 777 /* fd=103 */)->set_write_callback(Stdio.File("socket", "63.85.143.88 1198", 777 /* fd=103 */)$
The probable cause.
Thread 5: Thread 6:
Idle handler threads.
The hangup appears to be:
Thread.Queue()->read()
pike-devel@lists.lysator.liu.se