I’m running builds of 8.0 to make sure we don’t have any major test failures, and I’ve run into a few problems so far. I’ll put them in separate emails so they are more manageable. If anyone can offer any assistance, that would be most appreciated. I can supply any info needed, up to getting you a logon to the systems in question.
First up, macOS 10.12+ hang on socktest.pike. The 10.11 and earlier do not have this problem, and I haven’t tried running an older binary on a newer OS. The call to gc() in finish() never returns, and according to LLDB:
(lldb) thread list Process 4746 stopped * thread #1: tid = 0xa7cc2d, 0x00007fff781f922a libsystem_kernel.dylib`mach_msg_trap + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP (lldb) thread backtrace * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00007fff781f922a libsystem_kernel.dylib`mach_msg_trap + 10 frame #1: 0x00007fff781f976c libsystem_kernel.dylib`mach_msg + 60 frame #2: 0x00007fff781fd05e libsystem_kernel.dylib`clock_get_time + 85 frame #3: 0x0000000105974667 pike`mach_clock_get_time at rusage.c:633:7 frame #4: 0x00000001058e713b pike`do_gc(ignored_UNUSED=<unavailable>, explicit_call=<unavailable>) at gc.c:3507:24 frame #5: 0x00000001059ee723 pike`f_gc(args=<unavailable>) at builtin_functions.c:5126:11 frame #6: 0x00000001063607b7 frame #7: 0x00000001058a365a pike`mega_apply [inlined] eval_instruction(pc=<unavailable>) at interpret.c:1711:5 frame #8: 0x00000001058a3658 pike`mega_apply(type=<unavailable>, args=<unavailable>, arg1=<unavailable>, arg2=<unavailable>) at interpret.c:2695 frame #9: 0x000000010589c72d pike`apply_svalue(s=<unavailable>, args=<unavailable>) at interpret.c:3158:5 frame #10: 0x0000000105a31769 pike`got_fd_event(box=0x0000000105de5008, event=937461904) at file.c:368:5 frame #11: 0x00000001058c901f pike`backend_call_active_callbacks(fd_list=0x00007ffeea373ae8, me_UNUSED=<unavailable>) at backend.cmod:2349:6 frame #12: 0x00000001058c4839 pike`pdb_low_backend_once(pdb=0x00007fef37e086d0, timeout=0x00007ffeea373fa8) at backend.cmod:4137:11 frame #13: 0x00000001058c4aec pike`f_PollDeviceBackend_cq__backtick_28_29(args=1) at backend.cmod:4315:5 frame #14: 0x00000001058a1cbc pike`low_mega_apply(type=APPLY_SVALUE, args=1, arg1=<unavailable>, arg2=<unavailable>) at apply_low.h:221:2 frame #15: 0x00000001058a2753 pike`jump_opcode_F_CALL_FUNCTION_AND_POP at interpret_functions.h:2452:1 frame #16: 0x00000001061d4348
Some other information I discovered looking into this:
I tried to add some sleep() in the child process in order to examine the process with dtrace, but the sleep() never returned. Sleep seems to work fine with a pike -e ’sleep(5);’. If I disable the fork(), the test runs successfully.
The following test case demonstrates the problem. The sleep() can be exchanged for gc() and it also hangs.
int main() { object pid; if (mixed err = catch { pid = fork(); }) { werror("fork() failed\n"); } else if (pid) { int res = pid->wait(); werror("child exited.\n"); return 0; }
werror("child\n"); sleep(2); werror("slept\n"); return 0; }
bin/pike test2.pike child
I don’t quite understand why clock_get_time() would hang like that unless there was some sort of problem with the mach clock service across processes, though it wouldn’t surprise me if that were a problem. What is also interesting is that clock_gettime() is available in 10.12 and newer. According to the manpage, this is POSIX compliant and provides CLOCK_MONOTONIC, which is what is used on some other systems. There is a problem in that _POSIX_MONOTONIC_CLOCK is set to -1, which seems to contradict the man page. Not sure if it makes sense to try that instead, or re-initialize the clock service after the fork?
Bill
On Nov 19, 2020, at 2:43 PM, H William Welliver william@welliver.org wrote:
I’m running builds of 8.0 to make sure we don’t have any major test failures, and I’ve run into a few problems so far. I’ll put them in separate emails so they are more manageable. If anyone can offer any assistance, that would be most appreciated. I can supply any info needed, up to getting you a logon to the systems in question.
First up, macOS 10.12+ hang on socktest.pike. The 10.11 and earlier do not have this problem, and I haven’t tried running an older binary on a newer OS. The call to gc() in finish() never returns, and according to LLDB:
…
So, a little more experimentation and it appears that my hunch was correct: mach ports are invalid in the child process, and a call to init_mach_clock()following the fork() seems to restore order. What’s the best approach to make that happen? I see that atfork_child_callback get called in the child process after fork… is that the approved approach?
Bill
So, a little more experimentation and it appears that my hunch was correct: mach ports are invalid in the child process, and a call to init_mach_clock()following the fork() seems to restore order. What.FN"s the best approach to make that happen? I see that atfork_child_callback get called in the child process after fork$B!D(B is that the approved approach?
This has now been implemented in 8.1, and socktest.pike no longer hangs on macOS 11.1. I can backport this, and other macOS fixes, to 8.0 once they get some more testing.
pike-devel@lists.lysator.liu.se