I've become aware of an issue where a multithreaded pike application on Linux causes very frequent context switches (around 17000 per second). It's about five threads that all do a lot of data structure manipulation in pike code.
What could be the cause of this? Is th_yield called way too often? Or is it possible that the thread lib is so stupid that it schedules all threads that hangs on the interpreter lock in round-robin fashion?
Looks like the fallback if gethrtime doesn't exist is a bit arbitrary:
static void check_threads(struct callback *cb, void *arg, void * arg2) { #ifdef HAVE_GETHRTIME static long long last_; if( gethrtime()-last_ < 50000000 ) /* 0.05s slice */ return; last_ = gethrtime(); #else static int div_; if(div_++ & 255) return; #endif
This only works if check_threads is called at approximately even time intervals, which probably isn't true at all.
I'd like to add a fallback to clock():
#elif defined(USE_CLOCK_FOR_SLICES) if (clock() - thread_start_clock < (clock_t) (CLOCKS_PER_SEC / 20)) return; #else
where thread_start_clock is set to clock() in SWAP_IN_THREADS (since clock() measures the thread local cpu time).
Thoughts? I know that clock() has bad precision, but I think it should be adequate down to the 1/20 sec resolution used here.
/ Martin Stjernholm, Roxen IS
Previous text:
2002-09-05 16:06: Subject: Frequent context switches
I've become aware of an issue where a multithreaded pike application on Linux causes very frequent context switches (around 17000 per second). It's about five threads that all do a lot of data structure manipulation in pike code.
What could be the cause of this? Is th_yield called way too often? Or is it possible that the thread lib is so stupid that it schedules all threads that hangs on the interpreter lock in round-robin fashion?
/ Martin Stjernholm, Roxen IS
clock() is a system call which takes a small but not insignificant amount of cpu time. The reason for the div_ construction is that check_threads is called a *lot*. Even using up a few cycles in this function creates measurable slowdowns.
My suggestion would be to make the code look something like:
static int divisor; static int context_switches; static float cps=50.0; static float last_time=-1.0;
void check_threads() { static int counter=0; if(counter--) return;
counter=divisor; context_switches++; ... }
And then, in some code which is called more seldom:
float t=float_time(); if(last_time != -1.0) { float s=10.0; /* Average time */ float frac=1.0-1.0/s;
cps=cps * pow(frac,t - last_time) + context_switches/s; } context_switches=0; last_time=t;
if(cps < 20) divisor -= divisor / 5; if(cps > 100) divisor += divisor / 3;
This way, it will adapt automatically, not use much cpu, and you can fall back to gettimeofday() if you like, because the adaption function is not called as often.
As far as I'm concerned, it doesn't matter if time slices are uniform in length.
/ Fredrik (Naranek) Hubinette (Real Build Master)
Previous text:
2002-09-05 18:20: Subject: Frequent context switches
Looks like the fallback if gethrtime doesn't exist is a bit arbitrary:
static void check_threads(struct callback *cb, void *arg, void * arg2) { #ifdef HAVE_GETHRTIME static long long last_; if( gethrtime()-last_ < 50000000 ) /* 0.05s slice */ return; last_ = gethrtime(); #else static int div_; if(div_++ & 255) return; #endif
This only works if check_threads is called at approximately even time intervals, which probably isn't true at all.
I'd like to add a fallback to clock():
#elif defined(USE_CLOCK_FOR_SLICES) if (clock() - thread_start_clock < (clock_t) (CLOCKS_PER_SEC / 20)) return; #else
where thread_start_clock is set to clock() in SWAP_IN_THREADS (since clock() measures the thread local cpu time).
Thoughts? I know that clock() has bad precision, but I think it should be adequate down to the 1/20 sec resolution used here.
/ Martin Stjernholm, Roxen IS
This seems to be some sort of decaying average algorithm. Wouldn't it be necessary that the number of calls to check_threads per second varies in a somewhat continuous way for it to work? I.e. isn't it possible that the program shifts the working set from one that causes 100 k calls/sec to one that only causes 1 k calls/sec? Then this could produce time slices of several seconds before it readjusts. Sure, it's not important that the time slices are exactly uniform, but that would probably be too erratic.
What about this crude but obvious solution:
static int div_; if(div_++ & 255) return; if (clock() - thread_start_clock < (clock_t) (CLOCKS_PER_SEC / 20)) return;
It would only put a cap on very short time slices. (Ideally one would like to put a cap on long ones too, but I guess there's no real risk that the function gets called as seldom as about 100 times/sec.)
/ Martin Stjernholm, Roxen IS
Previous text:
2002-09-06 00:24: Subject: Frequent context switches
clock() is a system call which takes a small but not insignificant amount of cpu time. The reason for the div_ construction is that check_threads is called a *lot*. Even using up a few cycles in this function creates measurable slowdowns.
My suggestion would be to make the code look something like:
static int divisor; static int context_switches; static float cps=50.0; static float last_time=-1.0;
void check_threads() { static int counter=0; if(counter--) return;
counter=divisor; context_switches++; ... }
And then, in some code which is called more seldom:
float t=float_time(); if(last_time != -1.0) { float s=10.0; /* Average time */ float frac=1.0-1.0/s;
cps=cps * pow(frac,t - last_time) + context_switches/s;
} context_switches=0; last_time=t;
if(cps < 20) divisor -= divisor / 5; if(cps > 100) divisor += divisor / 3;
This way, it will adapt automatically, not use much cpu, and you can fall back to gettimeofday() if you like, because the adaption function is not called as often.
As far as I'm concerned, it doesn't matter if time slices are uniform in length.
/ Fredrik (Naranek) Hubinette (Real Build Master)
It's better than the previous version, but I suspect it would still slow things down noticably. However, I think some measurements would be in order to know for sure. (On various platforms of course)
/ Fredrik (Naranek) Hubinette (Real Build Master)
Previous text:
2002-09-06 16:59: Subject: Frequent context switches
This seems to be some sort of decaying average algorithm. Wouldn't it be necessary that the number of calls to check_threads per second varies in a somewhat continuous way for it to work? I.e. isn't it possible that the program shifts the working set from one that causes 100 k calls/sec to one that only causes 1 k calls/sec? Then this could produce time slices of several seconds before it readjusts. Sure, it's not important that the time slices are exactly uniform, but that would probably be too erratic.
What about this crude but obvious solution:
static int div_; if(div_++ & 255)
return; if (clock() - thread_start_clock < (clock_t) (CLOCKS_PER_SEC / 20)) return;
It would only put a cap on very short time slices. (Ideally one would like to put a cap on long ones too, but I guess there's no real risk that the function gets called as seldom as about 100 times/sec.)
/ Martin Stjernholm, Roxen IS
Compared to a context switch it ought to be negligible. So an alternative way could perhaps be to measure the divisor at startup, and also have a counter over active threads so the function doesn't do anything if there's only one thread running. (It could perhaps even be removed from evaluator_callbacks altogether then.)
Btw, is gethrtime always cheap? Wouldn't it be wise to put that too behind the div_ check?
I've investigated the reason for the frequent calls a bit more now, and I think there are some suspect check_threads_etc() in apply_low.h. E.g. the following program generates 4 million calls per second to check_threads on my system:
int gronk() {return 17;} void work() { while (1) gronk(); }
int main() { for (int i = 0; i < 5; i++) Thread.thread_create (work); sleep (10); }
The reason for the many calls is the pike function call in the loop. Every pike function call and most lfun calls causes a call to check_threads, while many other call variants do not. Is there a reason for this? I'm considering replacing all the check_threads_etc() calls in apply_low.h with a single fast_check_threads_etc(6) at the top.
/ Martin Stjernholm, Roxen IS
Previous text:
2002-09-10 01:01: Subject: Frequent context switches
It's better than the previous version, but I suspect it would still slow things down noticably. However, I think some measurements would be in order to know for sure. (On various platforms of course)
/ Fredrik (Naranek) Hubinette (Real Build Master)
Well, the general idea is that any call that can cause recursion or iteration should at least do check_threads *sometimes*. However, beware that it is really easy to cause loooong delays (several seconds) between calls to check_threads_etc() if you use fast_check_threads instead. If I remember correctly, I first replaced all check_threads_etc() with fast_check_threads_etc(), but that caused problems for some applications...
/ Fredrik (Naranek) Hubinette (Real Build Master)
Previous text:
2002-09-10 02:02: Subject: Frequent context switches
Compared to a context switch it ought to be negligible. So an alternative way could perhaps be to measure the divisor at startup, and also have a counter over active threads so the function doesn't do anything if there's only one thread running. (It could perhaps even be removed from evaluator_callbacks altogether then.)
Btw, is gethrtime always cheap? Wouldn't it be wise to put that too behind the div_ check?
I've investigated the reason for the frequent calls a bit more now, and I think there are some suspect check_threads_etc() in apply_low.h. E.g. the following program generates 4 million calls per second to check_threads on my system:
int gronk() {return 17;} void work() { while (1) gronk(); }
int main() { for (int i = 0; i < 5; i++) Thread.thread_create (work); sleep (10); }
The reason for the many calls is the pike function call in the loop. Every pike function call and most lfun calls causes a call to check_threads, while many other call variants do not. Is there a reason for this? I'm considering replacing all the check_threads_etc() calls in apply_low.h with a single fast_check_threads_etc(6) at the top.
/ Martin Stjernholm, Roxen IS
It'd be interesting to see how those applications worked. I discussed it a bit with Grubba and we reasoned like this:
If there's a single fast_check_threads_etc() in mega_apply there would be long delays only if mega_apply is called seldom. It's hard to see any way that would happen if it's pike code that does the work, so the only case would be long C function calls, e.g. doing indices() on mappings and things like that. So one way would be to use check_threads_etc before every C function call but only, say, fast_check_threads_etc(8) before Pike function calls (only to catch the mostly theoretical cases where a significant amount of time is spent in Pike functions that only calls each other and not any C functions at all).
Unfortunately most C functions are faster and more frequent than Pike functions so it'd still be very uneven. The good solution would be to use fast_check_threads_etc() on C functions too and then add check_threads_etc() inside the C functions when they are about to start on some bigger task. We might not want to rely on that though, so the question then becomes: How big can the ratio slow/fast C function calls become? We couldn't imagine any reasonable cases where it becomes very big, so a fast_check_threads_etc(4) in front of C functions calls seems to be on the safe side.
So the conclusion is to have fast_check_threads_etc(4) for C function calls and fast_check_threads_etc(8) for Pike calls. Do you think that's reasonably safe?
/ Martin Stjernholm, Roxen IS
Previous text:
2002-09-10 08:11: Subject: Frequent context switches
Well, the general idea is that any call that can cause recursion or iteration should at least do check_threads *sometimes*. However, beware that it is really easy to cause loooong delays (several seconds) between calls to check_threads_etc() if you use fast_check_threads instead. If I remember correctly, I first replaced all check_threads_etc() with fast_check_threads_etc(), but that caused problems for some applications...
/ Fredrik (Naranek) Hubinette (Real Build Master)
Seems reasonable, however, you may want to test if fast_check_threads_etc(4) actually gives any speed gain at all considering that check_threads_etc() is a pretty speedy call.
As always, measuring is key.
/ Fredrik (Naranek) Hubinette (Real Build Master)
Previous text:
2002-09-10 15:54: Subject: Frequent context switches
It'd be interesting to see how those applications worked. I discussed it a bit with Grubba and we reasoned like this:
If there's a single fast_check_threads_etc() in mega_apply there would be long delays only if mega_apply is called seldom. It's hard to see any way that would happen if it's pike code that does the work, so the only case would be long C function calls, e.g. doing indices() on mappings and things like that. So one way would be to use check_threads_etc before every C function call but only, say, fast_check_threads_etc(8) before Pike function calls (only to catch the mostly theoretical cases where a significant amount of time is spent in Pike functions that only calls each other and not any C functions at all).
Unfortunately most C functions are faster and more frequent than Pike functions so it'd still be very uneven. The good solution would be to use fast_check_threads_etc() on C functions too and then add check_threads_etc() inside the C functions when they are about to start on some bigger task. We might not want to rely on that though, so the question then becomes: How big can the ratio slow/fast C function calls become? We couldn't imagine any reasonable cases where it becomes very big, so a fast_check_threads_etc(4) in front of C functions calls seems to be on the safe side.
So the conclusion is to have fast_check_threads_etc(4) for C function calls and fast_check_threads_etc(8) for Pike calls. Do you think that's reasonably safe?
/ Martin Stjernholm, Roxen IS
I've made a couple of observations:
o clock() on Linux is about 30% slower than gethrtime on Solaris (reasonable since clock() measures thread local cpu time while gethrtime measures real time). clock() on Solaris is 50% slower than gethrtime. My conclusion: Try to use gethrtime if it exists, with fallback to clock. Since their speed is on the same order of magnitude one should be equally cautious about calling either of them.
o Many of the calls to C functions are not made through mega_apply; some of them become special opcodes and other are made through the "call builtin" opcodes which doesn't have any check_threads_etc calls. So it might not be so reasonable afterall to assume that there always is a low ratio of slow C function calls in mega_apply.
o Even only a fast_call_threads_etc(1) gives a speed improvement compared to call_threads_etc(). That's regardless whether there's a clock/gethrtime check in check_threads or not.
o If there's no working yield then it's easy to get starvation since a context switch rarely happens by itself in the unlocked window in check_threads. That means that it could be disastrous to have the once-every-256 divisor check or a clock/gethrtime check which short-circuits the unlocked window. I've checked in a patch that disables them completely if th_yield() doesn't expand to a function.
(I discovered this by accident; there was a bug in pike_threadlib.h that blocked the fallback to thr_yield if the POSIX thread lib is chosen and there's no pthread_yield. This happened on at least Solaris 7.)
o Using clock or gethrtime to cover the yield call gives a comparatively small speed improvement. E.g. on sol7-x86 I got a test loop 42% faster by limiting the number of context switches from 41000 to 20 per second. With the simple divisor check first the gap shrinks to 230 - 20 context switches per second, which translates to only about 10% speed increase.
I expected the clock/gethrtime calls to be a lot cheaper than a context switch, but that's apparently not the case. Perhaps it's some effect in my small test case that causes the context switches to be uncharacteristically fast, I don't know. Anyway, the clock/gethrtime check gives some speed improvement so it should be used unless the divisor is tuned better. But I think that'd be too risky since there's no telling how much the call rate to check_threads can vary.
/ Martin Stjernholm, Roxen IS
Previous text:
2002-09-11 06:33: Subject: Frequent context switches
Seems reasonable, however, you may want to test if fast_check_threads_etc(4) actually gives any speed gain at all considering that check_threads_etc() is a pretty speedy call.
As always, measuring is key.
/ Fredrik (Naranek) Hubinette (Real Build Master)
Very interesting. However, I don't see what th_yield() has to do with anything. Unlocking a mutex that someone else is waiting for should automatically yield in a correct thread implementation.
/ Fredrik (Naranek) Hubinette (Real Build Master)
Previous text:
2002-09-14 03:47: Subject: Frequent context switches
I've made a couple of observations:
o clock() on Linux is about 30% slower than gethrtime on Solaris (reasonable since clock() measures thread local cpu time while gethrtime measures real time). clock() on Solaris is 50% slower than gethrtime. My conclusion: Try to use gethrtime if it exists, with fallback to clock. Since their speed is on the same order of magnitude one should be equally cautious about calling either of them.
o Many of the calls to C functions are not made through mega_apply; some of them become special opcodes and other are made through the "call builtin" opcodes which doesn't have any check_threads_etc calls. So it might not be so reasonable afterall to assume that there always is a low ratio of slow C function calls in mega_apply.
o Even only a fast_call_threads_etc(1) gives a speed improvement compared to call_threads_etc(). That's regardless whether there's a clock/gethrtime check in check_threads or not.
o If there's no working yield then it's easy to get starvation since a context switch rarely happens by itself in the unlocked window in check_threads. That means that it could be disastrous to have the once-every-256 divisor check or a clock/gethrtime check which short-circuits the unlocked window. I've checked in a patch that disables them completely if th_yield() doesn't expand to a function.
(I discovered this by accident; there was a bug in pike_threadlib.h that blocked the fallback to thr_yield if the POSIX thread lib is chosen and there's no pthread_yield. This happened on at least Solaris 7.)
o Using clock or gethrtime to cover the yield call gives a comparatively small speed improvement. E.g. on sol7-x86 I got a test loop 42% faster by limiting the number of context switches from 41000 to 20 per second. With the simple divisor check first the gap shrinks to 230 - 20 context switches per second, which translates to only about 10% speed increase.
I expected the clock/gethrtime calls to be a lot cheaper than a context switch, but that's apparently not the case. Perhaps it's some effect in my small test case that causes the context switches to be uncharacteristically fast, I don't know. Anyway, the clock/gethrtime check gives some speed improvement so it should be used unless the divisor is tuned better. But I think that'd be too risky since there's no telling how much the call rate to check_threads can vary.
/ Martin Stjernholm, Roxen IS
Not necessarily. The current thread may have as much right to run as the waiting thread, in which case continuing to run would also be correct behaviour. Only when a running thread itself waits or it has run out of allocated time is a thread switch needed for correctness.
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Previous text:
2002-09-15 12:18: Subject: Frequent context switches
Very interesting. However, I don't see what th_yield() has to do with anything. Unlocking a mutex that someone else is waiting for should automatically yield in a correct thread implementation.
/ Fredrik (Naranek) Hubinette (Real Build Master)
A good thread implementation should let the current thread continue if it can and hasn't run its full time slice. In such implementations it's probably better to not try to use a yield call together with various tricks to construct sane time slices. I think I'll try to do that on Linux and Solaris 7 and see what it gives.
/ Martin Stjernholm, Roxen IS
Previous text:
2002-09-15 12:18: Subject: Frequent context switches
Very interesting. However, I don't see what th_yield() has to do with anything. Unlocking a mutex that someone else is waiting for should automatically yield in a correct thread implementation.
/ Fredrik (Naranek) Hubinette (Real Build Master)
pike-devel@lists.lysator.liu.se