In checking the Windows build I notice: -------- Doing tests in modules/Image/testsuite (1480 tests, pid 44356) Cond 1477:1:Index 'FIXED' not present in module Gz. Cond 1477:1:Indexed module was: Gz.
Conditional 1477 (line 827) failed.
Doing tests in modules/Mysql/testsuite (0 tests, pid 45592) --------
The testsuite of Gz seems to be ok, though. Any takers?
Stephen R. van den Berg wrote:
In checking the Windows build I notice:
Doing tests in modules/Image/testsuite (1480 tests, pid 44356) Cond 1477:1:Index 'FIXED' not present in module Gz. Cond 1477:1:Indexed module was: Gz.
Conditional 1477 (line 827) failed.
Doing tests in modules/Mysql/testsuite (0 tests, pid 45592)
I finally fixed those.
In checking the testsuite results, I notice two typical unresolved failure cases:
a. For the latest Solaris versions:
[00:46:09] Testing 80 sockets. [00:46:09] Testing 84 sockets. Connect failed: Address in use. Waiting for better times. [00:46:10] Succeeded after 1 seconds. [00:46:31] No callbacks for 20 seconds! [00:46:31] 16 open fds: [00:46:31] 0 - 11, 13, 15 - 16, 21
Judging by the fact that it doesn't happen all the time, it seems to be a race condition of some kind. Did anyone chase this one and have more information? Do I have enough quota on the git-pike Solaris account to try and hunt this one down?
b. For the newer FreeBSD versions:
[00:00:23] Socket test [00:00:23] Forking... ok. [00:00:23] Doing simple tests. [00:00:23] Testing dup & assign. [00:00:23] Testing accept. Failed to read complete data, errno=54, "Connection reset by peer". [00:00:23] 10:Input buffer: 8192 bytes. [00:00:23] 10:Expected data: 28266 bytes. [00:00:23] Child failed with errcode 1 [00:00:23] Running in parent... [00:00:23] Doing simple tests. [00:00:23] Testing dup & assign. [00:00:23] Testing accept. Failed to read complete data, errno=54, "Connection reset by peer". [00:00:23] 10:Input buffer: 8192 bytes. [00:00:23] 10:Expected data: 28266 bytes.
Since I don't have access to a FreeBSD machine, I can only speculate. It seems like this could be an OS error, or an unexpected race condition which (due to the fact that it's on localhost) triggers all the time. Did anyone chase this one and have more information? Who is providing the FreeBSD xenofarm machines, any chance I or anyone else could get temporary access there to have a stab at this?
Do I have enough quota on the git-pike Solaris account to try and hunt this one down?
You are limited to 750000 OS threads in that instance, but other than that there aren't any limits you should hit besides what the underlaying machine offers in CPU, RAM and disk. Please don't fill up the disk though. Something is bound to handle that badly.
Peter Bortas @ Pike developers forum wrote:
Do I have enough quota on the git-pike Solaris account to try and hunt this one down?
You are limited to 750000 OS threads in that instance, but other than that there aren't any limits you should hit besides what the underlaying machine offers in CPU, RAM and disk. Please don't fill up the disk though. Something is bound to handle that badly.
Could you install SUNWlibm on eureka-git? Compiling pike without a /usr/include/math.h proves difficult.
a. For the latest Solaris versions:
[00:46:09] Testing 80 sockets. [00:46:09] Testing 84 sockets. Connect failed: Address in use. Waiting for better times. [00:46:10] Succeeded after 1 seconds.
b. For the newer FreeBSD versions:
[00:00:23] Testing accept. Failed to read complete data, errno=54, "Connection reset by peer". [00:00:23] 10:Input buffer: 8192 bytes. [00:00:23] 10:Expected data: 28266 bytes.
Found and fixed both problems, I think. The solution involved slightly more code than I expected, then again, I eliminated quite some cruft from the Pike binary in the process (not to worry, the cruft is still there for AmigaOS fans). The testsuite now works on Linux and Solaris. I don't have any other way to test it really except by checking it in. I'll be doing so in a few minutes. Please review the changes.
N.B. The changes in socktest are necessary because the previous behaviour in case of errors and recovery was completely bogus.
[...]
Found and fixed both problems, I think. The solution involved slightly more code than I expected, then again, I eliminated quite some cruft from the Pike binary in the process (not to worry, the cruft is still there for AmigaOS fans).
Great!
The testsuite now works on Linux and Solaris. I don't have any other way to test it really except by checking it in. I'll be doing so in a few minutes. Please review the changes.
Will do.
Henrik Grubbstr?m (Lysator) @ Pike (-) developers forum wrote:
The testsuite now works on Linux and Solaris. I don't have any other way to test it really except by checking it in. I'll be doing so in a few minutes. Please review the changes.
Ok, the game is afoot.
I would appreciate some last finetuning on the patch I made to the socket tests. I kind of lost the overview in the object oriented spaghetticode (or better yet, I never got a clear insight on what exactly happens when and where) in socktest.pike.
The idea is that after trying to acquire more sockets in face of a persistent EADDRINUSE for 30 seconds, the testing code punts and silently drops the socket (or perhaps we could print a warning). In order not to make the test too heavily dependent on system resources.
I tried to emulate what is supposed to happen to drop the socket successfully, but apparently I didn't succeed completely (I still see some timeouts while running it on Solaris, although not every failure results in a timeout anymore now).
Henrik Grubbstr?m (Lysator) @ Pike (-) developers forum wrote:
The testsuite now works on Linux and Solaris. I don't have any other way to test it really except by checking it in. I'll be doing so in a few minutes. Please review the changes.
Ok, the game is afoot.
I would appreciate some last finetuning on the patch I made to the socket tests. I kind of lost the overview in the object oriented spaghetticode (or better yet, I never got a clear insight on what exactly happens when and where) in socktest.pike.
I agree that the socktest.pike code isn't the most readable... One of the reasons is probably that it has survived since before Pike 0.4 (ie since before the module system).
The idea is that after trying to acquire more sockets in face of a persistent EADDRINUSE for 30 seconds, the testing code punts and silently drops the socket (or perhaps we could print a warning). In order not to make the test too heavily dependent on system resources.
Sounds reasonable.
I tried to emulate what is supposed to happen to drop the socket successfully, but apparently I didn't succeed completely (I still see some timeouts while running it on Solaris, although not every failure results in a timeout anymore now).
I've reenabled socketpair_ultra in a few more cases (for systems where UNIX_SOCKETS_WORKS_WITH_SHUTDOWN hasn't been set).
I would appreciate some last finetuning on the patch I made to the socket tests. I kind of lost the overview in the object oriented spaghetticode (or better yet, I never got a clear insight on what exactly happens when and where) in socktest.pike.
I agree that the socktest.pike code isn't the most readable... One of the reasons is probably that it has survived since before Pike 0.4 (ie since before the module system).
I see you (or anyone else) didn't change it yet. If nobody understands it enough to fix it further, it might be faster to write it again from scratch.
I tried to emulate what is supposed to happen to drop the socket successfully, but apparently I didn't succeed completely (I still see some timeouts while running it on Solaris, although not every failure results in a timeout anymore now).
I've reenabled socketpair_ultra in a few more cases (for systems where UNIX_SOCKETS_WORKS_WITH_SHUTDOWN hasn't been set).
That's fine. As long as I don't see it appearing in the implementations I use (which is Linux, mostly); which it doesn't, I just checked. My fingers itched at sanitising the socketpair_ultra and my_socketpair functions, but since they're now excluded from the Linux implementation of Pike, my itch is gone :-).
With respect to the FreeBSD testsuite failures, I suspect they're the result of a failing write_oob in which the send() system call returns an ECONNRESET for some reason. It seems a bit silly though to insert debugging statements in the main Pike CVS just to be able to try and pinpoint the problem on FreeBSD via Pikefarm feedback. Anyone willing to provide a temporary login to a FreeBSD machine to troubleshoot this?
In the last episode (Feb 19), Stephen R. van den Berg said:
With respect to the FreeBSD testsuite failures, I suspect they're the result of a failing write_oob in which the send() system call returns an ECONNRESET for some reason. It seems a bit silly though to insert debugging statements in the main Pike CVS just to be able to try and pinpoint the problem on FreeBSD via Pikefarm feedback. Anyone willing to provide a temporary login to a FreeBSD machine to troubleshoot this?
I should be able to create a VM that you can log into tomorrow ( all the evoy.net pikefarm entries are me ).
Dan Nelson wrote:
In the last episode (Feb 19), Stephen R. van den Berg said:
With respect to the FreeBSD testsuite failures, I suspect they're the result of a failing write_oob in which the send() system call returns an ECONNRESET for some reason. It seems a bit silly though to insert
I should be able to create a VM that you can log into tomorrow ( all the evoy.net pikefarm entries are me ).
The FreeBSD issue has been solved AFAICS, so thanks for the access, you can close it up again.
Henrik stated:
Fixed persistent typo in symbol UNIX_SOCKETS_WORKS_WITH_SHUTDOWN: WORK ==> WORKS
Not that it is such a big deal, but, "UNIX Sockets" is plural, so in proper English the verb should be "work", not "works". Or am I misinterpreting the intention?
Henrik stated:
Fixed persistent typo in symbol UNIX_SOCKETS_WORKS_WITH_SHUTDOWN: WORK ==> WORKS
Not that it is such a big deal, but, "UNIX Sockets" is plural, so in proper English the verb should be "work", not "works". Or am I misinterpreting the intention?
I'm aware of this, which was the cause of the typo to begin with; unfortunately the person who wrote the configure test wasn't...
-- Sincerely, Stephen R. van den Berg.
I've reenabled socketpair_ultra in a few more cases (for systems where UNIX_SOCKETS_WORKS_WITH_SHUTDOWN hasn't been set).
Is the recent MacOSX breakage a result of this capability not properly being diagnosed in configure?
Stephen R. van den Berg wrote:
I've reenabled socketpair_ultra in a few more cases (for systems where UNIX_SOCKETS_WORKS_WITH_SHUTDOWN hasn't been set).
Is the recent MacOSX breakage a result of this capability not properly being diagnosed in configure?
Found and (sort of) fixed this.
Summary of the changes: - Closing a socket one-way, results in errors when still trying to get OOB data from it. I.e. the OS signals an error because it knows that OOB data cannot be sent anymore. - At least Linux returns EOPNOTSUPP in this case. - At least FreeBSD returns ECONNRESET in this case. - MacOSX returns EOPNOTSUPP, with a value of errno == 102. However, with MACOSX_DEPLOYMENT_TARGET=10.3 set in smartlink, the compilation results in a EOPNOTSUPP definition of 45, which doesn't work if the kernel still returns 102. Therefore I disabled the MACOSX_DEPLOYMENT_TARGET=10.3 macro. Not sure what negative impact this has. People with MacOSX experience, please look into it (I merely tested it on my Leopard MacBook).
Therefore I disabled the MACOSX_DEPLOYMENT_TARGET=10.3 macro. Not sure what negative impact this has. People with MacOSX experience, please look into it (I merely tested it on my Leopard MacBook).
We're still building on 10.4 and 10.5 in various combinations of PPC32, x86 and x86_64 and I'd be concerned this change leads to regressions. If you look in src/configure.in you'll see why MACOSX_DEPLOYMENT_TARGET was introduced in the first place:
# 10.3 or newer take advantage of two-level namespaces to avoid # symbol collisions for e.g. libjpeg which is referenced from both # _Image_JPEG and _Image_TIFF. It requires MACOSX_DEPLOYMENT_TARGET # which is initialized in smartlink to 10.3. LDSHARED="$REALCC $CFLAGS -bundle -bind_at_load -undefined dynamic_lookup"
Have you checked whether those modules still work? Any change if you assign a more recent version number to the environment variable? I don't see a problem with discontinuing support for 10.3 specifically if that helps.
Jonas Walld?n @ Pike developers forum wrote:
Therefore I disabled the MACOSX_DEPLOYMENT_TARGET=10.3 macro. Not sure what negative impact this has. People with MacOSX experience, please look into it (I merely tested it on my Leopard MacBook).
We're still building on 10.4 and 10.5 in various combinations of PPC32, x86 and x86_64 and I'd be concerned this change leads to
# 10.3 or newer take advantage of two-level namespaces to avoid # symbol collisions for e.g. libjpeg which is referenced from both # _Image_JPEG and _Image_TIFF. It requires MACOSX_DEPLOYMENT_TARGET # which is initialized in smartlink to 10.3. LDSHARED="$REALCC $CFLAGS -bundle -bind_at_load -undefined dynamic_lookup"
Have you checked whether those modules still work? Any change if you
Sorry, no. I'm not using the MacBook for anything other than a movable terminal. I just installed a minimal Pike tree to zoom in on the testsuite problems.
assign a more recent version number to the environment variable? I don't see a problem with discontinuing support for 10.3 specifically if that helps.
Well, on my Leopard MacBook, setting it to 10.4 exhibits the same problem, setting it to 10.5 solves it. Basically, this is only relevant during the compilation of file.c. So if the compilation of this file could be singled out, it might be solved as well.
So if I understand correctly the value of EOPNOTSUPP changed between 10.4 and 10.5. Is there no way to detect the correct value at runtime? It seems like coding one or the other value into the binary will prevent it from working on the "other" OS version...
Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum wrote:
So if I understand correctly the value of EOPNOTSUPP changed between 10.4 and 10.5. Is there no way to detect the correct value at runtime? It seems like coding one or the other value into the binary will prevent it from working on the "other" OS version...
Well, I wouldn't rush to that conclusion just yet. The only things I can vouch for is that on my MacBook OSX Leopard: - The kernel returns 102 for EOPNOTSUPP. - The include files define it to be: Environment for cpp System include file defintion of MACOSX_DEPLOYMENT_TARGET EOPNOTSUPP undefined 102 10.3 45 10.4 45 10.5 102
Looks from /usr/include/sys/cdefs.h that we need to set __ENVIRONMENT_MAC_OS_X_VERSION_MIN_REQUIRED__ to something lower than 1050 (for 10.5) to avoid __DARWIN_UNIX03.
...but AvailabilityMacros.h indicates that it should already be done through MACOSX_DEPLOYMENT_TARGET. Hmm, perhaps the latter isn't set in enough places during Pike builds?
Jonas Walld?n @ Pike developers forum wrote:
Looks from /usr/include/sys/cdefs.h that we need to set __ENVIRONMENT_MAC_OS_X_VERSION_MIN_REQUIRED__ to something lower than 1050 (for 10.5) to avoid __DARWIN_UNIX03.
Actually, we *want* __DARWIN_UNIX03 to be set to 1.
Quoting from <sys/errno.h>: #if __DARWIN_UNIX03 || defined(KERNEL) /* This value is only discrete when compiling __DARWIN_UNIX03, or KERNEL */ #define EOPNOTSUPP 102 /* Operation not supported on socket */ #endif /* __DARWIN_UNIX03 || KERNEL */
At least on a MacOSX Leopard Macbook we do, because that one returns 102 from the kernel incase of a not-supported error.
What does it mean that the value is only "discrete" when compiling __DARWIN_UNIX03? What is the definition when __DARWIN_UNIX03 is not true? If the value changed to 102 in 10.5, then it makes sense to compile in the constant 102 only if you are going to support 10.5 and newer exclusively, which is what this #if seems to suggest.
Googling some on the code snippet you showed us, I found the other half:
#if !__DARWIN_UNIX03 && !defined(KERNEL) /* * This is the same for binary and source copmpatability, unless compiling * the kernel itself, or compiling __DARWIN_UNIX03; if compiling for the * kernel, the correct value will be returned. If compiling non-POSIX * source, the kernel return value will be converted by a stub in libc, and * if compiling source with __DARWIN_UNIX03, the conversion in libc is not * done, and the caller gets the expected (discrete) value. */ #define EOPNOTSUPP ENOTSUP /* Operation not supported on socket */ #endif /* !__DARWIN_UNIX03 && !KERNEL */
So, it seems that what we are missing is the "stub in libc" that should convert 102 to 45 for us. Which I guess means we are linking wrong.
Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum wrote:
Googling some on the code snippet you showed us, I found the other half:
- source, the kernel return value will be converted by a stub in libc, and
- if compiling source with __DARWIN_UNIX03, the conversion in libc is not
- done, and the caller gets the expected (discrete) value.
Yes, I saw that piece of code as well on my MacBook.
So, it seems that what we are missing is the "stub in libc" that should convert 102 to 45 for us. Which I guess means we are linking wrong.
Sounds sensible. However, except for using a MacBook, I'm not a Mac expert at all. So I wouldn't be able to confirm, nor deny this, and definitely don't know how to fix it properly.
Actually, we *want* __DARWIN_UNIX03 to be set to 1.
At least on a MacOSX Leopard Macbook we do, because that one returns 102 from the kernel incase of a not-supported error.
But errno.h says there is a libc stub that converts it back to 45 if it's not set.
Anyway, I don't ask for builds performed on 10.5 to run on older systems (we always build on the minimum OS version supported for each OS X architecture). Instead I'm worried that changing smartlink.c will have an impact on two-level namespace handling in general and reintroduce the _Image_JPEG / _Image_TIFF problem.
Ideally this fiddling should take place in 7.9 and not jeopardize the stability of 7.8.
In the last episode (Feb 20), Stephen R. van den Berg said:
Stephen R. van den Berg wrote:
I've reenabled socketpair_ultra in a few more cases (for systems where UNIX_SOCKETS_WORKS_WITH_SHUTDOWN hasn't been set).
Is the recent MacOSX breakage a result of this capability not properly being diagnosed in configure?
Found and (sort of) fixed this.
Summary of the changes:
- Closing a socket one-way, results in errors when still trying to get OOB data from it. I.e. the OS signals an error because it knows that OOB data cannot be sent anymore.
Is this a kernel bug? Shutting down the socket for writes should still allow reads to complete. If the shutdown was for reads, then it's a pike bug that it's even trying to read from it.
Summary of the changes:
- Closing a socket one-way, results in errors when still trying to get OOB data from it. I.e. the OS signals an error because it knows that OOB data cannot be sent anymore.
Is this a kernel bug? Shutting down the socket for writes should still allow reads to complete. If the shutdown was for reads, then it's a pike bug that it's even trying to read from it.
Well, no. What happens is that the sending side writes the data (into the kernel/network buffers), then closes the descriptor for writing. At the receiving end we try to read OOB data, which returns with an ECONNRESET immediately; as we then try the normal read, we still receive all the data which was previously written into the buffers, and the normal read then rightfully ends with an EOF.
So, nothing is lost, the kernel is behaving allright. Then again, it is a bit confusing that the OOB read returns with ECONNRESET instead of simply returning 0 (i.e. EOF) in this case. I'm not fully up to speed what POSIX says about OOB behaviour, so I can't tell you if this is something that should be fixed in the kernel or not. It might be worth the trouble to bring this (IMO non-intuitive behaviour) to the attention of the FreeBSD developers.
pike-devel@lists.lysator.liu.se