That doesn't break hilfe either.
Only that exact order breaks:
| > string s = Stdio.read_bytes("sugar.jpg"); | > object i = Image.JPEG.decode(s); | >> i; | (1) Result: Image.Image( 300 x 300 /* 263.7Kb */) | > Image.JPEG.encode(i); | *** glibc detected *** /usr/local/bin/pike: double free or corruption (out): 0x0000000000b52860 *** | Error while mapping shared library sections: | ¹=ãȬàx¬<5`Ó 3Ú9ÿ: No such file or directory. | ======= Backtrace: ========= | /lib/libc.so.6[0x2b5197a4f08a] | /lib/libc.so.6(cfree+0x8c)[0x2b5197a52c1c] | ...
Skipping the "i;" doesn't lead to a fault. Replacing the "i;" with werror("%O\n",i); stops it from breaking as well.
This bug goes away if you look at it the wrong way. :P
"i;" isn't necessary in my setup. I just added it to verify that I loaded my test data correctly.
I can't tell whether your crash is related to the problem I'm investigating. If you suspect duplicate symbols for the JPEG lib you can try to move _Image_TIFF.so away to avoid getting the second copy loaded.
Valgrind isn't of any help?
Valgrind gives a lot of false alarms in the Image module on 64 bit architectures, though. The problem is that gcc can generate a 64 bit read when rgb_group structs are read, and if that happens near the end of a malloced block then valgrind complains about reading outside addressable memory. I've got some half-baked patches to pad the malloced blocks more when --with-valgrind is used.
I did try valgrind before, but just compiling for valgrind removed the crash - at least I couldn't trigger it anymore.
How tiresome. Have you tried compiling without valgrind support and run it with valgrind anyway? You'll have to fix ignores for all the false alarms then, though..
That seems like the GTK2-module leak I'm observing (_probably_ related to the list/tree widget). Running pike in valgrind or with dmalloc removes it totally...
Yep. Typically hard to trace stuff. :p
The only suspicious output I get from valgrind seems to be this:
==29136== Invalid write of size 1 ==29136== at 0x7EF175A: (within /usr/lib/libjpeg.so.62.0.0) ==29136== by 0x7EEEFC9: (within /usr/lib/libjpeg.so.62.0.0) ==29136== by 0x7EEDEF5: (within /usr/lib/libjpeg.so.62.0.0) ==29136== by 0x7EEADFE: jpeg_write_scanlines (in /usr/lib/libjpeg.so.62.0.0) ==29136== by 0x7CE285F: image_jpeg_encode (image_jpeg.c:912) ==29136== by 0x434907: low_mega_apply (apply_low.h:225) ==29136== by 0x4375B3: eval_instruction (interpret_functions.h:2066) ==29136== by 0x440CAA: catching_eval_instruction (interpret.c:2227) ==29136== by 0x440177: eval_instruction (interpret_functions.h:1287) ==29136== by 0x440D9F: mega_apply (interpret.c:2197) ==29136== by 0x4DBF37: call_pike_initializers (object.c:337) ==29136== by 0x4DEA0B: parent_clone_object (object.c:420) ==29136== Address 0x6199cf8 is 0 bytes after a block of size 8,192 alloc'd ==29136== at 0x4C22FAB: malloc (vg_replace_malloc.c:207) ==29136== by 0x7CDF7AE: my_init_destination (image_jpeg.c:249) ==29136== by 0x7EEAF8C: jpeg_start_compress (in /usr/lib/libjpeg.so.62.0.0) ==29136== by 0x7CE25AA: image_jpeg_encode (image_jpeg.c:880) ==29136== by 0x434907: low_mega_apply (apply_low.h:225) ==29136== by 0x4375B3: eval_instruction (interpret_functions.h:2066) ==29136== by 0x440CAA: catching_eval_instruction (interpret.c:2227) ==29136== by 0x440177: eval_instruction (interpret_functions.h:1287) ==29136== by 0x440D9F: mega_apply (interpret.c:2197) ==29136== by 0x4DBF37: call_pike_initializers (object.c:337) ==29136== by 0x4DEA0B: parent_clone_object (object.c:420) ==29136== by 0x43546A: low_mega_apply (apply_low.h:238)
a few of those, then:
==29136== Invalid write of size 1 ==29136== at 0x7EF1588: (within /usr/lib/libjpeg.so.62.0.0) ==29136== by 0x7EEEFC9: (within /usr/lib/libjpeg.so.62.0.0) ==29136== by 0x7EEDEF5: (within /usr/lib/libjpeg.so.62.0.0) ==29136== by 0x7EEADFE: jpeg_write_scanlines (in /usr/lib/libjpeg.so.62.0.0) ==29136== by 0x7CE285F: image_jpeg_encode (image_jpeg.c:912) ==29136== by 0x434907: low_mega_apply (apply_low.h:225) ==29136== by 0x4375B3: eval_instruction (interpret_functions.h:2066) ==29136== by 0x440CAA: catching_eval_instruction (interpret.c:2227) ==29136== by 0x440177: eval_instruction (interpret_functions.h:1287) ==29136== by 0x440D9F: mega_apply (interpret.c:2197) ==29136== by 0x4DBF37: call_pike_initializers (object.c:337) ==29136== by 0x4DEA0B: parent_clone_object (object.c:420) ==29136== Address 0x619a30a is 674 bytes inside a block of size 12,960 free'd ==29136== at 0x4C22B2E: free (vg_replace_malloc.c:323) ==29136== by 0x4D41A2: really_free_mapping (mapping.c:277) ==29136== by 0x476CE9: free_decode_data (encode.c:4795) ==29136== by 0x482BCF: f_decode_value (encode.c:4975) ==29136== by 0x436E6F: eval_instruction (interpret_functions.h:2301) ==29136== by 0x440CAA: catching_eval_instruction (interpret.c:2227) ==29136== by 0x440177: eval_instruction (interpret_functions.h:1287) ==29136== by 0x440D9F: mega_apply (interpret.c:2197) ==29136== by 0x4E1664: object_index_no_free (object.c:1373) ==29136== by 0x4384F2: eval_instruction (interpret_functions.h:1803) ==29136== by 0x440D9F: mega_apply (interpret.c:2197) ==29136== by 0x4E1664: object_index_no_free (object.c:1373) --29136-- VALGRIND INTERNAL ERROR: Valgrind received a signal 11 (SIGSEGV) - exiting --29136-- si_code=80; Faulting address: 0x0; sp: 0x403469E40
Any ideas? Is Pike messing up the alloc table? :p
Hmm, the jpeg_write_scanlines() call is made inside a THREADS_ALLOW loop. Are you running many JPEG operations in parallel? Maybe the internals of that library isn't thread-safe.
As far as I can tell from my hilfe input, I'm only running one. ;)
The internals of the library are given from the programs using it, as far as I can tell. You feed it a superstructure (struct jpeg_compress_struct).
Increasing the default buffer size (it's supposed to ask for more if needed) at least stopped it from triggering the bug. :p
-#define DEFAULT_BUF_SIZE 8192 +#define DEFAULT_BUF_SIZE 81920
Probably not the correct solution though.
Yeah, a threading bug sounds less likely.
Anyway, the first "is 0 bytes after a block" sounds like what Mast described earlier and pretty harmless compared to the last one. Google tells me one can run gdb and valgrind together; try stopping at the last error, focus on frame for image_jpeg_encode and see what pointers gets passed to jpeg_write_scanlines. It would be interesting to see if they are reasonable blocks that can be traced to a struct *image or if the Pike internals are messed up.
If they are ok I'd suspect the library itself (especially if you say that most images can't trigger the bug), and the next step would perhaps be to compile your own with debug symbols and optimizations off.
No, it's not the case I described. That was only reading a bit past the end. In this case it writes, and that is worse.
I just realized that too, but for another reason. In your case it was about a 64-bit access that spilled over a logical boundary (but presumably within the rounded-up size used by malloc) but here it's a single byte being written past a power-of-2 boundary.
Another important clue is that solving the first problems also fixed the fatal crash later. Apparently those tiny writes 1 bytes off are enough to corrupt the malloc structures.
Since the #define and corresponding malloc() is in the Pike source we might be able to call malloc(DEFAULT_BUF_SIZE + 17) (and similar for realloc), but who knows if for other input libjpeg will access even greater offsets?
I've now tracked down and fixed the bug. Ironically it's Mirar that caused it (granted a long time ago) by "#define unsigned int size_t" which isn't valid on a 64-bit machine. I believe it could have caused overwriting of as much memory as the resulting JPEG image occupied outside of the initial buffer size.
But then I started my Roxen 5.0 to verify and get this nice present:
Post-padding overwritten for block at 0x108e8fa20 (size 801)! **Block: 0x108e8fa20 Type: string Refs: 1 **size_shift: 0, len: 768, hash: 25c37b5a76df6cc5 ** "
$$#'''((("... Stack at allocation: | 0 pike 0x0000000000153997 debug_malloc + 119 | 0 pike 0x0000000000153fad debug_xalloc + 29 | 0 pike 0x00000000001c8879 debug_begin_shared_string + 89 | 0 _Image_GIF.so 0x00000000067e67ba image_gif_header_block + 730 | 0 _Image_GIF.so 0x00000000067e9a12 _image_gif_encode + 2546 | 0 pike 0x000000000001d9dd low_mega_apply + 5053 | 0 pike 0x0000000000041bbc low_mega_apply + 152988 | 0 pike 0x00000000000406cc low_mega_apply + 147628 | 0 pike 0x000000000004d135 low_mega_apply + 199445 | 0 pike 0x000000000004f295 mega_apply + 501 | 0 pike 0x00000000001d9960 new_thread_func + 944 | 0 libSystem.B.dylib 0x0000000081400913 _pthread_start + 316 | 0 libSystem.B.dylib 0x00000000814007d5 thread_start + 13 Locations that handled 0x108e8fa20: (gc generation: 2/2 gc pass: 0/0) *** /home/jonasw/pike/7.8/src/stralloc.c:628 xalloc (1 times) !*! *** /home/jonasw/pike/7.8/src/pike_memory.c:287 malloc (1 times) !*! *** /home/jonasw/pike/7.8/src/modules/_Image_GIF/image_gif.c:282 (1 times) !*! ******************* : Start script terminating. : Shutting down MySQL.. : Start script terminated.
Not the best way to end the week... :-(
Speaking of which,
[http://pike.ida.liu.se/generated/pikefarm/7.8/46_46/verifylog.txt] | Doing tests in testsuite (11196 tests) | | test: failed to load "/home/[...]/pike/7.8.20/lib/modules/GSSAPI.so": load_module("/home/[...]/pike/7.8.20/lib/modules/GSSAPI.so") failed: libgssapi_krb5.so.2: failed to map segment from shared object: Cannot allocate memory | | | test: failed to load "/home/[...]/pike/7.8.20/lib/modules/_Image_JPEG.so": load_module("/home/[...]/pike/7.8.20/lib/modules/_Image_JPEG.so") failed: /home/[...]/pike/7.8.20/lib/modules/_Image_JPEG.so: failed to map segment from shared object: Cannot allocate memory | | ... | | test: failed to load "/home/[...]/pike/7.8.20/lib/modules/___GTK2.so": load_module("/home/[...]/pike/7.8.20/lib/modules/___GTK2.so") failed: /home/[...]/pike/7.8.20/lib/modules/___GTK2.so: failed to map segment from shared object: Cannot allocate memory | | Fatal: out of memory.
Doesn't seem very good. I didn't see this when I ran make verify manually. How do I debug it?
xenofarm/client.sh sets a couple of limits. Try running verify with the same limits (data segment and virtual memory size should be the relevant ones).
Ok, found and fixed that one as well. But of course fate deals me yet another one:
#0 debug_fatal (fmt=0x10039ae30 "really_free_memloc got invalid pointer %p\n") at /home/jonasw/pike/7.8/src/error.c:632 #1 0x000000010014df83 in really_free_memloc (d=0x114727ac0) at /home/jonasw/pike/7.8/src/pike_memory.c:1358 #2 0x000000010014fa88 in really_free_memhdr (d=0x107d4a940) at /home/jonasw/pike/7.8/src/pike_memory.c:1608 #3 0x000000010014fdc7 in remove_memhdr (ptr=<value temporarily unavailable, due to optimizations>) at /home/jonasw/pike/7.8/src/pike_memory.c:1608 #4 0x0000000100151eac in dmalloc_unregister (p=0x107d4a940, already_gone=-1626928576) at /home/jonasw/pike/7.8/src/pike_memory.c:2089 #5 0x0000000100017f04 in alloc_catch_context () at /home/jonasw/pike/7.8/src/interpret.c:1082 #6 0x0000000100046485 in eval_instruction_without_debug (pc=0x1005a7800 "") at interpret_functions.h:1287 [...]
Anyone else using dmalloc and stressing the 7.8 code?
Time for a follow-up. This bug seems most likely caused by dmalloc itself not being threadsafe in its handling of internal structures. A hacked version that Grubba and I put together got rid of the problem temporarily but I leave it to the dmalloc experts to develop a long-term solution.
pike-devel@lists.lysator.liu.se