This is in a heavy multithreaded application:
Thread 1475 "pike" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff574c700 (LWP 14689)] ba_alloc (a=a@entry=0x55555575bf20 <object_allocator>) at /var/src/roxen/81pike/src/block_allocator.c:267 267 p->h.used++; (gdb) l 262 263 ptr = p->h.first; 264 PIKE_MEMPOOL_ALLOC(a, ptr, a->l.block_size); 265 PIKE_MEM_RW_RANGE(ptr, sizeof(struct ba_block_header)); 266 267 p->h.used++; 268 269 #ifdef PIKE_DEBUG 270 ba_check_ptr(a, a->alloc, ptr, NULL, __LINE__); 271 #endif (gdb) p p $1 = (struct ba_page *) 0x555555d298c0 (gdb) p *p $2 = {h = {first = 0x555555d35ae000, used = 681, flags = 0}} (gdb) p ptr $3 = (struct ba_block_header *) 0x555555d35ae000 (gdb) p *ptr Cannot access memory at address 0x555555d35ae000 (gdb) where #0 ba_alloc (a=a@entry=0x55555575bf20 <object_allocator>) at /var/src/roxen/81pike/src/block_allocator.c:267 #1 0x000055555560156f in alloc_object () at /var/src/roxen/81pike/src/object.c:135 #2 low_clone (p=0x5555557d2f70) at /var/src/roxen/81pike/src/object.c:135 #3 0x0000555555601ad2 in debug_clone_object (p=<optimized out>, args=args@entry=0) at /var/src/roxen/81pike/src/object.c:408 #4 0x000055555564b55c in f_mutex_trylock (args=1) at /var/src/roxen/81pike/src/threads.c:2333 #5 0x0000555555585a9c in low_mega_apply (type=<optimized out>, type@entry=APPLY_STACK, args=1, arg1=<optimized out>, arg1@entry=0x0, arg2=arg2@entry=0x0) at /var/src/roxen/81pike/src/apply_low.h:225 #6 0x000055555558624e in jump_opcode_F_CALL_FUNCTION () at /var/src/roxen/81pike/src/interpret_functions.h:2422 #7 0x00007ffff6a4c55c in ?? () #8 0x00007ffff574bd41 in ?? () #9 0x0000000000000001 in ?? () #10 0x00007fffffffe03e in ?? () #11 0x00007fffffffe03f in ?? () #12 0x00007ffff574c700 in ?? () #13 0x00005555559af8e0 in ?? () #14 0x0000000000000000 in ?? ()
Any ideas? I can still explore more structures, have the session still open.
Stephen R. van den Berg wrote:
This is in a heavy multithreaded application:
I can still explore more structures, have the session still open.
It appears to be undeterministic. I.e. sometimes the problem occurs after a few seconds after application start, sometimes it takes more than 10 minutes before it hits.
The test application in this case is a database filled with about 10 thousand records of 200 bytes each on average. Then I run 10 threads querying the same database on the same single filedescriptor using the pgsql driver. The pgsql driver is just fine with this, but Pike sometimes segfaults. Do note that the pgsql driver does not contain any C-modules.
By querying the whole database in 10 simultaneous threads, you get a wonderfully chaotic interleaved datastream from the database to Pike which takes (on average) more than 10 seconds of runtime until the whole dataset has been transferred.
Due to scheduling at the database side, the actual interleave order of the data for all streams/threads is largely unpredictable.
Could you reproduce this when compiled with --with-debug. Then the blocka allocator has additional checks which might be helpful to debug this.
Arne
On 2019-11-01 09:50, Stephen R. van den Berg wrote:
This is in a heavy multithreaded application:
Thread 1475 "pike" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff574c700 (LWP 14689)] ba_alloc (a=a@entry=0x55555575bf20 <object_allocator>) at /var/src/roxen/81pike/src/block_allocator.c:267 267 p->h.used++; (gdb) l 262 263 ptr = p->h.first; 264 PIKE_MEMPOOL_ALLOC(a, ptr, a->l.block_size); 265 PIKE_MEM_RW_RANGE(ptr, sizeof(struct ba_block_header)); 266 267 p->h.used++; 268 269 #ifdef PIKE_DEBUG 270 ba_check_ptr(a, a->alloc, ptr, NULL, __LINE__); 271 #endif (gdb) p p $1 = (struct ba_page *) 0x555555d298c0 (gdb) p *p $2 = {h = {first = 0x555555d35ae000, used = 681, flags = 0}} (gdb) p ptr $3 = (struct ba_block_header *) 0x555555d35ae000 (gdb) p *ptr Cannot access memory at address 0x555555d35ae000 (gdb) where #0 ba_alloc (a=a@entry=0x55555575bf20 <object_allocator>) at /var/src/roxen/81pike/src/block_allocator.c:267 #1 0x000055555560156f in alloc_object () at /var/src/roxen/81pike/src/object.c:135 #2 low_clone (p=0x5555557d2f70) at /var/src/roxen/81pike/src/object.c:135 #3 0x0000555555601ad2 in debug_clone_object (p=<optimized out>, args=args@entry=0) at /var/src/roxen/81pike/src/object.c:408 #4 0x000055555564b55c in f_mutex_trylock (args=1) at /var/src/roxen/81pike/src/threads.c:2333 #5 0x0000555555585a9c in low_mega_apply (type=<optimized out>, type@entry=APPLY_STACK, args=1, arg1=<optimized out>, arg1@entry=0x0, arg2=arg2@entry=0x0) at /var/src/roxen/81pike/src/apply_low.h:225 #6 0x000055555558624e in jump_opcode_F_CALL_FUNCTION () at /var/src/roxen/81pike/src/interpret_functions.h:2422 #7 0x00007ffff6a4c55c in ?? () #8 0x00007ffff574bd41 in ?? () #9 0x0000000000000001 in ?? () #10 0x00007fffffffe03e in ?? () #11 0x00007fffffffe03f in ?? () #12 0x00007ffff574c700 in ?? () #13 0x00005555559af8e0 in ?? () #14 0x0000000000000000 in ?? ()
Any ideas? I can still explore more structures, have the session still open.
The most likely cause of this issue might be related to MutexKeys being lost in the heap, which then get released (a while later) by the garbage collection. I fixed this problem in a newer commit; if I run with that fix, the problem does not seem to happen anymore.
Arne Goedeke wrote:
Could you reproduce this when compiled with --with-debug. Then the blocka allocator has additional checks which might be helpful to debug this.
--with-rtldebug gives:
Svalue to object without references. **Block: 0x55ed99884420 Type: object Refs: -1719122832 **Program id: 3 **The object is fake. **The object is destructed but program found from id. ******************* /var/src/roxen/81pike/src/object.c:1656: Fatal error: Svalue to object without references. Program flags: 0x002f
Reference table: ####: Flags Inherit Identifier 0: 33 0 0 _fun -:1 Offset: 0x00000000 1: 33 0 1 oprog -:1 Offset: 0x00000010 2: 10 0 2 args -:1 Offset: 0x00000018 3: 33 0 3 prog -:1 Offset: 0x00000020 4: 0 0 4 fun -:1 Offset: 0xffffffff00060005 5: 1 0 5 `->fun -:1 Cfun: 0x55ed97c149e0 6: 1 0 6 `->fun= -:1 Cfun: 0x55ed97c20700 7: 10 0 7 _is_type -:1 Cfun: 0x55ed97c15a30 8: 1 0 8 fill_in_file_and_line -:1 Cfun: 0x55ed97c29d80 9: 0 0 9 filename -:1 Offset: 0xffffffffffff000a 10: 1 0 10 `filename -:1 Cfun: 0x55ed97c17380 11: 0 0 11 line -:1 Offset: 0xffffffffffff000c 12: 1 0 12 `line -:1 Cfun: 0x55ed97c17560 13: 11 0 13 _sprintf -:1 Cfun: 0x55ed97c2db70 14: 11 0 14 _sizeof -:1 Cfun: 0x55ed97c27a90 15: 11 0 15 `[] -:1 Cfun: 0x55ed97c209c0 16: 11 0 16 `[]= -:1 Cfun: 0x55ed97c21180
Identifier index table: ####: Index Name 0: 7 _is_type 1: 2 args 2: 4 fun 3: 9 filename 4: 11 line
Inherit table: ####: Level prog_id id_level storage_offs par_id par_offs par_obj_id id_ref_offs 0: 0 11 0 0 -1 -18 -1 0
Identifier table: ####: Flags Offset Type Name 0: 0 0 251 "_fun" -:1 1: 0 16 13 "oprog" -:1 2: 0 24 8 "args" -:1 3: 0 32 13 "prog" -:1 4: 0 -4294574075 32 "fun" -:1 5: 2 94478941637088 12 "`->fun" -:1 6: 2 94478941685504 12 "`->fun=" -:1 7: 2 94478941641264 12 "_is_type" -:1 8: 2 94478941724032 12 "fill_in_file_and_line" -:1 9: 0 -65526 32 "filename" -:1 10: 2 94478941647744 12 "`filename" -:1 11: 0 -65524 32 "line" -:1 12: 2 94478941648224 12 "`line" -:1 13: 2 94478941739888 12 "_sprintf" -:1 14: 2 94478941715088 12 "_sizeof" -:1 15: 2 94478941686208 12 "`[]" -:1 16: 2 94478941688192 12 "`[]=" -:1
Variable table: ####: Index 0: 0 1: 1 2: 2 3: 3 4: 4 5: 9 6: 11
Constant table: ####: Type Raw
String table: ####: Value 0: [0x55ed9919e7f0]"src/builtin.cmod"(16 characters) 1: [0x55ed99155268]"-"(1 characters)
LFUN table: LFUN Ref# Name 20: 0015 `[] 21: 0016 `[]= 24: 0014 _sizeof 39: 0007 _is_type 40: 0013 _sprintf
Linenumber table: Filename: String #0 0: 3134
Identifier reference index 74 out of range 0..16 Aborted
A run straight from within gdb give this:
SEGFAULT:
(gdb) l 177 178 INIT 179 { 180 struct std_cs_stor *s = THIS; 181 182 s->retain = NULL; 183 s->replace = NULL; 184 185 init_string_builder(&s->strbuild,0); 186 } (gdb) p s $15 = (struct cq__Charset_Std_CS_struct *) 0x0 (gdb) where #0 init_cq__Charset_Std_CS_struct () at /var/src/roxen/81pike/src/modules/_Charset/charsetmod.cmod:182 #1 cq__Charset_Std_CS_event_handler (ev=<optimized out>) at /var/src/roxen/81pike/src/modules/_Charset/charsetmod.cmod:206 #2 0x0000555555666fce in call_c_initializers (o=o@entry=0x555555ab69c0) at /var/src/roxen/81pike/src/object.c:295 #3 0x00005555556676e9 in debug_clone_object (p=<optimized out>, args=args@entry=0) at /var/src/roxen/81pike/src/object.c:415 #4 0x00005555555aaa2e in low_mega_apply (type=<optimized out>, type@entry=APPLY_STACK, args=0, arg1=0x7ffff79a0830, arg1@entry=0x0, arg2=arg2@entry=0x0) at /var/src/roxen/81pike/src/interpret.c:2740 #5 0x00005555555ac50e in jump_opcode_F_CALL_FUNCTION () at /var/src/roxen/81pike/src/interpret_functions.h:2422 #6 0x00007ffff73c3575 in ?? () #7 0x00005555559ab5d6 in ?? () #8 0x00007ffff7722074 in ?? () #9 0x0000000000000000 in ?? ()
Stephen R. van den Berg wrote:
A run straight from within gdb give this:
Tried it again, after less than a second runtime:
[Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault. init_cq__Charset_Std_CS_struct () at /var/src/roxen/81pike/src/modules/_Charset/charsetmod.cmod:182 182 s->retain = NULL;
Stephen R. van den Berg wrote:
The most likely cause of this issue might be related to MutexKeys being lost in the heap, which then get released (a while later) by the garbage collection. I fixed this problem in a newer commit; if I run with that fix, the problem does not seem to happen anymore.
--with-rtldebug gives:
A new run gives:
page: 0x55c71d610df0 used: 753/1024 last: 0x55c71d624db0 p+offset: 0x55c71d624db0 page: 0x55c71d33fb30 used: 428/512 last: 0x55c71d349af0 p+offset: 0x55c71d349af0 page: 0x55c71d0c4f80 used: 46/256 last: 0x55c71d0c9f40 p+offset: 0x55c71d0c9f40 page: 0x55c71cefed70 used: 115/128 last: 0x55c71cf01530 p+offset: 0x55c71cf01530 In block 0x55c71d61cf70: /var/src/roxen/81pike/src/block_allocator.c:282: Fatal error: Free-List corruption. List pointer 0x55c71d61e04f is inside block [0x55c71d61e000 , 0x55c71d61e050) page: 0x55c71d610df0 used: 754/1024 last: 0x55c71d624db0 p+offset: 0x55c71d624db0 page: 0x55c71d33fb30 used: 428/512 last: 0x55c71d349af0 p+offset: 0x55c71d349af0 page: 0x55c71d0c4f80 used: 46/256 last: 0x55c71d0c9f40 p+offset: 0x55c71d0c9f40 page: 0x55c71cefed70 used: 115/128 last: 0x55c71cf01530 p+offset: 0x55c71cf01530 In block 0x55c71d61cf70: /var/src/roxen/81pike/src/block_allocator.c:282: Fatal error: Free-List corruption. List pointer 0x55c71d61e04f is inside block [0x55c71d61e000 , 0x55c71d61e050) Aborted
Not quite sure yet what to do with this.
Stephen R. van den Berg wrote:
Stephen R. van den Berg wrote:
The most likely cause of this issue might be related to MutexKeys being lost in the heap, which then get released (a while later) by the garbage collection. I fixed this problem in a newer commit; if I run with that fix, the problem does not seem to happen anymore.
--with-rtldebug gives:
And another: page: 0x56420b0fde40 used: 750/1024 last: 0x56420b111e00 p+offset: 0x56420b111e00 page: 0x56420ae239f0 used: 427/512 last: 0x56420ae2d9b0 p+offset: 0x56420ae2d9b0 page: 0x56420aba7b30 used: 46/256 last: 0x56420abacaf0 p+offset: 0x56420abacaf0 page: 0x56420a9e2d70 used: 115/128 last: 0x56420a9e5530 p+offset: 0x56420a9e5530 In block 0x56420b10a240: /var/src/roxen/81pike/src/block_allocator.c:282: Fatal error: Free-List corruption. List pointer 0x56420b10c3ff is inside block [0x56420b10c3b0 , 0x56420b10c400) page: 0x56420b0fde40 used: 751/1024 last: 0x56420b111e00 p+offset: 0x56420b111e00 page: 0x56420ae239f0 used: 427/512 last: 0x56420ae2d9b0 p+offset: 0x56420ae2d9b0 page: 0x56420aba7b30 used: 46/256 last: 0x56420abacaf0 p+offset: 0x56420abacaf0 page: 0x56420a9e2d70 used: 115/128 last: 0x56420a9e5530 p+offset: 0x56420a9e5530 In block 0x56420b10a240: /var/src/roxen/81pike/src/block_allocator.c:282: Fatal error: Free-List corruption. List pointer 0x56420b10c3ff is inside block [0x56420b10c3b0 , 0x56420b10c400) Aborted
Any ideas how to tackle this?
Stephen R. van den Berg wrote:
Stephen R. van den Berg wrote:
Stephen R. van den Berg wrote:
The most likely cause of this issue might be related to MutexKeys being lost in the heap, which then get released (a while later) by the garbage collection. I fixed this problem in a newer commit; if I run with that fix, the problem does not seem to happen anymore.
--with-rtldebug gives:
And another one (the used-ratio's seem to be quite similar between different runs; the time it takes until it breaks varies wildly):
page: 0x55945d2334c0 used: 745/1024 last: 0x55945d247480 p+offset: 0x55945d247480 page: 0x55945cf60bb0 used: 428/512 last: 0x55945cf6ab70 p+offset: 0x55945cf6ab70 page: 0x55945cd41100 used: 46/256 last: 0x55945cd460c0 p+offset: 0x55945cd460c0 page: 0x55945cb1ad70 used: 115/128 last: 0x55945cb1d530 p+offset: 0x55945cb1d530 In block 0x55945d241ee0: /var/src/roxen/81pike/src/block_allocator.c:282: Fatal error: Free-List corruption. List pointer 0x55945d23edcf is inside block [0x55945d23ed80 , 0x55945d23edd0) page: 0x55945d2334c0 used: 746/1024 last: 0x55945d247480 p+offset: 0x55945d247480 page: 0x55945cf60bb0 used: 428/512 last: 0x55945cf6ab70 p+offset: 0x55945cf6ab70 page: 0x55945cd41100 used: 46/256 last: 0x55945cd460c0 p+offset: 0x55945cd460c0 page: 0x55945cb1ad70 used: 115/128 last: 0x55945cb1d530 p+offset: 0x55945cb1d530 In block 0x55945d241ee0: /var/src/roxen/81pike/src/block_allocator.c:282: Fatal error: Free-List corruption. List pointer 0x55945d23edcf is inside block [0x55945d23ed80 , 0x55945d23edd0) Aborted
pike-devel@lists.lysator.liu.se