Martin Stjernholm, Roxen IS @ Pike developers forum 10353@lyskom.lysator.liu.se wrote:
I think you need to isolate how the Parser.HTML object is being set up to be able to debug that (callbacks, flags, etc). I assume it's not the Roxen RXML parser we're talking about here.
We are indeed talking about Pike's Parser.HTML. Here's a standalone Pike program that triggers the problem:
----8<----8<----8<----8<---- #!/usr/bin/env pike
int main(int argc, array(string) argv) { object my_parser = Parser.HTML(); my_parser->_set_entity_callback(entity_callback);
string to_parse = "<a href="mailto:&foobar;"></a>"; string foo = my_parser->finish(to_parse)->read();
werror("%O\n", foo);
return 0; }
int|string entity_callback(object parser, string entity, object id, mixed ... extra) { if(entity=="&foobar;") { // return string ending with XML entity will crash: // " " will crash // " " won't crash return " "; } return 0; } ---->8---->8---->8---->8----
Here's what i get:
----8<----8<----8<----8<---- $ gdb /sw/bin/pike7.6 GNU gdb 6.3.50-20050815 (Apple version gdb-563) (Wed Jul 19 05:17:43 GMT 2006) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "powerpc-apple-darwin"...Reading symbols for shared libraries ..... done
(gdb) run parser.pike Starting program: /sw/bin/pike7.6 parser.pike Reading symbols for shared libraries ................... done Reading symbols for shared libraries .. done Reading symbols for shared libraries . done
Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: KERN_PROTECTION_FAILURE at address: 0x00000008 0x007a7fcc in scan_forward (feed=0x111ad68, c=0, destp=0xbfffec78, d_p=0xbfffec7c, look_for=0x7b621c, num_look_for=2) at /sw/src/fink.build/pike7.6-7.6.112-10/Pike-v7.6.112/src/modules/Parser/h tml.c:1672 1672 /sw/src/fink.build/pike7.6-7.6.112-10/Pike-v7.6.112/src/modules/Parser/h tml.c: No such file or directory. in /sw/src/fink.build/pike7.6-7.6.112-10/Pike-v7.6.112/src/modules/Parser/h tml.c (gdb) bt #0 0x007a7fcc in scan_forward (feed=0x111ad68, c=0, destp=0xbfffec78, d_p=0xbfffec7c, look_for=0x7b621c, num_look_for=2) at /sw/src/fink.build/pike7.6-7.6.112-10/Pike-v7.6.112/src/modules/Parser/h tml.c:1672 #1 0x007ad9dc in scan_forward_arg (this=0x111ca50, feed=0x2, c=0, destp=0xbfffec78, d_p=0xbfffec7c, what=SCAN_ARG_ENT_BREAK, finished=1, quote=0xbfffec90) at /sw/src/fink.build/pike7.6-7.6.112-10/Pike-v7.6.112/src/modules/Parser/h tml.c:2051 #2 0x007b3448 in try_feed (finished=17935720) at /sw/src/fink.build/pike7.6-7.6.112-10/Pike-v7.6.112/src/modules/Parser/h tml.c:3537 #3 0x007b4278 in html_finish (args=-1073746824) at /sw/src/fink.build/pike7.6-7.6.112-10/Pike-v7.6.112/src/modules/Parser/h tml.c:3936 #4 0x000168e4 in low_mega_apply (type=3221220472, args=-1073746824, arg1=0x0, arg2=0xbfffec78) at /sw/src/fink.build/pike7.6-7.6.112-10/Pike-v7.6.112/src/apply_low.h:214 #5 0x0001888c in jump_opcode_F_CALL_OTHER (arg1=-1073746820) at /home/peter/hack/Pike/7.6-distmaker/src/interpret_functions.h:1957 #6 0x00750390 in ?? () #7 0x00019738 in o_catch (pc=0x5ea360) at /sw/src/fink.build/pike7.6-7.6.112-10/Pike-v7.6.112/src/interpret.c:2051 #8 0x00019814 in jump_opcode_F_CATCH () at /home/peter/hack/Pike/7.6-distmaker/src/interpret_functions.h:1239 #9 0x005ea354 in ?? () #10 0x0001708c in mega_apply (type=17935720, args=0, arg1=0xbfffec78, arg2=0x7b621c) at /sw/src/fink.build/pike7.6-7.6.112-10/Pike-v7.6.112/src/interpret.c:2006 #11 0x0009384c in main (argc=2, argv=0xbffff808) at /sw/src/fink.build/pike7.6-7.6.112-10/Pike-v7.6.112/src/main.c:841 ---->8---->8---->8---->8----
Thank you. Fixed in 7.4 and later.
A workaround is perhaps to do
return ({" "});
instead. That way you don't get the entity reparsed and save another call to your entity callback.
The reparsing behavior that you get when returning strings is imho a bit odd. The RXML parser in Roxen always returns arrays to avoid it (which could explain why this bug has remained undiscovered for so long).
It *is* a bit odd, but it used to be the default behaviour, so it was made like that for compatibility reasons, iirc. These things are old... :)
pike-devel@lists.lysator.liu.se