Cool, did indeed fix those benchmarks. On the other hand two of my three real-world XSLT tests dropped 2% (and one gained as much), but perhaps that's just differences in CPU cache use or alignment or similar.
If the tests use widestring it might very well be becase the longest wide short string is now half as long as it used to be (counted in characters), and thus more strings are allocated using malloc/free instead of the short string allocator.