Turns out that string_to_utf8(), as well as several other functions such as string replace, are using the accessors generic_extract() and index_shared_string() in stralloc.c which in my case (ppc970/gcc 3.3) aren't inlined.
Currently the inner loop will look at string shift and compute addresses in a switch statement for every character. Add to this the overhead for making the function call and it's easy to see the potential for optimization. At least on ppc970 I also get costly loop misalignments but that can be solved by having the configure script dynamically choose optimizer flags instead of using a ppc750 baseline.
Doing some brute-force hacking I got speedups of 2-2.5x (reaching about 170 MB/sec for 7-bit input), but I'm trying to see whether inlining, alignment changes or compiler version/flags are responsible for that.