However, I still can't entirely shake the notion that we're overdoing it here. Maybe we could simply make the preprocessor and compiler grok UTF8 directly and get rid of the special casing. All compiler input processing would return back to 8-bit only.
Converting everything to utf8 before preprocessing would work, yes, if it is then converted back to unicode before the tokenization.
The alternative is a needlessly messy (handling utf-8 in the tokenizer).
Define name/argument handling would be the only thing that needs to be altered in cpp to handle utf-8.
Then again, just switching data[i] to IND(i) or similar, and have that be defined to index_shared_string(data,i) (or, to break with convetions in the code, not use a macro at all and instead just use the function directly) is actually significantly easier than adding utf-8 support to the preprocessor.
It is however bound to be somewhat slower in most cases. But I do not really think the difference matters at all, considering everything else we are doing in there.
-- Per Hedbor