I've been doing some PCB work lately, exploring various MCUs, discovered that ARM MCUs are quite affordable these days.
How feasible would a embedded-Pike be, kind of in the same spirit as this MicroPython: https://www.kickstarter.com/projects/214379695/micro-python-python-for-micro...
Stephen R. van den Berg wrote:
as this MicroPython: https://www.kickstarter.com/projects/214379695/micro-python-python-for-micro...
You can get a full linux OS to run on ARM System-on-chip boards, including Pike. The catch is that they're not really embedded systems as much as single board computers.
I've considered what would be required to strip away the OS, and the answer is that you lose a lot of what you'd want to use Pike for anyhow: no filesystem, no threads, no built-in networking. Basically, C but without memory management and with better types. But really, most true MCUs are so space constrained that I'm not sure it's worth the effort (which I suppose perhaps depends on the goal).
I'd be interested to hear opinions on this from others, though.
Bill
On Mon, 5 May 2014, Stephen R. van den Berg wrote:
I've been doing some PCB work lately, exploring various MCUs, discovered that ARM MCUs are quite affordable these days.
How feasible would a embedded-Pike be, kind of in the same spirit as this MicroPython: https://www.kickstarter.com/projects/214379695/micro-python-python-for-micro... -- Stephen.
Bill Welliver wrote:
You can get a full linux OS to run on ARM System-on-chip boards, including Pike. The catch is that they're not really embedded systems as much as single board computers.
Those are still too large/expensive (for the applications I have in mind).
I've considered what would be required to strip away the OS, and the answer is that you lose a lot of what you'd want to use Pike for anyhow: no filesystem, no threads, no built-in networking. Basically, C but without memory management and with better types. But really, most true MCUs are so space constrained that I'm not sure it's worth the effort (which I suppose perhaps depends on the goal).
Well, the hardware the micropython targets should be able to support an embedded Pike too. The benefits (obviously) are somewhat faster development, which is a plus in an embedded system, because debugging is more complicated on such a system.
Bill Welliver wrote:
C but without memory management and with better types. But really, most true MCUs are so space constrained that I'm not sure it's worth the effort (which I suppose perhaps depends on the goal).
I have to admit though, that the most cost-effective embedded systems have something the size of 8 to 128KB of flash, and 4 to 32KB of RAM, which in reality is too small to run Pike (I'm guessing though). So maybe you're right, and the highest volume target market is out of reach. Then again, MCU capacities are slowly rising and costs are coming down, so there is going to be a point in time where it will make sense.
So, I suppose it all depends on what you're looking to do. In small quantities, a minimum parts BOM that will run a standard build of Pike will probably be $20. The lesser ARM components that have the horsepower to run Pike but are incapable of running a full OS would probably run about half that.
The really affordable MCUs, such as AVRs and PICs, are probably just too slow to be useful.
In the case of micropython, I suspect that the micropython interpreter is the firmware flashed to the MCU and that they're using some sort of flash card to store the libraries and script to run, which sort of defeats my vision of a simple (from part/circuit complexity) solution. I'm not sure that there's much way around that, though, because the master alone is bigger than the total flash available for a lot of MCUs.
Still, I think it's an interesting idea. Maybe a very early version of ulpc might be a good starting point...
On Wed, 7 May 2014, Stephen R. van den Berg wrote:
Bill Welliver wrote:
C but without memory management and with better types. But really, most true MCUs are so space constrained that I'm not sure it's worth the effort (which I suppose perhaps depends on the goal).
I have to admit though, that the most cost-effective embedded systems have something the size of 8 to 128KB of flash, and 4 to 32KB of RAM, which in reality is too small to run Pike (I'm guessing though). So maybe you're right, and the highest volume target market is out of reach. Then again, MCU capacities are slowly rising and costs are coming down, so there is going to be a point in time where it will make sense. -- Stephen.
Bill Welliver wrote:
The really affordable MCUs, such as AVRs and PICs, are probably just too slow to be useful.
Well, actually, forget the AVRs and PICs. One can get ARM MCUs with everything onboard (except USB) for EUR 0.55. Which is close enough or cheaper than most AVRs and PICs. The first thing that does USB as well as ARM, is something like EUR 2.10 (way cheaper than ATxmega). So 8-bit MCUs have become obsolete as far as I am concerned.
In the case of micropython, I suspect that the micropython interpreter is the firmware flashed to the MCU and that they're using some sort of flash card to store the libraries and script to run,
They do interpreter and libs on the flash, but have a 1MB flash for that. But, in case of an MCU, the amount of libraries required is close to none.
But, like I said, the 1MB flash is larger than I'm willing to put in most small MCU designs.
From the kickstarter site:
-----------------------------cut here----------------------------- Micro Python is a complete rewrite, from scratch, of the Python scripting language. It is written in clean, ANSI C and includes a complete parser, compiler, virtual machine, runtime system, garbage collector and support libraries to run on a microcontroller. The compiler can compile to byte code or native machine code, selectable per function using a function decorator. It also supports inline assembler. All compilation happens on the chip, so there is no need for any software on your PC.
Micro Python currently supports 32-bit ARM processors with the Thumb v2 instruction set, such as the Cortex-M range used in low-cost microcontrollers. It has been tested on an STM32F405 chip.
Micro Python has the following features:
Full implementation of the Python 3 grammar (but not yet all of Python's standard libraries). Implements a lexer, parser, compiler, virtual machine and runtime. Can execute files, and also has a command line interface (a read-evaluate-print-loop, or REPL). Python code is compiled to a compressed byte code that runs on the built-in virtual machine. Memory usage is minimised by storing objects in efficient ways. Integers that fit in 31-bits do not allocate an object on the heap, and so require memory only on the stack. Using Python decorators, functions can be optionally compiled to native machine code, which takes more memory but runs around 2 times faster than byte code. Such functions still implement the complete Python language. A function can also be optionally compiled to use native machine integers as numbers, instead of Python objects. Such code runs at close to the speed of an equivalent C function, and can still be called from Python, and can still call Python. These functions can be used to perform time-critical procedures, such as interrupts. An implementation of inline assembler allows complete access to the underlying machine. Inline assembler functions can be called from Python as though they were a normal function. Memory is managed using a simple and fast mark-sweep garbage collector. It takes less than 4ms to perform a full collection. A lot of functions can be written to use no heap memory at all and therefore require no garbage collection. Tell me about the Micro Python board...
The Micro Python board is an electronics development board that runs Micro Python, and is based on the STM32F405 microcontroller. This microcontroller is one of the more powerful ones available, and was chosen so that Micro Python could run at its full potential. The microcontroller is clocked at 168MHz and has 1MiB flash and 192KiB RAM, which is plenty for writing complex Python scripts. The board measures 33x40 mm and is pictured below. -----------------------------cut here-----------------------------
It's certainly true that AVR and friends aren't powerful enough to run a pike interpreter, but they're also much simpler, smaller and also use a fraction of the power that any ARM would use. I'm surprised that you'd be able to get an ARM for less than an AVR... you must have better sources than I.
The chip that they're targeting for this project runs about $12 in quantities less than 100 and they're selling the board for about $25. ST sells a discovery board using the same chip as the micropython developer board, but without the card reader for $14. you can put an RTOS like nuttx on it. If Pike ran under that, you might be in business without having to resort to a "micro pike".
bill
On Wed, 7 May 2014, Stephen R. van den Berg wrote:
Bill Welliver wrote:
The really affordable MCUs, such as AVRs and PICs, are probably just too slow to be useful.
Well, actually, forget the AVRs and PICs. One can get ARM MCUs with everything onboard (except USB) for EUR 0.55. Which is close enough or cheaper than most AVRs and PICs. The first thing that does USB as well as ARM, is something like EUR 2.10 (way cheaper than ATxmega). So 8-bit MCUs have become obsolete as far as I am concerned.
In the case of micropython, I suspect that the micropython interpreter is the firmware flashed to the MCU and that they're using some sort of flash card to store the libraries and script to run,
They do interpreter and libs on the flash, but have a 1MB flash for that. But, in case of an MCU, the amount of libraries required is close to none.
But, like I said, the 1MB flash is larger than I'm willing to put in most small MCU designs.
...
Bill Welliver wrote:
It's certainly true that AVR and friends aren't powerful enough to run a pike interpreter, but they're also much simpler, smaller and also use a fraction of the power that any ARM would use. I'm
Well, I admit you caught me off-guard with the powerusage. Checked it just now, comparing an ATtiny461A with the MKE05Z8VTG4: - Running without all peripherals active the ATtiny uses about 3.6mA at 5V and 8MHz, the ARM uses about 3.5mA at 5V and 12MHz. - Power down mode ATtiny uses 4uA at 3V, stop mode ARM 1.9uA at 3V. Doesn't look like ARM is at a disadvantage there.
surprised that you'd be able to get an ARM for less than an AVR... you must have better sources than I.
Well, the ARM ATtiny rival would be this one (8KB Flash, 1KB RAM): http://nl.farnell.com/freescale-semiconductor/mke04z8vtg4/mcu-32bit-cortex-m... I can't come up with any convincing reasons to use an ATtiny instead.
And if you want USB, try MKL26Z128VLH4 (128KB flash, 16KB RAM): http://nl.farnell.com/jsp/search/productdetail.jsp?sku=2360679
The chip that they're targeting for this project runs about $12 in quantities less than 100 and they're selling the board for about $25. ST sells a discovery board using the same chip as the micropython developer board, but without the card reader for $14. you can put an RTOS like nuttx on it. If Pike ran under that, you might be in business without having to resort to a "micro pike".
Well, this time I actually looked for a 1MB flash ARM, and I find this MK22FN1M0VLL12 (1MB flash, 128KB RAM, EUR 6.50): http://nl.farnell.com/freescale-semiconductor/mk22fn1m0vll12/mcu-32bit-corte...
I'm actually amazed at the low price. It's about three times the price for the 128KB version. Which, admittedly, makes Pike on an embedded ARM a possibility again. I just checked out nuttx; looks nice, but judging from the functionality it probably does require a high enough clockrate to achieve anything meaningful (and that, sort of linearly increases power consumption).
Glancing at Pike I notice that: - The i386 compiled 32 bit binary is roughly 2MB text size and 866KB BSS size. - program.o contains an unexpectedly large BSS segment of 306KB, why? - cpp.o is 153KB vs. language.o being 131KB. It is rather unexpected that the cpp module is larger than the actual language parser. Why don't we use yacc for the preprocessor too; or is it difficult to capture the rules in yacc for the preprocessor? - Which is not say that an embedded Pike could do without the preprocessor, and possibly even without the language parser (it would mean that everything would need to be precompiled).
Well, the ARM ATtiny rival would be this one (8KB Flash, 1KB RAM): http://nl.farnell.com/freescale-semiconductor/mke04z8vtg4/mcu-32bit-cortex-m... I can't come up with any convincing reasons to use an ATtiny instead.
mke04z8vtg4: EUR 6.91, 100 pins attiny10: EUR 0.62, 6 pins
Apart from the factor 10+ in price difference, I know which one I'd rather solder...
Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum wrote:
Well, the ARM ATtiny rival would be this one (8KB Flash, 1KB RAM): http://nl.farnell.com/freescale-semiconductor/mke04z8vtg4/mcu-32bit-cortex-m... I can't come up with any convincing reasons to use an ATtiny instead.
mke04z8vtg4: EUR 6.91, 100 pins attiny10: EUR 0.62, 6 pins
Apart from the factor 10+ in price difference, I know which one I'd rather solder...
Are we looking at the same part? The link above goes to the Farnell part, which costs: EUR 0.59 in quantities from 10 up.
As to the soldering, the part has 16 pins, albeit TSSOP style. You need to get used to it, but in effect soldering it is not really slower than a 16 pin DIP part.
Sorry, I must have mixed up the two links. That was the 1MB flash variant.
Still, more expensive and more pins to solder. The attiny costs 0.52 in 10-up quantities. :-)
On Thu, 8 May 2014, Stephen R. van den Berg wrote:
Glancing at Pike I notice that:
- The i386 compiled 32 bit binary is roughly 2MB text size and 866KB BSS
size.
- program.o contains an unexpectedly large BSS segment of 306KB, why?
Not sure why shows up in program.o, but that might be the type_stack defined in pike_types.c. I guess it could be replaced that by something that resizes on demand.
- cpp.o is 153KB vs. language.o being 131KB. It is rather unexpected
that the cpp module is larger than the actual language parser. Why don't we use yacc for the preprocessor too; or is it difficult to capture the rules in yacc for the preprocessor?
The preprocessor is included 3 times (once for each string width). A similar thing happens in sscanf (for all possible combinations of string widths).
One other culprit for code size is sprintf, due to the use of huge macros for slow paths. There are this and other places where reducing the code size by de-inlining would also help performance on big cpus (or at least not make it slower).
arne
p.s.
nm -S pike | cut -d ' ' -f 2- | sort -r | less
Arne Goedeke wrote:
- cpp.o is 153KB vs. language.o being 131KB. It is rather unexpected
that the cpp module is larger than the actual language parser. Why don't we use yacc for the preprocessor too; or is it difficult to capture the rules in yacc for the preprocessor?
The preprocessor is included 3 times (once for each string width). A
So Pike actually supports source code delivered in UTF character encoding? Does that make sense? Why not limit the preprocessor and source compiling to 8 bit only?
similar thing happens in sscanf (for all possible combinations of string widths).
One other culprit for code size is sprintf, due to the use of huge macros for slow paths. There are this and other places where reducing the code size by de-inlining would also help performance on big cpus (or at least not make it slower).
I'd think so too. Increases cache hits, at least.
So Pike actually supports source code delivered in UTF character encoding? Does that make sense? Why not limit the preprocessor and source compiling to 8 bit only?
Pike supports source code in any character encoding. This is then converted to UCS internally, so that one character is always one character, and the character codes are well defined.
And as to why non-ASCII source is permitted; it allows you to use non-ASCII characters in string literals, variable names etc.
On Thu, 8 May 2014, Stephen R. van den Berg wrote:
One other culprit for code size is sprintf, due to the use of huge macros for slow paths. There are this and other places where reducing the code size by de-inlining would also help performance on big cpus (or at least not make it slower).
I'd think so too. Increases cache hits, at least.
I have a branch lying around which does some de-inlining in sprintf. It reduces code size by half. There was no real speed advantage, but I only ran some micro benchmarks on i7, so I assume there was no cache pressure.
arne
On Thu, 8 May 2014, Arne Goedeke wrote:
I have a branch lying around which does some de-inlining in sprintf. It reduces code size by half. There was no real speed advantage, but I only ran some micro benchmarks on i7, so I assume there was no cache pressure.
I pushed those sprintf de-inlining changes into a new branch 'arne/slim'. That patch alone makes sprintf about 75% smaller. During my yesterday train ride I wrote another patch that changes the generation of sscanf functions. Instead of having one funciton for each strings width of both the format and the input, it uses the PCHARP accessor functions for the format. I have not run any benchmarks to see how much slower it is, but it saves about 60% of code size there.
arne
When looking at code size for different functions, I noticed that gcc with -O3 generates horribly large code for the file_open* functions in modules/_Stdio/file.c. Adding ATTRIBUTE((optimize("Os"))) to those saves about 15 kB. For file_open its 5 kB vs 1 kB. With -O3 gcc ends up generating an impressive 32 calls to open64...
arne
On Sat, 10 May 2014, Arne Goedeke wrote:
On Thu, 8 May 2014, Arne Goedeke wrote:
I have a branch lying around which does some de-inlining in sprintf. It reduces code size by half. There was no real speed advantage, but I only ran some micro benchmarks on i7, so I assume there was no cache pressure.
I pushed those sprintf de-inlining changes into a new branch 'arne/slim'. That patch alone makes sprintf about 75% smaller. During my yesterday train ride I wrote another patch that changes the generation of sscanf functions. Instead of having one funciton for each strings width of both the format and the input, it uses the PCHARP accessor functions for the format. I have not run any benchmarks to see how much slower it is, but it saves about 60% of code size there.
arne
I would like to merge arne/slim into 8.0. It reduces text size with -O3 by about 100k. Most changes are probably harmless, however one disables support for decoding programs with old style encoding and adds a configure argument to turn it back on. I assume that part of the decoder is not normally used, is that correct?
Arne
I think it looks good. Have you done any benchmarking on the relevant parts?
I benchmarked the sprintf changes, and there is really no difference. I think in theory on cpus with small caches performance should improve, but thats just a guess. The changes in sscanf make it slightly slower. Depending on the benchmark the old inlined code is about 3% faster. However, I only looked at the tag removal benchmarks.
Arne
On Thu, 15 May 2014, Martin Nilsson (Opera Mini - AFK!) @ Pike (-) developers forum wrote:
I think it looks good. Have you done any benchmarking on the relevant parts?
So the slower sscanf would be my concern then, at least the 0_0 case.
I guess a reasonable thing would be to have a 0_0 variant and one for all other combinations using PCHARP, but that would make the macros even worse..
On Thu, 15 May 2014, Martin Nilsson (Opera Mini - AFK!) @ Pike (-) developers forum wrote:
So the slower sscanf would be my concern then, at least the 0_0 case.
I merged all the changes except for those sscanf modifications. I added the %-F support, thought. I also removed the old-style encoding completely, without a configure argument to reactivate it.
Arne
On Thu, 15 May 2014, Martin Nilsson (Opera Mini - AFK!) @ Pike (-) developers forum wrote:
So the slower sscanf would be my concern then, at least the 0_0 case.
Arne Goedeke wrote:
I benchmarked the sprintf changes, and there is really no difference. I think in theory on cpus with small caches performance should improve, but thats just a guess.
Actually, even on a processor with a large enough cache, there will be a noticable difference if multiple processes are running; it will allow better hitrates on the cache despite context switches. This is, obviously, notoriously difficult to benchmark.
Arne Goedeke wrote:
The changes in sscanf make it slightly slower. Depending on the benchmark the old inlined code is about 3% faster. However, I only looked at the tag removal benchmarks.
What are the typical sscanf operations in those benchmarks? (Or where is the benchmark code?).
The test cases are
while (sscanf(in,"%s<%*s>%s",tmp,in))
and
array_sscanf(data,"%{%s<%*s>%}%{%s%}")
so its not testing very many formats. The benchmark code is the one run by pike -x benchmark, so you can have a look there.
arne
On Fri, 16 May 2014, Stephen R. van den Berg wrote:
Arne Goedeke wrote:
The changes in sscanf make it slightly slower. Depending on the benchmark the old inlined code is about 3% faster. However, I only looked at the tag removal benchmarks.
What are the typical sscanf operations in those benchmarks? (Or where is the benchmark code?). -- Stephen.
I would like to merge arne/slim into 8.0. It reduces text size with -O3 by about 100k. Most changes are probably harmless, however one disables support for decoding programs with old style encoding and adds a configure argument to turn it back on. I assume that part of the decoder is not normally used, is that correct?
I believe no one has used OLD_PIKE_ENCODE_PROGRAM since about 2003-02-24, when mast changed it to not be the default, so I don't think there'll be any problems with removing it.
Personally I think it could be removed entirely. It's not like it's all that common for people to try to decode .o-files generated by other versions of pike at all.
Speaking about encode: How about putting the encode code for programs in a separate .so file, that is only loaded when dumping programs?
Likewise, decoding programs could at least be moved to a separate function to make the decode_value function smaller (usually gcc generates better code for small functions. Usually.)
I know it would be somewhat cumbersome, but it's the /by far/ biggest part of the code.
Per Hedbor () @ Pike (-) developers forum wrote:
Personally I think it could be removed entirely. It's not like it's all that common for people to try to decode .o-files generated by other versions of pike at all.
Quite. Simply checkout an old version of Pike, and decode with that.
Speaking about encode: How about putting the encode code for programs in a separate .so file, that is only loaded when dumping programs?
Sounds very good. Would help the embedded Pike solution a lot.
function to make the decode_value function smaller (usually gcc generates better code for small functions. Usually.)
Have you checked/noticed this? It shouldn't matter, actually. Last time I checked the innards of gcc was around 1996, so my data is a bit dated, but back then it didn't matter. The lifetime analysis for local variables of gcc is quite good (provided you don't use pointers). When you use lots of pointers, it gets a bit hairy though.
Glancing at Pike I notice that:
- The i386 compiled 32 bit binary is roughly 2MB text size and 866KB BSS
size.
- program.o contains an unexpectedly large BSS segment of 306KB, why?
Not sure why shows up in program.o, but that might be the type_stack defined in pike_types.c. I guess it could be replaced that by something that resizes on demand.
There is a function lookup cache as well, that probably should be replaced with a mapping internally.
- cpp.o is 153KB vs. language.o being 131KB. It is rather unexpected
that the cpp module is larger than the actual language parser. Why don't we use yacc for the preprocessor too; or is it difficult to capture the rules in yacc for the preprocessor?
The preprocessor is included 3 times (once for each string width). A similar thing happens in sscanf (for all possible combinations of string widths).
There is also no point in including it three times, since you can happily use a slow version of cpp with one function for each string witdth. Per Hedbor worked on that a week ago, with very good results in triming down the size (just getting rid of all the inlined code does wonders in both size and speed). The code is however very britle, and in the end he ended up with strange bugs so he decided to start over at some later point, and do it in small incremental changes.
There is also no point in including it three times, since you can happily use a slow version of cpp with one function for each string witdth. Per Hedbor worked on that a week ago, with very good results in triming down the size (just getting rid of all the inlined code does wonders in both size and speed). The code is however very britle, and in the end he ended up with strange bugs so he decided to start over at some later point, and do it in small incremental changes.
The same is true of the lexer, by the way, there does not really _need_ to be three versions, simply use normal string indexing code. It's not _that_ slow, and generally speaking you do not spend most of your time compiling, optimizing that case specifically to this extent seems rather excessive.
Even without inlining the amount of code duplication in preprocessor.h is rather amazing, by the way. Just look for all the code skipping quoted newlines.
This could be moved to a pre-pass, making all other parsing both easier to read and smaller, code-wise.
But it takes time.
For reference, the pike compiler would actually fit on a DOS box:
text data bss 195650 0 312 cpp.o 142355 0 148 language.o 88963 0 720 las.o 88252 4 8264 peep.o 49165 0 0 lex.o 48132 84 68 pikecode.o 42776 72 18932 docode.o ------- 640k <-- A few bytes less than..
"Should be enough for anyone", right? :)
Per Hedbor () @ Pike (-) developers forum wrote:
The same is true of the lexer, by the way, there does not really _need_ to be three versions, simply use normal string indexing code. It's not _that_ slow, and generally speaking you do not spend most of your time compiling, optimizing that case specifically to this extent seems rather excessive.
Yes, quite.
However, I still can't entirely shake the notion that we're overdoing it here. Maybe we could simply make the preprocessor and compiler grok UTF8 directly and get rid of the special casing. All compiler input processing would return back to 8-bit only. And if someone would be as audacious to keep unicode pike source files on disk, then a quick unicode -> utf8 preprocessor would do the trick quite nicely.
grok UTF8 directly
The input is not necessarily UTF8, and the output is definitiely not. So your proposal is to make two conversions instead of one. Not necessarily a problem, but it seems a bit convoluted, especially since the results of parsing would need to be converted individually (i.e. each string literal, each symbol etc).
However, I still can't entirely shake the notion that we're overdoing it here. Maybe we could simply make the preprocessor and compiler grok UTF8 directly and get rid of the special casing. All compiler input processing would return back to 8-bit only.
Converting everything to utf8 before preprocessing would work, yes, if it is then converted back to unicode before the tokenization.
The alternative is a needlessly messy (handling utf-8 in the tokenizer).
Define name/argument handling would be the only thing that needs to be altered in cpp to handle utf-8.
Then again, just switching data[i] to IND(i) or similar, and have that be defined to index_shared_string(data,i) (or, to break with convetions in the code, not use a macro at all and instead just use the function directly) is actually significantly easier than adding utf-8 support to the preprocessor.
It is however bound to be somewhat slower in most cases. But I do not really think the difference matters at all, considering everything else we are doing in there.
-- Per Hedbor
I have a cpp.o version on x86_64 that is 48Kb (normally it's about 200k), which is better, albeit still rather large.
That involved changing the input to a pike-string (size /= 3), and changing all the macros to functions (most are automatically inlined by gcc, compiling with -Os saves a few Kb more).
I can not for the life of me get it to produce the correct output, though. Linenumbers change, and things are generally shaky.
The code is extremely sensitive to change, I think I would have to do it over from scratch, changing at most a few lines at a time and running a testsuite on it all the time.
pike-devel@lists.lysator.liu.se