I've been working on a multi-cpu design for Pike during the last couple of months, and now I think it's time to get some criticism on it.
It's not finished, and unfortunately it's a bit long already. I still hope the people around here can find the time to read it through and mull a bit on it, because this hack is going to be rather large. I've tried to explain things briefly so that no knowledge of the pike core internals should be required - most of it is hopefully understandable for anyone who has written some C module code.
The goal is not only to make pike run reasonably on 4-8 cores. It's rather to completely avoid inherent hotspots in the interpreter, so that the level of parallellism only depends on the pike program. I.e. this should scale to 100+ cpu's.
Very briefly, the approach is to divide all data into thread local things that need no locking, and an arbitrary number of "lock spaces". A "lock space" is a monitor lock that controls access to any amount of data - it is up to the programmer to divide the shared data into suitable pieces that are locked independently. Some structures (e.g. arrays and possibly mappings) are lock-free. Refcounting is selectively - or perhaps even completely - disabled to avoid hotspots.
I'll attach the whole thing as a comment to this message. I've also checked it in as multi-cpu.txt in a new branch "multi-cpu" in srb's pikex repository (lyskom people see 16711234).
Btw, merry Christmas! ;)
Multi-cpu support in Pike -------------------------
This is a draft spec for how to implement multi-cpu support in Pike. The intention is that it gets extended along the way as more issues gets ironed out. Discussions take place in "Pike dev" in LysKOM or pike-devel@lists.lysator.liu.se.
Initial draft created 8 Nov 2008 by Martin Stjernholm.
Background and goals
Pike supports multiple threads, but like many other high-level languages it only allows one thread at a time to access the data structures. This means that the utilization of multi-cpu and multi-core systems remains low, even though there are some modules that can do isolated computational tasks in parallell (e.g. the Image module).
It is the so-called "interpreter lock" that must be locked to access any reference variable (i.e. everything except floats and native integers). This lock is held by default in essentially all C code and is explicitly unlocked in a region by the THREADS_ALLOW/ THREADS_DISALLOW macros. On the pike level, the lock is always held - no pike variable can be accessed and no pike function can be called otherwise.
The purpose of the multi-cpu support is to rectify this. The design goals are, in order of importance:
1. Pike threads should be able to execute pike code concurrently on multiple cpus as long as they only modify thread local pike data and read a shared pool of static data (i.e. the pike programs, modules and constants).
2. There should be as few internal hot spots as possible (preferably none) when pike code is executed concurrently. Care must be taken to avoid internal synchronization, or updates of shared data that would cause "cache line ping-pong" between cpus.
3. The concurrency should be transparent on the pike level. Pike code should still be able to access shared data without locking and without risking low-level inconsistencies. (So Thread.Mutex etc would still be necessary to achieve higher level synchronization.)
4. There should be tools on the pike level to allow further performance tuning, e.g. lock-free queues, concurrent access hash tables, and the possibility to lock different regions of shared data separately. These tools should be designed so that they are easy to slot into existing code with few changes.
5. There should be tools to monitor and debug concurrency. It should be possible to make assertions that certain objects aren't shared, and that certain access patterns don't cause thread synchronization. This is especially important if goal (3) is realized, since the pike code by itself won't show what is shared and what is thread local.
6. C modules should continue to work without source level modification (but likely without allowing any kind of concurrency).
Note that even if goal (3) is accomplished, this is no miracle cure that would make all multithreaded pike programs run with optimal efficiency on multiple cpus. One could expect better concurrency in old code without adaptions, but it could still be hampered considerably by e.g. frequent updates to shared data. Concurrency is a problem that must be taken into account on all levels.
Other languages
Perl: All data is thread local by default. Data can be explicitly shared, in which case Perl ensures internal consistency. Every shared variable is apparently locked individually. Referencing a thread local variable from a shared one causes the thread to die. See perthrtut(1).
Python: Afaik it's the same state of affairs as Pike.
Solution overview
The basic approach is to divide all data into thread local and shared:
o Thread local data is everything that is accessible to one thread only, i.e. there are no references to anything in it from shared data or from any other thread. This is typically data that the current thread has created itself and only reference from the stack. The thread can access its local data without locking.
o Shared data is everything that is accessible from more than one thread. Access to it is synchronized using a global read/write lock, the so-called "global lock". I.e. this lock can either be locked for reading by many threads, or be locked by a single thread for writing. Locking the global lock for writing is the same as locking the interpreter lock in current pikes. (This single lock is refined later - see issue "Lock spaces".)
o There is also a special case where data can be "disowned", i.e. not shared and not local in any thread. This is used in e.g. Thread.Queue for the objects that is in transit between threads. Disowned data cannot have arbitrary references to it - it must always be under the control of some object that in some way ensures consistency. (Garbage could be made disowned since it by definition no longer is accessible from anywhere, but of course it is always better to clean it up instead.)
+--------+ +---------------------+ Direct +--------+ | |<-- refs --| Thread 1 local data |<- - access - -| | | | +---------------------+ | Thread | | | | 1 | | |<- - - - Access through global lock only - - - -| | | Shared | +--------+ | | | data | +---------------------+ Direct +--------+ | |<-- refs --| Thread 2 local data |<- - access - -| | | | +---------------------+ | Thread | | | | 2 | | |<- - - - Access through global lock only - - - -| | | | +--------+ +--------+ ... etc ...
The principal use case for this model is that threads can do most of their work with local data and read access to the shared data, and comparatively seldom require the global write lock to update the shared data. Every shared thing does not have its own lock since that would cause excessive lock overhead.
Note that the shared data is typically the same as the data referenced from the common environment (i.e. the "global data").
Also note that the current object (this) always is shared in pike modules, so a thread cannot assume free access to it. In other pike classes it would often be shared too, but it is still important to utilize the situation when it is thread local.
A thread local thing, and all the things it references directly or indirectly, automatically becomes shared whenever it gets referenced from a shared thing.
A shared thing never automatically becomes thread local, but there is a function to explicitly "take" it. It would first have to make sure there are no references to it from shared or other thread local things. Thread.Queue has a special case so that if a thread local thing with no other refs is enqueued, it is disowned by the current thread, and later becomes thread local in the thread that dequeues it.
Issue: Lock spaces
Having a single global read/write lock for all shared data could become a bottleneck. Thus there is a need for shared data with locks separate from the global lock. Things that share a common lock is called a "lock space", and it is always possible to look up the lock that governs any given thing (see issue "Memory object structure").
A special global lock space, which corresponds to the shared data discussed above, is created on startup. All others have to be created explicitly.
The intended use case for lock spaces is a "moderately large" collection of things: Too large and you get outlocking problems, too small and the lock overhead (both execution- and memorywise) gets prohibiting. A typical lock space could be a RAM cache consisting of a mapping and all its content.
Many different varieties of lock space locks can be considered, e.g. a simple exclusive access mutex lock or a read/write lock, priority locks, locks that ensure fairness, etc. Therefore different (C-level) implementations should be allowed.
One important characteristic of lock space locks is whether they are implicit or explicit:
Implicit locks are locked internally, without intervention on the pike level. The lock duration is unspecified; locks are only acquired to ensure internal consistency. All low level data access functions check whether the lock space for the accessed thing is locked already. If it isn't then the lock is acquired automatically. All implicit locks have a well defined lock order (by pointer comparison), and since they only are taken to guarantee internal consistency, an access function can always free a lock to ensure correct order (see also issue "Lock space locking").
Explicit locks are exposed to the pike level and must be locked in a similar way to Thread.Mutex. If a low level data access function encounters an explicit lock that isn't locked, it throws an error. Thus it is left to the pike programmer to avoid deadlocks, but the pike core won't cause any by itself. Since the pike core keeps track which lock governs which thing it ensures that no lock violating access occurs, which is a valuable aid to ensure correctness.
One can also consider a variant with a read/write lock space lock that is implicit for read but explicit for write, thus combining atomic pike-level updates with the convenience of implicit locking for read access.
The scope of a lock space lock is (at least) the state inside all the things it contains, but not the set of things itself, i.e. things might be added to a lock space without holding a write lock (provided the memory structure allows it). Removing a thing from a lock space always requires the write lock since that is necessary to ensure that a lock actually governs a thing for as long as it is held (regardless it's for reading or writing).
FIXME: Allow removing garbage from a lock space without the write lock?
See also issues "Memory object structure" and "Lock space locking" for more details.
Issue: Memory object structure
Of concern are the refcounted memory objects known to the gc. They are called "things", to avoid confusion with "objects" which are the structs for pike objects.
There are three types of things:
o First class things with ref counter, lock space pointer, and double-linked list pointers (to be able to visit all things in memory, regardless of other references). Most pike visible types are first class things. The exceptions are ints and floats, which are passed by value, and strings and types.
o Second class things with ref counter and lock space pointer but no double-linked list pointers. These are always reached through pointers from one or more first class things. It's the job of the visit functions for those first class things to ensure that the gc visits these, thus they don't need the double-linked list pointers. Only strings and types are likely to be of this type.
o Third class things contain only a ref counter. They are similar to second class except that their lock spaces are implicit from the referencing things, which means all those things must always be in the same lock space.
Thread local things could have NULL as lock space pointer, but as a debug measure they could also point to the thread object so that it's possible to detect bugs with a thread accessing things local to another thread.
Before the multi-cpu architecture, all first class things are linked into the same global double-linked lists (one for each type: array, mapping, multiset, object, and program). This gets split into one set of double-linked lists for each thread and for each lock space. That allows things to be added and removed to a thread or lock space without requiring other locks (a lock-free double-linked list is apparently difficult to accomplish). It also allows the gc to do garbage collection locally in each thread and in each lock space (although cyclic structures over several lock spaces won't be freed that way).
A global lock-free hash table (see issue "Lock-free hash table") is used to keep track of all lock space lock objects, and hence all things they contain in their double-linked lists.
+----------+ +----------+ | Thread 1 | | Thread 2 | +----------+ +----------+ // \ // \ // \ // \ ,--- O O O O ,------------- O O O O ---. | \ // \ // | \ // \ // | ref | O O -. | ref O O | ref | | | | v refs ref | v v O <----- O `--> O O O O // \ // \ // \ // \ refs // \ // \ O O -> O O O O O O <----> O O O O \ // \ // \ // \ // \ // \ // +--------------+ +--------------+ +--------------+ | Lock space 1 | | Lock space 2 | | Lock space 3 | +--------------+ +--------------+ +--------------+ ^________ ^ ____^ | | | +-----------------------+-|-+-----+-|-+-------+-|-+----------------- | | X | | X | | X | ... +-----------------------+---+-----+---+-------+---+-----------------
Figure 2: "Space Invaders". The O's represent things, and the \ and // represent the double-linked lists. Some examples of references between things are included, and at the bottom is the global hash table with pointers to all lock spaces.
Accessing a lock space lock structure from the global hash table requires a hazard pointer (c.f. issue "Hazard pointers"). Accessing it from a thing is safe if the thread controls at least one ref to the thing, because a lock space has to be empty to delete the lock space lock struct.
Issue: Lock space lock semantics
There are three types of locks:
o A read-safe lock ensures only that the data is consistent, not that it stays constant. This allows lock-free updates in things where possible (which could include arrays, mappings, and maybe even multisets and objects of selected classes).
o A read-constant lock ensures both consistency and constantness (i.e. what usually is assumed for a read-only lock).
o A write lock ensures complete exclusive access. The owning thread can modify the data, and it can assume no other changes occur to it (barring refcounters - see below). The owning thread can also under limited time leave the data in inconsistent state. This is however still limited by the calls to check_threads(), which means that the state must be consistent again every time the evaluator callbacks are run. See issue "Emulating the interpreter lock".
Allowing lock-free updates is attractive, so the standard read/write lock that governs the global lock space will probably be multiple read-safe/single write.
An exception to the lock semantics above are the reference counters in refcounted things (c.f. issue "Refcounting and shared data"). A ref to a thing can always be added or removed if it is certain that the thing cannot asynchronously disappear. That means:
o Refcount changes must always be atomic, even when a write lock is held. o The refcount may be incremented or decremented when any kind of read lock is held. o The refcount may be incremented or decremented without any kind of lock at all, provided the same thread already holds at least one other ref to the same thing. This means another thread might hold a write lock, but it still won't free the thing since the refcount never can reach zero. o A thing may be freed if its refcount is zero and a write lock is held.
FIXME: Whether or not to free a thing if its refcount is zero and only some kind of read lock is held is tricky. To allow that it's necessary to have an atomic-decrement-and-get instruction (can be emulated with CAS, though) to ensure no other thread is decrementing it and reaching zero at the same time. Lock-free linked lists are also necessary to make unlinking possible. Barring that, we need to figure out a policy for scheduling frees of things reaching refcount zero during read locks.
Issue: Lock space locking
Assuming that a thread already controls at least one ref to a thing (so it won't be freed asynchronously), this is the locking process before accessing it:
1. Read the lock space pointer. If it's NULL then the thing is thread local and nothing more needs to be done. 2. Address an array containing the pointers to the lock spaces that are already locked by the thread. 3. Search for the lock space pointer in the array. If present then nothing more needs to be done. 4. Lock the lock space lock as appropriate. Note that this can imply that other implicit locks that are held are unlocked to ensure correct lock order (see issue "Lock spaces"). Then it's added to the array.
A thread typically won't hold more than a few locks at any time (less than ten or so), so a plain array and linear search should perform well. For quickest possible access the array should be a static thread local variable (c.f. issue "Thread local storage"). If the array gets full, implicit locks in it can be released automatically to make space. Still, a system where more arrays can be allocated and chained on would perhaps be prudent to avoid the theoretical possibility of running out of space for locked locks.
"Controlling" a ref means either to add one "for the stack", or ensuring a lock on a thing that holds a ref. Note that implicit locks might be released in step 4, so unless the thread controls a ref to the referring thing too, it might no longer exist afterwards, and hence the thing itself might be gone.
Since implicit locks can be released (almost) at will, they are open for performance tuning: Too long lock durations and they'll outlock other threads, too short and the locking overhead becomes more significant. As a starting point, it seems reasonable to release them at every evaluator callback call (i.e. at approximately every pike function call and return).
Issue: Refcounting and shared data
Using the traditional refcounting on shared data could easily produce hotspots: Some strings, shared constants, and the object instances for pike modules are often accessed from many threads, so their refcounts would be changed frequently from different processors.
E.g. making a single function call in a pike module requires the refcount of the module object to be increased during the call since there is a new reference from a pike_frame. The refcounters in the module objects for commonly used modules like Stdio.pmod/module.pmod could easily become hotspots.
Atomic increments and decrements are not enough to overcome this - the memory must not be changed at all to avoid slow synchronizations between cpu local caches.
Observation: Refcounters become hotspots primarily in globally accessible shared data, which for the most part has a long lifetime (i.e. programs, module objects, and constants). Otoh, they are most valuable in short-lived data (shared or not), which would produce lots of garbage if they were to be reaped by the gc instead.
Following this observation, the problem with refcounter hotspots can to a large degree be mitigated by simply turning off refcounting in the large body of practically static data in the shared runtime environment.
A good way to do that is to extend the resolver in the master to mark all programs it compiles, their constants, and the module objects, so that refcounting of them is disabled. To do this, there has to be a function similar to Pike.count_memory that can walk through a structure recursively and mark everything in it. When those things lose their refs, they will always become garbage that only is freed by the gc.
Question: Is there data that is missed with this approach?
A disabled refcounter is recognized by a negative value and flagged by setting the topmost two bits to one and the rest to zero, i.e. a value in the middle of the negative range. That way, in case there is code that steps the refcounter then it stays negative. (Such code is still bad for performance and should be fixed, though.)
Disabling refcounting requires the gc to operate differently; see issue "Garbage collection and external references".
Another alternative is to actually cease to use refcounting altogether and instead use a generational garbage collector that can reap the newest data quickly and frequently. Lock space local garbage collection will also help here.
Issue: Strings
Strings are unique in Pike. This property is hard to keep if threads have local string pools, since a thread local string might become shared at any moment, and thus would need to be moved. Therefore the string hash table remains global, and lock congestion is avoided with some concurrent access hash table implementation. See issue "Lock-free hash table".
Lock-free is a good start, but the hash function must also provide a good even distribution to avoid hotspots. Pike currently uses an in-house algorithm (DO_HASHMEM in pike_memory.h). Replacing it with a more widespread and better studied alternative should be considered. There seems to be few that are below O(n) (which DO_HASHMEM is), though.
Issue: Types
Like strings, types are globally unique and always shared in Pike. That means lock-free access to them is desirable, and it should also be doable fairly easily since they are constant (except for the refcounts which can be updated atomically). Otoh it's probably not as vital as for strings since types typically only are built during compilation.
Types are more or less always part of global shared data. That suggests they should have their refcounts disabled most of the time (see issue "Refcounting and shared data"). But again, since types typically only get built during compilation, their refcounts probably won't become hotspots anyway. So it looks like they could be exempt from that rule.
Issue: Shared mapping and multiset data blocks
An interesting issue is if things like mapping/multiset data blocks should be second or third class things (c.f. issue "Memory object structure"). If they're third class it means copy-on-write behavior doesn't work across lock spaces. If they're second class it means additional overhead handling the lock spaces of the mapping data blocks, and if a mapping data is shared between lock spaces then it has to be in some third lock space of its own, or in the global lock space, neither of which would be very good.
So it doesn't look like there's a better way than to botch copy-on-write in this case.
Issue: Emulating the interpreter lock
For compatibility with old C modules, and for the _disable_threads function, it is necessary to retain a complete lock like the current interpretator lock. It has to lock the global area for writing, and also stop all access to all lock spaces, since the thread local data might refer to any lock space.
This lock is implemented as a read/write lock, which normally is held permanently for reading by all threads. Only when a thread is waiting to acquire the compat interpreter lock is it released as each thread goes into check_threads().
This lock cannot wait for explicit lock space locks to be released. Thus it can override the assumption that a lock space is safe from tampering by holding a write lock on it. Still, it's only available from the C level (with the exception of _disable_threads) so the situation is not any different from the way the interpreter lock overrides Thread.Mutex today.
Issue: Function calls
A lock on an object is almost always necessary before calling a function in it. Therefore the central apply function (mega_apply) must ensure an appropriate lock is taken. Which kind of lock (read-safe/read-constant/write - see issue "Lock space lock semantics") depends on what the function wants to do. Therefore all object functions are extended with flags for this.
The best default is probably read-safe. Flags for no locking (for the few special cases where the implementations actually are completely lock-free) and for compat-interpreter-lock-locking should probably exist as well. A compat-interpreter-lock flag is also necessary for global functions that don't have a "this" object (aka efuns).
Having the required locking declared this way also alleviates each function from the burden of doing the locking to access the current storage, and it allows future compiler optimizations to minimize lock operations.
Issue: Exceptions
"Forgotten" locks after exceptions shouldn't be a problem: Explicit locks are handled just like today (i.e. it's up to the pike programmer), and implicit locks can safely be released when an exception is thrown.
One case requires attention: An old-style function that requires the compat interpreter lock might catch an error. In that case the error system has to ensure that lock is reacquired.
Issue: C module interface
A new add_function variant is probably added for new-style functions. It takes bits for the flags discussed for issue "Function calls". New-style functions can only assume free access to the current storage according to those flags; everything else must be locked (through a new set of macros/functions).
Accessor functions for data types (e.g. add_shared_strings, mapping_lookup, and object_index_no_free) handles the necessary locking internally. They will only assume that the thing is safe, i.e. that the caller ensures the current thread controls at least one ref.
THREADS_ALLOW/THREADS_DISALLOW and their likes are not used in new-style functions.
There will be new GC callbacks for walking module global pointers to things (see issue "Garbage collection and external references").
Issue: C module compatibility
Ref issue "Emulating the interpreter lock".
Ref issue "Garbage collection and external references".
Issue: Garbage collection and external references
The current gc design is that there is an initial "check" pass that determines external references by counting all internal references, and then for each thing subtract it from its refcount. If the result isn't zero then there are external references (e.g. from global C variables or from the C stack) and the thing is not garbage.
Since refcounting can be disabled in some objects (see issue "Refcounting and shared data"), this approach no longer work; the gc has to be changed to find external references some other way:
References from global C variables are few, so they can be dealt with by requiring C modules and the core parts to provide callbacks that lets the gc walk through them (see issue "C module interface"). This is however not compatible with old C modules.
References from C stacks are common, and it is infeasible to require callbacks that keep track of them. The gc instead has to scan the C stacks for the threads and treat any aligned machine word containing an apparently valid pointer to a known thing as an external reference. This is the common approach used by standalone gc libraries that don't require application support. For reference, here is one such garbage collector, written in C++: http://developer.apple.com/DOCUMENTATION/Cocoa/Conceptual/GarbageCollection/... Its source is here: http://www.opensource.apple.com/darwinsource/10.5.5/autozone-77.1/
The same approach is also necessary to cope with old C modules (see issue "C module compatibility"), but since global C level pointers are few, it might not be mandatory to get this working.
Btw, using this approach to find external refs should be considerably more efficient than the old "check" pass, even if C stacks are scanned wholesale.
Issue: Local garbage collection
Each thread periodically invokes a gc that only looks for garbage in the local data of that thread. This can naturally be done without disturbing the other threads. It follows that this gc also can be disabled on a per-thread basis. This is a reason for keeping thread local data in separate double-linked lists (see issue "Memory object structure").
Similarly, if gc statistics are added to each lock space, they could also be gc'd for internal garbage at appropriate times when they get write locked by some thread. That might be interesting since known cyclic structures could then be put in lock spaces of their own and be gc'd efficiently without a global gc. Note that a global gc is still required to clean up cycles with things in more than one lock space.
Issue: Global pike level caches
Issue: Thread.Queue
A lock-free implementation should be used. The things in the queue are typically disowned to allow them to become thread local in the reading thread.
Issue: False sharing
False sharing occurs when thread local things used frequently by different threads are next to each other so that they share the same cache line. Thus the cpu caches might force frequent resynchronization of the cache line even though there is no apparent hotspot problem on the C level.
This can be a problem in particular for all the block_alloc pools containing small structs. Using thread local pools is seldom a workable solution since most thread local structs might become shared later on.
One way to avoid it is to add padding (and alignment). Cache line sizes are usually 64 bytes or less (at least for Intel ia32). That should be small enough to make this viable in many cases.
FIXME: Check cache line sizes on the other important architectures.
Worth noting that the problem is greatest for the frequently changed ref counters at the start of each thing, so the most important thing is to keep ref counters separated. I.e. things larger than a cache line can probably be packed without padding.
Another way is to move things when they get shared, but that is pretty complicated and slow.
Issue: Malloc and block_alloc
Standard OS mallocs are usually locking. Bundling a lock-free one could be important. FIXME: Survey free implementations.
Block_alloc is a simple homebrew memory manager used in several different places to allocate fixed-size blocks. The block_alloc pools are often shared, so they must allow efficient concurrent access. With a modern malloc, it is possible that the need for block_alloc is gone, or perhaps the malloc lib has builtin support for fixed-size pools. Making a lock-free implementation is nontrivial, so the homebrew ought to be ditched in any case.
A problem with ditching block_alloc is that there is some code that walks through all allocated blocks in a pool, and also avoids garbage by freeing the whole pool altogether. FIXME: Investigate alternatives here.
See also issue "False sharing".
Issue: The compiler
Issue: Foreign thread visits
JVM threads..
Issue: Pike security system
It is possible that keeping the pike security system intact would complicate the implementation, and even if it was kept intact a lot of testing would be required before one can be confident that it really works (and there are currently very few tests for it in the test suite).
Also, the security system isn't used at all to my (mast's) knowledge, and it is not even compiled in by default (has to be enabled with a configure flag).
All this leads to the conclusion that it is easiest to ignore the security system altogether, and if possible leave it as it is with the option to get it working later.
Issue: Contention-free counters
There is probably a need for contention-free counters in several different areas. They should be possible to update from several threads in parallell without synchronization. Querying the current count is always approximate since it can be changing simultaneously in other threads. However, the thread's own local count is always accurate.
They should be separated from the blocks they apply to, to avoid cache line invalidation of those blocks.
To accomplish that, a generic tool somewhat similar to block_alloc is created that allocates one or more counter blocks for each thread. In these blocks indexes are allocated, so a counter is defined by the same index into all the thread local counter blocks.
Each thread can then modify its own counters without locking, and it typically has its own counter blocks in the local cache while the corresponding main memory is marked invalid. To query a counter, a thread would need to read the blocks for all other threads.
This means that these counters are efficient for updates but less so for queries. However, since queries always are approximate, it is possible to cache them for some time (e.g. 1 ms). Each thread would need its own cache though, since the local count cannot be cached.
It should be lock-free for allocating and freeing counters, and preferably also for starting and stopping threads (c.f. issue "Foreign thread visits"). In both cases the freeing steps represents a race problem - see issue "Hazard pointers". To free counters, the counter index would constitute the hazard pointer.
Issue: Lock-free hash table
A good lock-free hash table implementation is necessary. A promising one is http://blogs.azulsystems.com/cliff/2007/03/a_nonblocking_h.html. It requires a CAS (Compare And Swap) instruction to work, but that shouldn't be a problem. The java implementation (http://sourceforge.net/projects/high-scale-lib) is Public Domain. In the comments there is talk about efforts to make a C version.
It supports (through putIfAbsent) the uniqueness requirement for strings, i.e. if several threads try to add the same string (at different addresses) then all will end up with the same string pointer afterwards.
The java implementation relies on the gc to free up the old hash tables after resize. We don't have that convenience, but the problem is still solvable; see issue "Hazard pointers".
Issue: Hazard pointers
A problem with most lock-free algorithms is how to know no other thread is accessing a block that is about to be freed. Another is the ABA problem which can occur when a block is freed and immediately allocated again (common for block_alloc).
Hazard pointers are a good way to solve these problems without leaving the blocks to the garbage collector (see http://www.research.ibm.com/people/m/michael/ieeetpds-2004.pdf). So a generic hazard pointer tool is necessary.
Note however that a more difficult variant of the ABA problem still can occur when the block cannot be freed after leaving the data structure. (In the canonical example with a lock-free stack - see e.g. "ABA problem" in Wikipedia - consider the case when A is a thing that continues to live on and actually gets pushed back.) The only reliable way to cope with that is probably to use wrappers.
Issue: Thread local storage
Implementation would be considerably simpler if working TLS can be assumed on the C level, through the __thread keyword (or __declspec(thread) in Visual C++). A survey of the support for TLS in common compilers and OS'es is needed to decide whether this is an workable assumption:
o GCC: __thread is supported. Source: Wikipedia. FIXME: Check from which version.
o Visual C++: __declspec(thread) is supported. Source: Wikipedia. FIXME: Check from which version.
o Intel C compiler: Support exists. Source: Wikipedia. FIXME: Check from which version.
o Sun C compiler: Support exists. Source: Wikipedia. FIXME: Check from which version.
o Linux (i386, x86_64, sparc32, sparc64): TLS is supported and works for dynamic libs. C.f. http://people.redhat.com/drepper/tls.pdf. FIXME: Check from which version of glibc and kernel (if relevant).
o Windows (i386, x86_64): TLS is supported but does not always work in dll's loaded using LoadLibrary (which means all dynamic modules in pike). C.f. http://msdn.microsoft.com/en-us/library/2s9wt68x.aspx. According to Wikipedia this is fixed in Vista and Server 2008 (FIXME: verify). In any case, TLS is still usable in the pike core.
o MacOS X: FIXME: Check this.
o Solaris: FIXME: Check this.
o *BSD: FIXME: Check this.
Issue: Platform specific primitives
Some low-level primitives, such as CAS and fences, are necessary to build the various lock-free tools. A third-party library would be useful.
An effort to make a standardized library is here: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2047.html (C level interface at the end). It apparently lacks implementation, though.
Required operations:
CAS(address, old_value, new_value) Compare-and-set: Atomically sets *address to new_value iff its current value is old_value. Needed for 32-bit variables, and on 64-bit systems also for 64-bit variables.
ATOMIC_INC(address) ATOMIC_DEC(address) Increments/decrements *address atomically. Can be simulated with CAS. 32-bit version necessary, 64-bit version would be nice.
LFENCE() Load fence: All memory reads in the thread before this point are guaranteed to be done (i.e. be globally visible) before any following it.
SFENCE() Store fence: All memory writes in the thread before this point are guaranteed to be done before any following it.
MFENCE() Memory fence: Both load and store fence at the same time. (On many architectures this is implied by CAS etc, but we shouldn't assume that.)
The following operations are uncertain - still not known if they're useful and supported enough to be required, or if it's better to do without them:
CASW(address, old_value_low, old_value_high, new_value_low, new_value_high) A compare-and-set that works on a double pointer size area. Supported on more modern x86 and x86_64 processors (c.f. http://en.wikipedia.org/wiki/Compare-and-swap#Extensions).
FIXME: More..
Survey of platform support:
o Windows/Visual Studio: Got "Interlocked Variable Access": http://msdn.microsoft.com/en-us/library/ms684122.aspx
o FIXME: More..
Various links
Pragmatic nonblocking synchronization for real-time systems http://www.usenix.org/publications/library/proceedings/usenix01/full_papers/... DCAS is not a silver bullet for nonblocking algorithm design http://portal.acm.org/citation.cfm?id=1007945
On Wed, Dec 24, 2008 at 2:50 AM, Martin Stjernholm, Roxen IS @ Pike developers forum 10353@lyskom.lysator.liu.se wrote:
Issue: Pike security system
[...]
Also, the security system isn't used at all to my (mast's) knowledge, and it is not even compiled in by default (has to be enabled with a configure flag).
All this leads to the conclusion that it is easiest to ignore the security system altogether, and if possible leave it as it is with the option to get it working later.
I do not use it currently, but I have always kept in mind that I could and planned to eventually look at it and then shield user supplied plugins (once we start supporting them at all) etc. in ppp ("chat server", although it is more a framework and intends to transport anything, not only chat messages). I do not know anything specific about the security system at all, but I programmed LPC (LDMud-flavour) before resorting to Pike and therefore I assume that the Pike security system is somewhat like that, or at least you can achieve the same goals with it. Of course, my impression may be terribly wrong. On the other hand, most of what I would need (or even: most of what can be done with the security system) can be achieved by using a CompilationHandler and a little getting-your-hands-dirty-effort, at least that's what I suspect and so it is probably not a too big loss. I'd need to know more about it to give a well thought through opinion, but the thing I heard of the Pike security system first is that it is sparsely documented. Or is it?
I don't know much about the security system, and I can't imagine there's any decent amount of docs on it. I _believe_ it's made up of basically two parts: One is a set of security bits that can be set to block access to the OS (filesystem, sockets, and so on) in the C level glue, and the other is a system to block function calls between different zones. The latter is much like the lock spaces in my proposal, so I guess that part would probably be best reimplemented using special kinds of lock space locks in the future.
On the other hand, most of what I would need (or even: most of what can be done with the security system) can be achieved by using a CompilationHandler and a little getting-your-hands-dirty-effort, at least that's what I suspect and so it is probably not a too big loss.
Yes, I think that's a workable alternative, although in that case you'd be doing a lot of the work (lots of function wrappers etc) that the security system would be doing for you otherwise.
On Sat, Dec 27, 2008 at 09:00:02PM +0000, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
On the other hand, most of what I would need (or even: most of what can be done with the security system) can be achieved by using a CompilationHandler and a little getting-your-hands-dirty-effort
Yes, I think that's a workable alternative, although in that case you'd be doing a lot of the work (lots of function wrappers etc) that the security system would be doing for you otherwise.
that kind of work is done in sTeam, so if someone needs an implementation :-)
greetings, martin.
Just a quick idea.
For referenced counted things it is possible to have the counting partially delegated to a local scope from the shared scope. A shadow-thing is allocated in the local scope, increasing the shared counter and setting the local counter to 1. Any new references can be added to the shadow-thing without modifying the shared thing. O(n) comparision can be maintained if item pointer (str for strings) rather than the svalue is used.
It's tricky to get that to work: When making a reference to the shared thing from another shared thing, the shadow has to be bypassed correctly. That adds more tests and different code paths in many places.
I'm investigating another somewhat similar approach that temporarily records refcount changes thread locally, and then combines it with thread local gc: http://www.cs.technion.ac.il/%7Eerez/Papers/refcount.pdf. For starters, afaics references from the stack are never counted, instead the stack is scanned. And a stack scanning gc appears to be necessary anyway. It's a long paper, but it looks promising.
I've studied that refcount paper and some other recent gc algorithms. http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-get.cgi/2006/PHD/PHD-2006... combines a number of gc algorithms into something that's a whole lot more efficient than what pike currently uses. Some highlights:
o The reference counts aren't updated for references on the stack. The stacks are scanned when the gc runs instead. This saves many many updates, and it also simplifies C level programming a lot. Only refcounts between things on the heap are counted.
o The refcounts are only updated when the gc runs. This saves a lot of the remaining updates since if a pointer starts with value p_0 and then changes to p_1, then p_2, p_3, ..., and lastly to p_n at the next gc, then only p_0->refs needs to be decremented and p_n->refs needs to be incremented - all the refcount changes for the other things it has pointed to cancel out.
o The above is accomplished by thread local logging, to make the old p_0 value available to the gc at the next run. This means it scales well with many cpu's.
o A generational gc uses refcounting only for old things in the heap. New things, which are typically very short-lived, aren't refcounted at all but instead gc'ed using a mark-and-sweep collector. This is shown to be more effective for short-lived data, and it handles cyclic structures without any extra effort.
o By using refcounting on old data, the gc only need to give attention to refcounts that gets down to zero. This means the heap can scale to any size without affecting the gc run time, as opposed to using a mark-and-sweep collector on the whole heap. The gc time scales only with the amount of _change_ in the heap.
o Cyclic structures in the old refcounted data that is handled incrementally using the fact that a cyclic structure can only occur when a refcounter is decremented to a value greater than zero. Those things can therefore be tracked and cycle checked in the background. The gc uses several different methods to weed out false alarms before doing actual cycle checks.
o The gc runs entirely in its own thread. It only needs to stop the working threads for a very short time to scan stacks etc, and they can be stopped one at a time.
o Since all the things that need gc attention are logged on change, there's no need for the double-linked lists.
All the above taken together puts this gc in a whole different league compared to the one pike currently uses, so it'd be very nice indeed to replace the old one, and it would simplify a lot of things in the multi-cpu scenario. There are however some significant drawbacks:
o Noncyclic things are no longer destructed and freed immediately. In particular, it means code like this no longer would work:
void foo() { Thread.MutexKey my_lock = my_mutex->lock(); ... do some work ... // my_lock falls out of scope here when the function exits, so // the lock is released right away. }
Mutex locks like the above would need something like a synchronized statement to work well (the locks must be known to the compiler so that they can be destructed properly also when exceptions are thrown - using explicit catch{} and rethrow everywhere would be too cumbersome).
There's also code that opens files and sockets etc and expects them to be automatically closed again through this method. (That practice has been shown to be bug prone, though, so in our sources many of those places have been fixed over time.)
Anyway, bottom line is that it would need pike level fixing. I think the compiler can be changed to produce warnings for most occurrences.
And there might be more immediate-destruct tricks out there too..
o The ubiquitous foo->refs would no longer be useful to anyone other than the gc. This means the _refs() efun becomes meaningless. Afaik it's used occasionally for some destruct tricks, but I think weak refs can be used instead in most cases.
o _next(), _prev() and next_object() would also be defunct. It's a minor problem - other methods to let pike code walk the heap can be added.
o The C module interface becomes more seriously incompatible.
For starters, all code that fiddle with refcounts on the stack become meaningless. I think most of it can be taken care of by changing the various macros, e.g. ref_push_string would just be an alias for push_string, and pop_stack would simply decrement Pike_sp.
More problematic is that all code that change pointers in heap objects would have to use new macros/functions. Above all, this includes all C modules that fiddles with pointers in their current storage (e.g. through a THIS macro or similar). Changes would be trivial but widespread, and I can't think of any way to keep source level compatibility with old modules.
So what do you think?
Per and I discussed your initial writeup and from our side we fear that we would have to explicitly free() stuff to get acceptable memory performance on server software.
Memory is currently released very quickly when the object reach 0 refs. In high-throughput applications (such as a loadbalancer with a few thousand requests/second) it's very critical that data don't stay around more than a few ms, or the memory usage will increase rather rapidly.
The current gc, at least, is _way_ to slow at freeing things for it to be a suitable solution, the loadbalancer in question is actually likely to run out of addressing space or run gc very often.
Not to mention that it can take up to a a second (or even several) for the gc to do it's job at times which is not really acceptable as long as it blocks the whole server, but that would be fixed with an asynchronous version.
The loadbalancer and download servers for Opera Mini actually had disabled gc() totally since it sometimes hung the service for 10+ seconds, long enough to cause things to time out, but that seems to work better in newer versions of pike, so it's now enabled again
You shouldn't mix up the current gc with the issue on whether or not to free things immediately. The current gc is incredibly inefficient compared to any reasonably modern gc algorithm.
To begin with it's not mark-and-sweep, but check-mark-and-sweep, i.e. there's an entire extra pass over the whole heap just to determine the external refs.
Aside from that, a decent mark-and-sweep gc doesn't go over the whole heap like the naive implementation in Pike does. Rather a tri-color scheme is used to avoid data that hasn't been touched since the previous gc run (see http://en.wikipedia.org/wiki/Garbage_collection_(computer_science)#Basic_alg...).
So in any case, we should toss the current gc. It's just dog slow. It's a museum piece.
Using refcounts to free things immediately actually does not improve performance. The research paper shows that delaying the gc a bit can improve performance with as much as 40% in high-concurrency server apps. All that refcount twiddling to get immediate frees is a sizable performance hog on short-lived data; mark-and-sweep on it (and _only_ on that new data) is a lot more efficient. The gc would run much more often, of course, but it would do _way_ less work in each run. Remember that the gc work would only be proportional to the amount of change in the heap, not the size of it.
Aside from that, there are also arguments that freeing many things at the same time lessens fragmentation and improves locality.
The thing is, at least opera mini and most webservers are not really highly concurrent inside a single process, instead multiple processes (2-4 per core, say) are run.
So from the point of view of the server it's actually rather singlethreaded.
That's of course partly because it's harder to write higly efficient perfectly scaling multithreaded applications. Especially in pike. As in, in pike/perl/python it's currently impossible, in C/java/C# it's very hard.
But consider the 32-cores/128 threads case, basically any shared data can cause a rather severe scalability problem, since any mutex can really mess things up if 127 threads are waiting for it.
And even if we are not waiting, simply reading the mutex and the cache-coherency traffic needed for that can slow things down.
This means that you can often just as well use multiple processes with a SHM chunk (or simply use some kind of RPC).
I guess what I am saying is that there should not be a _too_ large penalty for the (rather common) more-or-less-singlethreaded case.
Yes, that the threads in both our server cases are mostly independent is a key factor why a multi-cpu capable pike is a worthwhile improvement.
Even in these cases there are clear benefits by offloading the freeing to another thread (which presumably can run on another cpu most of the time). Consider the work needed to handle a thing allocated on the heap:
1. Allocate the thing. 2. Write references to it on the stack and in other things. 3. Increment and decrement its refcounter. 4. Free it.
With mark-and-sweep gc, item 3 disappears completely and 4 is offloaded to a different cpu. This obviously saves time, even if the thing is completely thread local.
But consider the 32-cores/128 threads case, basically any shared data can cause a rather severe scalability problem, since any mutex can really mess things up if 127 threads are waiting for it.
Yes, that's why I consider any such lock unacceptable, and that's why lock-free (and preferably also memory fence free) algorithms are so sexy. The pike core must not have any lock or other hotspot that every thread is forced to access with any significant frequently. There might be such things in the OS (which we can't do anything about), and there might be such things in the pike app, but that's up to the pike app to solve - the core can only provide adequate tools to do it.
Another issue is that refcounts are currently used to detect when it is safe to perform destructive operations.
Good point.
I was thinking of a way to keep the immediate-destruct semantic: Threads could do a "micro gc" on their own thread local data on each evaluator callback call (i.e. a bit like the current destruct_objects_to_destruct calls). These micro gc's would have to run very quickly though, which probably rules out the generational gc approach with mark-and-sweep for young data (refcount-based garbing remains basically equally efficient regardless how often it runs, while mark-and-sweep does not).
This would mean that the immediate-destruct semantic works as long as the data is thread local, which is true in the mutex-lock-on-the-stack scenario. It's also true in most cases when e.g. arrays are built on the stack using +=. However, cases like
my_map[key] += ({another_value});
would not be destructive on the array values if my_map is shared. But that case is hopeless anyway in a multi-cpu world since the array value always can get refs from other threads asynchronously (prohibiting that would require locking that would be much worse).
To allow destructive updates in such cases, it'd be necessary to introduce some kind of new construct so that the pike programmer explicitly can allow (but not always expect) destructive updates regardless of extra refs.
However, ditching mark-and-sweep for young data comes at a cost. The paper I linked to has measured that a purely refcounting collector is between 20-30% slower when the number concurrent of threads gets above 3 on a 4 cpu box (see pages 80-81). This slowdown is measured over the total throughput of a benchmark, so it's not just "the gc itself".
Note that this is a comparison between two gc's where the only difference is the mark-and-sweep for young data - the purely refcounting collector in this case is still a whole lot more efficient than the current one in pike due to the drastically lowered refcount update frequency. I haven't seen any comparisons between the delayed-update refcount gc and an immediate-update like pike currently uses, but I suspect that the difference is substantial there already.
So the options we're considering here is either keeping the immediate-destruct semantic and some single-ref destructive optimizations, at a cost of (conservatively speaking) at least 15% overall performance in high-concurrency server apps. I don't think the single-ref destructive optimizations can weigh up that performance hit (and in the longer run they can be achieved anyway with new language constructs). Still, keeping the immediate-destruct semantic is worth something from a compatibility view.
Turns out the single-ref-destructive-update optimizations are more difficult to accomplish, even with a micro-gc. The problem is that refs from the stacks aren't counted.
To explain, let's start with the case where we want to keep the optimization pike currently does:
array a = ({0}); for (int i = 1; i < 1000; i++) a += ({i});
(To make the example simpler, I start out with the array ({0}) instead of the empty array ({}) since the empty array always is shared.) Here pike can be destructive on the array in a since there are no other refs to it, and hence it can overallocate it and append new elements to the end efficiently.
With the micro-gc, we can tell in `+= that there are no refs to the array from anywhere but the stack. Now consider this case:
array a = ({0}); array b = a; for (int i = 1; i < 1000; i++) a += ({i});
After this, one of course expects a = ({0, 1, 2, ...}) and b = ({0}). But since we don't count refs from the stack then we don't see the second ref to the array and might therefore think it's safe to be destructive, but that would make b = ({0, 1, 2, ...}) too.
So to use the single-ref-destructive-update optimizations we have to continue to count refs from the stacks. Or at least the pike svalue stack to cope with the situation above, but there might be situations when the same occurs with refs from the C stack too.
This is a problem, because I'm convinced there is a _big_ performance boost in not counting stack refs. So I wonder, is there a way to cope?
One alternative is to extend the array type to explicitly allow adding elements to the end destructively. It could perhaps look like this:
a[sizeof (a)] = i;
I.e. an indexed assignment to one step past the end. The rationale is that []= already is destructive, and this is the same sort of assignment, even though a new slot has to be created in the process. It doesn't imply that arbitrary elements past the end or before the beginning can be created the same way.
I'd like an extension like that anyway, because destructive appends are useful regardless of the amount of refs (and whether the array is thread local or not). It's also nice to be explicit about it rather than relying on an optimization.
The problem is of course that old pike code that previously was O(n) suddenly becomes O(n^2) until it's updated. It doesn't misbehave though.
Of course, there are many more single-ref-destructive-update optimizations, but this is by far the most common for which there currently is no way to be explicitly destructive.
Btw, I've updated lots of things in the design document according the last week's discussions (here and IRL). Latest version is always in multi-cpu.txt in srb's pikex git repository:
git clone git://pike-git.lysator.liu.se/pikex
I can repost the major changes here for those who don't want to use git, but there are a couple more things I'd like to add first.
On Tue, Jan 20, 2009 at 11:15:02PM +0000, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
array a = ({0}); for (int i = 1; i < 1000; i++) a += ({i}); (To make the example simpler, I start out with the array ({0}) instead of the empty array ({}) since the empty array always is shared.)
is that actually making code faster or is it only to make the logic of the problem you are discussing simpler?
One alternative is to extend the array type to explicitly allow adding elements to the end destructively. It could perhaps look like this: a[sizeof (a)] = i;
what about possible code that expects this operation to fail?
mixed error = catch{ a[x] = i; }; if (error) write("ooops, we reached the end\n");
greetings, martin.
On Tue, Jan 20, 2009 at 11:15:02PM +0000, Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
array a = ({0}); for (int i = 1; i < 1000; i++) a += ({i}); (To make the example simpler, I start out with the array ({0}) instead of the empty array ({}) since the empty array always is shared.)
is that actually making code faster or is it only to make the logic of the problem you are discussing simpler?
Here it is simply to point out the exact problem.
One alternative is to extend the array type to explicitly allow adding elements to the end destructively. It could perhaps look like this: a[sizeof (a)] = i;
what about possible code that expects this operation to fail?
mixed error = catch{ a[x] = i; }; if (error) write("ooops, we reached the end\n");
Imho, that is an academic problem. I've never seen code like that in a real program. But, if it's really something we want to solve, it's probably possible that adding elements destructivly at the end of an array requires a new/special syntax.
is that actually making code faster or is it only to make the logic of the problem you are discussing simpler?
The latter. Just to avoid complicating the example with an irrelevant special case.
what about possible code that expects this operation to fail?
mixed error = catch{ a[x] = i; }; if (error) write("ooops, we reached the end\n");
Yes, that'd be a compatibility problem too, strictly speaking. Don't think it's very significant, so if #pike 7.8 solves it then it'd be enough, imho. Do you expect it to be a real problem?
Anyway, Per pointed out in a personal response another much more prevalent case, namely string concatenation:
s += "foo";
Considering this, it's almost a requirement to figure out a way to keep the single-ref-destructive-update optimizations for thread local things referenced from the stack.
I'm toying with an idea to introduce a single bit for synchronous refcounting: Consider a flag MULTI_REF which is set when a thing gets its second ref (regardless whether the ref is from the stack or elsewhere). Single-ref-destructive-update optimizations would then only be done if it's cleared. It'd only be possible to clear it from the gc, and it wouldn't be applicable if the thing is shared.
Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
o Noncyclic things are no longer destructed and freed immediately. In particular, it means code like this no longer would work:
void foo() { Thread.MutexKey my_lock = my_mutex->lock(); ... do some work ... // my_lock falls out of scope here when the function exits, so // the lock is released right away.
Why would the actual freeing of memory need to synchronous with the running of the destructors? Ideally I'd expect the destructors to be called just like in C++: whenever the variables go out of scope. The actual freeing of the space to return it to the heap can be delayed until the gc finds it.
This can be solved at the compiler level, without changes in Pike applications. It also then supports the level of least surprise, which is always a good thing.
This would immediately get rid of all the dangling filedescriptors everywhere.
Why would the actual freeing of memory need to synchronous with the running of the destructors?
It doesn't, but with the proposed gc the pike runtime wouldn't have any earlier chance to discover the garbage data.
Ideally I'd expect the destructors to be called just like in C++: whenever the variables go out of scope.
The problem is more complicated in Pike since everything is allocated on the heap. In C and C++ you can allocate a struct on the stack, and so it'll be naturally freed when the stack is popped (with the risk of loose pointers instead). There's no way to do that in Pike - the mutex lock is allocated on the heap and only a reference to it is on the stack.
This can be solved at the compiler level, without changes in Pike applications. It also then supports the level of least surprise, which is always a good thing.
In very trivial cases the compiler can determine that the object doesn't get refs from elsewhere, e.g:
void foo() { Foo x = Foo(); bar (y); }
However, if you do just
void foo() { Foo x = Foo(); bar (x); }
or
void foo() { Foo x = Foo(); x->beep(); }
then the compiler can no longer know if x has grown more references during the bar or beep calls. For that to work, the compiler would have to analyze what every function does with this_object() and all its arguments. It's probably solvable, but it would require a fair bit of work in the compiler, and every C level function would have to declare whether or not it adds external refs to the current storage and every argument.
Martin Stjernholm, Roxen IS @ Pike developers forum wrote:
In very trivial cases the compiler can determine that the object doesn't get refs from elsewhere, e.g:
void foo() { Foo x = Foo(); bar (y);
However, if you do just
void foo() { Foo x = Foo(); bar (x);
then the compiler can no longer know if x has grown more references during the bar or beep calls. For that to work, the compiler would
I see. You're right, of course. In that case I'd indeed vote for some kind of micro-gc at the end of every scope which tries to gc all (and just) the locally declared variables which are leaving scope. It would make the invocations of the destructors more predictabe and timely, and might help in rapid allocation/release cycles to keep memory usage within bounds.
Well, as I've said elsewhere the immediate-destructs probably comes at a considerable price (compared to the more efficient alternative, not compared to the current implementation). It isn't possible to combine a micro-gc of only the things referenced by the stack and the mark-and-sweep approach - the point with mark-and-sweep is precisely to save the work of continuously tracking the refs to things.
Still, fixing the generational and mark-and-sweep bits would be additional work anyway, so I guess we could start with only the delayed-update refcount gc and dispense with this issue until all the other things start to work a bit.
Anyway, I'd like to put more attention on the other serious issue, namely the C module interface compatibility problem. Basically every line that does something like this:
THIS->my_thing = some_thing;
where some_thing is a pointer to a string, array, or whatever, would have to be changed into some kind of macro/function call:
set_ptr (THIS, my_thing, some_thing);
This is because the key to be able to do the delayed refcount updates, and to get the on-the-fly gc operation to work, is to intercept every pointer change in the heap and do the logging of the old value that is occasionally required.
I can't see any way around that. Thus all existing C modules would be pretty thoroughly hosed until someone goes through them. That hurts, but otoh keeping the current refcount handling would be a very serious problem too.
/.../ so I guess we could start with only the delayed-update refcount gc and dispense with this issue until all the other things start to work a bit.
Realized that it unfortunately isn't that easy - scanning the stacks on each function exit would still be required, and that'd probably be too slow.
pike-devel@lists.lysator.liu.se