New ICI stuff, March 2002

Well, I had a week off to work on ICI. Here is a summary of the changes. Unfortunately, at time of writting I still have some testing to do, and there is some porting work to do, and I haven't checked it in. Checkin will be fairly soon. Beware that the head of development will be unproven for a while. Comments to the ICI mail list welcome of course.

Highlights

I have moved to major version number to 4, as external C interfaces will (at least) require re-compilation. Possible some small changs as well.
 

Multi-threading

I've added native OS based multi-threading to ICI. This was tricky to add without slowing things down, but I managed it. The model is very simple, and treats all ICI data and objects as a single shared resource. So ICI-bound programs will happily multi-thread, but not take advantage of multiple processors. However, time spent in I/O or suitable heavy computing tasks outside of ICI are truly independent. It has minimal impact on the C side of things.

The language suppport for this is:

    exec = thread(callable, args...)
starts a new thread that calls callable (e.g. a function or method) and passing args. Returns an execution context. The thread runs until callable returns.
    critsect statement
executes statement indivisibly with respect to other threads. Eg:
    critsect ++shared_counter;
    critsect
    {
        item = shared_list;
        shared_list = item.next;
    }
More importantly:
    waitfor (wait-condition; wait-object) statement
waits till wait-condition is true, sleeping until wait-object is woken up before each re-test of the condition. Once wait-condition is true, statemet is executed. All of this is done under a critical section except for the actual sleep on the wait-object. For example, suppose jobs is an array to which things to-do get added occasionally in some other thread.
    waitfor (nels(jobs) > 0; jobs)
        job = rpop(jobs);             /* rpop? see below. */
Some other thread might have code:
    push(jobs, job);
    wakeup(jobs);
The wakeup function wakes up all threads waiting on the given object, which makes them re-evaluate their wait condition. The object can be anything. An integer, a string, an array (as in this example). For example, a wakeup is done on the execution context object of a thread when it exits.

It was tricky to add multiple execution contexts to the execution engine without slowing things down. Adding a single indirection in the top-of-stack references added 10..20% to the execution time of some programs. In the end I devised a method that didn't involve any additional indirection. I also reduced the use of macros in this area (which I think makes it clearer and easier to debug).

On the C side there is not much impact. Current C code should not notice any difference - it will just run with the global mutex taken and be indivisible with respect to other ICI threads. You can call ici_leave() to release the mutex. It gives you a pointer that you pass to ici_enter() to re-aquire the mutex. For example:

    {
        exec_t      *x;

        x = ici_leave();
        ...read file or something...
        ici_enter(x);
    }

Arrays are now queues, not just stacks

Arrays can now be efficiently pushed and poped at the front as well as the end. The new functions rpush() and rpop() achieve this. I've always felt the lack of a good list type, but after coding an "ordered set", I realised that it would be much more efficient and probably even more useful if arrays became gowable circular buffers, rather than just growable stacks as they have been before now. But anyone who has looked at the parser/compiler/execution engine will realise that arrays-as-stacks are critical to performance, so I couldn't afford to introduce any overhead.

To solve this, internally arrays conceptually have two forms: pure stacks, and the general case of being a queue. As long as no rpush or rpop operations have been done, they have the same internal operation they always did. All the important stacks used internally have this property. Once an rpush or rpop has been done an extra pointer introduced into arrays might be different from the base -- it is now a circular buffer. In the general case you have to use new rules for accessing the contents of arrays from C. This was probably the hardest change to make. Arrays are used everywhere.

rpush() and rpop() seem like a small additional feature. But after using it just a few times I think it was worth it. Lots of things become easier to do.

Documentation upgrade

I have given the documentation a major upgrade. I'm a long way from finished, but it's heaps better than it was. On the understanding that you realise this is still work-in-progress, and various chapters just finish in the middle, I've put a snap-shot at http://www.zeta.org.au/~timl/ici.pdf .Chapter 3 - Language Reference and chapter 5 - Core language functions, are almost finshed.

Allowing more than just structs to have supers

This has no visible effect on the language. But it means that in C you can make objects that have supers. This means you can make intrinsic OO objects that you can subclass and everything in complete generality.

Object header reduced from 8 bytes to 4

I have reduced the universal header on objects from 2 x 32bit words to 1 x 32 bit word. This sounds like a major change, but was actually was not too bad. The net change on execution time is neutral or slightly improved. Basically I eliminated the type pointer. To get the type pointer now, it uses a small integer to index an array of pointers. The down-side is you have to call a function to register a new type to get your small int (which you place in the object header when you make a new object of that type). This in itself is not a big percentage improvement on memory use, but when combined with the next thing...

Dense allocation of small objects

ICI has long maintained the principle that each object is independently malloced from the underling system. But the overheads on this are, I decided, unjustifiable on hosted environments. The new scheme allocates chunks (currently 1024bytes) which are dense arrays of small objects. There are no boundary words. So ints now consume just 8 bytes of memory each (previously 24+ bytes in real terms). The technique for eliminating boundary words, and keeping the fast free lists, is to have alloc/free routines where the caller is required to tell the free how much memory it asked for on the alloc. Thus we have:
    x = ici_talloc(type);
    ...
    ici_tfree(x, type);
and
    x = ici_nalloc(size);
    ...
    ici_nfree(x, size);
98% of the time this is really easy because the place you free the data knows exectly how big it is. For the occasions where this is not convenient, you can use the completely malloc/free equivalent:
    x = ici_alloc(size);
    ...
    ici_free(x);
Unfortunately, in making this change I have lost all that beautiful debug support that was put into the old allocator. I might go back and try to retro-fit it sometime.

The net effect of this is an improvement in CPU time and memory usage. The improved CPU time comes mostly (I think) from a large total memory bandwidth reduction into the processor cache. Especially on garbage collection. (Before objects were so big they would just about fill a whole cache line by themselves. Now a whole bunch come in together.) A "small" object is one less than or equal to 64 bytes. Things over 64 bytes go straight to malloc without further memory overhead.

The down-side of this change (and the reason I didn't do it years ago) is that the dense allocation can't be freed until you shut the interpreter down with ici_uninit(). But I figured that in most applications you want ICI to go faster and have lower peak memory usage; more than have it reduce malloc heap usage between tasks.

Improving the coherence of the struct-lookup lookaside mechanism

The struct-lookup lookaside mechanism is the process whereby strings record by-pass information to avoid actual searching of struct hash tables and chasing up the super chain. There is a universal serial number that can be incremented to invalidate all such by-pass records. It is now incremented a lot less often, which means the lookaside works more often, which makes things faster.

Other speed-ups

There are a few other speed-ups that are a bit too technical to go into here. Calls from C to ICI are faster.
 

Change log

Here is the change log so far...
*       Added an ici_pcre() function to avoid exposing internals
        of PCRE in the ici.h include file.

*       Changed the definition of the struct lookup look-aside cache
        stored in strings. It used to apply only to variables. But
        now it applies to all struct lookups. That meant it could be
        cleaned up and the number of times it is invalidated (by
        incrementing ici_vsver) greatly reduced. This makes a good
        improvement in execution speed.

*       Generalised the super mechanism. Objects that want to support
        a super (still only structs in the core language) use the
        new type objwsup_t (object-with-super) instead of object_t
        as their header. This includes the super poiner. They must
        then also set the O_SUPER flag in their header. They must also
        support some extra fetch/assign functions.
        
        There are quiet a few place where struct_t types became
        objwsup_t types as a consequence of this.

*       Removed the version number from the naming of auto-loading
        ICI modules (but not native code modules). Thus, for example,
        the version 3 startup file was called:
        
           ici3core.ici
           
        but it will now be called
        
           icicore.ici
        
        I think the ICI language (as opposed to its internal APIs)
        is sufficiently stable that it is not really required.
        I found it an unnecessary inconvenience.

*       Changed chkbuf() to ici_chkbuf().

*       Added a new basic type "handle". This not accessible from the
        core language, but C code can use it to return generic references
        to C data objects. It supports a super pointer, so C code
        can associate a class (i.e. a struct full of intrinsic methods)
        with it to allow it to be used as an OO object (which can also
        identify it as an object of the expected type when passed back
        from ICI code). It also allows a type name to be associated
        with the handle which will appear in diagnostics.

*       Changed 'error' to 'ici_error'.

*       Changed ici_evaluate() to use a catch object on its C stack
        as its frame marker on its ICI execution stack. This avoids
        an object allocation on each ici_evaluate call. The arguments
        to ici_evaluate have changed slightly as a consequence.

*       Removed syscall functions from the core. They will only
        be accessible through the sys module in future.

*       Changed the allocation routines to allocate small objects
        densly (no boundary words) out of larger chunks. The
        technique for eliminating boundary words, and keeping the
        fast  free lists, is to have alloc/free routines where
        the caller is required to tell the free how much memory
        it asked for on the  alloc. Thus we have:

            x = ici_talloc(type);
            ...
            ici_tfree(x, type);

        and

            x = ici_nalloc(size);
            ...
            ici_nfree(x, size);

        98% of the time this is really easy because the place you
        free the data knows exectly how big it is. For the occasions 
        where this is not convenient, you can use the completely
        malloc/free equivalent:

            x = ici_alloc(size);
            ...
            ici_free(x);

*       Added a small array (32) of pre-generated small ints to allow
        a quick check and use of these very common numbers.

*       Changed the internal ICI calling convention. The call operator
        object used to store the number of actual parameters to a
        function. Now an seperate int is pushed onto the operand
        stack. There is now just a single static call operator.
        Because ints are not heavily optimised, a call from C to ICI
        now, typically, does no allocation until its in the main
        execution engine.

*       Moved the lib curses based text window feature out of the core.
        Will put it in an extension module soon.

*       Changed new_array() to take an int argument being the initial
        number of slots for the array to have. The caller can assume
        that that many items can be pushed on. Use 0 for the default
        value.

*       Changed arrays so that they can be efficiently push()ed and
        pop()ed at *both* ends. Thus they can be used to form efficent
        queues. Although apparently a small feature, queues are something
        that I've always felt were important and missing from ICI.
        However this was a *big* change (much harder than the
        object header change). The parser and execution engine rely
        heavily on arrays for their efficiency. To prevent an impact
        on them we distinguish arrays that have neve been used as a
        queue (never had the new functions rpush() or rpop() done on
        them) from the general case. Virgin arrays are refered to as
        stacks and have all the old semantics. But in the general case
        arrays are now growable circular buffers. Were you don't know
        the origin or history of an array, you must assume the general
        case and use some new knowledge, functions and macros to access it

*       Removed the feature of binary << that allowed "array << int".
        This has been flagged for removal in the documentation for a
        long time, and became difficult to support.

*       Changed the universal object header(!) From 2 x 32 bit
        words to 1 x 32 bit word. Theoretically this is a huge
        change, but it was actually pretty easy. It requires
        recompilation of external modules, and some changes to their
        source. Basically the type is now completely indicated by
        the small int o_tcode field of the header. To find a pointer
        to the type structure you must index an array of pointers to
        them. Use ici_typeof(o) for this. Types must now register their
        type_t structure to obtain their small int type code, which
        they should remember and use when making new objects. After this
        change, the next release will move to version 4 to keep extension
        modules with the new smaller objects seperate.
        
        The overall effect on CPU time seems to be neutral or a
        slight improvement.

*       Added multi-threading. This is based on native machine threads,
        but the whole mass of ICI objects and static data is gated
        through a single mutex. So it works fine except threads competing
        for the ICI execution engine will not take advantage of multiple
        processors (but they will if they spend their time in functions
        that release the mutex while running).
        
        It was a little difficult to achieve this without slowing things
        down. Introducing a single extra indirection in top-of-stack
        accesses (the obvious way) adds up to 20% to the execution time
        of some programs. But I managed it.
        
        New language contructs "waitfor (expr; obj) stmt" and
        "critsetc stmt" have been added. As well as the "wakeup(obj)"
        and "sleep(num)" functions. All I/O routines in the core
        release the mutex around the low level I/O - except the parser.
        (See new documentation.)
        
        In the process, o_top, x_top and v_top macros got removed.
        Use ici_os.a_top instead.

*       The ICI Technical Description has been updated to
        FrameMaker 6 format and split into seperate chapters;
        each in a seperate source file.

*       Removed the obsolete function ici_op_offsq() and operator
        o_offsq from from array.c, ici.def, and fwd.h.