This repository (https://github.com/nikitadanilov/usched) contains a simple experimental implementation of
coroutines alternative to well-known
"stackless" and "stackful" methods.
The term "coroutine" gradually grew to mean a mechanism where a computation,
which in this context means a chain of nested function calls, can "block" or
"yield" so that the top-most caller can proceed and the computation can later be
resumed at the blocking point with the chain of intermediate function activation
frames preserved.
Prototypical uses of coroutines are lightweight support for potentially blocking
operations (user interaction, IO, networking) and
generators,
which produce multiple values (see same fringe
problem).
There are two common coroutine implementation methods:
a stackful coroutine runs on a separate stack. When a stackful coroutine
blocks, it performs a usual context switch. Historically "coroutines" meant
stackful coroutines. Stackful coroutines are basically little more than usual
threads, and so they can be kernel (supported by the operating system) or
user-space (implemented by a user-space library, also known as green
threads), preemptive or cooperative.
a stackless coroutine does not use any stack when blocked. In a typical
implementation instead of using a normal function activation frame on the
stack, the coroutine uses a special activation frame allocated in the heap so
that it can outlive its caller. Using heap-allocated frame to store all local
variable lends itself naturally to compiler support, but some people are known
to implement stackless coroutines manually via a combination of
pre-processing, library and tricks much worse than Duff's
device.
Stackful and stateless are by no means the only possibilities. One of the
earliest languages to feature generators
CLU
(distribution) ran
generators on the caller's stack.
usched is in some sense intermediate between stackful and stackless: its
coroutines do not use stack when blocked, nor do they allocate individual
activation frames in the heap.
The following is copied with some abbreviations from
usched.c.
Overview
usched: A simple dispatcher for cooperative user-space threads.
A typical implementation of user-space threads allocates a separate stack for
each thread when the thread is created and then dispatches threads (as
decided by the scheduler) through some context switching mechanism, for
example, longjmp()
.
In usched all threads (represented by struct ustack) are executed on the same
"native" stack. When a thread is about to block (usched_block()
), a memory
buffer for the stack used by this thread is allocated and the stack is copied
to the buffer. After that the part of the stack used by the blocking thread
is discarded (by longjmp()
-ing to the base of the stack) and a new thread is
selected. The stack of the selected thread is restored from its buffer and
the thread is resumed by longjmp()
-ing to the usched_block()
that blocked it.
The focus of this implementation is simplicity: the total size of usched.[ch]
is less than 120LOC, as measured by SLOCCount.
Advantages:
no need to allocate maximal possible stack at thread initialisation: stack
buffer is allocated as needed. It is also possible to free the buffer when the
thread is resumed (not currently implemented);
thread that doesn't block has 0 overhead: it is executed as a native function
call (through a function pointer) without any context switching;
because the threads are executed on the stack of the same native underlying
thread, native synchronisation primitives (mutices, etc.) work, although the
threads share underlying TLS. Of course one cannot use native primitives to
synchronise between usched threads running on the same native thread.
Disadvantages:
stack copying introduces overhead (memcpy()
) in each context switch;
because stacks are moved around, addresses on a thread stack are only valid
while the thread is running. This invalidates certain common programming
idioms: other threads and heap cannot store pointers to the stacks, at least
to the stacks of the blocked threads. Note that Go language, and probably
other run-times, maintains a similar invariant.
Usage
usched is only a dispatcher and not a scheduler: it blocks and resumes
threads but
it does not keep track of threads (specifically allocation and freeing of
struct ustack instances is done elsewhere),
it implements no scheduling policies.
These things are left to the user, together with stack buffers allocation and
freeing. The user supplies 3 call-backs:
usched::s_next()
: the scheduling function. This call-backs returns the next
thread to execute. This can be either a new (never before executed) thread
initialised with ustack_init()
, or it can be a blocked thread. The user must
keep track of blocked and runnable threads, presumably by providing wrappers
to ustack_init()
and ustack_block()
that would record thread state changes. It
is up to usched::s_next()
to block and wait for events if there are no
runnable threads and all threads are waiting for something;
usched::s_alloc()
: allocates new stack buffer of at least the specified
size. The user have full control over stack buffer allocation. It is possible
to pre-allocate the buffer when the thread is initialised (reducing the cost
of usched_block()
), it is possible to cache buffers, etc.;
usched::s_free()
: frees the previously allocated stack buffer.
rr.h and
rr.c provide a
simple "round-robin" scheduler implementing all the call-backs. Use it
carefully, it was only tested with
rmain.c
benchmark.
Pictures!
The following diagrams show stack management by usched. The stack grows from
right to left.
At the entrance to the dispatcher loop. usched_run(S)
:
usched_run()
----------------------------------------------+--------------+-------+
| buf | anchor | ... |
----------------------------------------------+--------------+-------+
^
|
sp = S->s_buf
A new (never before executed) thread U
is selected by S->s_next()
, launch()
calls the thread startup function U->u_f()
:
U->u_f() launch() usched_run()
-----------------------------+---------+-----+--------------+-------+
| | pad | buf | anchor | ... |
-----------------------------+---------+-----+--------------+-------+
^ ^
| |
sp U->u_bottom
The thread executes as usual on the stack, until it blocks by calling
usched_block()
:
usched_block() bar() U->u_f() launch() usched_run()
----------+------+-----+-----+---------+-----+--------------+-------+
| here | ... | | | pad | buf | anchor | ... |
----------+------+-----+-----+---------+-----+--------------+-------+
^ ^ ^
| +-- sp = U->u_cont |
| U->u_bottom
U->u_top
The stack from U->u_top
to U->u_bottom
is copied into the stack buffer
U->u_stack
, and control returns to usched_run()
by longjmp(S->s_buf)
:
usched_run()
----------------------------------------------+--------------+-------+
| buf | anchor | ... |
----------------------------------------------+--------------+-------+
^
|
sp = S->s_buf
Next, suppose S->s_next()
selects a previously blocked thread V
ready to be
resumed. usched_run()
calls cont(V)
.
cont() usched_run()
----------------------------------------+-----+--------------+-------+
| | buf | anchor | ... |
----------------------------------------+-----+--------------+-------+
^
|
sp
cont()
copies the stack from the buffer to [V->u_top
, V->u_bottom
]
range. It's important that this memcpy()
operation does not overwrite
cont()
's own stack frame, this is why pad[]
array is needed in launch()
: it
advances V->u_bottom
and gives cont()
some space to operate.
usched_block() foo() V->u_f() cont() usched_run()
---------+------+-----+-----+--------+--+-----+--------------+-------+
| here | ... | | | | | buf | anchor | ... |
---------+------+-----+-----+--------+--+-----+--------------+-------+
^ ^ ^ ^
| +-- V->u_cont | +-- sp
| |
V->u_top V->u_bottom
Then cont()
longjmp()
-s to V->u_cont
, restoring V
execution context:
usched_block() foo() V->u_f() cont() usched_run()
---------+------+-----+-----+--------+--+-----+--------------+-------+
| here | ... | | | | | buf | anchor | ... |
---------+------+-----+-----+--------+--+-----+--------------+-------+
^
+-- sp = V->u_cont
V
continues its execution as if it returned from usched_block()
.
Multiprocessing
By design, a single instance of struct usched
cannot take advantage of
multiple processors, because all its threads are executing within a single
native thread. Multiple instances of struct usched
can co-exist within a
single process address space, but a ustack thread created for one instance
cannot be migrated to another. One possible strategy to add support for multiple
processors is to create multiple instances of struct usched and schedule them
(that is, schedule the threads running respective usched_run()
-s) to
processors via pthread_setaffinity_np()
or similar. See
rr.c for a
simplistic implementation.
Current limitations
the stack is assumed to grow toward lower addresses. This is easy to fix,
if necessary;
the implementation is not signal-safe. Fixing this can be as easy as
replacing *jmp() calls with their sig*jmp() counterparts. At the moment
signal-based code, like gperf -lprofiler library, would most likely crash
usched;
usched.c
must be compiled without optimisations and with
-fno-stack-protector
option (gcc);
usched threads are cooperative: a thread will continue to run until it
completes of blocks. Adding preemption (via signal-based timers) is
relatively easy, the actual preemption decision will be relegated to the
external "scheduler" via a new usched::s_preempt()
call-back invoked from
a signal handler.
Notes
Midori seems to
use a similar method: a coroutine (called activity there) starts on the native
stack. If it needs to block, frames are allocated in the heap (this requires
compiler support) and filled in from the stack, the coroutine runs in these
heap-allocated frames when resumed.
Benchmarks
usched was benchmarked against a few stackful (go, pthreads) and stackless
(rust, c++ coroutines) implementations. A couple of caveats:
all benchmarking in general is subject to the reservations voiced by
Hippocrates and usually translated (with the complete reversal of the meaning)
as ars longa, vita brevis, which means: "the art [of doctor or tester]
takes long time to learn, but the life of a student is brief,
symptoms are vague, chances of success are doubtful".
the author is much less than fluent with all the languages and frameworks used
in the benchmarking. It is possible that some of the benchmarking code is
inefficient or just outright wrong. Any comments are appreciated.
The benchmark tries to measure the efficiency of coroutine switching. It creates
R cycles, N coroutines each. Each cycle performs M rounds, where each round
consists of sending a message across the cycle: a particular coroutine (selected
depending on the round number) sends the message to its right neighbour, all
other coroutines relay the message received from the left to the right, the
round completes when the originator receives the message after it passed through
the entire cycle.
If N == 2, the benchmark is R pairs of processes, ping-ponging M messages within
each pair.
Some benchmarks support two additional parameters: D (additional space in bytes,
artificially consumed by each coroutine in its frame) and P (the number of
native threads used to schedule the coroutines.
The benchmark creates N*R coroutines and sends a total of N*R*M messages,
the latter being proportional to the number of coroutine switches.
bench.sh runs
all implementations with the same N, R and M
parameters. graph.py
plots the results.
POSIX. Source:
pmain.c,
binary: pmain. Pthreads-based stackful implementation in C. Uses default
thread attributes. pmain.c also contains emulation of unnamed POSIX semaphores
for Darwin. Plot label: "P". This benchmarks crashes with "pmain:
pthread_create: Resource temporarily unavailable" for large values of N*R.
Go. Source:
gmain.go,
binary: gmain. The code is straightforward (it was a pleasure to write). D is
supported via runtime.GOMAXPROCS(). "GO1T" are the results for a single native
thread, "GO" are the results without the restriction on the number of threads.
Rust. Source:
cycle/src/main.rs,
binary: cycle/target/release/cycle. Stackless implementation using Rust
builtin async/.await. Label:
"R". It is single-threaded (I haven't figured out how to distribute coroutines
to multiple executors), so should be compared with GO1T, C++1T and
U1T. Instead of fighting with the Rust borrow checker, I used "unsafe" and
shared data-structures between multiple coroutines much like other benchmarks
do.
C++. Source:
c++main.cpp,
binary: c++main. The current state of coroutine support in C++ is unclear. Is
everybody supposed to directly use <coroutine>
interfaces or one of the
mutually incompatible libraries that provide easier to use interfaces on top
of <coroutine>
? This benchmark uses Lewis
Baker's
cppcoro, (Andreas
Buhr's
fork). Labels: "C++" and "C++1T" (for
single-threaded results).
usched. Source:
rmain.c,
binary: rmain. Based on usched.[ch]
and rr.[ch]
This is our main interest,
so we test a few combinations of parameters.
- Label: "U": the default configuration, round-robin scheduler over 16 native threads,
- "U1K": 1000 bytes of additional stack space for each coroutine
- "U10K": 10000 bytes,
- "U1T": round-robin over 1 native thread,
- "U1TS": round-robin over 1 native thread with pthread locking in
rr.c compiled
out (
-DSINGLE_THREAD
compilation option, a separate binary rmain.1t).
- Update "UL": uses "local" scheduler
ll.c. All
coroutines within a cycle are assigned to the same native thread so that
scheduling between them require no locking. This demonstrates very high
throughput (comparable to C++), but unfortunately I do not have time right
now to re-do all the measurements consistently. Binary: lmain.
bench.sh runs
all benchmarks with N == 2 (message ping-pong) and N == 8. Raw results are in
results.linux. In
the graphs, the horizontal axis is the number of coroutines (N*R, logarithmic)
and the vertical axis is the operations (N*R*M) per second
Environment: Linux VM, 16 processors, 16GB of memory. Kernel: 4.18.0 (Rocky
Linux).
|
16 |
32 |
64 |
400 |
800 |
4000 |
8000 |
40000 |
80000 |
400000 |
800000 |
4000000 |
8000000 |
POSIX |
1.76 |
3.46 |
6.39 |
14.58 |
14.85 |
14.70 |
13.63 |
9.87 |
8.02 |
0.00 |
0.00 |
0.00 |
0.01 |
GO |
4.14 |
5.62 |
7.77 |
36.74 |
41.64 |
49.72 |
48.24 |
37.24 |
43.06 |
46.31 |
46.22 |
46.09 |
45.95 |
GO1T |
4.38 |
4.30 |
4.27 |
4.11 |
3.81 |
3.53 |
3.40 |
3.33 |
3.43 |
3.99 |
3.98 |
3.95 |
3.86 |
RUST |
9.48 |
8.71 |
8.69 |
8.64 |
8.53 |
7.85 |
6.59 |
4.32 |
3.80 |
3.63 |
3.63 |
3.83 |
3.90 |
U |
17.24 |
17.27 |
17.30 |
25.77 |
29.99 |
71.68 |
77.32 |
78.92 |
77.98 |
80.88 |
82.09 |
83.66 |
82.15 |
U1K |
16.21 |
16.29 |
16.35 |
25.38 |
28.41 |
69.92 |
75.76 |
74.31 |
73.65 |
76.69 |
76.75 |
75.84 |
76.56 |
U10K |
9.04 |
8.96 |
9.09 |
20.38 |
21.69 |
58.13 |
60.95 |
59.66 |
60.50 |
61.32 |
61.71 |
62.06 |
62.72 |
U1T |
17.37 |
17.31 |
17.35 |
17.35 |
17.36 |
17.27 |
17.29 |
17.14 |
17.06 |
16.91 |
16.91 |
16.91 |
16.87 |
C++ |
49.87 |
67.85 |
74.94 |
73.91 |
73.04 |
62.48 |
59.15 |
57.23 |
56.48 |
55.35 |
55.44 |
54.02 |
53.61 |
C++1T |
97.03 |
97.38 |
96.82 |
96.06 |
96.58 |
95.78 |
94.83 |
89.83 |
86.09 |
80.48 |
79.37 |
77.04 |
77.48 |
U1TS |
49.53 |
49.76 |
49.83 |
50.16 |
49.93 |
48.88 |
49.75 |
48.75 |
47.99 |
46.48 |
46.25 |
45.99 |
46.12 |
UL |
76.03 |
116.63 |
160.72 |
169.74 |
169.99 |
171.57 |
170.32 |
165.82 |
169.43 |
174.32 |
171.55 |
169.48 |
170.04 |
(N == 8) A few notes:
As mentioned above, pthreads-based solution crashes with around 50K threads.
Most single-threaded versions ("GO1T", "R" and "U1T") are stable as corpse's
body temperature. Rust cools off completely at about 500K
coroutines. Single-threaded C++ ("C++1T") on the other hand is the most
performant solution for almost the entire range of measurement, it is only for
coroutine counts higher than 500K when "U" overtakes it.
It is interesting that a very simple and unoptimised usched fares so well
against heavily optimized C++ and Go run-times. (Again, see the reservations
about the benchmarking.)
Rust is disappointing: one would hope to get better results from a rich type combined with compiler support.
|
4 |
8 |
16 |
100 |
200 |
1000 |
2000 |
10000 |
20000 |
100000 |
200000 |
1000000 |
2000000 |
POSIX |
0.56 |
0.97 |
1.84 |
6.36 |
6.88 |
7.78 |
7.82 |
7.58 |
7.15 |
5.34 |
0.00 |
0.00 |
0.00 |
GO |
7.40 |
11.03 |
19.23 |
40.44 |
45.79 |
51.81 |
52.87 |
52.77 |
53.15 |
53.62 |
53.22 |
55.77 |
56.82 |
GO1T |
4.54 |
4.55 |
4.53 |
4.53 |
4.53 |
4.52 |
4.52 |
4.50 |
4.47 |
4.36 |
4.31 |
4.26 |
4.26 |
RUST |
5.68 |
5.75 |
5.75 |
4.74 |
4.74 |
4.62 |
4.46 |
4.13 |
3.70 |
2.81 |
2.77 |
2.76 |
2.73 |
U |
11.22 |
11.27 |
11.26 |
11.30 |
7.91 |
24.66 |
38.72 |
35.67 |
40.60 |
41.18 |
42.06 |
42.96 |
42.74 |
U1K |
9.64 |
9.62 |
9.65 |
9.67 |
7.61 |
22.14 |
34.38 |
31.70 |
34.54 |
34.56 |
34.59 |
35.47 |
35.56 |
U10K |
4.43 |
4.62 |
4.50 |
4.25 |
5.02 |
15.79 |
26.18 |
25.33 |
27.60 |
27.62 |
27.63 |
27.72 |
28.16 |
U1T |
11.24 |
11.29 |
11.34 |
11.26 |
11.32 |
11.30 |
11.28 |
11.28 |
11.22 |
11.19 |
11.15 |
11.13 |
11.15 |
C++ |
46.33 |
46.30 |
63.38 |
114.30 |
117.05 |
114.12 |
111.36 |
101.32 |
100.13 |
84.30 |
78.53 |
72.77 |
71.00 |
C++1T |
96.56 |
96.03 |
96.37 |
95.97 |
95.49 |
95.68 |
94.94 |
92.95 |
91.23 |
83.55 |
80.33 |
77.22 |
76.22 |
U1TS |
19.59 |
19.66 |
19.80 |
19.87 |
19.89 |
19.86 |
19.82 |
19.72 |
19.66 |
19.51 |
19.45 |
19.33 |
19.37 |
UL |
12.19 |
23.48 |
50.39 |
65.71 |
67.22 |
69.17 |
70.01 |
70.09 |
69.36 |
69.28 |
69.43 |
68.83 |
68.00 |
(N == 2)
First, note that the scale is different on the vertical axis.
Single-threaded benchmarks display roughly the same behaviour (exactly the
same in "C++1T" case) as with N == 8.
Go is somewhat better. Perhaps its scheduler is optimised for message
ping-pong usual in channel-based concurrency models?
usched variants are much worse (50% worse for "U") than N == 8.
Rust is disappointing.
To reproduce:
$ # install libraries and everything, then...
$ make
$ while : ;do ./bench.sh | tee -a results; sleep 5 ;done # collect enough results, this might take long...
^C
$ grep -h '^ *[2N],' results | python3 graph.py c2.svg > c2-table.md # create plot for N == 2
$ grep -h '^ *[8N],' results | python3 graph.py c8.svg > c8-table.md # create plot for N == 8
Conclusion
Overall, the results are surprisingly good. The difference between "U1T" and
"U1TS" indicates that the locking in
rr.c affects
performance significantly, and affects it even more with multiple native
threads, when locks are contended across processors. I'll try to produce a more
efficient (perhaps lockless) version of a scheduler as the next step.