futex2¶
- Author
André Almeida <andrealmeid@collabora.com>
futex, or fast user mutex, is a set of syscalls to allow userspace to create performant synchronization mechanisms, such as mutexes, semaphores and conditional variables in userspace. C standard libraries, like glibc, uses it as a means to implement more high level interfaces like pthreads.
The interface¶
uAPI functions¶
-
long
sys_futex_wait
(void __user *uaddr, u64 val, unsigned int flags, struct __kernel_timespec __user *timo)¶ Wait on a futex address if (*uaddr) == val
Parameters
void __user * uaddr
User address of futex
u64 val
Expected value of futex
unsigned int flags
Specify the size of futex and the clockid
struct __kernel_timespec __user * timo
Optional absolute timeout.
Description
The user thread is put to sleep, waiting for a futex_wake() at uaddr, if the value at *uaddr is the same as val (otherwise, the syscall returns immediately with -EAGAIN).
Returns 0 on success, error code otherwise.
-
long
sys_futex_waitv
(struct futex_waitv __user *waiters, unsigned int nr_futexes, unsigned int flags, struct __kernel_timespec __user *timo)¶ Wait on a list of futexes
Parameters
struct futex_waitv __user * waiters
List of futexes to wait on
unsigned int nr_futexes
Length of futexv
unsigned int flags
Flag for timeout (monotonic/realtime)
struct __kernel_timespec __user * timo
Optional absolute timeout.
Description
Given an array of struct futex_waitv
, wait on each uaddr. The thread wakes
if a futex_wake() is performed at any uaddr. The syscall returns immediately
if any waiter has *uaddr != val. *timo is an optional timeout value for the
operation. Each waiter has individual flags. The flags argument for the
syscall should be used solely for specifying the timeout as realtime, if
needed. Flags for shared futexes, sizes, etc. should be used on the
individual flags of each waiter.
Returns the array index of one of the awaken futexes. There’s no given information of how many were awakened, or any particular attribute of it (if it’s the first awakened, if it is of the smaller index…).
-
long
sys_futex_wake
(void __user *uaddr, unsigned int nr_wake, unsigned int flags)¶ Wake a number of futexes waiting on an address
Parameters
void __user * uaddr
Address of futex to be woken up
unsigned int nr_wake
Number of futexes waiting in uaddr to be woken up
unsigned int flags
Flags for size and shared
Description
Wake nr_wake threads waiting at uaddr.
Returns the number of woken threads on success, error code otherwise.
-
long
sys_futex_requeue
(struct futex_requeue __user *uaddr1, struct futex_requeue __user *uaddr2, unsigned int nr_wake, unsigned int nr_requeue, u64 cmpval, unsigned int flags)¶ Wake futexes at uaddr1 and requeue from uaddr1 to uaddr2
Parameters
struct futex_requeue __user * uaddr1
Address of futexes to be waken/dequeued
struct futex_requeue __user * uaddr2
Address for the futexes to be enqueued
unsigned int nr_wake
Number of futexes waiting in uaddr1 to be woken up
unsigned int nr_requeue
Number of futexes to be requeued from uaddr1 to uaddr2
u64 cmpval
Expected value at uaddr1
unsigned int flags
Reserved flags arg for requeue operation expansion. Must be 0.
Description
If (uaddr1->uaddr == cmpval), wake at uaddr1->uaddr a nr_wake number of
waiters and then, remove a number of nr_requeue waiters at uaddr1->uaddr
and add then to uaddr2->uaddr list. Each uaddr has its own set of flags,
that must be defined at struct futex_requeue
(such as size, shared, NUMA).
Return the number of the woken futexes + the number of requeued ones on success, error code otherwise.
uAPI structures¶
-
struct
futex_waitv
¶ A waiter for vectorized wait
Definition
struct futex_waitv {
__u64 val;
void __user *uaddr;
unsigned int flags;
};
Members
val
Expected value at uaddr
uaddr
User address to wait on
flags
Flags for this waiter
-
struct
futex_requeue
¶ Define an address and its flags for requeue operation
Definition
struct futex_requeue {
void __user *uaddr;
unsigned int flags;
};
Members
uaddr
User address of one of the requeue arguments
flags
Flags for this address
-
struct
futex_wait_block
¶ Block of futexes to be waited for
Definition
struct futex_wait_block {
__u32 __user *uaddr;
__u32 val;
__u32 bitset;
};
Members
uaddr
User address of the futex
val
Futex value expected by userspace
bitset
Bitset for the optional bitmasked wakeup
The flag
argument¶
The flag is used to specify the size of the futex word (FUTEX_[8, 16, 32, 64]). It’s mandatory to define one, since there’s no default size.
By default, the timeout uses a monotonic clock, but can be used as a realtime one by using the FUTEX_REALTIME_CLOCK flag.
By default, futexes are of the private type, that means that this user address
will be accessed by threads that share the same memory region. This allows for
some internal optimizations, so they are faster. However, if the address needs
to be shared with different processes (like using mmap()
or shm()
), they
need to be defined as shared and the flag FUTEX_SHARED_FLAG is used to set that.
By default, the operation has no NUMA-awareness, meaning that the user can’t choose the memory node where the kernel side futex data will be stored. The user can choose the node where it wants to operate by setting the FUTEX_NUMA_FLAG and using the following structure (where X can be 8, 16, 32 or 64):
struct futexX_numa {
__uX value;
__sX hint;
};
This structure should be passed at the void *uaddr
of futex functions. The
address of the structure will be used to be waited on/waken on, and the
value
will be compared to val
as usual. The hint
member is used to
define which node the futex will use. When waiting, the futex will be
registered on a kernel-side table stored on that node; when waking, the futex
will be searched for on that given table. That means that there’s no redundancy
between tables, and the wrong hint
value will lead to undesired behavior.
Userspace is responsible for dealing with node migrations issues that may
occur. hint
can range from [0, MAX_NUMA_NODES), for specifying a node, or
-1, to use the same node the current process is using.
When not using FUTEX_NUMA_FLAG on a NUMA system, the futex will be stored on a global table on allocated on the first node.
The timo
argument¶
As per the Y2038 work done in the kernel, new interfaces shouldn’t add timeout
options known to be buggy. Given that, timo
should be a 64-bit timeout at
all platforms, using an absolute timeout value.
Implementation¶
The internal implementation follows a similar design to the original futex. Given that we want to replicate the same external behavior of current futex, this should be somewhat expected.
Waiting¶
For the wait operations, they are all treated as if you want to wait on N futexes, so the path for futex_wait and futex_waitv is the basically the same. For both syscalls, the first step is to prepare an internal list for the list of futexes to wait for (using struct futexv_head). For futex_wait() calls, this list will have a single object.
We have a hash table, where waiters register themselves before sleeping. Then
the wake function checks this table looking for waiters at uaddr. The hash
bucket to be used is determined by a struct futex_key
, that stores information
to uniquely identify an address from a given process. Given the huge address
space, there’ll be hash collisions, so we store information to be later used on
collision treatment.
First, for every futex we want to wait on, we check if (*uaddr == val
).
This check is done holding the bucket lock, so we are correctly serialized with
any futex_wake() calls. If any waiter fails the check above, we dequeue all
futexes. The check (*uaddr == val
) can fail for two reasons:
The values are different, and we return -EAGAIN. However, if while dequeueing we found that some futexes were awakened, we prioritize this and return success.
When trying to access the user address, we do so with page faults disabled because we are holding a bucket’s spin lock (and can’t sleep while holding a spin lock). If there’s an error, it might be a page fault, or an invalid address. We release the lock, dequeue everyone (because it’s illegal to sleep while there are futexes enqueued, we could lose wakeups) and try again with page fault enabled. If we succeed, this means that the address is valid, but we need to do all the work again. For serialization reasons, we need to have the spin lock when getting the user value. Additionally, for shared futexes, we also need to recalculate the hash, since the underlying mapping mechanisms could have changed when dealing with page fault. If, even with page fault enabled, we can’t access the address, it means it’s an invalid user address, and we return -EFAULT. For this case, we prioritize the error, even if some futexes were awaken.
If the check is OK, they are enqueued on a linked list in our bucket, and proceed to the next one. If all waiters succeed, we put the thread to sleep until a futex_wake() call, timeout expires or we get a signal. After waking up, we dequeue everyone, and check if some futex was awakened. This dequeue is done by iteratively walking at each element of struct futex_head list.
All enqueuing/dequeuing operations requires to hold the bucket lock, to avoid racing while modifying the list.
Waking¶
We get the bucket that’s storing the waiters at uaddr, and wake the required number of waiters, checking for hash collision.
There’s an optimization that makes futex_wake() not take the bucket lock if there’s no one to be woken on that bucket. It checks an atomic counter that each bucket has, if it says 0, then the syscall exits. In order for this to work, the waiter thread increases it before taking the lock, so the wake thread will correctly see that there’s someone waiting and will continue the path to take the bucket lock. To get the correct serialization, the waiter issues a memory barrier after increasing the bucket counter and the waker issues a memory barrier before checking it.
Requeuing¶
The requeue path first checks for each struct futex_requeue
and their flags.
Then, it will compare the expected value with the one at uaddr1::uaddr.
Following the same serialization explained at Waking, we increase the atomic
counter for the bucket of uaddr2 before taking the lock. We need to have both
buckets locks at same time so we don’t race with other futex operation. To
ensure the locks are taken in the same order for all threads (and thus avoiding
deadlocks), every requeue operation takes the “smaller” bucket first, when
comparing both addresses.
If the compare with user value succeeds, we proceed by waking nr_wake
futexes, and then requeuing nr_requeue
from bucket of uaddr1 to the uaddr2.
This consists in a simple list deletion/addition and replacing the old futex key
with the new one.
Futex keys¶
There are two types of futexes: private and shared ones. The private are futexes meant to be used by threads that share the same memory space, are easier to be uniquely identified and thus can have some performance optimization. The elements for identifying one are: the start address of the page where the address is, the address offset within the page and the current->mm pointer.
Now, for uniquely identifying a shared futex:
If the page containing the user address is an anonymous page, we can just use the same data used for private futexes (the start address of the page, the address offset within the page and the current->mm pointer); that will be enough for uniquely identifying such futex. We also set one bit at the key to differentiate if a private futex is used on the same address (mixing shared and private calls does not work).
If the page is file-backed, current->mm maybe isn’t the same one for every user of this futex, so we need to use other data: the page->index, a UUID for the struct inode and the offset within the page.
Note that members of futex_key don’t have any particular meaning after they are part of the struct - they are just bytes to identify a futex. Given that, we don’t need to use a particular name or type that matches the original data, we only need to care about the bitsize of each component and make both private and shared fit in the same memory space.
Source code documentation¶
-
struct
futex_key
¶ Components to build unique key for a futex
Definition
struct futex_key {
u64 pointer;
unsigned long index;
unsigned long offset;
};
Members
pointer
Pointer to current->mm or inode’s UUID for file backed futexes
index
Start address of the page containing futex or index of the page
offset
Address offset of uaddr in a page
-
struct
futex_waiter
¶ List entry for a waiter
Definition
struct futex_waiter {
void __user *uaddr;
struct futex_key key;
struct list_head list;
u64 val;
unsigned int flags;
struct futex_bucket *bucket;
unsigned int index;
};
Members
uaddr
Virtual address of userspace futex
key
Information that uniquely identify a futex
list
List node struct
val
Expected value for this waiter
flags
Flags
bucket
Pointer to the bucket for this waiter
index
Index of waiter in futexv list
-
struct
futex_waiter_head
¶ List of futexes to be waited
Definition
struct futex_waiter_head {
struct task_struct *task;
bool hint;
struct futex_waiter objects[0];
};
Members
task
Task to be awaken
hint
Was someone on this list awakened?
objects
List of futexes
-
struct
futex_bucket
¶ A bucket of futex’s hash table
Definition
struct futex_bucket {
atomic_t waiters;
spinlock_t lock;
struct list_head list;
};
Members
waiters
Number of waiters in the bucket
lock
Bucket lock
list
List of waiters on this bucket
Parameters
struct inode *inode
inode to get UUID
Description
Generate a machine wide unique identifier for this inode.
This relies on u64 not wrapping in the life-time of the machine; which with 1ns resolution means almost 585 years.
This further relies on the fact that a well formed program will not unmap the file while it has a (shared) futex waiting on it. This mapping will have a file reference which pins the mount and inode.
If for some reason an inode gets evicted and read back in again, it will get a new sequence number and will _NOT_ match, even though it is the exact same file.
It is important that match_futex()
will never have a false-positive, esp.
for PI futexes that can mess up the state. The above argues that false-negatives
are only possible for malformed programs.
Return
UUID for the given inode
Get a key for a shared futex
Parameters
uintptr_t address
Futex memory address
struct mm_struct *mm
Current process mm_struct pointer
struct futex_key *key
Key struct to be filled
Return
0 on success, error code otherwise
-
struct futex_bucket *
futex_get_bucket
(void __user *uaddr, struct futex_key *key, bool shared)¶ Check if the user address is valid, prepare internal data and calculate the hash
Parameters
void __user *uaddr
futex user address
struct futex_key *key
data that uniquely identifies a futex
bool shared
is this a shared futex?
Description
For private futexes, each uaddr will be unique for a given mm_struct, and it
won’t be freed for the life time of the process. For shared futexes, check
futex_get_shared_key()
.
Return
address of bucket on success, error code otherwise
-
int
futex_get_user
(u32 *uval, u32 __user *uaddr)¶ Get the userspace value on this address
Parameters
u32 *uval
variable to store the value
u32 __user *uaddr
userspace address
Description
Check the comment at futex_enqueue()
for more information.
-
int
futex_setup_time
(struct __kernel_timespec __user *timo, struct hrtimer_sleeper *timeout, unsigned int flags)¶ Prepare the timeout mechanism and start it.
Parameters
struct __kernel_timespec __user *timo
Timeout value from userspace
struct hrtimer_sleeper *timeout
Pointer to hrtimer handler
unsigned int flags
Flags from userspace, to decide which clockid to use
Return
0 on success, error code otherwise
-
int
futex_dequeue_multiple
(struct futex_waiter_head *futexv, unsigned int nr)¶ Remove multiple futexes from hash table
Parameters
struct futex_waiter_head *futexv
list of waiters
unsigned int nr
number of futexes to be removed
Description
This function is used if (a) something went wrong while enqueuing, and we need to undo our work (then nr <= nr_futexes) or (b) we woke up, and thus need to remove every waiter, check if some was indeed woken and return. Before removing a waiter, we check if it’s on the list, since we have no clue who have been waken.
Return
- -1
If no futex was woken during the removal
0>= - At least one futex was found woken, index of the last one
-
int
futex_enqueue
(struct futex_waiter_head *futexv, unsigned int nr_futexes, int *awakened)¶ Check the value and enqueue a futex on a wait list
Parameters
struct futex_waiter_head *futexv
List of futexes
unsigned int nr_futexes
Number of futexes in the list
int *awakened
If a futex was awakened during enqueueing, store the index here
Description
Get the value from the userspace address and compares with the expected one.
Getting the value from user futex address:
Since we are in a hurry, we use a spin lock and we can’t sleep. Try to get the value with page fault disabled (when enable, we might sleep).
If we fail, we aren’t sure if the address is invalid or is just a
page fault. Then, release the lock (so we can sleep) and try to get
the value with page fault enabled. In order to trigger a page fault
handling, we just call __get_user()
again. If we sleep with enqueued
futexes, we might miss a wake, so dequeue everything before sleeping.
If get_user succeeds, this mean that the address is valid and we do the work again. Since we just handled the page fault, the page is likely pinned in memory and we should be luckier this time and be able to get the value. If we fail anyway, we will try again.
If even with page faults enabled we get and error, this means that the address is not valid and we return from the syscall.
If we got an unexpected value or need to treat a page fault and realized that a futex was awakened, we can priority this and return success.
In success, enqueue the futex in the correct bucket
Return
1 - We were awake in the process and nothing is enqueued
0 - Everything is enqueued and we are ready to sleep
0< - Something went wrong, nothing is enqueued, return error code
-
int
__futex_waitv
(struct futex_waiter_head *futexv, unsigned int nr_futexes, struct __kernel_timespec __user *timo, unsigned int flags)¶ Enqueue the list of futexes and wait to be woken
Parameters
struct futex_waiter_head *futexv
List of futexes to wait
unsigned int nr_futexes
Length of futexv
struct __kernel_timespec __user *timo
Timeout
unsigned int flags
Timeout flags
Return
0 >= - Hint of which futex woke us
0 < - Error code
-
int
compat_futex_parse_waitv
(struct futex_waiter_head *futexv, struct compat_futex_waitv __user *uwaitv, unsigned int nr_futexes)¶ Parse a waitv array from userspace
Parameters
struct futex_waiter_head *futexv
Kernel side list of waiters to be filled
struct compat_futex_waitv __user *uwaitv
Userspace list to be parsed
unsigned int nr_futexes
Length of futexv
Return
Error code on failure, pointer to a prepared futexv otherwise
-
int
futex_parse_waitv
(struct futex_waiter_head *futexv, struct futex_waitv __user *uwaitv, unsigned int nr_futexes)¶ Parse a waitv array from userspace
Parameters
struct futex_waiter_head *futexv
Kernel side list of waiters to be filled
struct futex_waitv __user *uwaitv
Userspace list to be parsed
unsigned int nr_futexes
Length of futexv
Return
Error code on failure, pointer to a prepared futexv otherwise
-
struct futex_waiter_head *
futex_get_parent
(uintptr_t waiter, unsigned int index)¶ For a given futex in a futexv list, get a pointer to the futexv
Parameters
uintptr_t waiter
Address of futex in the list
unsigned int index
Index of futex in the list
Return
A pointer to its futexv struct
-
void
futex_mark_wake
(struct futex_waiter *waiter, struct futex_bucket *bucket, struct wake_q_head *wake_q)¶ Find the task to be wake and add it in wake queue
Parameters
struct futex_waiter *waiter
Waiter to be wake
struct futex_bucket *bucket
Bucket to be decremented
struct wake_q_head *wake_q
Wake queue to insert the task
-
int
futex_parse_requeue
(struct futex_requeue *rq, struct futex_requeue __user *uaddr, bool *shared)¶ Copy a user
struct futex_requeue
and check it’s flags
Parameters
struct futex_requeue *rq
Kernel struct
struct futex_requeue __user *uaddr
Address of user struct
bool *shared
Out parameter, defines if this is a shared futex
Return
0 on success, error code otherwise