runtime: distinguish two kind of mutexes #13716

gasche · 2025-01-08T09:59:32Z

Summary

This PR distinguishes two kind of mutexes:

'runtime' mutexes are for blocking critical sections which do not access the runtime.
'mutator' mutexes are for non-blocking critical sections (blocking on a mutex releases the runtime lock) which may access the runtime system.

This refactoring comes from the discussions in #13227, it tries to avoid a class of bug where the same mutex is used in both blocking and non-blocking fashion, resulting in subtle deadlock situations.

More details

The runtime has a caml_plat_lock_blocking function that takes a mutex in the obvious way. This function should be used very carefully, because it in blocks a domain without transferring control to its backup thread or otherwise listening to STW interruptions, and it can easily cause deadlocks if the critical section itself contains an STW poll point. In #13063, @gadmm introduced a different mutex-taking function, caml_plat_lock_non_blocking that releases the domain lock when it needs to block, and should be used in any critical section that could be long or needs to use the runtime system.

We are still learning about what's a correct usage discipline for these two functions (on Monday I temporarily introduced a bug in trunk, detected reported on Tuesday morning by @jmid in #13713 and fixed on Tuesday evening by @gadmm in #13714 ). In #13714 we realized that it is incorrect to mix uses of lock_blocking and lock_non_blocking on the same mutex -- except in very specific use-cases that are not currently used in the runtime. The current PR proposes to separate the two APIs so that there is no risk to make this mistake again in the future. The goal is to have a system that is simpler to reason about and to use correctly for non-experts such as myself.

gasche · 2025-01-08T10:18:34Z

(I intend to enable multicoretests CI on this PR -- it should only be a refactoring that does not change the runtime behavior, but I wrote new code for mutator mutexes instead of just moving code around, so there may be mistakes. But I will wait first for the usual CI to be green, I have only tested it lightly locally and there is no need to unleash automated testing yet.)

- 'runtime' mutexes are for blocking critical sections which do not access the runtime. - 'mutator' mutexes are for non-blocking critical sections (blocking on a mutex releases the runtime lock) which may access the runtime system. This refactoring comes from the discussions in ocaml#13227, it tries to avoid a class of bug where the same mutex is used in both blocking and non-blocking fashion, resulting in subtle deadlock situations.

gasche · 2025-01-08T10:24:15Z

Note: I am afraid there is some overlap and potential conflicts between this PR and #13416, as the PR exploits the definition of sync_mutex in sync_posix.h as a pointer to a caml_plat_mutex. There is not much I can do in particular about this (I don't have the expertise to review #13416 to help it move forward), but I tried to not touch sync_posix.h to minimize conflicts.

gasche mentioned this pull request Jan 8, 2025

Audit and fix caml_plat_lock_blocking usage #13227

Merged

gasche added the runtime-system label Jan 8, 2025

gasche force-pushed the caml_plat_lock_non_blocking5 branch from 476e163 to b839f44 Compare January 8, 2025 10:22

dra27 assigned gasche Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: distinguish two kind of mutexes #13716

runtime: distinguish two kind of mutexes #13716

gasche commented Jan 8, 2025

gasche commented Jan 8, 2025

gasche commented Jan 8, 2025 •

edited

Loading

runtime: distinguish two kind of mutexes #13716

Are you sure you want to change the base?

runtime: distinguish two kind of mutexes #13716

Conversation

gasche commented Jan 8, 2025

Summary

More details

gasche commented Jan 8, 2025

gasche commented Jan 8, 2025 • edited Loading

gasche commented Jan 8, 2025 •

edited

Loading