Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: distinguish two kind of mutexes #13716

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

gasche
Copy link
Member

@gasche gasche commented Jan 8, 2025

Summary

This PR distinguishes two kind of mutexes:

  • 'runtime' mutexes are for blocking critical sections which do not access the runtime.

  • 'mutator' mutexes are for non-blocking critical sections (blocking on a mutex releases the runtime lock) which may access the runtime system.

This refactoring comes from the discussions in #13227, it tries to avoid a class of bug where the same mutex is used in both blocking and non-blocking fashion, resulting in subtle deadlock situations.

More details

The runtime has a caml_plat_lock_blocking function that takes a mutex in the obvious way. This function should be used very carefully, because it in blocks a domain without transferring control to its backup thread or otherwise listening to STW interruptions, and it can easily cause deadlocks if the critical section itself contains an STW poll point. In #13063, @gadmm introduced a different mutex-taking function, caml_plat_lock_non_blocking that releases the domain lock when it needs to block, and should be used in any critical section that could be long or needs to use the runtime system.

We are still learning about what's a correct usage discipline for these two functions (on Monday I temporarily introduced a bug in trunk, detected reported on Tuesday morning by @jmid in #13713 and fixed on Tuesday evening by @gadmm in #13714 ). In #13714 we realized that it is incorrect to mix uses of lock_blocking and lock_non_blocking on the same mutex -- except in very specific use-cases that are not currently used in the runtime. The current PR proposes to separate the two APIs so that there is no risk to make this mistake again in the future. The goal is to have a system that is simpler to reason about and to use correctly for non-experts such as myself.

@gasche
Copy link
Member Author

gasche commented Jan 8, 2025

(I intend to enable multicoretests CI on this PR -- it should only be a refactoring that does not change the runtime behavior, but I wrote new code for mutator mutexes instead of just moving code around, so there may be mistakes. But I will wait first for the usual CI to be green, I have only tested it lightly locally and there is no need to unleash automated testing yet.)

- 'runtime' mutexes are for blocking critical sections which do not
   access the runtime.

- 'mutator' mutexes are for non-blocking critical sections (blocking on
   a mutex releases the runtime lock) which may access the runtime
   system.

This refactoring comes from the discussions in ocaml#13227, it tries to avoid
a class of bug where the same mutex is used in both blocking and
non-blocking fashion, resulting in subtle deadlock situations.
@gasche gasche force-pushed the caml_plat_lock_non_blocking5 branch from 476e163 to b839f44 Compare January 8, 2025 10:22
@gasche
Copy link
Member Author

gasche commented Jan 8, 2025

Note: I am afraid there is some overlap and potential conflicts between this PR and #13416, as the PR exploits the definition of sync_mutex in sync_posix.h as a pointer to a caml_plat_mutex. There is not much I can do in particular about this (I don't have the expertise to review #13416 to help it move forward), but I tried to not touch sync_posix.h to minimize conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant