gh-149101: Implement PEP 788#149116
Conversation
Documentation build overview
47 files changed ·
|
encukou
left a comment
There was a problem hiding this comment.
Thanks for adding these!
I'll send notes for Doc/ now; code review coming up.
Co-authored-by: Petr Viktorin <encukou@gmail.com>
Co-authored-by: Petr Viktorin <encukou@gmail.com>
Co-authored-by: Petr Viktorin <encukou@gmail.com>
|
🤖 New build scheduled with the buildbot fleet by @ZeroIntensity for commit bc78c10 🤖 Results will be shown at: https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F149116%2Fmerge If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again. |
|
for buildbots: The RHEL8 failures aren't relevant. Refleaks are worrying though. |
Never mind; main currently leaks (#149179). |
encukou
left a comment
There was a problem hiding this comment.
Today's part of the review
(If some comment doesn't make sense, it might be because I didn't read through everything yet. )
| [function.PyInterpreterGuard_FromView] | ||
| added = '3.15' |
There was a problem hiding this comment.
This is missing from the PEP; I assume that's an oversight. Just update the PEP when you mark it Final, and ask SC to rubber-stamp it.
| // For debugging purposes, we emit a fatal error if someone | ||
| // CTRL^C'ed the process. |
There was a problem hiding this comment.
I think it's nice to have this for users. Otherwise, developers debugging a stuck process will have to pkill every time they mess something up.
There was a problem hiding this comment.
Or Ctrl+\ :)
At least limit this to KeyboardInterrupt? I really think this is not a good reason to bring down the entire process.
There was a problem hiding this comment.
Do we have a built-in mechanism to only catch KeyboardInterrupt?
I really think this is not a good reason to bring down the entire process.
Well, if we simply don't allow this to be interrupted, then developers will just have to kill the process during debugging, which is a little annoying, but if the process is going to be killed anyway, why not make it easier for users to do that?
The alternative is to just stop waiting on guards if a CTRL^C occurs, but I think the process will likely segfault if we do that because threads are having their protection against finalization removed out from under them.
There was a problem hiding this comment.
Do we have a built-in mechanism to only catch KeyboardInterrupt?
PyErr_ExceptionMatches(PyExc_KeyboardInterrupt)
Well, if we simply don't allow this to be interrupted, then developers will just have to kill the process during debugging, which is a little annoying, but if the process is going to be killed anyway, why not make it easier for users to do that?
Is the process going to be killed, though? We're finalizing the runtime here, not the process. Embedders can still Py_Initialize afterwards.
There was a problem hiding this comment.
Is the process going to be killed, though? We're finalizing the runtime here, not the process. Embedders can still Py_Initialize afterwards.
This is for the case in which the developer forgot to call PyInterpreterGuard_Close. The process has to hang indefinitely. If we somehow don't, then there's a thread-safety issue, because the interpreter was somehow able to finalize while a guard was active.
Are you concerned about someone accidentally hitting this while they're shutting down the process? As a compromise, I could make it so that CTRL^C will only emit a fatal error if, say, 1 second has passed while waiting on guards. We could also make it only available in certain environments, like with a -X interrupt_interp_guards flag.
There was a problem hiding this comment.
Isn't it also for the case where finalization is taking a long time? Perhaps it'll take 2 seconds to safely save some data before shutdown?
There was a problem hiding this comment.
I added it solely for debugging unclosed guards. Unfortunately, there's not any way to distinguish between a long time and never here.
Co-authored-by: Petr Viktorin <encukou@gmail.com>
Co-authored-by: Petr Viktorin <encukou@gmail.com>
For some reason, clangd thought it would be a good idea to automatically apply its ugly formatting style on save. It has since been uninstalled.
But this time, I didn't accidentally reformat all of pylifecycle.c!
| // it is not necessarily guaranteed that countdown is zero). We use it | ||
| // to simply prevent the finalization from constantly spinning and | ||
| // atomically reading the countdown. | ||
| _PyEvent_Reset(&interp->finalization_guards.done); |
There was a problem hiding this comment.
I'd prefer if a threading expert checked that _PyEvent_Reset is a safe primitive.
But before pinging one: why is countdown necessary? Could this do with only the R/W lock, with guards as readers and finalization (which sets the Finalized flag) as a writer?
There was a problem hiding this comment.
Resetting an event isn't a super common thing to do, but it is safe as long as there aren't any waiters (for reference, we do have Event.clear in the threading module).
It's likely possible to do this with only the RW lock, but it would require a lot of refactoring. During finalization, our current setup for "pre-finalization" tasks (guards, non-daemon threads, atexit callbacks, etc) is as follows:
- First, we try to run all of them sequentially. So,
joinis called for each thread, atexit callbacks are executed, and in this case, we wait on thePyEventfor guards. - Then, we make a stop-the-world pause to do an atomic check for the number of each remaining. We know other threads can't create more things during this period.
- If any of them is non-zero (meaning that more tasks were created while the world was stopping), then the world is started and the cycle repeats.
- Otherwise, the world is kept stopped, and then all non-main threads are deleted and thread state attachment is disallowed before the world is started again.
We need the countdown for step 2. If we don't track the number of guards, we can't know whether they're all cleaned up, so we don't know whether to run another cycle. With the current setup, the alternative without the countdown would involve disallowing the creation of new guards prematurely (before step 1), but that would mean threads can't protect against finalization in perfectly valid scenarios. We can't do it after step 4, because the world is stopped, so threads wouldn't be able to release their guards.
There was a problem hiding this comment.
Do we need a _PyRWMutex_TryLock that fails if the lock is held?
Well, guess that's for after the beta :)
There was a problem hiding this comment.
Do we need a _PyRWMutex_TryLock that fails if the lock is held?
What for? Interrupting the lock acquisition?
There was a problem hiding this comment.
With that this could be:
- as before
- We stop the world and try acquiring the write lock. We know other threads can't create more things during this period.
- If acquiring failed (meaning that more tasks were created while the world was stopping), then the world is started and the cycle repeats.
- as before
There was a problem hiding this comment.
This doesn't seem safe to me. Isn't there a race here with make_pre_finalization_calls and try_acquire_interp_guard?
- T1 checks
finalization_guards.countdown > 0 - T2 decrements, notifies, and resets
- T1 waits on
finalization_guards.done
There was a problem hiding this comment.
What would be the race there, exactly? It's supposed to wait on finalization_guards.done multiple times.
There was a problem hiding this comment.
Oh, I misunderstood your comment; I was looking at the finalization_guards.countdown read during the STW pause. I think I see the race now. We need to grab the lock before waiting. Edit: actually, we can just check finalization_guards.countdown > 0 while spinning.
| PyThreadState_Swap(to_restore); | ||
| } | ||
|
|
||
| PyThreadState_Swap(to_restore); |
There was a problem hiding this comment.
PyThreadState_Swap doesn't need to happen twice, right?
Also, could you assert the result?
There was a problem hiding this comment.
Good catch, we don't need the PyThreadState_Swap in the branch above.
Also, could you assert the result?
What result am I asserting?
There was a problem hiding this comment.
PyThreadState_Swap should return tstate.
| if (tstate->ensure.owned_guard != NULL) { | ||
| PyInterpreterGuard_Close(tstate->ensure.owned_guard); | ||
| tstate->ensure.owned_guard = NULL; | ||
| } |
There was a problem hiding this comment.
This looks fishy: tstate shouldn't be used after PyThreadState_Clear wipes it, and AFIAK owned_guard isn't locked here.
There was a problem hiding this comment.
Technically, it's safe to access owned_guard as long as the thread state is only cleared, not deleted (because the memory is still valid). It's not really safe to release the guard before we've cleared the thread state because the guard needs to be there to protect finalizers.
Also, what do you mean by "locked"?
There was a problem hiding this comment.
Going by the docs, PyThreadState_Clear will “Reset all information in a thread state object.”
I realize we can rely on undocumented behavior here, but still, it would be nice to “take over” owned_guard & delete_on_release (stash them in locals and unset the tstate's owned_guard) while everything is still attached.
what do you mean by "locked"?
I don't see what's preventing another thread's EnsureFromView from relying on the owned_guard just before it's set to NULL.
There was a problem hiding this comment.
Ah, okay, I see where you're coming from now.
I realize we can rely on undocumented behavior here, but still, it would be nice to “take over” owned_guard & delete_on_release (stash them in locals and unset the tstate's owned_guard) while everything is still attached.
This sounds reasonable, though I'm a little uneasy about setting owned_guard to NULL in PyThreadState_Clear. (I can't think of a practical example where it would be an issue, but something feels a little off about it.)
Would you be comfortable with stashing owned_guard in locals before calling PyThreadState_Clear, but not actually clearing the owned_guard field in PyThreadState_Clear (for now)? We can always add it later (presumably after b1).
I don't see what's preventing another thread's EnsureFromView from relying on the owned_guard just before it's set to NULL.
Another thread shouldn't be touching this owned_guard at all (the field is local to the thread, not to the interpreter). The other thread will have its own guard associated with its own EnsureFromView call.
There was a problem hiding this comment.
Would you be comfortable with stashing owned_guard in locals before calling PyThreadState_Clear, but not actually clearing the owned_guard field in PyThreadState_Clear (for now)? We can always add it later (presumably after b1).
I got over my fear -- done in 25687af.
There was a problem hiding this comment.
Ah, OK.
Would you be comfortable with stashing owned_guard in locals before calling PyThreadState_Clear, but not actually clearing the owned_guard field in PyThreadState_Clear (for now)?
Definitely.
Ideally, close the guard before the PyThreadState_Clear, though :)
|
Failing CI is because Ubuntu is down. |
colesbury
left a comment
There was a problem hiding this comment.
I've only briefly looked through the PR, but my thoughts so far are:
- This is a big change with plenty of opportunities for subtle bugs and deadlocks
- It doesn't seem wise to me to rush this into 3.15 at the end of the feature development cycle
| static const PyThreadState _no_tstate_sentinel = {0}; | ||
| #define NO_TSTATE_SENTINEL ((PyThreadState *)&_no_tstate_sentinel) |
There was a problem hiding this comment.
PyThreadState objects are large. We shouldn't be creating one here just for a sentinel return value.
| if (attached_tstate != NULL && attached_tstate->interp == interp) { | ||
| /* Yay! We already have an attached thread state that matches. */ | ||
| ++attached_tstate->ensure.counter; | ||
| return NO_TSTATE_SENTINEL; |
There was a problem hiding this comment.
I feel like I'm missing something here. How is this supposed to be used by callers? Sometimes PyThreadState_Ensure will return a completely unusable (but not NULL) value? Like how are callers supposed to know if the PyThreadState* returned from this function is useful?
I didn't see any discussions about NO_TSTATE_SENTINEL in the PEP. I also didn't see it in the discussion forums, but that can be hard to follow/search so I might have missed the post.
There was a problem hiding this comment.
Like how are callers supposed to know if the PyThreadState* returned from this function is useful?
They aren't supposed to, and that's by design. The PyThreadState * returned by Ensure/EnsureFromView is a handle to the previous thread state, which isn't relevant to the user. If they want to see the thread state that was just attached, they have to use PyThreadState_Get().
I also didn't see it in the discussion forums, but that can be hard to follow/search so I might have missed the post.
I think it was brought up at some point in the C API WG discussion. I do agree that it's a little hacky, but it doesn't seem like a huge deal to me either. Switching to int PyThreadState_Ensure(PyInterpreterGuard *guard, PyThreadState **out_ptr) would complicate API usage, so it seemed like a fair tradeoff.
There was a problem hiding this comment.
So if I understand correctly, the API has a public return value that is by design useless?
There was a problem hiding this comment.
It's in the PEP:
This function will return NULL to indicate a memory allocation failure, and otherwise return a pointer to the thread state that was previously attached (which might have been NULL, in which case an non-NULL sentinel value is returned instead to differentiate between failure – this means that this function will sometimes return an invalid PyThreadState pointer).
There was a problem hiding this comment.
This function will return NULL to indicate a memory allocation failure, and otherwise return a pointer to the thread state that was previously attached.
I find that text very confusing. This isn't a pointer to a previously attached thread state!
We have a return value that is:
- Useless to the caller
- An existing type used in multiple other APIs! So callers should reasonably expect that they interoperate!
Doesn't this stand out as a giant API design red flag to you?
There was a problem hiding this comment.
Yes, it's not ideal. The result is something you can only pass to PyThreadState_Release -- or compare with another PyThreadState*.
Hugo granted an exception to backport this to 3.15 if we don't make the b1 deadline (though I did tell him that if we do go over the deadline, it won't be by that long). Are you suggesting that we defer this to 3.16? |
Yes |
|
You should take that up with the Steering Council, as the PEP was approved for 3.15. |
|
I'm taking it up with the PEP author, who can choose to disagree with my opinion |
|
I'd like to aim for 3.15, at least for the time being. I think the majority of thread-safety bugs here are going to be rare and won't require large changes to the implementation. That said, if it turns out that we do find some fundamental flaws here, I am okay with deferring the PEP to 3.16. |
Hugo has graciously given me permission to backport this if we don't make the May 5th deadline, but let's try to get this done in time!
I will write a full tutorial and migration guide once this is merged; I want to first make sure that this lands before the beta freeze.