RSDK-9440 Report machine `state` through `GetMachineStatus` #4616

benjirewis · 2024-12-10T20:57:39Z

RSDK-9440

Changes:

Adds a new enumerated State field to robot.MachineStatus both server and client side
Starts machines with a "minimal" config to start web service earlier before starting with full config
Reports StateInitializing in robot.MachineStatus before reconfigure with full config occurs
Reports StateRunning in robot.MachineStatus after reconfigure with full config occurs
Exposes a SetInitializing method on robot.LocalRobot for the above two points to work

Testing:

Modifies server and client side MachineStatus tests to make assertions on State
Basic integration test
Checks for state running before returning from client.New when in testing
Adds more injected machine status functions to client and client session tests

benjirewis

Still a WIP wrt testing; will leave in draft. These are my ideas so far, though.

cheukt · 2024-12-10T21:25:40Z

web/server/entrypoint.go

+	// and immediately start web service. We need the machine to be reachable
+	// through the web service ASAP, even if some resources take a long time to
+	// initially configure.
+	minimalProcessedConfig := &(*fullProcessedConfig)


maybe CopyPublicFields? might look less janky

cheukt · 2024-12-10T21:27:40Z

web/server/entrypoint.go

+	if err := web.RunWeb(ctx, myRobot, options, s.logger); err != nil {
+		return err
+	}
+	myRobot.Reconfigure(ctx, fullProcessedConfig)


is there a risk that the config watcher would call reconfigure before this reconfigure is called?

Yes; good call out. Discussed offline a bit, we should start the config watcher goroutine only after this reconfigure is called.

That does mean that we have the same behavior as today: when the robot is starting up (both in first robotimpl.New(minimalConfig) and myRobot.Reconfigure(fullProcessedConfig)) no new config changes will be seen. So, if a user messes up their config and accidentally starts a module that takes forever to start up, they will not be able to quickly remove that module from their config. Instead, they'll have to restart/shutdown their robot if they want to stop the initial construction. Once again, I don't think this is different from what we have currently, and, of course, viam-server is receptive to gRPC requests earlier with the changes in this PR.

benjirewis · 2024-12-11T21:32:24Z

I broke a lot of tests that I'm presuming are expecting resources to be available as soon as the web service is available. Thinking about it.

cli/client_test.go

robot/client/client.go

dgottlieb · 2024-12-16T14:13:28Z

web/server/entrypoint.go

+	minimalProcessedConfig.Modules = nil
+	minimalProcessedConfig.Processes = nil
+
+	myRobot, err := robotimpl.New(ctx, minimalProcessedConfig, s.logger, robotOptions...)


Is this the best way of achieving what we want or the most expedient?

I'm fine with this as-is. And I'm kind of fine never coming back to think about this. But the whole "robot owns the web server" feels backwards.

There would be less states to consider if we could start a web service and register robots with it. And there'd be a small API that describes:

What state the robot is in (startup or running) and

which APIs are available, e.g:

just "GetMachineStatus" and maybe "ResourceNames"

but none of "SetPower"/other resource specific APIs

But in this PR we have 90-100 lines between this comment/robot.New and when the web service is started. That's a lot of lines to accidentally break our contract and add some blocking code.

Is this the best way of achieving what we want or the most expedient?

Perhaps not, and I understand your argument there. I've introduced a slightly different/simpler mechanic for controlling the "initializing" value, so that might address some of your concerns here. I didn't go so far as starting a web service and registering robots with it (if I'm understanding what you're saying.)

dgottlieb · 2024-12-16T14:14:47Z

web/server/entrypoint.go

+	// Use `fullProcessedConfig` as the initial `oldCfg` for the config watcher
+	// goroutine, as we want incoming config changes to be compared to the full
+	// config.
+	oldCfg := fullProcessedConfig
 	utils.ManagedGo(func() {


It's probably about time this lambda gets its own function/name. I think a lot of my above concern goes away if this 60 lines of control flow keyword soup is hidden by default.

Great idea; working on it + will re-request review when done.

Done; maybe.

dgottlieb · 2024-12-16T14:17:08Z

web/server/entrypoint.go

@@ -479,7 +502,8 @@ func (s *robotServer) serveWeb(ctx context.Context, cfg *config.Config) (err err
 	}()
 	defer cancel()

-	options, err := s.createWebOptions(processedConfig)
+	// Create initial web options with `minimalProcessedConfig`.
+	options, err := s.createWebOptions(minimalProcessedConfig)


I can't comment off-diff. The goroutine spun off above will check for diff.NetworkEqual and if not, run myRobot.StartWeb (newline 490).

Just below this we call web.RunWeb. I'm not sure what the significance is between having different methods, StartWeb and RunWeb, but assuming that's not interesting: is it possible for those two things to race? Are we guaranteed to end up with the right set of weboptions?

To clarify, this is a question about existing behavior. I don't think this patch changed anything here.

The goroutine spun off above will check for diff.NetworkEqual and if not, run myRobot.StartWeb (newline 490).

Correct; I believe there's a call to StopWeb before that happens, too. And then a Reconfigure after that StopWeb. All about handling network changes in the config.

Just below this we call web.RunWeb. I'm not sure what the significance is between having different methods, StartWeb and RunWeb, but assuming that's not interesting

It's interesting having those two methods. StartWeb starts up the web service on the robot. RunWeb does that, but also waits on <-ctx.Done(), so it's a blocking call and represents the "main" program that "runs" when you call go run web/cmd/server/main.go.

Is it possible for those two things to race? Are we guaranteed to end up with the right set of weboptions?

It depends what you mean by race. There is a lock on starting the web service, so I'm not sure we'd see a race manifest as an actual DATA RACE, but I think you are "right" in wondering about the "right set of weboptions." I'm not entirely sure, but I think RunWeb would run into an error if it tried to start the web service with an old set of options after the config watcher goroutine had started it already with a new set of options. So, my guess is we'd see an error from RunWeb and an inability to start the server in the event of the race you're describing.

dgottlieb · 2024-12-16T14:25:57Z

robot/impl/local_robot.go

@@ -498,6 +502,8 @@ func newWithResources(
 	}

 	successful = true
+	// Robot is "initializing" until first reconfigure after initial creation completes.
+	r.initializing.Store(true)


Technically there's a Reconfigure earlier on newcode 491 that sets this value to false. Not to mention the initializing value is initialized (ugh) to false. Two things:

I'm taking that it's important we exit this function with initializing set to true. But I would expect to see the setting up at the top near the constructor. Can we document that the placement here is intentional to avoid prior calls mucking with the state?

Is this function guaranteed to not start webserver and expose the GetMachineStatus API? If it can, it seems we might be setting initializing too late and may allow clients to observe an illegal transition of ready -> initializing.

Also -- line 428 (old) 432 (new) refers to the mod manager web server. Do we need to consider/provide guidelines for how module SDKs (which are -- in theory -- different from "application SDKs") use and perhaps expose GetMachineStatus?

I've modified the mechanics here slightly. You'll see there's a new robot.Option to start a robot in initializing mode. You can then use SetInitializing(false) to mark the robot as running. This means that only the code here in web/server/entrypoint.go is "special" with respect to initialization. All other calls to robotimpl.New will create robots that always return robot.StateRunning from MachineStatus.

I'm not sure that will address all your concerns here, and I'll think a bit harder about your module question before re-requesting review.

Do we need to consider/provide guidelines for how module SDKs (which are -- in theory -- different from "application SDKs") use and perhaps expose GetMachineStatus?

Slightly confused about your question here. The mod manager web server is started before any modules are added to the module manager, so it should always be the case that "module SDKs" (I'm reading that as Golang, Python, and C++ module libraries) should have the ability to connect back to the RDK through the mod manager web server before any module process has started. Were you suggesting that the module libraries should be using MachineStatus to check the status of the parent RDK, or that they should expose their own MachineStatus endpoint for some reason?

Slightly confused about your question here. The mod manager web server is started before any modules are added to the module manager, so it should always be the case that "module SDKs" (I'm reading that as Golang, Python, and C++ module libraries) should have the ability to connect back to the RDK through the mod manager web server before any module process has started.

Given that, it sounds like if* a module, as soon as it was possible, tried calling MachineStatus they'd get an initialized == false.

But it also sounds like in the current, pre-patch code, a module could actually call ResourceNames before a regular "network client" could? Because we wouldn't have started accepting connections yet? And the result of calling ResourceNames would be undefined as the robot hasn't necessarily done/completed its initial reconfigure yet? Just considering the "happy path" where the initial robot config is good.

And if that's true, modules already have to be "resilient" to talking with an "uninitialized" robot. And we would not expect to need module changes. Such as the testing changes to "wait by default" for a robot to be initialized.

Given that, it sounds like if* a module, as soon as it was possible, tried calling MachineStatus they'd get an initialized == false.

That sounds correct to me.

But it also sounds like in the current, pre-patch code, a module could actually call ResourceNames before a regular "network client" could? Because we wouldn't have started accepting connections yet?

That sounds correct to me. In particular, the "module web server" of the RDK will be open for connections while the regular web server will not be. So, a module could connect via the former before a regular "network client" could connect via the latter.

And the result of calling ResourceNames would be undefined as the robot hasn't necessarily done/completed its initial reconfigure yet?

That does not sound correct to me. If a module is able to call ResourceNames through the module server, it will see the current status of the resource graph in terms of available names. If some resources have already completed configuration, the module will see them through ResourceNames.

And if that's true, modules already have to be "resilient" to talking with an "uninitialized" robot. And we would not expect to need module changes.

Modules do expect a certain guarantee around modular dependencies: if a modular resource A depends on another resource B, it is expected that, barring any inability to create B, B will be available and usable through ResourceByName within the constructor for A. I actually broke that guarantee and caused TestComplexModule to fail here.

We need the web service, which is "weakly" (it's actually, annoyingly, a fourth type of hardcoded-weak dependency that is not registered via WeakDependencies in resource registration) dependent on all resources to Reconfigure before the modular base builds such that the web service is aware of the two motors that the modular base depends on.

I "fixed" the test by updating weak dependents more often and more simply in completeConfig (see my incoming comment.)

dgottlieb · 2024-12-16T14:33:30Z

robot/impl/local_robot.go

+	// been closed above. This ensures processes are shutdown before any files
+	// are deleted they are using.
+	//
+	// If initializing, machine will be starting with no modules, but may


I'm not sure I understand this. Does anything actually go wrong if we don't guard this "cleanup" logic? Or are we just suggesting that making these calls would be "wasteful" no-ops?

The existing comment/first paragraph refers to "cleanup unused packages", so if we never started up with any, where are they coming from? Existing files on the file system from a previous start?

@cheukt had mentioned it would be good to guard the lines below based on initialization.

Or are we just suggesting that making these calls would be "wasteful" no-ops?

That's my understanding, yep.

Existing files on the file system from a previous start?

Also my understanding, yep.

if we don't guard this call, I suspect that we would cleanup modules from the full config that aren't in the initial minimal config, so we end up deleting modules that would be used. If offline, the robot would no longer be able to re-download the module and start up correctly.

I think I understand. Because we're literally trying to start the server with an empty config, this code will delete all the things unless we communicate/check the special "this isn't a real reconfigure step" initializing flag.

Maybe this is worth clarifying -- but when we were talking about "this feature should appear as if the robot was first configured as if it had no components, followed by a reconfigure with components": I can't tell if we went with implementing this that way because it was the best way to solve the problem? Or if we went down that path because the above statement was interpreted as "should literally be implemented as" a bit?

dgottlieb · 2025-01-03T16:24:24Z

web/server/entrypoint_test.go

+
+	// Set value for `DoNotWaitForRunningEnvVar` to allow connecting to a
+	// still-initializing machine.
+	test.That(t, os.Setenv(client.DoNotWaitForRunningEnvVar, "true"), test.ShouldBeNil)


I've always heard about how it's unsafe to change environment variables -- particularly in a multi-threaded program. I don't think I understand the claim well enough to know whether there's a problem here. Instead I'll ask:

does this need to be an environment variable because we need to control processes outside of the test program?

Or would a global variable do just as well?

Seeing that this env variable lookup happens alongside a test.IsTesting(), I'm assuming the latter.

I could also see value in letting users disable the behavior with a client dial option. But I don't personally need to see that in this PR.

Great point; it's the latter, and I've switched to using a global atomic boolean.

dgottlieb · 2025-01-03T18:05:52Z

robot/client/client.go

+				if status.Code(err) == codes.Unimplemented {
+					break
+				}
+				return nil, multierr.Combine(err, rc.conn.Close())


I'm going to recommend ignoring the error from rc.conn.Close() and just returning the err from rc.MachineStatus instead.

I've been undoing this pattern as I've come across it. Especially when correctness doesn't require Close to be called. See: the webrtc connection bugs where a connection succeeded, but we disconnected from the peer because SendDone[1] to the signaling server failed.

[1] I claim SendDone is analogous to Close.

Gotcha that sounds good to me; thanks for pointing it out.

dgottlieb · 2025-01-03T18:08:08Z

robot/client/client_session_test.go

@@ -129,13 +132,11 @@ func TestClientSessionOptions(t *testing.T) {
 								Disable: true,
 							})))
 						}
-						roboClient, err := client.New(ctx, addr, logger, opts...)


client.New now calls MachineStatus in testing environments, which is not exempted from session creation.

I'm fine not touching this in the current PR. But are we currently aware of a reason why MachineStatus should not be exempt?

dgottlieb · 2025-01-03T18:14:26Z

web/server/entrypoint.go

+	minimalProcessedConfig.Processes = nil
+
+	// Start robot in an initializing state with minimal config.
+	robotOptions = append(robotOptions, robotimpl.WithInitializing())


What's the consequence of not having this line?

Do we go back to the old behavior of having to do an initial reconfigure before opening a server port?

Or do we still open a server port before configuring -- but we just lie to clients that the robot is in "MachineState.RUNNING"?

Or do we still open a server port before configuring -- but we just lie to clients that the robot is in "MachineState.RUNNING"?

This one is the consequence; the robot would falsely report itself as running.

dgottlieb · 2025-01-03T18:20:39Z

robot/impl/local_robot.go

+	// been closed above. This ensures processes are shutdown before any files
+	// are deleted they are using.
+	//
+	// If initializing, machine will be starting with no modules, but may


I think I understand. Because we're literally trying to start the server with an empty config, this code will delete all the things unless we communicate/check the special "this isn't a real reconfigure step" initializing flag.

Maybe this is worth clarifying -- but when we were talking about "this feature should appear as if the robot was first configured as if it had no components, followed by a reconfigure with components": I can't tell if we went with implementing this that way because it was the best way to solve the problem? Or if we went down that path because the above statement was interpreted as "should literally be implemented as" a bit?

dgottlieb · 2025-01-03T18:25:48Z

web/server/entrypoint.go

+	// Once reconfigure with initial config is complete; set initializing to
+	// false. Robot is now fully running and can indicate this through the
+	// MachineStatus endpoint.
+	r.SetInitializing(false)


Because I see we had to expose a whole method for this on the LocalRobot API, I feel compelled to ask: would it be wrong to have localRobot.Reconfigure just set this value at the end?

newWithResources calls Reconfigure to construct the robot, so if we set initializing to false at the end of Reconfigure, then initializing would be incorrectly false before the Reconfigure above completes.

dgottlieb · 2025-01-03T20:10:03Z

@benjirewis and I talked offline about options regarding managing the initializing flag. Broadly, I'm fine with how the code is now, but @benjirewis was on board with taking a soft pass at some ideas for ensuring (in the future as developers continue to muck with the code) we never violate the invariant of having a configured robot, but also have initializing = true.

viambot added the safe to test This pull request is marked safe to test from a trusted zone label Dec 10, 2024

benjirewis changed the title ~~RSDK-9440~~ RSDK-9440 Report machine state through GetMachineStatus Dec 10, 2024

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 10, 2024

benjirewis commented Dec 10, 2024

View reviewed changes

benjirewis requested review from dgottlieb and cheukt December 10, 2024 21:14

cheukt reviewed Dec 10, 2024

View reviewed changes

benjirewis mentioned this pull request Dec 11, 2024

DATA-3441 Update data export command #4596

Merged

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 11, 2024

benjirewis force-pushed the machine-state branch from febcd0b to a88c3fa Compare December 11, 2024 19:25

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 11, 2024

benjirewis force-pushed the machine-state branch from 533ae80 to a4d585d Compare December 13, 2024 17:28

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 13, 2024

benjirewis commented Dec 13, 2024

View reviewed changes

cli/client_test.go Outdated Show resolved Hide resolved

robot/client/client.go Outdated Show resolved Hide resolved

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 13, 2024

dgottlieb reviewed Dec 16, 2024

View reviewed changes

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Dec 16, 2024

benjirewis added 16 commits January 3, 2025 11:22

move out of initializing in setupLocalRobot

4948b13

add log lines for debugging 32 bit tests

d756396

move running check higher

fcf03b5

potentially simpler API

09b76fd

more missing injections

35f8f3d

fix more tests sigh

e299ab4

maybe this time

19f7940

actually use value from options

d68e6f4

make configWatcher a method instead of an anonymous function

2d8bc1b

put back lints

f997bbb

try to simplify updateWeakDependents check in completeConfig

f97feef

avoid CopyOnlyPublicFields

72bada6

initializing -> running test

fc67364

move redefinition of context above slow shutdown goroutine

18555b3

remove debug logs

e46aa9d

table drive and fix server GetMachineStatus test

7f0fffd

benjirewis force-pushed the machine-state branch from e6eb903 to 7f0fffd Compare January 3, 2025 16:33

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 3, 2025

better table driving ?

da02e54

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 3, 2025

dgottlieb reviewed Jan 3, 2025

View reviewed changes

dan comments

2c591ad

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 3, 2025

fix client test

d5e484f

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RSDK-9440 Report machine `state` through `GetMachineStatus` #4616

RSDK-9440 Report machine `state` through `GetMachineStatus` #4616

benjirewis commented Dec 10, 2024 •

edited

Loading

benjirewis left a comment •

edited

Loading

cheukt Dec 10, 2024

benjirewis Dec 11, 2024

cheukt Dec 10, 2024

benjirewis Dec 11, 2024

benjirewis commented Dec 11, 2024

dgottlieb Dec 16, 2024

benjirewis Dec 18, 2024

dgottlieb Dec 16, 2024

benjirewis Dec 18, 2024

benjirewis Dec 19, 2024

dgottlieb Dec 16, 2024 •

edited

Loading

benjirewis Dec 18, 2024

dgottlieb Dec 16, 2024 •

edited

Loading

dgottlieb Dec 16, 2024

benjirewis Dec 18, 2024

benjirewis Dec 19, 2024

dgottlieb Dec 20, 2024 •

edited

Loading

benjirewis Dec 23, 2024

dgottlieb Dec 16, 2024

benjirewis Dec 18, 2024

cheukt Dec 23, 2024

dgottlieb Jan 3, 2025

dgottlieb Jan 3, 2025

benjirewis Jan 3, 2025

dgottlieb Jan 3, 2025

benjirewis Jan 3, 2025

dgottlieb Jan 3, 2025

dgottlieb Jan 3, 2025

benjirewis Jan 3, 2025

dgottlieb Jan 3, 2025

dgottlieb Jan 3, 2025

benjirewis Jan 3, 2025

dgottlieb commented Jan 3, 2025

RSDK-9440 Report machine state through GetMachineStatus #4616

Are you sure you want to change the base?

RSDK-9440 Report machine state through GetMachineStatus #4616

Conversation

benjirewis commented Dec 10, 2024 • edited Loading

benjirewis left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benjirewis commented Dec 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgottlieb Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgottlieb Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgottlieb Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgottlieb commented Jan 3, 2025

RSDK-9440 Report machine `state` through `GetMachineStatus` #4616

RSDK-9440 Report machine `state` through `GetMachineStatus` #4616

benjirewis commented Dec 10, 2024 •

edited

Loading

benjirewis left a comment •

edited

Loading

dgottlieb Dec 16, 2024 •

edited

Loading

dgottlieb Dec 16, 2024 •

edited

Loading

dgottlieb Dec 20, 2024 •

edited

Loading