-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RSDK-9593] move ota to config monitor, config monitor restart hook #369
[RSDK-9593] move ota to config monitor, config monitor restart hook #369
Conversation
@@ -51,7 +51,7 @@ use super::server::{IncomingConnectionManager, WebRtcConfiguration}; | |||
use crate::common::provisioning::server::AsNetwork; | |||
|
|||
#[cfg(feature = "ota")] | |||
use crate::common::{credentials_storage::OtaMetadataStorage, ota}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left this here for now instead of shifting it to config_monitor.rs
. I like that it's part of ViamServerStorage
, especially if we remove the flag entirely to make it a builtin
// TODO(RSDK-9464): flush logs to app.viam before restarting | ||
esp_idf_svc::hal::reset::restart(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shifts the responsibility of rebooting to whoever is calling ota::update()
while let Err(ota_err) = ota.update().await { | ||
log::error!("failed to update firmware: {}", ota_err); | ||
log::info!("retrying firmware update"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wasn't sure if I should put a delay here at all since there is one internally in update
for connection establishment already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
additionally, if the error is due to specific errors ConfigError
, InvalidImage
, etc, I'm not sure what the behavior should be.
we can
a) continue applying the other configs (might be risky if the new configs are only compatible with new firmware) this means changing my current implementation to add an extra check when the task first starts.
b) continue to block on applying the current update and surface an error, if so, we need to check for new configs between failures.
c) add options down the road for users to decide on the behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a) is the right move here. Consumers of the pre-packaged micro-rdk-server can see why the configs incompatible with the old firmware failed to function appropriately because the version on app will be the old version. Others can look at the logs (which is admittedly not ideal, but the solution to that is to change/revamp version reporting as we have previously discussed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No logical issues, but I've suggested some shifting of code
self.executor.clone(), | ||
) { | ||
Ok(mut ota) => { | ||
let curr_metadata = ota.stored_metadata().await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like the logic here could go back to the OTA service and this could be reduced to a function call on the service? You would no longer need the stored_metadata
function and the pending_version
would no longer need to be pub(crate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right. because this would run every 10 seconds when the config gets checked, I wanted to control how often logging gets propogated, otherwise we's see version up to date
all the time.
I've since moved a single log to when viam server starts to just log the current firmware once, and report only the update target when attempting ota
while let Err(ota_err) = ota.update().await { | ||
log::error!("failed to update firmware: {}", ota_err); | ||
log::info!("retrying firmware update"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a) is the right move here. Consumers of the pre-packaged micro-rdk-server can see why the configs incompatible with the old firmware failed to function appropriately because the version on app will be the old version. Others can look at the logs (which is admittedly not ideal, but the solution to that is to change/revamp version reporting as we have previously discussed)
} | ||
} | ||
|
||
if reboot { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks cleaner to use a flag instead of fiddling with feature gates
micro-rdk/src/common/conn/viam.rs
Outdated
#[cfg(feature = "ota")] | ||
{ | ||
if self.storage.has_ota_metadata() { | ||
let metadata = self.storage.get_ota_metadata().unwrap_or_default(); | ||
log::info!("firmware version: {}", metadata.version); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding this here avoids complicating config monitor's ota stuff to get the right logging behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with optional suggestion
micro-rdk/src/common/ota.rs
Outdated
@@ -256,31 +269,22 @@ impl<S: OtaMetadataStorage> OtaService<S> { | |||
pending_version, | |||
max_size, | |||
address, | |||
needs_reboot: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to have update
return a boolean as to whether an update was actually performed rather than store state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, provided the one logical issue is resolved
let (mut sender, conn) = loop { | ||
num_tries += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because of where the increment is, this ends up only trying a maximum of twice rather than thrice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah you're right. should go at the bottom of the loop. fixed
micro-rdk/src/common/conn/viam.rs
Outdated
{ | ||
if self.storage.has_ota_metadata() { | ||
let metadata = self.storage.get_ota_metadata().unwrap_or_default(); | ||
log::info!("firmware version: {}", metadata.version); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will log the empty string if there is no OTA metadata in NVS. I'm not sure that's the best outcome. Can we be more explicit?
micro-rdk/src/common/conn/viam.rs
Outdated
#[cfg(feature = "esp32")] | ||
let hook = || crate::esp32::esp_idf_svc::hal::reset::restart(); | ||
#[cfg(not(feature = "esp32"))] | ||
let hook = || std::process::exit(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Before we just called
std::process::exit
, but this goes back to an older style where on esp32 we dohal::reset::restart
. What was the motivation for that change. I think I know, but I'd like to be sure. - If this is to effect the change I suspect, I think I'd want to see the same change in the
RestartMonitor
. - A block nearly identical to this exists in
grpc::robot_shutdown
. Can we DRY this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the motivation was to get rid of the coredump we see everytime on config change restarts since I'm already modifying the config_monitor's behavior.
I mention it in the description about partially addressing RSDK-9594, but just in config monitor.
I can look into grpc::robot_shutdown
, but that depends if you want that done across the repo in RSDK-9594 or just here. Which would you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per offline discussion, changing this back to just use std::process::exit
. the right changes will be done in RSDK-9594
micro-rdk/src/common/ota.rs
Outdated
@@ -172,6 +170,17 @@ pub(crate) struct OtaService<S: OtaMetadataStorage> { | |||
} | |||
|
|||
impl<S: OtaMetadataStorage> OtaService<S> { | |||
pub(crate) async fn stored_metadata(&self) -> OtaMetadata { | |||
if !self.storage.has_ota_metadata() { | |||
log::info!("no ota metadata currently stored in NVS"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: OTA
micro-rdk/src/common/ota.rs
Outdated
@@ -172,6 +170,17 @@ pub(crate) struct OtaService<S: OtaMetadataStorage> { | |||
} | |||
|
|||
impl<S: OtaMetadataStorage> OtaService<S> { | |||
pub(crate) async fn stored_metadata(&self) -> OtaMetadata { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function really async
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't, but shouldn't it be? The underlying calls are essentially I/O, getting data from flash storage. But this is maybe is more of a "should our storage APIs be async?" question.
I'll remove it here though.
micro-rdk/src/common/ota.rs
Outdated
.inspect_err(|e| log::warn!("failed to get ota metadata from nvs: {}", e)) | ||
.unwrap_or_default() | ||
}; | ||
pub(crate) async fn needs_update(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function really async
?
micro-rdk/src/common/ota.rs
Outdated
@@ -337,6 +342,7 @@ impl<S: OtaMetadataStorage> OtaService<S> { | |||
CONN_RETRY_SECS | |||
); | |||
Timer::after(Duration::from_secs(CONN_RETRY_SECS)).await; | |||
num_tries += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is there a way to frame this loop without an explicitly managed counter?
- If you really need the counter, I'd recommend incrementing at or near the top so it can't somehow become skipped later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried exploring different paths for a day or two:
- wrapping the connection initialization logic into its own function
- getting the return type right (even with generics) was a rabbit hole of satisfying the compiler and enabling an unstable/nightly features that are used to define a lower-level trait (something with Allocator if I remember right)
- declaring the variables outside the loop
- same issue as above
- using
for i in 0..CONN_RETRY
- can't use
break
in afor-loop
, only inloop
- can't use
I think I tried a couple other things, (like inspect,map_err, ok_and, etc) but it seemed like too much time spend on trying to avoid the messiness and have the correct behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can move the increment to the top of the loop and set the range accordingly
micro-rdk/src/common/ota.rs
Outdated
let (mut sender, conn) = loop { | ||
if num_tries == NUM_RETRY_CONN { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a necessary part of this change? It seems independent of moving the driver of OTA checks to be the ConfigMonitor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we don't limit the number of connection attempts, it will block further config changes. So the change is to still have a limited number of short-lived retry attempts in case there's initial wonkiness with the connection.
If there is any general wonkiness with establishing the connection, there isn't any guarantee that returning immediately and depending on the next invocation of ConfigMonitor (in 10sec) will resolve it.
three one-second retries seemed like a decent middle ground that we can iterate on if anything.
micro-rdk/src/common/ota.rs
Outdated
self.storage | ||
.get_ota_metadata() | ||
.inspect_err(|e| log::warn!("failed to get ota metadata from nvs: {}", e)) | ||
.unwrap_or_default() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why return a default value rather than an error?
@@ -27,11 +30,14 @@ where | |||
pub fn new( | |||
curr_config: Box<RobotConfig>, | |||
storage: Storage, | |||
#[cfg(feature = "ota")] executor: Executor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: inconsistent with how we usually do cfg
things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
if reboot { | ||
// TODO(RSDK-9464): flush logs to app.viam before restarting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ticket would be another reason to DRY the shutdown handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as in, have a global shutdown function that handles log-flushing?
This reverts commit c7b1b2e.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok to merge for testing but not really a fan of the approach here. Having the OTA logic coupled to the config monitor feels clunky and complicated. It may lead to premature cancellation of the OTA process if connection to app.viam.com is lost. It also make any future changes to config monitor harder
@@ -27,11 +30,14 @@ where | |||
pub fn new( | |||
curr_config: Box<RobotConfig>, | |||
storage: Storage, | |||
#[cfg(feature = "ota")] executor: Executor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not ideal to have function parameters be dependent on features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if either a) we decide to revert back to a separate ota task or b) if OTA becomes default and we remove the feagure-gate, we could remove this.
.iter() | ||
.find(|&service| service.model == *ota::OTA_MODEL_TRIPLET) | ||
{ | ||
match ota::OtaService::from_config( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We end up instantiating a new OtaService and doing a version check each time the ConfigMonitor triggers that's not super ideal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made a ticket, though may be a non-issue if we revert back to separate ota task after testing. RSDK-9676
self.executor.clone(), | ||
) { | ||
Ok(mut ota) => match ota.update().await { | ||
Ok(needs_reboot) => reboot = needs_reboot, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC at reboot the config retrieved offline overwrites the stored config, is there a particular reason for the ota to not reboot immediately on successful download?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe not a likely scenario, but if app.viam becomes unreachable during the download period, at least it would be operating off of the new config at reboot.
we can iterate on it
This PR:
uses a.while let Err()
loop onota.update()
to ensure retries, blocking other config changes from being appliedConfigMonitor
(every 10sec by default) to retry failed OTA attempsaddresses ticket RSDK-9594 in the config monitor, haven't checked if this change needs to be made in other parts of the codebase.