Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

behavior change, now I get timeouts on ping #175

Open
schaefer01 opened this issue Aug 29, 2023 · 19 comments
Open

behavior change, now I get timeouts on ping #175

schaefer01 opened this issue Aug 29, 2023 · 19 comments

Comments

@schaefer01
Copy link

Hi,
I've got a v7 running on a red featherwing, loaded with minimqtt from the v7 bundle around december 2022
and a v8 running on a black featherwing, loaded with minimqtt from the v8 bundle

The v7 with an earlier version of minimqtt runs w/o ping timeouts if I'm waiting and doing nothing

The v8 does this, after a short period of time:
File "code.py", line 1776, in
File "adafruit_minimqtt/adafruit_minimqtt.py", line 1007, in loop
File "adafruit_minimqtt/adafruit_minimqtt.py", line 1030, in _wait_for_msg
File "adafruit_minimqtt/adafruit_minimqtt.py", line 1126, in _sock_exact_recv
MMQTTException: Unable to receive 1 bytes within 60 seconds.

The problem I see is that the exception, in response to a ping should be handled in the ping,
which has its own timeout handling code, which is never reached but even if it were,
would also raise an unhandled exception if it did.
Or in the loop that contains the ping and return an error code for the app to receive and
figure out what to do.

questions:

  1. why does the earlier version of minimqtt not have this problem?
  2. in any case, why aren't ping timeout exceptions handled and rather than return an exception
    thatis, return an error code that can be handled as to whether the app wants to restart or not,
    and be accurate in that the problem is a ping timeout rather than an "expected 1 byte(s)" timeout
@schaefer01
Copy link
Author

I put loop() in a try except block.
ping() exceptions at a 1 per minute rate, which makes be believe that ping is broken but ...

I don't know if this is an mqtt problem or a circuitpy v8 problem because both are different between my rp running v7 and an v7 minimqtt, and my rp running v8 and the latest minimqtt

@schaefer01
Copy link
Author

more information, I instrumented the code, the loop is supposed to exit after one second but when a ping occurs, the loop holds its context until the ping timeout.

my application's main loop has time sensitive actions and shouldn't be delayed for one minute for a bad ping

entering client.loop()
Client exception: Unable to receive 1 bytes within 60 seconds.
exited client.loop()
entering client.loop()
client rcs = [208]
exited client.loop()
entering client.loop()
Client exception: Unable to receive 1 bytes within 60 seconds.
exited client.loop()
entering client.loop()
client rcs = [208]
exited client.loop()
entering client.loop()

@schaefer01
Copy link
Author

Got it, Its not ping, instrumented the source available after 12/5/2023

in loop() the call to receive generates an exception from if nothing is received but its ok for nothing to be received.

_wait_for_msg() ->_sock_exact_recv(1)

@schaefer01
Copy link
Author

in the same function, _sock_exact_recv(), the timeout isn't the tiimeout but the keep alive

read_timeout = self.keep_alive
...
if self.get_monotonic_time() - stamp > read_timeout:

@schaefer01
Copy link
Author

problem solved with hack inside _wait_for_msg

note: first I instrumented the code to discover the path, then I shortened the timeout for the first receive
which is ok for the first time in and then lengthened the timout if something was received
this is a hack which allows me to keep going but isn't the right thing to do for the general case

note: keep alive is the wrong value for this function, keep_alive is for ping
the is a self receive timeout that is set but never used

    else:  # socketpool, esp32spi
        #print("2: self._sock_exact_recv(),keep_alive=",self.keep_alive)
        self.keep_alive = 1
        try:
            res = self._sock_exact_recv(1)
        except OSError as error:
            if error.errno in (errno.ETIMEDOUT, errno.EAGAIN):
                # raised by a socket timeout if 0 bytes were present
                return None
            raise MMQTTException from error
        else:
           self.keep_alive = 60

@schaefer01
Copy link
Author

schaefer01 commented Dec 7, 2023

the above is not optimum because it causes ping to fail, now adding a parameter in _wait_for_msg to indicate whether a response is definitely expected or might or might not be expected and set the first wait accordingly

@schaefer01
Copy link
Author

Solved it. Contact me if you want to know what I did.
The lower limit of timeout in the call loop() to not drop messages is 2 seconds

@vladak
Copy link
Contributor

vladak commented Jan 20, 2024

2. in any case, why aren't ping timeout exceptions handled and rather than return an exception
   thatis, return an error code that can be handled as to whether the app wants to restart or not,
   and be accurate  in that the problem is a ping timeout rather than an "expected 1 byte(s)" timeout

Here's my understanding: The trouble is that once the receive loop (common to all MQTT message processing, not just PING) starts reading data into the buffer and hits the receive timeout (modulo #189) that pretty much results in the MQTT session to be useless, because if a partial read was already done, portion of the MQTT message was lost with the buffer (it is local variable) and if the client tried to recover, it would get the remaining portion of the message that would be interpreted as garbage w.r.t. its current state, hence the exception. Maybe it could special case for receive timeout with empty buffer however it should still give an indication (different exception perhaps) to the consumer that the receive timeout was exceeded; possibly not worth the trouble as it usually means that the connection at TCP level is dead anyway.

@schaefer01
Copy link
Author

My belief was that the ping timeout was colliding with the "normal" timeout.
There is also a timeout between the first element received and the second element received.
I disabled the ping because I wanted a short timeout for the equivalent of polling (if ping is wanted, it should be sent outside of the polling loop) and I then extended the timeout between the first and second received elements else the function would exit too soon. That is first element received but 2nd element timed out (when the function should have waited longer).

@vladak
Copy link
Contributor

vladak commented Jan 22, 2024

My belief was that the ping timeout was colliding with the "normal" timeout. There is also a timeout between the first element received and the second element received. I disabled the ping because I wanted a short timeout for the equivalent of polling (if ping is wanted, it should be sent outside of the polling loop) and I then extended the timeout between the first and second received elements else the function would exit too soon. That is first element received but 2nd element timed out (when the function should have waited longer).

Could you expand on the timeout collision ?

Anyhow, with the fix for #198 , the PINGREQ will be sent only if there has been no outgoing message traffic to the broker for the duration of the keep alive timeout or longer (assuming the loop() is called often enough - could be further improved by making the check in various functions).

@schaefer01
Copy link
Author

The "collision" is an assumption on my part.
I was seeing timeouts that did not make sense with respect to the test code I was running.
The test code was a call to client.loop() with a 2 second timeout, where I modified the code to pass the passed-in timeout to _wait_for_msg() to be used not on the first call to _sock_exact_recv() but used on the second call to _sock_exact_recv() inside _wait_for_msg()

I did not instrument/diagnose the path that generated the timeout but when I disabled ping the unwanted
timeouts went away.

My assumption is that time passed so that the ping timeout was generated (even though ping may have or may have not been sent). And I'm not sure the code is "smart enough" to not invoke a ping in the middle of a mqtt transaction.

@SRMAX
Copy link

SRMAX commented Jan 25, 2024

2. in any case, why aren't ping timeout exceptions handled and rather than return an exception
   thatis, return an error code that can be handled as to whether the app wants to restart or not,
   and be accurate  in that the problem is a ping timeout rather than an "expected 1 byte(s)" timeout

Here's my understanding: The trouble is that once the receive loop (common to all MQTT message processing, not just PING) starts reading data into the buffer and hits the receive timeout (modulo #189) that pretty much results in the MQTT session to be useless, because if a partial read was already done, portion of the MQTT message was lost with the buffer (it is local variable) and if the client tried to recover, it would get the remaining portion of the message that would be interpreted as garbage w.r.t. its current state, hence the exception. Maybe it could special case for receive timeout with empty buffer however it should still give an indication (different exception perhaps) to the consumer that the receive timeout was exceeded; possibly not worth the trouble as it usually means that the connection at TCP level is dead anyway.

I agree an error code should ALWAYS be returned so the issue can be handled by the program as to what happens next. Whether it is a restart or a loop to wait until connection can be established. This just makes for very unreliable programs. If you ever want CircuitPython to be taken serious them escalate issues like this one to get it correctly. This is only one of several error issues that are not handled properly.

@schaefer01
Copy link
Author

While important people might be listening on this thead, if you want CircuitPython to be taken seriously, you really need to:

  1. buckle down and do the work to implement the secure socket layer so it can be on the Internet so you aren't upsetting the professional and hardworking IT folks that support the host organization of developers.
  2. make the wiznet api look more like unix/linux, including make it easier to access/set the mac address rather than DE.AD.BE.EF, again to not upset the professional and hardworking IT folks that support the host organization of developers.
    and not crash (or put in a plan to rapidly fix the problem) when there's a new update to macosx
  3. stop telling developers when there's a problem that you don't want to fix, that circuitpython is for hobbyists

@vladak
Copy link
Contributor

vladak commented Jan 25, 2024

I did not instrument/diagnose the path that generated the timeout but when I disabled ping the unwanted timeouts went away.

My assumption is that time passed so that the ping timeout was generated (even though ping may have or may have not been sent). And I'm not sure the code is "smart enough" to not invoke a ping in the middle of a mqtt transaction.

Currently the need to send PINGREQ is based on whether more than keep alive time has passed since sending the last PINGREQ (which is wrong but let's continue) and is done as a first thing in the loop(). Now, ping() itself has a _wait_for_msg() cycle, very similar to the cycle in loop() - might be the cause of the behavior you were seeing if you did not pass the timeout all the way to the cycle inside ping(). The cycle in ping() gets interrupted only when PINGRESP was received or the keep alive timeout has passed (which is wrong) . Feel free to grab the code from my 2 PRs and see if it makes any changes. Not sure it makes sense to check the need to send PINGREQ outside the loop() because it seems the rest of public functions send some sort of MQTT message to the broker which should reset the need for PINGREQ.

Also, thinking of this, it seems to me that the timeout specified in loop() should be higher than the socket timeout because this is the granularity of _sock_exact_recv().

@schaefer01
Copy link
Author

schaefer01 commented Jan 25, 2024

I have it working/hacked good enough to get by until the next circuitypython official release.

I found for the rp204, that minimum timeouts less than 2 seconds led to timeouts on completing a nominally good message/hanshake, which is, for my application, 2 seconds is good enough.

MQTT is just a step in a long chain of code of things that have to get done for my project. Too many things to do, too little time.

@vladak
Copy link
Contributor

vladak commented Jan 26, 2024

I have it working/hacked good enough to get by until the next circuitypython official release.

I found for the rp204, that minimum timeouts less than 2 seconds led to timeouts on completing a nominally good message/hanshake, which is, for my application, 2 seconds is good enough.

Do you explicitly set socket_timeout and recv_timeout when initializing the MQTT object ?

MQTT is just a step in a long chain of code of things that have to get done for my project. Too many things to do, too little time.

I understand. That said, it reminds me of https://xkcd.com/2347/

@schaefer01
Copy link
Author

I made sure the timeout parameter at the loop() API made it down the functional call chain to where it needed to be applied. I may or may not have used the existing variable names, I likely did not, to prevent timeout dependency side-effects similar to what I saw with PING.
re: xkcd, this is true for the social structure as well, also known as dejure and defacto.
I'm pretty sure Jon Postel's death by heart attack was caused by the addition of lawyers to the running of the Internet. https://en.wikipedia.org/wiki/Jon_Postel

@kevin-tritz
Copy link
Contributor

kevin-tritz commented Mar 21, 2024

FYI, in get_monotonic_time(), monotonic_ns()/1000000000 doesn't seem to maintain precision (at least for the RP2040 which I tested on) any better than straight up monotonic, so short timeout values (10-100ms) will start to run into errors within hours or few days of uptime.

@vladak
Copy link
Contributor

vladak commented Mar 21, 2024

FYI, in get_monotonic_time(), monotonic_ns()/1000000000 doesn't seem to maintain precision (at least for the RP2040 which I tested on) any better than straight up monotonic, so short timeout values (10-100ms) will start to run into errors within hours or few days of uptime.

Is this because of the limited range that can be represented in the resulting float ? Could you create a new issue (this one is already overloaded), please ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants