-
Notifications
You must be signed in to change notification settings - Fork 747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for XPUB/PUB option ZMQ_XPUB_NODROP to combat message loss #1070
Comments
I've created a set of unit tests to verify that this problem exists. When I send messages and the XPUB socket reaches its HWM it looks like it does not correctly report back (in call to TrySendMultipartMessage) that its buffer is full until "some time after" the fact, causing message loss. I am going to try to replicate this unit test in the NetMQ project to demonstrate that the problem exists and isn't caused by our (reasonably thin) wrappers around NetMQSocket. As a workaround we've now bumped the HWMs to 2_000_000. |
Are the unit tests for NetMQ in working order? I did a fresh checkout but get a ton of failures when trying to run the tests. Looking at the code, some things also don't make sense. E.g. in TcpListener there is an Assumes.NotNull(m_handle) right before code that assigns to this variable. Uncommenting this Assumes allows a bunch more tests to pass, but there are still many failing ones, and I don't really want to modify this rather complex codebase without a test suite that passes before any edits. Is this the end of the road for NetMQ or am I doing something wrong? |
#1073 fixes that assertion in The tests still fail for me in VS, though they have been passing on Ubuntu CI. We just need someone to spend some time to understand the failures. |
Ok, glad to hear it. I was just surprised to see so many failing tests considering that there haven't been that many commits to the project. I may take a second look and see if I can help get the tests working, as I'd really like to help squash the HWM bug. |
The socket option ZMQ_XPUB_NODROP (69) is supposed to toggle the behavior of the socket when SENDHWM is reached. If 0/false (the default), messages are silently dropped. If 1/true, sending a message will instead return an error.
We're using TrySendMultipartMessage to send the message, and our logic assumes that this returns false if the message couldn't be sent, but this seems to not always be the case (or we wouldn't be seeing lost messages). I checked the NetMQ (v4, master) source code, but was not able to understand what actually happens in NetMQ when the socket is full.
The motivation for requesting this is that we experience occasional message loss (not a slow joiner, but while the system is running) between publishers and subscribers. All subscribers always lose the same amount of messages, so I'm thinking it must be a publisher problem. We're verifying sequence numbers, both when we send a message, and when we receive it. The publisher code does not report SN gaps, but subscribers do. We've bumped the SENDHWM to 100_000 from the default 1_000 and also increased the buffer, but still have the problem when there are message spikes.
We're going to try to bump the HWM even more, but it seems like a hacky workaround instead of a solution.
The text was updated successfully, but these errors were encountered: