-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add WebSockets metrics #2649
Conversation
seems like webrtc firefox test is failing: https://github.com/libp2p/js-libp2p/actions/runs/10256017997/job/28381208851?pr=2649#step:5:2369, which shouldn't be related to these changes at all |
The Firefox thing is documented here - #2642 - unrelated to this PR, as you say. |
I think the various "start" events might be unnecessary? That is, we care about successes vs error/abort, the sum of which should be equal to the starts, so we can derive that metric if we need it but it doesn't seem particularly useful on its own? For the A good starting point apart from that, I think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments above
The main thing I was thinking here was that with https://github.com/libp2p/js-libp2p-amino-dht-bootstrapper, I ran into some issues where dials were completely failing and being closed early and I couldn't find them in the logs or events. If we don't have an event for first-touch on the receiving end, we don't have an easy way to confirm that the error occurred at the networking/docker/etc layer above js-libp2p. On the other hand, if we do have a start event and nothing else (totally possible if OS drops that socket for whatever reason), we can investigate there further.. e.g. A lot of starts and no other events for the transport could indicate instability at the host layer. If we have ZERO starts, but we know we tried to dial it.. we know libp2p never got the connection. These may not be enough of a reason to emit all these metrics all the time, but that's what was in my mind while creating this PR. I can remove the various "starts" if you feel like they're still unnecessary. let me know! I will add a |
This sounds a bit weird. If our handler isn't being invoked but the connection is accepted by the server we should see the number of active handles increase. If the connection succeeds we should see the "success" metric increase, if it errors or times out we should see the "error" or "aborted" metrics increase. There shouldn't be any other scenarios? If connections are not making it to the app it's probably a misconfiguration of ports in the docker file? I'd say take the "start"s out for now, we can always add them later if they become necessary. |
remove: * upgrade_start * socket_open_start * maconn_open_start * maconn_close_start add: * socket_open_error
@achingbrain starts removed in NOTE: added |
added metricPrefix to `socket-to-conn.ts` and set in `listener.ts` remove: * maconn_abort - would have resulted in duplicate of `maconn_close_abort` * maconn_open_success - No logic actually happens during maconn creation.. should be build/compile/run error and not a metric. add: N/A change: * socket_close_success -> maconn_socket_close_success
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self review
|
||
await maConn.close().catch(err => { | ||
this.log.error('inbound connection failed to close after upgrade failed', err) | ||
}) | ||
}) | ||
metrics?.increment({ socket_open_success: true }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could fire after the .catch
handler. should remove
@@ -91,6 +100,7 @@ export function socketToMaConn (stream: DuplexWebSocket, remoteAddr: Multiaddr, | |||
if (maConn.timeline.close == null) { | |||
maConn.timeline.close = Date.now() | |||
} | |||
metrics?.increment({ [`${metricPrefix}maconn_socket_close_success`]: true }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may not want this one? but based on the comment on lines 97-99, it seems like it could be useful
@@ -57,6 +63,7 @@ export function socketToMaConn (stream: DuplexWebSocket, remoteAddr: Multiaddr, | |||
const { host, port } = maConn.remoteAddr.toOptions() | |||
log('timeout closing stream to %s:%s after %dms, destroying it manually', | |||
host, port, Date.now() - start) | |||
metrics?.increment({ [`${metricPrefix}maconn_close_abort`]: true }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the this.abort(err)
, that calls stream.destroy
will trigger this.. i'm trying to avoid duplicate abort
and error
events but might have missed something
const conn = await options.upgrader.upgradeOutbound(maConn, options) | ||
this.metrics?.dialerEvents.increment({ upgrade_success: true }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need an upgrade_error
somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, makes sense.
i'm currently in the process of updating this PR to much more closely resemble TCP metrics.. including adding a status metric |
FYI, take a look at the callouts in PR description and let me know if you want me to add any event handlers. it looks like this transport hasn't been changed much since the migration to typescript, so maybe it could use an overhaul, but I wanted to keep things limited to only adding metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self review. single callout comment about the WebSocketListenerStatusCode
enum
events: CounterGroup | ||
} | ||
|
||
enum WebSocketListenerStatusCode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I removed the paused
status that the TCP transport had because it didn't seem to make sense here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Listener status was added to the TCP transport as it can shut itself down when under attack and restart later on. This feature isn't in the WebSocket transport so I think we don't need this right now.
We may add it later I guess but let's do that when we do that.
I think if we follow TCP as a model then no. We can always add more later.
We also listen for the
Seems fine.
For completeness probably, and they may be useful in answering questions like "Is the socket_close_success in socket-to-conn.ts called even after an abort event?" but I would be fine without them. I've pushed a couple of commits, I'd like to get this in. I removed the socket status as per #2649 (comment) I also remove the The idea is, we listen for events on the socket and log metrics when they occur (same as the TCP transport). This is a bit different to the in initial PR, which logged metrics after calling methods on the socket. That is, the socket is closed when the Without only relying solely on the events for logging metrics we could end up masking problems caused if we misuse the stream API for whatever reason. |
Title
feat: add WebSockets metrics
Description
Creates two new metric counter groups for WebSockets:
libp2p_websockets_dialer_events_total
andlibp2p_websockets_listener_events_total
.The following new metrics should be added:
This aligns pretty closely to TCP metrics now:
Fixes #1915
Notes & open questions
Callouts
timeout
metric because we don't have a socket timeout event already.timeout
metric for websockets because we don't have a socket timeout event already.end
metricclose_
because we don't have generalabort/close/end
handlers already.inbound_to_connection
metric for websockets becausesocketToMaConn
doesn't throw.events.increment({ [
${this.addr} error]
for websockets because there is no existingsocket.on('error'
handler.TODO based on comments:
24341d1
(#2649)Some questions:
server.ts
instead of inlistener.ts
?socket_close_success
insocket-to-conn.ts
called even after an abort event? we may want to move it if so.error
events? i.e. inthis.server.on('error', (err: Error) => {
${listeningAddrDetails?.transport}_${listeningAddrDetails?.host}:${listeningAddrDetails?.port}
because WebSocket listener may be listening on/tls/ws
or/wss
or/ws
-- Do we need.transport
?Change checklist