Lots of connections and sessions > Trouble with websocket

Hi all,

We have an issue and this might be related to our own code, but we’re not sure.
Since this afternoon we couldnt connect anymore to our Janus server with wss:// at some point it connected to the socket after like 20 seconds, but a little while later it just timed out.
We rebooted the whole server, just to disconnect everyone and clean all the rooms.

The CPU / Memory usage was low.
But quickly the amount of sessions went up to about 10.000 and we noticed a peak of 35.000 connections.

Is this a point where Janus would ‘break’? We’re trying to fix the cause, but meanwhile I’d also like to know if there’s a way to see why the WebSocket isn’t working.
The logs didn’t give us any actual errors, so everything looked fine if we looked at our server monitoring.

I would really like to;

  1. See what goes wrong and why (The websockets are dead right now)
  2. Get alerted if this happens (health check monitoring)
  3. Know if we can improve our server with some recommended server config settings

Thanks in advance!

By “websocket isn’t working”, do you mean it doesn’t connect at all (TCP connection fails, or WebSocket ugrade of HTTP connection fails), or that it does connect but messages you send are not answered? It would be two entirely different problems.

If it’s the former, not sure if libwebsocket has some configurable limitations in terms of how many connections it can handle (I’d doubt it). Make sure you’re using a proxy like nginx/httpd/haproxy to deal with secure websockets, and only configure the plain websockets in Janus to keep it lighter. Also try enabling the HTTP transport on the side, and possibly the Admin API interface on both: if that one doesn’t work either when WS stop, then it’s probably not a WS-only problem.

If it’s the latter, you may have hit some deadlock somewhere in the core. If that happened, any attempt to do something with a new session may try to access the same deadlocked mutex and so wait forever and never answer. Doing the same test with different transports and Admin API as said before can help pin down where the problem really is. If the Admin API works, you can try to use it to enable lock debugging, and then check the logs when new sessions are attempted to see if they hang at a certain mutex. If the Admin API is stuck too, you may want to enable lock debugging in the configuration file at startup (which will make the log size explode but should help capture which mutex is locking and where/when).

1 Like