Janus stops responding to STUN binding requests for several seconds

We have janus instance running on an vm with public ip. It runs in a docker container with host networking. The RTP port range is open to the world and so we don’t use any stun servers. Client browser receives candidates with janus public ip and a port from the range. A peer reflexive candidate is then used to send data to the browser.
Everything works but sometimes the connection drops (oniceconnectionstatechange disconnected, onconnectionstatechange disconnected), this happens irregularly few minutes or tens of minutes after session starts, few times a day. Often the connection gets restored after few seconds.
We ran tshark on the server and it shows janus stops responding to stun binding requests, the requests are sent approx. every second and janus responds to them but then it suddenly stops, few more requests arrive but janus takes around 10s to respond.
We also enabled ice debug flag and analyzed janus logs. There are following two lines:

(janus:1): libnice-DEBUG: 15:21:52.155: Agent 0x7fcdac0127d0 : scheduling triggered check with socket=0x7fcdac01e600 and remote cand=0x7fcda8002040.
(janus:1): libnice-DEBUG: 15:21:57.469: Agent 0x7fcdac0127d0 : Found a matching pair 0x7fcdb0002370 (1:4241747488) (SUCCEEDED) ...

we looked at libnice source code and the second line should be emitted immediately after the first, there are no blocking calls between them. Also our monitoring shows the server’s cpu / mem did not get overloaded.
When the connection gets restored the janus log contains few “Discarding too old outgoing packet” lines.

Our janus version is 1.1.3 with libnice 0.1.18. Host OS distribution is CentOS 8 and it runs on an vmware vm. The janus docker image is based on ubuntu:20.04.

Excerpt from janus.jcfg:

media: {
  ipv6 = false
  no_media_timer = 15
  slowlink_threshold = 10
  rtp_port_range = "50000-51000"
}

nat: {
  nice_debug = true
  full_trickle = false
  ice_lite = false
  nat_1_1_mapping = "... public ip ..."
}
1 Like