Deadlock in SIP plugin

Hi, all
We have several production systems with Janus version 0.15.2 on Centos7.
Periodically we catch a deadlock in the SIP plugin module.
This can happen from a few hours to a few weeks after launch.
At the moment all janus_sip_sessions expire and close, but new ones cannot be created.

Per core dump, one thread is blocked when janus_mutex_lock(&master->mutex) in janus_sip_destroy_session()

Thread 24 (Thread 0x7f4566ff5700 (LWP 18689)):
#0 0x00007f45b5613e29 in syscall () from /lib64/libc.so.6
#1 0x00007f45b6fabf42 in g_mutex_lock_slowpath () from /lib64/libglib-2.0.so.0
#2 0x00007f45b00f520c in janus_sip_destroy_session (handle=0x7f45a4005730, error=) at plugins/janus_sip.c:2460
#3 0x0000000000448d4b in janus_ice_outgoing_traffic_handle (handle=0x7f45a40079a0, pkt=) at ice.c:4896
#4 0x000000000044be84 in janus_ice_outgoing_traffic_dispatch (source=0x7f45a40080b0, callback=, user_data=) at ice.c:492
#5 0x00007f45b6fdc119 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#6 0x00007f45b6fdc478 in g_main_context_iterate.isra.19 () from /lib64/libglib-2.0.so.0
#7 0x00007f45b6fdc74a in g_main_loop_run () from /lib64/libglib-2.0.so.0
#8 0x000000000043d370 in janus_ice_handle_thread (data=0x7f45a40079a0) at ice.c:1316
#9 0x00007f45b70035b0 in g_thread_proxy () from /lib64/libglib-2.0.so.0
#10 0x00007f45b58f0ea5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f45b5619b0d in clone () from /lib64/libc.so.6

Other threads are waiting for janus_mutex_lock(&sessions_mutex) in janus_sip_destroy_session()

Thread 23 (Thread 0x7f4580ff9700 (LWP 18685)):
#0 0x00007f45b5613e29 in syscall () from /lib64/libc.so.6
#1 0x00007f45b6fabf42 in g_mutex_lock_slowpath () from /lib64/libglib-2.0.so.0
#2 0x00007f45b00f4eec in janus_sip_destroy_session (handle=0x7f45a4005790, error=0x7f4580ff88a0) at plugins/janus_sip.c:2425
#3 0x0000000000448d4b in janus_ice_outgoing_traffic_handle (handle=0x7f45a4003e50, pkt=) at ice.c:4896
#4 0x000000000044be84 in janus_ice_outgoing_traffic_dispatch (source=0x7f45a4005860, callback=, user_data=) at ice.c:492
#5 0x00007f45b6fdc119 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#6 0x00007f45b6fdc478 in g_main_context_iterate.isra.19 () from /lib64/libglib-2.0.so.0
#7 0x00007f45b6fdc74a in g_main_loop_run () from /lib64/libglib-2.0.so.0
#8 0x000000000043d370 in janus_ice_handle_thread (data=0x7f45a4003e50) at ice.c:1316
#9 0x00007f45b70035b0 in g_thread_proxy () from /lib64/libglib-2.0.so.0
#10 0x00007f45b58f0ea5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f45b5619b0d in clone () from /lib64/libc.so.6

The service does not crash. Only a restart helps.
Compiled using GMutex (USE_PTHREAD_MUTEX not defined).

Per GDB, the blocked thread #24 has invalid pointer to session->master (may be, “use-after-free”) in helper session.
Q: Can there be a situation where the session->master pointer is not cleared in helper sessions after the master session is destroyed?

If there are invalid pointers involved, you may want to compile with libasan support, just to check if we’re trying to access something after it was freed, and possibly why it happened (which may indicate some issues in refcount management in some scenarios).

In general, we do have a feature to debug locks, which basically means adding a log line any time a lock is attempted and any time it’s released: that could help identify paths that could lead to a deadlock (e.g., by checking what’s the latest lock that was taken but not released, and investigate from there), but they’d also make the logs MUCH more verbose, so it’s up to you to decide if you want to try that too.

Thank you for response.
Unfortunately, I can’t reproduce this issue on my lab.
So far, I’m using temporary workaround on 3 systems. No statistics yet.

Ack, please let us know if the workaround does the trick for you and we can consider pushing it upstream.