500 server error after 382 or 383 publishers

I’m in the process of stress testing our Janus deployment on AWS EC2. For the past few tests, about half failed after the 382nd, and the others failed after the 383rd publishers. When failed, the server returned 500 to new publishers, but the existing 382/383 publishers’ connections were still going strong. Janus’s log is not showing any error. CPU/memory/network were all not hitting their limit (~50% when failed). I’ve done some troubleshoots (see below) but was unable to get passed 383 publishers. Can anyone help?

Setup:

  • AWS EC2 c5n.xlarge (4 vCPU, 10.5G Memory, Up to 25 Gbps network bandwidth)
  • Each web(javascript) client joins its own unique VideoRoom, and publishes 720p 30FPS video to it.
  • Recording is enabled.
  • Janus version 1.1.3

I’ve tried:

  • Updated Janus to 1.1.3
  • Upgraded EC2 to c5n.9xlarge
  • Upgrade EBS volume to the max IOPS and throughput allowed
  • In janus.jcfg, rtp_port_range = “20000-40000”
  • “ulimit -n” (open files) set to 1048576

I’ve also found some other settings to tweak but would like some help before running another large-scale test ($$), including:

  • Setting a higher/unlimited “ulimit -u” (max user processes)
  • Increasing rtp_port_range to “20000-60000”
  • Found in nginx.conf worker_connections is set to 768, which is (coincidentally?) half of the number of publishers I could achieve. Maybe increase this?

Is this the issue of Janus (which I doubt), hardware/ec2 limit, configuration issues on places like nginx/linux/etc, or anything else? Any help would be greatly appreciated!

You should check if the 500 comes from Janus or nginx. That said, I’d go for websockets rather than HTTP for signalling, since with HTTP you need multiple connections available, and one of them is always kept busy by long polls for each connected session.

Thanks, @lorenzo.

The 500 is from nginx. It happened when the backend server made contact with Janus to create the VideoRoom. So does that mean the issue is likely from nginx?

<html>\r\n<head><title>500 Internal Server Error</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>500 Internal Server Error</h1></center>\r\n<hr><center>nginx/1.14.0 (Ubuntu)</center>\r\n</body>\r\n</html>\r\n

As to using websockets, I’m prioritizing websocket connections but have rest as a backup. The web client initializes Janus like below. I can force it to use wss only to see if that makes a difference. Do you recommend using wss only and get rid of http?

const janus = new Janus({
    destroyOnUnload: true,
    server: [socketUrl, restUrl],
    token: janusApiToken,
    success: () => {
        attachVideoRoomPlugin();
    },
    error: (error) => {
        // error
    },
    destroyed: () => {
        // destroyed
    }
})

I mean, nginx is sending you the 500 because it’s the frontend, but I was wondering if it was simply forwarding a 500 it got from Janus, or originating a 500 itself because, e.g., it couldn’t get to Janus for some reason.

In general we always recommend using WS because it’s just much more efficient (where WS is of course also proxied by nginx). If you’re using HTTP only as a fallback it should be fine, unless you’re getting the 500 from HTTP because you hit some limit with WS.

Turns out the error is from nginx. Digging thru the nginx log I found 768 worker_connections are not enough while connecting to upstream. Increasing worker_connections in nginx.conf fixes the issue.

1 Like