Using Audiobridge's RTP + ffmpeeg + OpenAI Whisper to achieve realtime transcription

Hello community,

We are using AudioBridge and Streaming plugin for audio broadcasting. Now we are looking to add real-time transcription to it.

The idea is to forward RTP from AudioBridge to a specific IP:Port and over there do some magic there. I don’t find any solution that can ingest RTP and do transcription. If you guys know any such pipeline please do share.

One way is to chunk the audio file into. let’s say 10-second chunks, and use OpenAI Whisper to do transcription. So basically idea is to setup below pipeline,

AudioBridge → rtp_forward → ffmpeg → file chunks → OpenAI Whisper → Transcription

If anyone from the community has done real-time transcriptions, then please do share your ideas and examples if any.

We did and it’s not rocket science: if you don’t find anything to handle RTP (doubtful) write your own, it’s fairly simple.

Thanks Lorenzo. I am trying to save incoming RTP from AudioBridge to file using ffmpeg, using SDP file. I am using the following command, but it is not working

root@ip-192-168-11-151:/home/ubuntu/hemraj# ffmpeg -loglevel debug -protocol_whitelist file,crypto,udp,rtp -i test3.sdp -acodec libopus out.ogg
ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 31.100 / 56. 31.100
  libavcodec     58. 54.100 / 58. 54.100
  libavformat    58. 29.100 / 58. 29.100
  libavdevice    58.  8.100 / 58.  8.100
  libavfilter     7. 57.100 /  7. 57.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  5.100 /  5.  5.100
  libswresample   3.  5.100 /  3.  5.100
  libpostproc    55.  5.100 / 55.  5.100
Splitting the commandline.
Reading option '-loglevel' ... matched as option 'loglevel' (set logging level) with argument 'debug'.
Reading option '-protocol_whitelist' ... matched as AVOption 'protocol_whitelist' with argument 'file,crypto,udp,rtp'.
Reading option '-i' ... matched as input url with argument 'test3.sdp'.
Reading option '-acodec' ... matched as option 'acodec' (force audio codec ('copy' to copy stream)) with argument 'libopus'.
Reading option 'out.ogg' ... matched as output url.
Finished splitting the commandline.
Parsing a group of options: global .
Applying option loglevel (set logging level) with argument debug.
Successfully parsed a group of options.
Parsing a group of options: input url test3.sdp.
Successfully parsed a group of options.
Opening an input file: test3.sdp.
[NULL @ 0x563fe9b14740] Opening 'test3.sdp' for reading
[sdp @ 0x563fe9b14740] Format sdp probed with size=2048 and score=50
[sdp @ 0x563fe9b14740] audio codec set to: opus
[sdp @ 0x563fe9b14740] audio samplerate set to: 48000
[sdp @ 0x563fe9b14740] audio channels set to: 2
[udp @ 0x563fe9b1ce00] end receive buffer size reported is 131072
[udp @ 0x563fe9b1c900] end receive buffer size reported is 131072
[sdp @ 0x563fe9b14740] setting jitter buffer size to 500
[sdp @ 0x563fe9b14740] Before avformat_find_stream_info() pos: 140 bytes read:140 seeks:0 nb_streams:1
[sdp @ 0x563fe9b14740] After avformat_find_stream_info() pos: 140 bytes read:140 seeks:0 frames:0
Input #0, sdp, from 'test3.sdp':
  Duration: N/A, bitrate: N/A
    Stream #0:0, 0, 1/48000: Audio: opus, 48000 Hz, stereo, fltp
Successfully opened the file.
Parsing a group of options: output url out.ogg.
Applying option acodec (force audio codec ('copy' to copy stream)) with argument libopus.
Successfully parsed a group of options.
Opening an output file: out.ogg.
[file @ 0x563fe9b70c40] Setting default whitelist 'file,crypto'
Successfully opened the file.
Stream mapping:
  Stream #0:0 -> #0:0 (opus (native) -> opus (libopus))
Press [q] to stop, [?] for help
cur_dts is invalid st:0 (0) [init:0 i_done:0 finish:0] (this is harmless if it occurs once at the start per stream)
test3.sdp: Connection timed out
cur_dts is invalid st:0 (0) [init:0 i_done:0 finish:0] (this is harmless if it occurs once at the start per stream)
detected 2 logical cores
[graph_0_in_0_0 @ 0x563fe9b9bb40] Setting 'time_base' to value '1/48000'
[graph_0_in_0_0 @ 0x563fe9b9bb40] Setting 'sample_rate' to value '48000'
[graph_0_in_0_0 @ 0x563fe9b9bb40] Setting 'sample_fmt' to value 'fltp'
[graph_0_in_0_0 @ 0x563fe9b9bb40] Setting 'channel_layout' to value '0x3'
[graph_0_in_0_0 @ 0x563fe9b9bb40] tb:1/48000 samplefmt:fltp samplerate:48000 chlayout:0x3
[format_out_0_0 @ 0x563fe9b9b980] Setting 'sample_fmts' to value 's16|flt'
[format_out_0_0 @ 0x563fe9b9b980] Setting 'sample_rates' to value '48000|24000|16000|12000|8000'
[format_out_0_0 @ 0x563fe9b9b980] auto-inserting filter 'auto_resampler_0' between the filter 'Parsed_anull_0' and the filter 'format_out_0_0'
[AVFilterGraph @ 0x563fe9b6db00] query_formats: 4 queried, 6 merged, 3 already done, 0 delayed
[auto_resampler_0 @ 0x563fe9b9ed00] picking flt out of 2 ref:fltp
[auto_resampler_0 @ 0x563fe9b9ed00] [SWR @ 0x563fe9b9f1c0] Using fltp internally between filters
[auto_resampler_0 @ 0x563fe9b9ed00] ch:2 chl:stereo fmt:fltp r:48000Hz -> ch:2 chl:stereo fmt:flt r:48000Hz
[libopus @ 0x563fe9b6f9c0] No bit rate set. Defaulting to 96000 bps.
Output #0, ogg, to 'out.ogg':
  Metadata:
    encoder         : Lavf58.29.100
    Stream #0:0, 0, 1/48000: Audio: opus (libopus), 48000 Hz, stereo, flt, delay 312, 96 kb/s
    Metadata:
      encoder         : Lavc58.54.100 libopus
[out_0_0 @ 0x563fe9b9cc40] EOF on sink link out_0_0:default.
No more output streams to write to, finishing.peed=   0x    
size=       0kB time=00:00:00.00 bitrate=N/A speed=   0x    
video:0kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
Input file #0 (test3.sdp):
  Input stream #0:0 (audio): 0 packets read (0 bytes); 0 frames decoded (0 samples); 
  Total: 0 packets (0 bytes) demuxed
Output file #0 (out.ogg):
  Output stream #0:0 (audio): 0 frames encoded (0 samples); 0 packets muxed (0 bytes); 
  Total: 0 packets (0 bytes) muxed
0 frames successfully decoded, 0 decoding errors
[AVIOContext @ 0x563fe9b70fc0] Statistics: 0 seeks, 2 writeouts
[AVIOContext @ 0x563fe9b1d540] Statistics: 140 bytes read, 0 seeks

And below is my SDP file that I use for input

v=0
t=0 0
m=audio 5002 RTP/AVP 98
c=IN IP4 127.0.0.1
a=recvonly
a=rtpmap:98 opus/48000/2
a=fmtp:98 stereo=1; sprop-stereo=0; useinbandfec=1

I can confirm that I am receiving stream on 5002 using tcpdump,

root@ip-192-168-11-151:/home/ubuntu/hemraj# tcpdump -nei ens5 udp port 5002
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens5, link-type EN10MB (Ethernet), capture size 262144 bytes
05:58:04.785128 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 260: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 218
05:58:04.804801 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:04.825360 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:04.841332 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:04.861753 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:04.882107 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:04.904635 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:04.925390 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:04.945874 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:04.961279 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:04.981689 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:05.002160 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:05.022602 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:05.043032 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:05.063361 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:05.083887 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:05.105024 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:05.125397 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:05.145953 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:05.161258 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:05.183944 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:05.204458 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:05.225072 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:05.245433 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:05.265820 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15
05:58:05.281193 0e:97:8c:3c:75:5b > 0e:62:26:65:1f:e7, ethertype IPv4 (0x0800), length 57: 192.168.11.140.54821 > 192.168.11.151.5002: UDP, length 15

But it is not writing anything to file. I am sure that I am making some silly mistake. Appreciate if someone help on this.

Never mind, I figured it out. I was setting the payload type as something else in rtp_forward and using something else in SDP. I am able to save the audio to files with 30s rotation.

Below is the command,

ffmpeg -protocol_whitelist file,crypto,udp,rtp -acodec opus -i test3.sdp -acodec libopus -f segment -segment_atclocktime 1 -segment_time 30 -reset_timestamps 1 -strftime 1 out-%Y%m%dT%H%M%S.ogg