Today, everyone is talking about AI, or artificial intelligence, and how it will change the world over the next few decades. There are many use cases for it, including healthcare, transportation, security, and even task assistance.

AI is being applied to WebRTC and real time communications in interesting ways. One interesting use case is WebRTC error detection and forecasting, an approach used by callstats.io. Another more obvious use case is object detection and segmentation for building things like Snapchat or Instagram filters.

In this blog post, we’ll go through the development process of a multiparty video chat with AI-powered background removal. I’ll talk about some options to achieve that and show you how to use image segmentation in the web and integrate tensorflow.js into the Agora.io video chat demo.


First of all, I forked a working web app demo using VanillaJS from Agora github repo here. After adding some small customizations, I got to test it and play a bit with it. Below, I will go over the integration process. For steps on how to run or configure it further, you can follow the README.

There are several options for background removal:

  1. Using a single color background and filtering it using something like this Mozilla demo. This would be the most common and cleanest approach.
  2. Using OpenCV for background subtraction.
  3. Using machine learning.

In this case, we chose to use machine learning with TensorFlow image segmentation.

Starting with Tensorflow background removal using semantic segmentation

Semantic segmentation is an ML technique that works by associating each pixel of an image with a class label, such as a person or a car. Since we want to focus on people, we will be segmenting people and hiding/covering the background. You can find more in-depth information on how this works here.

Although we could train our own model to detecting people, we used an existing Google best-performing semantic image segmentation model, DeepLabv3.

Then we will convert to TensorFlow.js by using:

With the converted model, we will be able to play with it in JavaScript. We will need to declare the model with:

Now we can use gUM to get the user video and create a video copy as a canvas. We will modify that canvas and mask its background.

Example of a native WebRTC one to one communication with background removal

You can get the whole working code for a WebRTC p2p here.

Special recognition to Aleksandar for his post, which is where I learned how to find DeepLab model input and output node names and got the code snippet to do tensorflow semantic segmentation using JavaScript.

Following with the integration into Agora multiparty video app

Now that we had a demo using native WebRTC, we just need to add this code into the Agora app. To do that, we had to make some small changes.

First from the front end, we need to call the TensorFlow.js dependency.

Agora SDK has a createStream function that will call gUM and pass the media automatically. We need to modify the media before sending it out. To do this, we use:

Finally, when the above actions are completed, we use:

Here’s an example of the end product:

Example using background removal in a one to one call on top of an Agora video call

Results

This proof of concept looks interesting. Being able to hide the background on the go without depending on how the background looks like can be pretty useful.

From my tests using a Pixel and a Lenovo E550 with 8GB RAM, I saw that this is quite resource-intensive. It will heat up your device pretty quick and video calls will have additional latency and a reduced frame rate due to the delays processing the video.

Ready to build something new?

At WebRTC.ventures, we can build customizable video applications with features such as recordings, transcriptions, image processing, and more. We have an experienced team ready and happy to help you out. Contact us today!

Please note: This is a republication of the original blog post published in May 2019.

Recent Blog Posts