Setting up communication
Although the basis of WebRTC communication is peer-to-peer, the initial step of setting up this communication requires some sort of coordination. This is most commonly provided by a web server and/or a signaling server. This enables two or more WebRTC capable devices or peers to find each other, exchange contact details, negotiate a session that defines how they will communicate, and then finally establish the direct peer-to-peer streams of media that flows between them.
The general flow
There are a wide range of scenarios, ranging from single web page demos running on a single device to complex distributed multi-party conferencing with a combination of media relays and archiving services. To get started, we will focus on the most common flow, which covers two web browsers using WebRTC to set up a simple video call between them.
Following is the summary of this flow:
- Connect users
- Start signals
- Find candidates
- Negotiate media sessions
- Start RTCPeerConnection streams
Connect users
The very first step in this process is for the two users to connect in some way. The simplest option is that both the users visit the same website. This page can then identify each browser and connect both of them to a shared signaling server, using something like the WebSocket API. This type of web page, often, assigns a unique token that can be used to link the communication between these two browsers. You can think of this token as a room or conversation ID. In the http://apprtc.appspot.com demo described previously, the first user visits http://apprtc.appspot.com, and is then provided with a unique URL that includes a new unique token. This first user then sends this unique URL to the second user, and once they both have this page open at the same time the first step is complete.
Start signals
Now that both users have a shared token, they can now exchange signaling messages to negotiate the setup of their WebRTC connection. In this context, "signaling messages" are simply any form of communication that helps these two browsers establish and control their WebRTC communication. The WebRTC standards don't define exactly how this has to be completed. This is a benefit, because it leaves this part of the process open for innovation and evolution. It is also a challenge as this uncertainty often confuses developers who are new to RTC communication in general. The apprtc demo described previously uses a combination of XHR and the Google AppEngine Channel API (https://developers.google.com/appengine/docs/python/channel/overview). This could, just as easily, be any other approach such as XHR polling, Server-Sent Events (http://www.html5rocks.com/en/tutorials/eventsource/basics/), WebSockets (http://www.html5rocks.com/en/tutorials/websockets/basics/), or any combination of these, you feel comfortable with.
Find candidates
The next step is for the two browsers to exchange information about their networks, and how they can be contacted. This process is commonly described as "finding candidates", and at the end each browser should be mapped to a directly accessible network interface and port. Each browser is likely to be sitting behind a router that may be using Network Address Translation (NAT) to connect the local network to the internet. Their routers may also impose firewall restrictions that block certain ports and incoming connections. Finding a way to connect through these types of routers is commonly known as NAT Traversal (http://en.wikipedia.org/wiki/NAT_traversal), and is critical for establishing a WebRTC communication. A common way to achieve this is to use a Session Traversal Utilities for NAT (STUN) server (http://en.wikipedia.org/wiki/Session_Traversal_Utilities_for_NAT), which simply helps to identify how you can be contacted from the public internet and then returns this information in a useful form. There are a range of people that provide public STUN servers. The apprtc demo previously described uses one provided by Google.
If the STUN server cannot find a way to connect to your browser from the public internet, you are left with no other option than to fall back to using a solution that relays your media, such as a Traversal Using Relay NAT (TURN) server (http://en.wikipedia.org/wiki/Traversal_Using_Relay_NAT). This effectively takes you back to a non peer-to-peer architecture, but in some cases, where you are inside a particularly strict private network, this may be your only option.
Within WebRTC, this whole process is usually bound into a single Interactive Connectivity Establishment (ICE) framework (http://en.wikipedia.org/wiki/Interactive_Connectivity_Establishment) that handles connecting to a STUN server and then falling back to a TURN server where required.
Negotiate media sessions
Now that both the browsers know how to talk to each other, they must also agree on the type and format of media (for example, audio and video) they will exchange including codec, resolution, bitrate, and so on. This is usually negotiated using an offer/answer based model, built upon the Session Description Protocol (SDP) (http://en.wikipedia.org/wiki/Session_Description_Protocol). This has been defined as the JavaScript Session Establishment Protocol (JSEP); for more information visit http://tools.ietf.org/html/draft-ietf-rtcweb-jsep-00) by the IETF.
Start RTCPeerConnection streams
Once this has all been completed, the browsers can finally start streaming media to each other, either directly through their peer-to-peer connections or via any media relay gateway they have fallen back to using.
At this stage, the browsers can continue to use the same signaling server solution for sharing communication to control this WebRTC communication. They can also use a specific type of WebRTC data channel to do this directly with each other.
Using WebSockets
The WebSocket API makes it easy for web developers to utilize bidirectional communication within their web applications. You simply create a new connection using the var connection = new WebSocket(url);
constructor, and then create your own functions to handle when messages and errors are received. And sending a message is as simple as using the connection.send(message);
method.
The key benefit here is that the messaging is truly bidirectional, fast, and lightweight. This means the WebSocket API server can send messages directly to your browser whenever it wants, and you receive them as soon as they happen. There are no delays or constant network traffic as it is in the XHR polling or long-polling model, which makes this ideal for the sort of offer/answer signaling dance that's required to set up WebRTC communication.
The WebSocket API server can then use the unique room or conversation token, previously described, to work out which of the WebSocket API clients messages should be relayed to. In this manner, a single WebSocket API server can support a very large number of clients. And since the network connection setup happens very rarely, and the messages themselves tend to be small, the server resources required are very modest.
There are WebSocket API libraries available in almost all major programming languages, and since Node.js
is based on JavaScript, it has become a popular choice for this type of implementation. Libraries such as socket.io
(http://socket.io/) provide a great example of just how easy this approach can really be.
Other signaling options
Any approach that allows browsers to send and receive messages via a server can be used for WebRTC signaling.
The simplest model is to use the XHR API to send messages and to poll the server periodically to collect any new messages. This can be easily implemented by any web developer without any additional tools. However, it has a number of drawbacks. It has a built-in delay based on the frequency of each polling cycle. It is also a waste of bandwidth, as the polling cycle is repeated even when no messages are ready to be sent or received. But if you're looking for a good old-fashioned solution, then this is the one.
A slightly more refined approach based on polling is called long-polling. In this model, if the server doesn't have any new messages yet, the network connection is kept alive until it does, using the HTTP 1.1 keep-alive mechanisms. When the server has some new information, it just sends it down the wire to complete the request. In this case, the network overhead of the polling is reduced. But it is still an outdated and inefficient approach compared to more modern solutions such as WebSockets.
Server-Sent Events are another option. You establish a connection to the server using the var source = new EventSource(url);
constructor, and then add listeners to that source
object to handle messages sent by the server. This allows servers to send you messages directly, and you receive them as soon as they happen. But you are still left using a separate channel, such as XHR, to send your messages to the server, which means you are forced to manage and synchronize two separate channels. This combination does provide a useful solution that has been used in a number of WebRTC demonstration apps, but it does not have the same elegance as a truly bidirectional channel, such as WebSockets.
There are all kinds of other creative ideas that could be used to facilitate the required signaling as well. But what we have covered are the most common options you will find being used.