Skip to main content

1. Introduction

1. Introduction

This document describes how the W3C Web Real-Time Communication (WebRTC) RTCPeerConnection interface [W3C.webrtc] is used to control the setup, management, and teardown of a multimedia session.

1.1 General Design of JSEP

WebRTC call setup has been designed to focus on controlling the media plane, leaving signaling-plane behavior up to the application as much as possible. The rationale is that different applications may prefer to use different protocols, such as the existing SIP call signaling protocol, or something custom to the particular application, perhaps for a novel use case. In this approach, the key information that needs to be exchanged is the multimedia session description, which specifies the transport and media configuration information necessary to establish the media plane.

With these considerations in mind, this document describes the JavaScript Session Establishment Protocol (JSEP), which allows for full control of the signaling state machine from JavaScript. As described above, JSEP assumes a model in which a JavaScript application executes inside a runtime containing WebRTC APIs (the "JSEP implementation"). The JSEP implementation is almost entirely divorced from the core signaling flow, which is instead handled by the JavaScript making use of two interfaces: (1) passing in local and remote session descriptions and (2) interacting with the Interactive Connectivity Establishment (ICE) state machine [RFC8445]. The combination of the JSEP implementation and the JavaScript application is referred to throughout this document as a "JSEP endpoint".

In this document, the use of JSEP is described as if it always occurs between two JSEP endpoints. Note, though, that in many cases it will actually be between a JSEP endpoint and some kind of server, such as a gateway or Multipoint Control Unit (MCU). This distinction is invisible to the JSEP endpoint; it just follows the instructions it is given via the API.

JSEP's handling of session descriptions is simple and straightforward. Whenever an offer/answer exchange is needed, the initiating side creates an offer by calling a createOffer API. The application then uses that offer to set up its local configuration via the setLocalDescription API. The offer is finally sent off to the remote side over its preferred signaling mechanism (e.g., WebSockets); upon receipt of that offer, the remote party installs it using the setRemoteDescription API.

To complete the offer/answer exchange, the remote party uses the createAnswer API to generate an appropriate answer, applies it using the setLocalDescription API, and sends the answer back to the initiator over the signaling channel. When the initiator gets that answer, it installs it using the setRemoteDescription API, and initial setup is complete. This process can be repeated for additional offer/answer exchanges.

Regarding ICE [RFC8445], JSEP decouples the ICE state machine from the overall signaling state machine. The ICE state machine must remain in the JSEP implementation because only the implementation has the necessary knowledge of candidates and other transport information. Performing this separation provides additional flexibility in protocols that decouple session descriptions from transport. For instance, in traditional SIP, each offer or answer is self-contained, including both the session descriptions and the transport information. However, [RFC8840] allows SIP to be used with Trickle ICE [RFC8838], in which the session description can be sent immediately and the transport information can be sent when available. Sending transport information separately can allow for faster ICE and DTLS startup, since ICE checks can start as soon as any transport information is available rather than waiting for all of it. JSEP's decoupling of the ICE and signaling state machines allows it to accommodate either model.

Although it abstracts signaling, the JSEP approach requires that the application be aware of the signaling process. While the application does not need to understand the contents of session descriptions to set up a call, the application must call the right APIs at the right times, convert the session descriptions and ICE information into the defined messages of its chosen signaling protocol, and perform the reverse conversion on the messages it receives from the other side.

One way to make life easier for the application is to provide a JavaScript library that hides this complexity from the developer; said library would implement a given signaling protocol along with its state machine and serialization code, presenting a higher-level call-oriented interface to the application developer. For example, libraries exist to provide implementations of the SIP [RFC3261] and Extensible Messaging and Presence Protocol (XMPP) [RFC6120] signaling protocols atop the JSEP API. Thus, JSEP provides greater control for the experienced developer without forcing any additional complexity on the novice developer.

1.2 Other Approaches Considered

One approach that was considered instead of JSEP was to include a lightweight signaling protocol. Instead of providing session descriptions to the API, the API would produce and consume messages from this protocol. While providing a more high-level API, this put more control of signaling within the JSEP implementation, forcing it to have to understand and handle concepts like signaling glare (see [RFC3264], Section 4).

A second approach that was considered but not chosen was to decouple the management of the media control objects from session descriptions, instead offering APIs that would control each component directly. This was rejected based on the argument that requiring exposure of this level of complexity to the application programmer would not be beneficial; it would (1) result in an API where even a simple example would require a significant amount of code to orchestrate all the needed interactions and (2) create a large API surface that would need to be agreed upon and documented. In addition, these API points could be called in any order, resulting in a more complex set of interactions with the media subsystem than the JSEP approach, which specifies how session descriptions are to be evaluated and applied.

One variation on JSEP that was considered was to keep the basic session-description-oriented API but to move the mechanism for generating offers and answers out of the JSEP implementation. Instead of providing createOffer/createAnswer methods within the implementation, this approach would instead expose a getCapabilities API, which would provide the application with the information it needed in order to generate its own session descriptions. This increases the amount of work that the application needs to do; it needs to know how to generate session descriptions from capabilities, and especially how to generate the correct answer from an arbitrary offer and the supported capabilities. While this could certainly be addressed by using a library like the one mentioned above, it basically forces the use of said library even for a simple example. Providing createOffer/createAnswer avoids this problem.

1.3 Contradiction regarding bundle-only "m=" sections

Since the approval of the WebRTC specification documents, the IETF has become aware of an inconsistency between the document specifying JSEP and the document specifying BUNDLE (this RFC and [RFC8843], respectively). Rather than delaying publication further to come to a resolution, the documents are being published as they were originally approved. The IETF intends to restart work on these technologies, and revised versions of these documents will be published as soon as a resolution becomes available.

The specific issue involves the handling of "m=" sections that are designated as bundle-only, as discussed in Section 4.1.1 of this RFC. Currently, there is divergence between JSEP and BUNDLE, as well as between these specifications and existing browser implementations:

  • JSEP prescribes that said "m=" sections should use port zero and add an "a=bundle-only" attribute in initial offers, but not in answers or subsequent offers.

  • BUNDLE prescribes that these "m=" sections should be marked as described in the previous point, but in all offers and answers.

  • Most current browsers do not mark any "m=" sections with port zero and instead use the same port for all bundled "m=" sections; some others follow the JSEP behavior.