3. Motivation

This section discusses the motivation and usage of the different video and media control messages. The video control messages have been under discussion for a long time, and a requirement document was drawn up [Basso]. That document has expired; however, we quote relevant sections of it to provide motivation and requirements.

3.1. Use Cases

There are a number of possible usages for the proposed feedback messages. Let us begin by looking through the use cases Basso et al. [Basso] proposed. Some of the use cases have been reformulated and comments have been added.

An RTP video mixer composes multiple encoded video sources into a single encoded video stream. Each time a video source is added, the RTP mixer needs to request a decoder refresh point from the video source, so as to start an uncorrupted prediction chain on the spatial area of the mixed picture occupied by the data from the new video source.
An RTP video mixer receives multiple encoded RTP video streams from conference participants, and dynamically selects one of the streams to be included in its output RTP stream. At the time of a bit stream change (determined through means such as voice activation or the user interface), the mixer requests a decoder refresh point from the remote source, in order to avoid using unrelated content as reference data for inter picture prediction. After requesting the decoder refresh point, the video mixer stops the delivery of the current RTP stream and monitors the RTP stream from the new source until it detects data belonging to the decoder refresh point. At that time, the RTP mixer starts forwarding the newly selected stream to the receiver(s).
An application needs to signal to the remote encoder that the desired trade-off between temporal and spatial resolution has changed. For example, one user may prefer a higher frame rate and a lower spatial quality, and another user may prefer the opposite. This choice is also highly content dependent. Many current video conferencing systems offer in the user interface a mechanism to make this selection, usually in the form of a slider. The mechanism is helpful in point-to-point, centralized multipoint and non-centralized multipoint uses.
Use case 4 of the Basso document applies only to Picture Loss Indication (PLI) as defined in AVPF [RFC4585] and is not reproduced here.
Use case 5 of the Basso document relates to a mechanism known as "freeze picture request". Sending freeze picture requests over a non-reliable forward RTCP channel has been identified as problematic. Therefore, no freeze picture request has been included in this memo, and the use case discussion is not reproduced here.
A video mixer dynamically selects one of the received video streams to be sent out to participants and tries to provide the highest bit rate possible to all participants, while minimizing stream trans-rating. One way of achieving this is to set up sessions with endpoints using the maximum bit rate accepted by each endpoint, and accepted by the call admission method used by the mixer. By means of commands that reduce the maximum media stream bit rate below what has been negotiated during session set up, the mixer can reduce the maximum bit rate sent by endpoints to the lowest of all the accepted bit rates. As the lowest accepted bit rate changes due to endpoints joining and leaving or due to network congestion, the mixer can adjust the limits at which endpoints can send their streams to match the new value. The mixer then requests a new maximum bit rate, which is equal to or less than the maximum bit rate negotiated at session setup for a specific media stream, and the remote endpoint can respond with the actual bit rate that it can support.

The picture Basso, et al., draw up covers most applications we foresee. However, we would like to extend the list with two additional use cases:

Currently deployed congestion control algorithms (AIMD and TCP Friendly Rate Control (TFRC) [RFC3448]) probe for additional available capacity as long as there is something to send. With congestion control algorithms using packet loss as the indication for congestion, this probing generally results in reduced media quality (often to a point where the distortion is large enough to make the media unusable), due to packet loss and increased delay.

In a number of deployment scenarios, especially cellular ones, the bottleneck link is often the last hop link. That cellular link also commonly has some type of QoS negotiation enabling the cellular device to learn the maximal bit rate available over this last hop. A media receiver behind this link can, in most (if not all) cases, calculate at least an upper bound for the bit rate available for each media stream it presently receives. How this is done is an implementation detail and not discussed herein. Indicating the maximum available bit rate to the transmitting party for the various media streams can be beneficial to prevent that party from probing for bandwidth for this stream in excess of a known hard limit. For cellular or other mobile devices, the known available bit rate for each stream (deduced from the link bit rate) can change quickly, due to handover to another transmission technology, QoS renegotiation due to congestion, etc. To enable minimal disruption of service, quick convergence is necessary, and therefore media path signaling is desirable.
Usually, in a point-to-point session, it is the media sender's responsibility to configure the media stream to stay within the limits of the available path bandwidth. However, in certain point-to-point video scenarios, it is advantageous to let the receiver restrict the maximum media bit rate even further. One example is the degradation of the rendering capability of the receiver (e.g., due to CPU resource shortage). In this case, the receiver may want to signal the sender to reduce the media bit rate to a level that can be handled. Another example is a receiver that wants to record the session. In this case, the receiver may want to limit the media rate to not exceed reliable write speeds to the storage device.

3.2. Using the Media Path

To support the use cases above, one can make use of the signaling channel (e.g., SIP) and re-negotiate the definition of the media streams. However, this has several disadvantages.

For one, re-negotiation of parameters via the signaling channel can be slow. In some control protocols (like H.323), the phases of breaking down an existing channel and setting up a new one are distinct, and a "gap" in the media playout can occur.

Second, the control channel makes use of different protocols (often TCP) than the media path (often UDP/RTP). In many topologies, the "path" of the signaling channel is distinct from the path of the media. If a middlebox such as a NAT-fw is present, the control channel may not be able to react to the changes in the media path characteristics, or may not even be aware of the media path.

Third, using the signaling channel to re-negotiate the media parameters is often heavyweight.

Accordingly, it is advantageous to perform the control of the media in the media path, and in a way that is light-weight, and thus fast.

3.3. Using AVPF

For the feedback messages, we use the AVPF profile [RFC4585]. (See [RFC4585] for the rationale behind using RTCP for feedback messages.) AVPF provides valid RTCP packet types and modes of operation to transmit the feedback messages.

3.3.1. Reliability

AVPF does not provide built-in reliability. Acknowledgement packets, retransmission, and other reliability mechanisms are difficult to implement and use with multicast.

For the messages defined in this document, we have chosen a design that relies on the sender of the feedback message to monitor the stream it is receiving. If the feedback message was lost, or the media sender has not yet acted upon it, the sender of the feedback message (e.g., the media receiver) will not see the expected reaction in the form of a modified RTP stream. In that case, the feedback message sender should re-send the message. The interval for such retransmissions should respect the AVPF timing rules. However, there are some messages that simply do not require reliability (notifications), and for others, the reliability is solved by repeating the message.

3.4. Multicast

The feedback messages defined in this document are primarily intended for point-to-point and centralized multipoint scenarios. However, their use in non-centralized multicast scenarios is not prohibited. But, when using the messages in such scenarios, their effects must be well understood.

In multicast, the media sender is sending the same bitstream to all receivers. If one receiver sends a request for a lower bit rate, or for an intra picture, satisfying this request affects all other receivers in the session. This may not be desirable.

In addition, AVPF requires that all RTCP packets in a multicast session are sent to all participants. This means that a feedback message sent by one receiver is seen by all other receivers. This may cause a flood of feedback messages if many receivers all send the same message at the same time. The AVPF timing rules are designed to prevent such floods, but they are not perfect.

3.5. Feedback Messages

This section provides a high-level description of the feedback messages defined in this specification. The formal definition of the messages is in section 4.

3.5.1. Full Intra Request Command

A Full Intra Request (FIR) command indicates to the media sender that it needs to send a decoder refresh point (for video, an Intra picture) as soon as possible. Use cases 1 and 2 in section 3.1 are the main drivers for this message.

The FIR message is similar to the Picture Loss Indication (PLI) message defined in [RFC4585]. However, there is a subtle difference. PLI is used when the receiver has lost some data and wants to restore the picture. The sender can choose to send an Intra picture, or it can use other means to restore the picture (e.g., if it knows what data the receiver has, it can send difference data). FIR, on the other hand, is a command that forces the sender to send a decoder refresh point. This is necessary when the receiver does not have any valid reference picture, for example, when switching streams in a mixer.

3.5.1.1. Reliability

The FIR message is a command. If it is lost, the video will stay corrupted (or blank). Therefore, the sender of the FIR message typically repeats the message until it sees a decoder refresh point in the received stream. The AVPF timing rules apply to these repetitions.

3.5.2. Temporal-Spatial Trade-off Request and Notification

The Temporal-Spatial Trade-off Request (TSTR) and Notification (TSTN) messages allow a media receiver to signal its preference for the trade-off between temporal resolution (frame rate) and spatial resolution (image quality). This addresses use case 3.

The trade-off is expressed as an integer value from 0 to 31, where 0 represents the highest spatial quality (and potentially lowest frame rate) and 31 represents the highest frame rate (and potentially lowest spatial quality).

The TSTR message is sent by a receiver to request a specific trade-off. The TSTN message is sent by the sender to notify the receivers of the current trade-off setting, or to acknowledge a TSTR.

3.5.2.1. Point-to-Point

In a point-to-point scenario, the receiver sends a TSTR. The sender presumably adjusts its encoding and sends a TSTN to confirm.

3.5.2.2. Point-to-Multipoint Using Multicast or Translators

In these scenarios, multiple receivers may send conflicting TSTR messages. The sender has to decide how to react. It could, for example, honor the request for the highest frame rate, or the highest spatial quality, or some average. The sender then sends a TSTN to inform all receivers of the actual setting.

3.5.2.3. Point-to-Multipoint Using RTP Mixer

A mixer can handle TSTR messages from multiple receivers and potentially generate different streams for different receivers, or aggregate the requests and send a single TSTR to the original sender.

3.5.2.4. Reliability

TSTR and TSTN are requests and notifications. If a TSTR is lost, the sender will not change the trade-off. The receiver can detect this (by not receiving a TSTN or seeing a change in the stream) and re-send the TSTR.

3.5.3. H.271 Video Back Channel Message

This message allows carrying ITU-T H.271 video back channel messages within RTCP. This is useful for established video codecs that use H.271 for feedback.

3.5.3.1. Reliability

Reliability for VBCM depends on the application. Some H.271 messages may require reliability, while others do not. The mechanism described in this document does not provide reliability at the RTCP level; it is up to the application to handle retransmissions if necessary.

3.5.4. Temporary Maximum Media Stream Bit Rate Request and Notification

The Temporary Maximum Media Stream Bit Rate Request (TMMBR, pronounced "timber") and Notification (TMMBN, pronounced "tim-ben") messages are used to control the bit rate of the media stream. This addresses use cases 6, 7, and 8.

TMMBR is a request from a receiver to a sender to limit the bit rate of the media stream to a certain value. TMMBN is a notification from the sender to the receiver(s) indicating the maximum bit rate that the sender is currently adhering to.

3.5.4.1. Behavior for Media Receivers Using TMMBR

A receiver estimates the maximum bit rate it can handle (e.g., based on link capacity) and sends a TMMBR message containing this value.

3.5.4.2. Algorithm for Establishing Current Limitations

A sender may receive TMMBR messages from multiple receivers. It needs to calculate the "bounding set" of these requests. Basically, it needs to find the minimum requested bit rate across all receivers (or at least the set of receivers it cares about) and limit its sending rate to that value. The algorithm describes how to maintain the state of received TMMBR messages and calculate the current limit.

3.5.4.3. Use of TMMBR in a Mixer-Based Multipoint Operation

A mixer acts as a receiver for the endpoints sending to it, and as a sender for the endpoints receiving from it. It can use TMMBR to limit the rate of incoming streams, and it must respect TMMBR messages received from endpoints it is sending to.

3.5.4.4. Use of TMMBR in Point-to-Multipoint Using Multicast or Translators

In multicast, the sender must respect the lowest requested bit rate from the group of receivers (or use some other policy).

3.5.4.5. Use of TMMBR in Point-to-Point Operation

Simple case: receiver requests a limit, sender respects it.

3.5.4.6. Reliability

TMMBR is a request. TMMBN is a notification that serves as an acknowledgement. If a receiver sends TMMBR and does not receive a corresponding TMMBN (or see a rate reduction), it re-sends the TMMBR.

3.1. Use Cases​

3.2. Using the Media Path​

3.3. Using AVPF​

3.3.1. Reliability​

3.4. Multicast​

3.5. Feedback Messages​

3.5.1. Full Intra Request Command​

3.5.1.1. Reliability​

3.5.2. Temporal-Spatial Trade-off Request and Notification​

3.5.2.1. Point-to-Point​

3.5.2.2. Point-to-Multipoint Using Multicast or Translators​

3.5.2.3. Point-to-Multipoint Using RTP Mixer​

3.5.2.4. Reliability​

3.5.3. H.271 Video Back Channel Message​

3.5.3.1. Reliability​

3.5.4. Temporary Maximum Media Stream Bit Rate Request and Notification​

3.5.4.1. Behavior for Media Receivers Using TMMBR​

3.5.4.2. Algorithm for Establishing Current Limitations​

3.5.4.3. Use of TMMBR in a Mixer-Based Multipoint Operation​

3.5.4.4. Use of TMMBR in Point-to-Multipoint Using Multicast or Translators​

3.5.4.5. Use of TMMBR in Point-to-Point Operation​

3.5.4.6. Reliability​