4. Application to Decoding Order Recovery in Layered Codecs

Packets in RTP flows are often predictively coded, with a receiver having to arrange the packets into a particular order before it can decode the media data. Depending on the payload format, the decoding order might be explicitly specified as a field in the RTP payload header, or the receiver might decode the packets in order of their RTP timestamps. If a layered encoding is used, where the media data is split across several RTP flows, then it is often necessary to exactly synchronise the RTP flows comprising the different layers before layers other than the base layer can be decoded. Examples of such layered encodings are H.264 SVC in NI-T mode [AVT-RTP-SVC] and MPEG surround multi-channel audio [RFC5691]. As described in Section 2, such synchronisation is possible in RTP, but can be difficult to perform rapidly. Below, we describe how the extensions defined in Section 3.3 can be used to synchronise layered flows, and provide a common timestamp-based decoding order.

4.1. In-Band Synchronisation for Decoding Order Recovery

When a layered, multi-description, or multi-view codec is used, with the different components of the media being transferred on separate RTP flows, the RTP sender SHOULD use periodic synchronous in-band delivery of synchronisation metadata to allow receivers to rapidly and accurately synchronise the separate components of the layered media flow. There are three parts to this:

The sender must negotiate the use of the RTP header extensions described in Section 3.3, and must periodically and synchronously insert such header extensions into all the RTP flows forming the separate components of the layered, multi-description, or multi-view flow.
Synchronous insertion requires that the sender insert these RTP header extensions into packets corresponding to exactly the same sampling instant in all the flows. Since the header extensions for each flow are inserted at exactly the same sampling instant, they will have identical NTP-format timestamps, hence allowing receivers to exactly align the RTP timestamps for the component flows. This may require the insertion of extra data packets into some of the component RTP flows, if some component flows contain packets for sampling instants that do not exist in other flows (for example, a layered video codec, where the layers have differing frame rates).
The frequency with which the sender inserts the header extensions will directly correspond to the synchronisation latency, with more frequent insertion leading to higher per-flow overheads, but lower synchronisation latency. It is RECOMMENDED that the sender insert the header extensions synchronously into all component RTP flows at least once per random access point of the media, but they MAY be inserted more often.

The sender MUST continue to send periodic RTCP reports including SR packets, and MUST ensure the RTP timestamp to NTP-format timestamp mapping in the RTCP SR packets is consistent with that used in the RTP header extensions. Receivers should use both the information contained in RTCP SR packets and the in-band mapping of RTP and NTP-format timestamps as input to the synchronisation process, but it is RECOMMENDED that receivers sanity check the mappings received and discard outliers, to provide robustness against invalid data (one might think it more likely that the RTCP SR mappings are invalid, since they are sent at irregular times and subject to skew, but the presence of broken RTP translators could also corrupt the timestamps in the RTP header extension; receivers need to cope with both types of failure).

4.2. Timestamp-Based Decoding Order Recovery

Once a receiver has synchronised the components of a layered, multi-description, or multi-view flow using the RTP header extensions as described in Section 4.1, it may then derive a decoding order based on the synchronised timestamps as follows (or it may use information in the RTP payload header to derive the decoding order, if present and desired).

There may be explicit dependencies between the component flows of a layered, multi-description, or multi-view flow. For example, it is common for layered flows to be arranged in a hierarchy, where flows from "higher" layers cannot be decoded until the corresponding data in "lower" layer flows has been received and decoded. If such a decoding hierarchy exists, it MUST be signalled out of band, for example using [RFC5583] when SDP signalling is used.

Each component RTP flow MUST contain packets corresponding to all the sampling instants of the RTP flows on which it depends. If such packets are not naturally present in the RTP flow, the sender MUST generate additional packets as necessary in order to satisfy this rule. The format of these packets depends on the payload format used. For H.264 SVC, the Empty Network Abstraction Layer (NAL) unit packet [AVT-RTP-SVC] should be used. Flows may also include packets corresponding to additional sampling instants that are not present in the flows on which they depend.

The receiver should decode the packets in all the component RTP flows as follows:

For each RTP packet in each flow, use the mapping contained in the RTP header extensions and RTCP SR packets to derive the NTP-format timestamp corresponding to its RTP timestamp.
Group together RTP data packets from all component flows that have identical calculated NTP-format timestamps.
Processing groups in order of ascending NTP-format timestamps, decode the RTP packets in each group according to the signalled RTP flow decoding hierarchy. That is, pass the RTP packet data from the flow on which all other flows depend to the decoder first, then that from the next dependent flow, and so on. The decoding order of the RTP flow hierarchy may be indicated by mechanisms defined in [RFC5583] or by some other means.

Note that the decoding order will not necessarily match the packet transmission order. The receiver will need to buffer packets for a codec-dependent amount of time in order for all necessary packets to arrive to allow decoding.

4.3. Example

The example shown in Figure 7 refers to three RTP flows A, B, and C, containing a layered, a multi-view, or a multi-description media stream. In the example, the dependency signalling as defined in [RFC5583] indicates that flow A is the lowest RTP flow. Flow B is the next higher RTP flow and depends on A. Flow C is the highest of the three RTP flows and depends on both A and B. A media coding structure is used that results in video access units (i.e., coded video frames) present in higher flows but not present in all lower flows. Flow A has the lowest frame rate. Flows B and C have the same frame rate, which is higher than that of Flow A. The figure shows the full video access units with their corresponding RTP timestamps "(x)". The video access units are already re-ordered according to their RTP sequence number order. The figure indicates the received video access unit part in decoding order within each RTP flow, as well as the associated NTP media timestamps ("TS[..]"). As shown in the figure, these timestamps may be derived using the NTP-format timestamp provided in the RTCP sender reports as indicated by the timestamp in "{x}", or derived directly from the NTP timestamp contained in the RTP header extensions as indicated by the timestamp in "<x>". Note that the timestamps are not in increasing order since, in this example, the decoding order is different from the output/presentation order.

The decoding order recovery process first advances to the video access unit parts associated with the first available synchronous insertion of the NTP timestamp into RTP header extensions at NTP media timestamp TS=[8]. The receiver starts in the highest RTP flow C and removes/ignores all preceding video access unit parts (in decoding order) to video access unit parts with TS=[8] in each of the de-jittering buffers of RTP flows A, B, and C. Then, starting from flow C, the first media timestamp available in decoding order (TS=[8]) is selected, and video access unit parts starting from RTP flow A, and flows B and C are placed in order of the RTP flow dependency as indicated by mechanisms defined in [RFC5583] (in the example for TS[8]: first flow B and then flow C into the video access unit AU(TS[8]) associated with NTP media timestamp TS=[8]). Then the next media timestamp TS=[6] (RTP timestamp=(4)) in order of appearance in the highest RTP flow C is processed, and the process described above is repeated. Note that there may be video access units with no video access unit parts present, e.g., in the lowest RTP flow A (see, e.g., TS=[5]). The decoding order recovery process could also be started after an RTP sender report containing the mapping between the RTP timestamp and the NTP-format timestamp (indicated as timestamps "(x){y}") has been received, assuming that there is no clock skew in the source used for the NTP-format timestamp generation.

   C:-(0)----(2)----(7)<8>--(5)----(4)----(6)-----(11)----(9){10}-
      |      |      |       |      |      |       |       |
   B:-(3)----(5)---(10)<8>--(8)----(7)----(9){7}--(14)----(12)----
                    |       |                     |       |
   A:---------------(3)<8>--(1)-------------------(7){12}-(5)-----

   ---------------------------------------decoding/transmission order->
   TS:[1]    [3]    [8]=<8> [6]    [5]    [7]    [12]    [10]


   Key:

   A, B, C                - RTP flows

   Integer values in "()" - video access unit with its RTP timestamp as
                            indicated in its RTP packet.

   "|"                    - indicates the corresponding parts of the
                            same video access unit AU(TS[..]) in the
                            RTP flows.

   Integer values in "[]" - NTP media timestamp TS, sampling time
                            as derived from the NTP timestamp
                            associated with the video access unit
                            AU(TS[..]), consisting of video access unit
                            parts in the flows above.

   Integer values in "<>" - NTP media timestamp TS as directly
                            taken from the NTP RTP header extensions.

   Integer values in "{}" - NTP media timestamp TS as provided in the
                            RTCP sender reports.

                 Figure 7: Example of a Layered RTP Stream

4.1. In-Band Synchronisation for Decoding Order Recovery​

4.2. Timestamp-Based Decoding Order Recovery​

4.3. Example​

4.1. In-Band Synchronisation for Decoding Order Recovery

4.2. Timestamp-Based Decoding Order Recovery

4.3. Example