1. Introduction

The Session Initiation Protocol (SIP) [RFC3261] and the Session Description Protocol (SDP) [RFC4566] are used to set up multimedia sessions or calls. SDP is also used to set up TCP [RFC4145] and additionally TCP/TLS connections for usage with media sessions [RFC4572]. The Real-time Transport Protocol (RTP) [RFC3550] is used to transmit real-time media on top of UDP and TCP [RFC4571]. Datagram TLS [RFC4347] was introduced to allow TLS functionality to be applied to datagram transport protocols, such as UDP and DCCP. This document provides guidelines on how to establish SRTP [RFC3711] security over UDP using an extension to DTLS (see [RFC5764]).

The goal of this work is to provide a key negotiation technique that allows encrypted communication between devices with no prior relationships. It also does not require the devices to trust every call signaling element that was involved in routing or session setup. This approach does not require any extra effort by end users and does not require deployment of certificates that are signed by a well- known certificate authority to all devices.

The media is transported over a mutually authenticated DTLS session where both sides have certificates. It is very important to note that certificates are being used purely as a carrier for the public keys of the peers. This is required because DTLS does not have a mode for carrying bare keys, but it is purely an issue of formatting. The certificates can be self-signed and completely self-generated. All major TLS stacks have the capability to generate such certificates on demand. However, third-party certificates MAY also be used if the peers have them (thus reducing the need to trust intermediaries). The certificate fingerprints are sent in SDP over SIP as part of the offer/answer exchange.

The fingerprint mechanism allows one side of the connection to verify that the certificate presented in the DTLS handshake matches the certificate used by the party in the signaling. However, this requires some form of integrity protection on the signaling. S/MIME signatures, as described in RFC 3261, or SIP Identity, as described in [RFC4474], provide the highest level of security because they are not susceptible to modification by malicious intermediaries. However, even hop-by-hop security, such as provided by SIPS, offers some protection against modification by attackers who are not in control of on-path signaling elements. Because DTLS-SRTP only requires message integrity and not confidentiality for the signaling, the number of elements that must have credentials and be trusted is significantly reduced. In particular, if RFC 4474 is used, only the Authentication Service need have a certificate and be trusted. Intermediate elements cannot undetectably modify the message and

therefore cannot mount a man-in-the-middle (MITM) attack. By comparison, because SDESCRIPTIONS [RFC4568] requires confidentiality for the signaling, all intermediate elements must be trusted.

This approach differs from previous attempts to secure media traffic where the authentication and key exchange protocol (e.g., Multimedia Internet KEYing (MIKEY) [RFC3830]) is piggybacked in the signaling message exchange. With DTLS-SRTP, establishing the protection of the media traffic between the endpoints is done by the media endpoints with only a cryptographic binding of the media keying to the SIP/SDP communication. It allows RTP and SIP to be used in the usual manner when there is no encrypted media.

In SIP, typically the caller sends an offer and the callee may subsequently send one-way media back to the caller before a SIP answer is received by the caller. The approach in this specification, where the media key negotiation is decoupled from the SIP signaling, allows the early media to be set up before the SIP answer is received while preserving the important security property of allowing the media sender to choose some of the keying material for the media. This also allows the media sessions to be changed, rekeyed, and otherwise modified after the initial SIP signaling without any additional SIP signaling.

Design decisions that influence the applicability of this specification are discussed in Section 3.