<?xml version="1.0" encoding="utf-8"?>
<!-- name="GENERATOR" content="github.com/mmarkdown/mmark Mmark Markdown Processor - mmark.miek.nl" -->
<rfc version="3" ipr="noModificationTrust200902" docName="draft-theo-hesp-06" submissionType="independent" category="info" xml:lang="en" xmlns:xi="http://www.w3.org/2001/XInclude" indexInclude="true">

<front>
<title abbrev="HESP">HESP - High Efficiency Streaming Protocol</title><seriesInfo value="draft-theo-hesp-06" status="informational" name="Informational"></seriesInfo>
<author role="editor" initials="P." surname="Speelmans" fullname="Pieter-Jan Speelmans"><organization>THEO Technologies</organization><address><postal><street></street>
<city>Leuven</city>
<country>Belgium</country>
</postal><email>pieter-jan.speelmans@theoplayer.com</email>
</address></author><date/>
<area>General</area>
<workgroup>Individual</workgroup>
<keyword>template</keyword>

<abstract>
<t>This document describes a protocol for delivering multimedia data, enabling ultra-low latency and fast channel change over HTTP networks. It specifies the data format of the files and the actions to be taken by the server (sender) and the clients (receivers) of the streams. It describes version 2 of this protocol.</t>
</abstract>

</front>

<middle>

<section anchor="introduction"><name>Introduction</name>
<t>Viewers are more demanding than ever, calling for a streaming protocol that combines ultra-low latency, fast zapping and cost-effective scalability.</t>
<t>HESP is an HTTP-based streaming approach that works with standard contribution feeds from (live) productions, standard encoders, albeit specifically configured, a HESP compliant packager, regular CDNs and an HESP compliant player.</t>
<t>HESP offers sub-second latency, near real-time interactivity, fast startup and channel change times and cost-effective scalability up to millions of viewers.</t>
<t>The purpose of this document is to facilitate interoperability between HESP implementations by describing the media transmission protocol.</t>
<t>This document describes version 2 of the protocol.</t>
<t>The key words &quot;MUST&quot;, &quot;MUST NOT&quot;, &quot;REQUIRED&quot;, &quot;SHALL&quot;, &quot;SHALL NOT&quot;, &quot;SHOULD&quot;, &quot;SHOULD NOT&quot;, &quot;RECOMMENDED&quot;, &quot;MAY&quot;, and &quot;OPTIONAL&quot; in this document are to be interpreted as described in <xref target="RFC2119"></xref>.</t>
</section>

<section anchor="overview"><name>Overview</name>
<t>This section contains an overview of the HESP protocol and its building blocks.</t>

<section anchor="hesp-components"><name>HESP components</name>
<t>HESP follows a regular approach to online video streaming. The content is ingested and transcoded in different qualities. Each quality requires two streams. The encoded streams are packaged by an HESP packager and made available via an origin server. An HESP player requests the stream using HTTP requests. A CDN can be used.</t>
<figure anchor="hesp-chain"><name>HESP chain from source to playback
</name>
<artwork align="center" name="Start of stream"><![CDATA[ _________   ________   __________   ________   _____   ________
|         | |        | |          | |        | |     | |        |
|  Input  |\| xcoder |\| packager |\| origin |\| CDN |\|  HESP  |
| Streams |/|        |/|          |/|        |/|     |/| player |
|_________| |________| |__________| |________| |_____| |________|
]]>
</artwork>
</figure>
</section>

<section anchor="two-complementary-streams"><name>Two complementary streams</name>
<t>HESP is based on using two streams for each track, the Initialization Stream and the Continuation Stream. The encoder MUST ensure that the corresponding media data of both streams are issued with synchronized presentation timestamps. The packager MUST ensure this data remains in sync.</t>
<t>The Initialization Stream consists of Initialization Packets. Initialization Packets MUST be individually addressable. An Initialization Packet MUST contain an independent media sample, the reference to the segment and the position in the segment where the Continuation Stream can start. Since the packet starts with an independent sample, playback can start with any Initialization Packet.</t>
<t>The Continuation Stream can start playback immediately after an Initialization Packet, allowing for very fast channel start and switch times. This mechanism puts referencing limitations to the Continuation Stream. When a reference is made from the Initialization Stream to a Continuation Stream, all data needed to render the sample with the subsequent Presentation timestamp MUST be available at the reference offset. In addition, the samples in the Initialization Stream and the samples in the Continuation Stream MUST be aligned. That is, the corresponding media samples of both streams MUST have the same PTS and MUST be made available at the same time. The Continuation Stream is addressed using byte-range requests. The Continuation Stream SHOULD be published in chunks in order to reduce end-to-end latency.</t>
<t>The player receives the Initialization Packet, initializes the decoder, puts the media data in the decoder buffer, and requests the subsequent media data from the Continuation Stream (using HTTP Range requests, starting by the range offset given by the Initialization Packet.)</t>
<figure><name>Start of an HESP stream
</name>
<artwork><![CDATA[time sequence              T1   T2   T3   T4   T5   T6   T7   T8	
                         +----+----+----+----+----+----+----+----+
initialization stream    | I1 | I2 | I3 | I4 | I5 | I6 | I7 | I8 |	
                         +----+----+----+----+----+----+----+----+	
                         +----+----+----+----+----+----+----+----+	
continuation stream      | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 |	
                         +----+----+----+----+----+----+----+----+	
                                   +----+----+----+----+----+----+
playback buffer                    | I3 | C4 | C5 | C6 | C7 | C8 |	
                                   +----+----+----+----+----+----+
]]>
</artwork>
</figure>
</section>

<section anchor="hesp-object-model"><name>HESP object model</name>
<t>The object model follows the CMAF media object model <xref target="CMAF"></xref>.</t>
<t>A Track is used to contain media samples (audio or video) or metadata. It consists of a Continuation Stream and (except for metadata Tracks) an Initialization Stream.</t>
<t>The Initialization Stream consists of Initialization Packets, where each such packet MUST contain a CMAF Header and possibly a CMAF Fragment. Each Initialization Packet MUST be individually addressable through Sequence Numbers. It is further explained in <xref target="cha-initialization"></xref>.</t>
<t>The Continuation Stream consists of Continuation Segments. These Segments MUST consist of CMAF Chunks that can be sent individually to clients. It is further explained in <xref target="cha-continuation"></xref>.</t>
<t>A Switching Set groups together Tracks that contain the same content but with different encoding parameters (e.g., different resolution or different bitrate). A client is able to seamlessly switch between Tracks of a Switching Set as a result. If multiple Switching Sets contain differing content but are aligned in their timings (e.g., multiple view perspectives on the same performance or different languages of the same audio), we can consider them Aligned Switching Sets.</t>
<t>A Selection Set groups together multiple Switching Sets of the same media type. HESP currently allows 3 Selection Sets: audio, video and metadata Selection Sets.</t>
<t>A Presentation contains the Selection Sets for a given period of time. Multiple Presentations together form a continuous timeline of media content, even though each Presentation might have different content, encodings or timestamps. It can be an advertisement, or a part of a show, or the first half of a game,... The Presentation is the lowest granularity inside a Manifest. All Tracks of a Presentation MUST have media data for the full duration of the Presentation.</t>
<t>The Manifest informs clients of the aforementioned data structure and must be retrieved before any media data. The format of the Manifest is detailed in <xref target="cha-manifest"></xref>.</t>
<t>A new Manifest is not needed to request a new Segment. Segment addressing can happen automatically within a Presentation for an efficient and continuous delivery of the Continuation Stream. A Manifest is only obligated to be updated before the start of a new Presentation. This gives the opportunity for low frequency Manifest updates, though this update rate can be freely configured.</t>
<t>Since Presentations can be of unknown duration, a different mechanism is used to signal a new Presentation to a client. To that extent, in-band metadata events are introduced in the Continuation Stream. Such an event can signal the client to retrieve a new Manifest, when information about an upcoming Presentation becomes available.</t>
</section>

<section anchor="reference-flow"><name>Reference flow</name>
<figure anchor="fig-reference-flow"><name>HESP reference flow
</name>
<artwork><![CDATA[         +---------+                                   +---------+
         | Player  |                                   | Origin  |
         +---------+                                   +---------+
              |                                             |
 _____________|_____________________________________________|_____
! LOOP   / all presentations                                |     !
!_______/     |                                             |     !
!             |                                             |     !
!             | request Manifest file                       |     !
!             |-------------------------------------------->|     !
!             |                                             |     !
!             |                             Return Manifest |     !
!             |<--------------------------------------------|     !
!             |                                             |     !
!             | parse Manifest                              |     !
!             |---------------                              |     !
!             |              |                              |     !
!             |<--------------                              |     !
!             |                                             |     !
!             |                                             |     !
!             | Request Initialization Packet               |     !
!             |-------------------------------------------->|     !
!             |                                             |     !
!             |               Return Initialization Packet  |     !
!             |<--------------------------------------------|     !
!             |                                             |     !
!             | Initialize decode pipeline                  |     !
!             |---------------------------                  |     !
!             |                          |                  |     !
!             |<--------------------------                  |     !
!             |                                             |     !
!             | Parse position information                  |     !
!             | of the corresponding data                   |     !
!             | in the Continuation Stream                  |     !
!             |--------------------------                   |     !
!             |                          |                  |     !
!             |<-------------------------                   |     !
!             |                                             |     !
!             |                                             |     !
!             | Request Continuation Stream:                |     !
!             | (Segment n) byte-range [now, -)             |     !
!             |-------------------------------------------->|     !
!             |                                             |     !
!             |             Return CMAF Fragments until the |     !
!             |                          end of the Segment |     !
!             |<--------------------------------------------|     !
!   __________|_____________________________________________|__   !
!  ! LOOP   / all Segments of the Presentation              |  !  !
!  !_______/  |                                             |  !  !
!  !          |                                             |  !  !
!  !          | Request Continuation Stream:                |  !  !
!  !          | (Segment n+i) without byte-range            |  !  !
!  !          |-------------------------------------------->|  !  !
!  !          |                                             |  !  !
!  !          |             Return CMAF Fragments until the |  !  !
!  !          |                          end of the Segment |  !  !
!  !          |<--------------------------------------------|  !  !
!  !          |                                             |  !  !
!  !__________|_____________________________________________|__!  !
!             |                                             |     !
!_____________|_____________________________________________|_____!
              |                                             !
              |                                             !
]]>
</artwork>
</figure>
<t>Though the manifest file is shared, video, audio and metadata (subtitles) media data MUST be distributed separately. They each require separate requests from the client.</t>
</section>
</section>

<section anchor="cha-manifest"><name>HESP Manifest</name>
<t>The first step for playback of an HESP stream is to fetch a Manifest. It contains information on the available Tracks and how to request content from every Track's Initialization and Continuation Streams.</t>

<section anchor="timestamps"><name>Timestamps</name>

<section anchor="manifest-timestamp-and-media-timestamp"><name>Manifest Timestamp and Media Timestamp</name>
<t>A distinction is made between Manifest Timestamps and Media Timestamps in the definitions below. Media Timestamps MUST represent Presentation timestamps as they are given by the media data itself. Manifest Timestamps MUST define those same Media Timestamps, but with an additional offset such that the timestamps of all Tracks of a Presentation are aligned. This offset MUST be given by the Manifest for each Track or Switching Set.</t>
<t>For example, consider the first Segment of a Track with Media Timestamps starting at 0 seconds. This Track is part of the second Presentation of an HESP stream. The first Presentation has run for a significant amount of time and the second Presentation follows immediately after the first Presentation ends. In this case, the Manifest Timestamps of the second Presentation must succeed the Manifest Timestamps of the first Presentation. As a result, the Media Timestamps and Manifest Timestamps of the second Presentation now differ. An offset must be given by the Manifest in order to inform clients about this difference between timestamp types for each Track of the second Presentation.</t>
<t>An example of this distinction is given in <xref target="use-case-timing-information"></xref>.</t>
</section>

<section anchor="manifest-sequence-numbers"><name>Sequence Numbers</name>
<t>Each Initialization Packet belonging to a Track is given a Sequence Number. This is a simple identifier used to retrieve specific Initialization Packets by the client. The Sequence Number is a positive integer that MUST be increased by 1 for each subsequent Initialization Packet of the same Track. The first Initialization Packet of all Tracks in all Switching Sets inside of a Presentation MUST contain the Manifest Timestamp for the start of that Presentation within the interval of the first presentation time of that Initialization Packet and the sum of the presentation time and all durations of the frames in that Initialization Packet.</t>
</section>

<section anchor="calculating-the-sequence-number-of-an-initialization-packet"><name>Calculating the Sequence Number of an Initialization Packet</name>
<t>To calculate the Sequence Number of a chosen Initialization Packet, the Manifest provides the client multiple values. Each active Presentation MUST contain both its current Manifest Timestamp and, for each Track, the Sequence Number of the first Initialization Packet and the (constant) frame rate.</t>
<t>It is given that:</t>

<ul spacing="compact">
<li>the frame rate MUST be constant for each Track.</li>
<li>Initialization Packets MUST become available in real-time.</li>
<li>Sequence Numbers MUST be increased by 1 per Initialization Packet.</li>
<li>the Sequence Number and Manifest Timestamp of the first Initialization Packet for a Track are given by the Manifest.</li>
</ul>
<t>As a result, it is possible to derive Sequence Numbers for arbitrary Initialization Packets. By taking the initial Sequence Number of a Track, its associated Manifest Timestamp and the frame rate, it is possible to calculate the number of frames between the initial Sequence Number and any later Sequence Number of an Initialization Packet. For example, for a Video Track the full calculation can be performed as follows:
<tt>calculateSequenceNumber(timestamp) = floor((timestamp - presentation.timeBounds.startTime / presentation.timeBounds.scale) * (videoTrack.frameRate.value / videoTrack.frameRate.scale) + videoTrack.startSequenceNumber</tt></t>
<t>For example, suppose the start Sequence Number would be 34 at Manifest Timestamp <tt>00:00:01.360</tt> for a Track with a frame rate of 25 fps. To find the Sequence Number at Manifest Timestamp <tt>00:00:04.120</tt>, the difference between both frames should be calculated. This means that the Sequence Number for the provided Manifest Timestamp is <tt>(4.120s - 1.360s) * 25fps + 34 = 103</tt>.</t>
<t>In case the encoder suffers from drift and is not synchronized with a wall clock, a packager SHOULD compensate for this drift by creating a new Presentation.</t>
</section>
</section>

<section anchor="manifest-data-types"><name>Manifest data types</name>
<t>The Manifest MUST be formatted as a JSON file. All data types used in the information below MUST satisfy the JavaScript Object Notation specification <xref target="RFC8259"></xref>. In regard to the definition of <tt>number</tt>s within the JSON specification, their range and precision are implementation-dependent. It should be noted that for interoperability, <tt>number</tt>s SHOULD be representable using IEEE 754 double-precision numbers.</t>
<t>Some additional data types are introduced on top of the JSON specification to narrow the allowed values of some fields further.</t>

<section anchor="additional-json-data-types"><name>Additional JSON data types</name>

<section anchor="integer"><name>Integer</name>
<t>An <tt>integer</tt> (or <tt>int</tt>) is a subset of the <tt>number</tt> type defined in the JSON specification, limited to integer numbers. Concretely, following Section 6 of the JSON specification, a <tt>number</tt> MUST NOT have a fraction part to be considered an <tt>integer</tt>. Note that as mentioned above, <tt>integer</tt>s SHOULD be representable using IEEE 754 double-precision numbers. As a result, <tt>integer</tt>s SHOULD be within the [(-2**53)+1, (2**53)-1] range to make sure implementations agree on the exact numeric value.</t>
</section>

<section anchor="unsigned-integer"><name>Unsigned integer</name>
<t>An <tt>unsigned integer</tt> (or <tt>uint</tt>) is a subset of the <tt>integer</tt> type, only allowing for positive integer numbers. Concretely, following Section 6 of the JSON specification, an <tt>integer</tt> MUST NOT contain a minus to be considered an <tt>unsigned integer</tt>.</t>
</section>

<section anchor="enumeration"><name>Enumeration</name>
<t>An enumeration defines a list of constant strings, where a valid value of this type MUST equal one of these strings. The type is written as <tt>Enum(x, y, z)</tt>, where <tt>x</tt>, <tt>y</tt> and <tt>z</tt> are the constant strings allowed to be used.</t>
</section>

<section anchor="datetime"><name>DateTime</name>
<t>A DateTime MUST be formatted as a string in the format defined by ISO 8601 <xref target="ISO8601"></xref>:</t>

<artwork><![CDATA[YYYY-MM-DDThh:mm:ss.mmmTZD

where YYYY = four-digit year
      MM = two-digit month
      DD = two-digit day of the month
      hh = two digits of the hour (00 through 23)
      mm = two digits of a minute (00 through 59)
      ss = two digits of a second (00 through 59)
      mmm = three digits of a millisecond (000 through 999)
      TZD = time zone designator (Z or +hh:mm or -hh:mm)
]]>
</artwork>
</section>
</section>

<section anchor="manifesttype"><name>ManifestType</name>
<t>This data type represents the root of the Manifest. The structure consists of general playback information and a collection of Presentations.</t>
<t>The table below gives the possible fields. The &quot;Required?&quot; column indicates if a field is REQUIRED (Y) or OPTIONAL (N).</t>
<table>
<thead>
<tr>
<th>Attribute</th>
<th>Type</th>
<th>Req?</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td>availabilityDuration</td>
<td>ScaledValue</td>
<td>Y</td>
<td>The amount of time (in seconds) that live content MUST be kept available to be retrieved by a client. It is used to define an interval [live - n, live], where live is the current live point and n is <tt>availabilityDuration</tt>, where content MUST be available for clients of a live stream. For VOD streams, this value MUST be ignored.</td>
</tr>

<tr>
<td>creationDate</td>
<td>DateTime</td>
<td>Y</td>
<td>The timestamp of the moment that this specific Manifest was created by the packager (in local packager time.)</td>
</tr>

<tr>
<td>fallbackPollRate</td>
<td>uint</td>
<td>Y</td>
<td>The number of seconds a player SHOULD wait to poll a new Manifest, if it hasn't requested any since retrieving the current Manifest.</td>
</tr>

<tr>
<td>manifestVersion</td>
<td>Enum(&quot;2.0.0&quot;)</td>
<td>Y</td>
<td>The version number of the Manifest.</td>
</tr>

<tr>
<td>presentations</td>
<td>Presentation<br />
Type[]</td>
<td>Y</td>
<td>A list of all Presentations that are currently known.</td>
</tr>

<tr>
<td>streamType</td>
<td>Enum(&quot;live&quot;, &quot;vod&quot;)</td>
<td>Y</td>
<td>Indicates whether the stream is a live stream (<tt>live</tt>) which does not have a known ending, or a video on demand stream (<tt>vod</tt>) with a known ending and duration.</td>
</tr>

<tr>
<td>activePresentation</td>
<td>string</td>
<td>N</td>
<td>The identifier of the currently active Presentation. This value MUST be available if <tt>streamType</tt> equals <tt>live</tt>, but MUST be ignored if <tt>streamType</tt> equals <tt>vod</tt>.</td>
</tr>

<tr>
<td>currentTime</td>
<td>ScaledValue</td>
<td>N</td>
<td>The most recent composition timestamp of any Track contained by this Manifest when this Manifest is generated. It SHOULD be specified in Manifest Time, not in Media Time (as that could have different offsets from Track to Track.) This value MUST be available if <tt>streamType</tt> equals <tt>live</tt>, but MUST be ignored if <tt>streamType</tt> equals <tt>vod</tt>.</td>
</tr>

<tr>
<td>contentBaseUrl</td>
<td>string</td>
<td>N</td>
<td>The base URL for all content requests relating to this Manifest. See <xref target="manifest-addressing-content"></xref> for more information.</td>
</tr>

<tr>
<td>timeSource</td>
<td>TimeSource</td>
<td>N</td>
<td>A reference to a time server with which the packager is synced. This value MUST be ignored if <tt>streamType</tt> equals <tt>vod</tt>.</td>
</tr>
</tbody>
</table></section>

<section anchor="timesource"><name>TimeSource</name>
<t>A TimeSource is used to sync the packager and player internal clocks to make requests for data only once it is available.</t>
<table>
<thead>
<tr>
<th>Attribute</th>
<th>Type</th>
<th>Required</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td>scheme</td>
<td>string</td>
<td>Y</td>
<td>A scheme id for the time source as defined as a valid <tt>@schemeIdURI</tt> in <xref target="DASH"></xref> section 5.8.4.11 Table 32.</td>
</tr>

<tr>
<td>value</td>
<td>string</td>
<td>Y</td>
<td>The information indicating where the time source can be found as indicated in <xref target="DASH"></xref> section 5.8.5.7.</td>
</tr>
</tbody>
</table></section>

<section anchor="scaledvalue"><name>ScaledValue</name>
<t>In order to avoid rounding issues introduced through floating-point numbers, this structure defines two integers. The total value is calculated by dividing <tt>value</tt> over <tt>scale</tt>.</t>
<table>
<thead>
<tr>
<th>Attribute</th>
<th>Type</th>
<th>Required</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td>value</td>
<td>int</td>
<td>Y</td>
<td>The defined integer value.</td>
</tr>

<tr>
<td>scale</td>
<td>uint</td>
<td>N</td>
<td>If not defined, the scale SHALL equal 1.</td>
</tr>
</tbody>
</table></section>

<section anchor="presentationtype"><name>PresentationType</name>
<t>This is the type definition of a Presentation.</t>
<table>
<thead>
<tr>
<th>Attribute</th>
<th>Type</th>
<th>Req?</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td>id</td>
<td>string</td>
<td>Y</td>
<td>The unique identifier for this Presentation. It MUST be unique over all Presentations of this Manifest. It MUST NOT change over Manifest updates.</td>
</tr>

<tr>
<td>timeBounds</td>
<td>TimeBounds</td>
<td>Y</td>
<td>The time boundaries of this Presentation, in Manifest Time. The start time MUST be announced at least 2 seconds before the Presentation is active. The end time of an active Presentation MUST be available at least 2 seconds before the actual end of that Presentation. The <tt>startTime</tt> of the <tt>TimeBounds</tt> MUST be equal to the time corresponding to the <tt>startSequenceNumber</tt> of each <tt>Track</tt>.</td>
</tr>

<tr>
<td>audio</td>
<td>AudioSwitchingSetType[]</td>
<td>N</td>
<td>The audio Selection Set. It contains all audio Switching Sets of this Presentation. If not defined, this value SHALL equal an empty list.</td>
</tr>

<tr>
<td>baseUrl</td>
<td>string</td>
<td>N</td>
<td>The base URL of this Presentation. It is part of the content base URLs for all Switching Sets belonging to this Presentation. See <xref target="manifest-addressing-content"></xref> for more information.</td>
</tr>

<tr>
<td>events</td>
<td>PresentationEventType[]</td>
<td>N</td>
<td>List of all currently available Presentation Events related to this Presentation. If not defined, this value SHALL equal an empty list.</td>
</tr>

<tr>
<td>metadata</td>
<td>MetadataSwitchingSetType[]</td>
<td>N</td>
<td>The metadata Selection Set. It contains all metadata Switching Sets of this Presentation. If not defined, this value SHALL equal an empty list.</td>
</tr>

<tr>
<td>video</td>
<td>VideoSwitchingSetType[]</td>
<td>N</td>
<td>The video Selection Set. It contains all video Switching Sets of this Presentation. If not defined, this value SHALL equal an empty list.</td>
</tr>
</tbody>
</table></section>

<section anchor="timebounds"><name>TimeBounds</name>
<t>A TimeBounds structure denotes a time interval with a start and end time. A start or end time may be undefined. The desired behavior in such a situation depends on the specific usage of the structure.</t>
<t>These boundaries are inclusive at the start time and exclusive at the end time. I.e., if two TimeBounds need to be continuous, then the end time of the first one must equal the start time of the second.</t>
<table>
<thead>
<tr>
<th>Attribute</th>
<th>Type</th>
<th>Required</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td>startTime</td>
<td>uint</td>
<td>N</td>
<td>This value denotes the start time in seconds when divided by the timescale.</td>
</tr>

<tr>
<td>endTime</td>
<td>uint</td>
<td>N</td>
<td>This value denotes the end time in seconds when divided by the timescale.</td>
</tr>

<tr>
<td>scale</td>
<td>uint</td>
<td>N</td>
<td>If the timescale is not defined, it SHALL equal 1.</td>
</tr>
</tbody>
</table></section>

<section anchor="audioswitchingsettype"><name>AudioSwitchingSetType</name>
<t>This is the type definition of an audio Switching Set.</t>
<table>
<thead>
<tr>
<th>Attribute</th>
<th>Type</th>
<th>Req?</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td>id</td>
<td>string</td>
<td>Y</td>
<td>The unique identifier for this Switching Set. It MUST be unique within its Presentation.</td>
</tr>

<tr>
<td>language</td>
<td>string</td>
<td>Y</td>
<td>The language of all audio Tracks of this Switching Set. It MUST be specified here by its ISO 639-2 <xref target="ISO6392"></xref> code.</td>
</tr>

<tr>
<td>tracks</td>
<td>AudioTrackType[]</td>
<td>Y</td>
<td>The collection of all Tracks belonging to this Switching Set.</td>
</tr>

<tr>
<td>alignId</td>
<td>string</td>
<td>N</td>
<td>A unique identifier that SHOULD be set by all Switching Sets that are Aligned Switching Sets with each other.</td>
</tr>

<tr>
<td>baseUrl</td>
<td>string</td>
<td>N</td>
<td>The base URL of this Switching Set. It is part of the content base URLs for all Tracks belonging to this Switching Set. See <xref target="manifest-addressing-content"></xref> for more information.</td>
</tr>

<tr>
<td>channels</td>
<td>uint</td>
<td>N</td>
<td>The audio channel configuration of all audio Tracks belonging to this Switching Set. It is defined as the total number of audio channels. This value is used for ABR selection by the player.</td>
</tr>

<tr>
<td>codecs</td>
<td>string</td>
<td>N</td>
<td>The definition of the codec(s) necessary to render the content of all Tracks belonging to this Switching Set. It MUST follow the ISO File Format Name Space as defined by <xref target="RFC6381"></xref>. If this is not defined, it MUST be defined on all Tracks of this Switching Set separately.</td>
</tr>

<tr>
<td>continuationPattern</td>
<td>string</td>
<td>N</td>
<td>The URL pattern used to request Continuation Segments belonging to this Switching Set. The pattern MUST include the <tt>{segmentId}</tt> string; see <xref target="manifest-addressing-content"></xref> for more information. If this is not defined, it MUST be defined on all Tracks of this Switching Set separately.</td>
</tr>

<tr>
<td>initializationPattern</td>
<td>string</td>
<td>N</td>
<td>The URL pattern used to request Initialization Packets belonging to this Switching Set. The pattern MUST include the <tt>{initId}</tt> string; see <xref target="manifest-addressing-content"></xref> for more information. If this is not defined, it MUST be defined on all Tracks of this Switching Set separately.</td>
</tr>

<tr>
<td>label</td>
<td>string</td>
<td>N</td>
<td>A human-readable name for this Switching Set.</td>
</tr>

<tr>
<td>mediaTimeOffset</td>
<td>ScaledValue</td>
<td>N</td>
<td>The offset to be added to Manifest Timestamps to calculate Media Timestamps contained by Segments of this Switching Set. If neither the Track nor the Switching Set has this value set, it SHALL equal 0.</td>
</tr>

<tr>
<td>mimeType</td>
<td>string</td>
<td>N</td>
<td>The MIME type of all content belonging to this Switching Set.  It MUST be a valid audio MIME type. If not set, the MIME type of all content of this Switching Set SHALL equal <tt>audio/mp4</tt>.</td>
</tr>

<tr>
<td>protection</td>
<td>SwitchingSet<br />
Protection</td>
<td>N</td>
<td>The information related to content protection for all Tracks belonging to this Switching Set.</td>
</tr>

<tr>
<td>sampleRate</td>
<td>uint</td>
<td>N</td>
<td>The sample rate (in Hz) of all audio Tracks belonging to this Switching Set. If this is not defined, it MUST be defined on all Tracks of this Switching Set separately.</td>
</tr>

<tr>
<td>samplesPerFrame</td>
<td>uint</td>
<td>N</td>
<td>The number of audio samples in one frame. If set, this MUST apply to each Track belonging to this Switching Set that does not define this value itself. If neither the Track nor the Switching Set has this value set, it SHALL equal 1024.</td>
</tr>
</tbody>
</table></section>

<section anchor="switchingsetprotection"><name>SwitchingSetProtection</name>
<t>The following fields are defined on the SwitchingSetProtection structure:</t>
<table>
<thead>
<tr>
<th>Attribute</th>
<th>Type</th>
<th>Req?</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td><tt>type</tt></td>
<td>Enum(&quot;cenc&quot;, &quot;cbcs&quot;)</td>
<td>Y</td>
<td>The protection scheme used to encrypt this Switching Set.</td>
</tr>

<tr>
<td><tt>systems</tt></td>
<td>SwitchingSetProtectionSystem[]</td>
<td>Y</td>
<td>Metadata about the DRM systems that can be used to playback this Switching Set. This list MUST contain at least one entry.</td>
</tr>
</tbody>
</table><t>More information on content protection can be found in <xref target="cha-content-protection"></xref>.</t>
</section>

<section anchor="switchingsetprotectionsystem"><name>SwitchingSetProtectionSystem</name>
<t>The following fields are defined on the SwitchingSetProtectionSystem structure:</t>
<table>
<thead>
<tr>
<th>Attribute</th>
<th>Type</th>
<th>Required</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td><tt>pssh</tt></td>
<td>string</td>
<td>N</td>
<td>A Base 64 encoded <tt>ProtectionSystemSpecificHeaderBox</tt>. If it is not defined in the Manifest, then such a box MUST be available in the Initialization Stream of each Track belonging to this Switching Set.</td>
</tr>

<tr>
<td><tt>schemeId</tt></td>
<td>string</td>
<td>Y</td>
<td>A UUID <xref target="UUID"></xref> that uniquely identifies the content protection system. It should match the System ID included in the <tt>pssh</tt> box of this protection system.</td>
</tr>
</tbody>
</table><t>Additional attributes, such as license acquisition URLs, authorization-related URLs, specific initialization data (a default key ID) or others, MAY be defined for this structure, depending on the scheme identifier. Below, additional attributes that MUST be included are given for a subset of System IDs. If a System ID is not included here, then additional attributes MAY still be defined by the author of the content protection system and is considered out of scope of this document.</t>
<table>
<thead>
<tr>
<th>System ID</th>
<th>Attributes</th>
<th>Reference</th>
</tr>
</thead>

<tbody>
<tr>
<td>94ce86fb-07ff-4f43-adb8-93d2fa968ca2</td>
<td>FairPlayAttributes</td>
<td>Apple FairPlay Streaming</td>
</tr>
</tbody>
</table>
<section anchor="fairplayattributes"><name>FairPlayAttributes</name>
<t>The following fields MUST be parsed from the SwitchingSetProtectionSystem structure when <tt>schemeId</tt> equals <tt>94ce86fb-07ff-4f43-adb8-93d2fa968ca2</tt>. More details about the requirements of these fields can be found in RFC 8216bis <xref target="I-D.pantos-hls-rfc8216bis"></xref>.</t>
<table>
<thead>
<tr>
<th>Attribute</th>
<th>Type</th>
<th>Required</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td>uri</td>
<td>string</td>
<td>Y</td>
<td>A URI that specifies how to obtain the key.</td>
</tr>

<tr>
<td>keyFormat</td>
<td>string</td>
<td>N</td>
<td>Specifies how the key is represented in the resource identified by the <tt>uri</tt>. If not defined, this value SHALL equal <tt>identity</tt>.</td>
</tr>

<tr>
<td>keyFormatVersions</td>
<td>string</td>
<td>N</td>
<td>Indicate which version(s) used in the formatting of the key this instance complies with. The value is a quoted-string containing one or more positive integers separated by the / character (for example, <tt>&quot;1&quot;</tt>, <tt>&quot;1/2&quot;</tt>, or <tt>&quot;1/2/5&quot;</tt>). If not defined, this value SHALL equal <tt>&quot;1&quot;</tt>.</td>
</tr>

<tr>
<td>iv</td>
<td>string</td>
<td>N</td>
<td>A hexadecimal-sequence that specifies a 128-bit unsigned integer Initialization Vector to be used with the key.</td>
</tr>

<tr>
<td>pssh</td>
<td>string</td>
<td>N</td>
<td>The <tt>pssh</tt> field of a SwitchingSetProtectionSystem MUST be ignored in the case of FairPlay content protection. Additionally, a <tt>ProtectionSystemSpecificHeaderBox</tt> MUST NOT be available in the Initialization Stream of any Track belonging to this Switching Set in this case.</td>
</tr>
</tbody>
</table></section>
</section>

<section anchor="videoswitchingsettype"><name>VideoSwitchingSetType</name>
<t>This is the type definition of a video Switching Set.</t>
<table>
<thead>
<tr>
<th>Attribute</th>
<th>Type</th>
<th>Req?</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td>id</td>
<td>string</td>
<td>Y</td>
<td>The unique identifier for this Switching Set. It MUST be unique within its Presentation.</td>
</tr>

<tr>
<td>tracks</td>
<td>VideoTrackType[]</td>
<td>Y</td>
<td>The collection of all Tracks belonging to this Switching Set.</td>
</tr>

<tr>
<td>alignId</td>
<td>string</td>
<td>N</td>
<td>A unique identifier that SHOULD be set by all Switching Sets that are Aligned Switching Sets with each other.</td>
</tr>

<tr>
<td>baseUrl</td>
<td>string</td>
<td>N</td>
<td>The base URL of this Switching Set. It is part of the content base URLs for all Tracks belonging to this Switching Set. See <xref target="manifest-addressing-content"></xref> for more information.</td>
</tr>

<tr>
<td>codecs</td>
<td>string</td>
<td>N</td>
<td>The definition of the codec(s) necessary to render the content of all Tracks belonging to this Switching Set. It MUST follow the ISO File Format Name Space as defined by <xref target="RFC6381"></xref>. If this is not defined, it MUST be defined on all Tracks of this Switching Set separately.</td>
</tr>

<tr>
<td>continuationPattern</td>
<td>string</td>
<td>N</td>
<td>The URL pattern used to request Continuation Segments belonging to this Switching Set. The pattern MUST include the <tt>{segmentId}</tt> string; see <xref target="manifest-addressing-content"></xref> for more information. If this is not defined, it MUST be defined on all Tracks of this Switching Set separately.</td>
</tr>

<tr>
<td>frameRate</td>
<td>ScaledValue</td>
<td>N</td>
<td>The frame rate of all video Tracks belonging to this Switching Set. If this is not defined, then every Track MUST set its frame rate separately.</td>
</tr>

<tr>
<td>initializationPattern</td>
<td>string</td>
<td>N</td>
<td>The URL pattern used to request Initialization Packets belonging to this Switching Set. The pattern MUST include the <tt>{initId}</tt> string; see <xref target="manifest-addressing-content"></xref> for more information. If this is not defined, it MUST be defined on all Tracks of this Switching Set separately.</td>
</tr>

<tr>
<td>label</td>
<td>string</td>
<td>N</td>
<td>A human-readable name for this Switching Set.</td>
</tr>

<tr>
<td>mediaTimeOffset</td>
<td>ScaledValue</td>
<td>N</td>
<td>The offset to be added to Manifest Timestamps to calculate Media Timestamps contained by Segments of this Switching Set. If neither the Track nor the Switching Set has this value set, it SHALL equal 0.</td>
</tr>

<tr>
<td>mimeType</td>
<td>string</td>
<td>N</td>
<td>The MIME type of all content belonging to this Switching Set. It MUST be a valid video MIME type. If not set, the MIME type of all content of this Switching Set SHALL equal <tt>video/mp4</tt>.</td>
</tr>

<tr>
<td>protection</td>
<td>SwitchingSet<br />
Protection</td>
<td>N</td>
<td>The information related to content protection for all Tracks belonging to this Switching Set.</td>
</tr>
</tbody>
</table></section>

<section anchor="metadataswitchingsettype"><name>MetadataSwitchingSetType</name>
<t>This is the type definition of a metadata Switching Set.</t>
<table>
<thead>
<tr>
<th>Attribute</th>
<th>Type</th>
<th>Req?</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td>id</td>
<td>string</td>
<td>Y</td>
<td>The unique identifier for this Switching Set. It MUST be unique within its Presentation.</td>
</tr>

<tr>
<td>mimeType</td>
<td>string</td>
<td>Y</td>
<td>The MIME type of all content belonging to this Switching Set.</td>
</tr>

<tr>
<td>tracks</td>
<td>MetadataTrackType[]</td>
<td>Y</td>
<td>The collection of all Tracks belonging to this Switching Set.</td>
</tr>

<tr>
<td>schemeId</td>
<td>string</td>
<td>Y</td>
<td>An identifier that denotes the type of metadata contained by this metadata Switching Set. Client behavior may differ depending on the vendor-specific identifier set here.</td>
</tr>

<tr>
<td>alignId</td>
<td>string</td>
<td>N</td>
<td>A unique identifier that SHOULD be set by all Switching Sets that are Aligned Switching Sets with each other.</td>
</tr>

<tr>
<td>baseUrl</td>
<td>string</td>
<td>N</td>
<td>The base URL of this Switching Set. It is part of the content base URLs for all Tracks belonging to this Switching Set. See <xref target="manifest-addressing-content"></xref> for more information.</td>
</tr>

<tr>
<td>codecs</td>
<td>string</td>
<td>N</td>
<td>The definition of the codec(s) necessary to render the content of all Tracks belonging to this Switching Set. It MUST follow the ISO File Format Name Space as defined by <xref target="RFC6381"></xref>. If this Switching Set does not define it, then it must be defined on all its Tracks separately.</td>
</tr>

<tr>
<td>continuationPattern</td>
<td>string</td>
<td>N</td>
<td>The URL pattern used to request Continuation Segments belonging to this Switching Set. The pattern MUST include the <tt>{segmentId}</tt> string; see <xref target="manifest-addressing-content"></xref> for more information. If this is not defined, it MUST be defined on all Tracks of this Switching Set separately.</td>
</tr>

<tr>
<td>label</td>
<td>string</td>
<td>N</td>
<td>A human-readable name for this Switching Set.</td>
</tr>

<tr>
<td>language</td>
<td>string</td>
<td>N</td>
<td>The language of all Tracks of the metadata Switching Set. It MUST be specified here by its ISO 639-2 <xref target="ISO6392"></xref> code.</td>
</tr>

<tr>
<td>mediaTimeOffset</td>
<td>ScaledValue</td>
<td>N</td>
<td>The offset to be added to Manifest Timestamps to calculate Media Timestamps contained by Segments of this Switching Set. If neither the Track nor the Switching Set has this value set, it SHALL equal 0.</td>
</tr>
</tbody>
</table></section>

<section anchor="audiotracktype"><name>AudioTrackType</name>
<t>This is the type definition of an audio Track.</t>
<table>
<thead>
<tr>
<th>Attribute</th>
<th>Type</th>
<th>Req?</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td>bandwidth</td>
<td>uint</td>
<td>Y</td>
<td>The peak bitrate of this Track. It is denoted in bits per second. The measured bitrates of all Segments of the Track MUST NOT exceed this value. This value is used to aid in ABR decisions by the player.</td>
</tr>

<tr>
<td>id</td>
<td>string</td>
<td>Y</td>
<td>The unique identifier for this Track. It MUST be unique within its Switching Set.</td>
</tr>

<tr>
<td>segments</td>
<td>SegmentType[]</td>
<td>Y</td>
<td>The metadata of the Segments contained by this Track at the moment of Manifest creation. If a segment duration is not set, then each Segment MUST be announced by the Manifest before it is active. More information about Segment availability is given in <xref target="cont-segment-availability"></xref>.</td>
</tr>

<tr>
<td>startSegmentId</td>
<td>uint</td>
<td>N</td>
<td>The identifier of the first Segment in the Continuation Stream of this Track. If not defined, this value SHALL equal <tt>0</tt>.</td>
</tr>

<tr>
<td>startSequenceNumber</td>
<td>uint</td>
<td>N</td>
<td>The Sequence Number of the first Initialization Packet in this Track. It MUST denote the Sequence Number of the first Initialization Packet of this Track which was published in this Track. If not defined, this value SHALL equal <tt>0</tt>.</td>
</tr>

<tr>
<td>averageBandwidth</td>
<td>uint</td>
<td>N</td>
<td>The average bitrate of this Track, denoted in bits per second. It is expected that over a duration of 10 minutes, the average bitrate of this Track SHALL be within 5% of the given value. This value is used to aid in ABR decisions by the player.</td>
</tr>

<tr>
<td>baseUrl</td>
<td>string</td>
<td>N</td>
<td>The base URL of this Track. It is part of the content base URLs used to request this Track's Initialization Packets and Continuation Segments. See <xref target="manifest-addressing-content"></xref> for more information.</td>
</tr>

<tr>
<td>channels</td>
<td>integer</td>
<td>N</td>
<td>The audio channel configuration of this audio Track, defined as the total number of audio channels in the media data. This is used for ABR selection by the player.</td>
</tr>

<tr>
<td>codecs</td>
<td>string</td>
<td>N</td>
<td>The definition of the codec(s) necessary to render the content of this Track. It MUST follow the ISO File Format Name Space as defined by <xref target="RFC6381"></xref>. If this is not defined here, then it MUST be defined by the Track's Switching Set.</td>
</tr>

<tr>
<td>continuationPattern</td>
<td>string</td>
<td>N</td>
<td>The URL pattern used to request Continuation Segments belonging to this Track. The pattern needs to include the <tt>{segmentId}</tt> string; see <xref target="manifest-addressing-content"></xref> for more information. If this is not defined here, then it MUST be defined by the Track's Switching Set.</td>
</tr>

<tr>
<td>label</td>
<td>string</td>
<td>N</td>
<td>A human-readable name for this Track.</td>
</tr>

<tr>
<td>initializationPattern</td>
<td>string</td>
<td>N</td>
<td>The URL pattern used to request Initialization Packets belonging to this media Track. The pattern needs to include the <tt>{initId}</tt> string; see <xref target="manifest-addressing-content"></xref> for more information. If this is not defined here, then it MUST be defined by the Track's Switching Set.</td>
</tr>

<tr>
<td>mediaTimeOffset</td>
<td>ScaledValue</td>
<td>N</td>
<td>The offset to be added to Manifest Timestamps to calculate Media Timestamps contained by segments of this Track. If neither the Track nor the Switching Set has this value set, it SHALL equal 0.</td>
</tr>

<tr>
<td>sampleRate</td>
<td>uint</td>
<td>N</td>
<td>The sample rate (in Hz) of this audio Track. If this is not defined here, then it MUST be defined by the Track's Switching Set.</td>
</tr>

<tr>
<td>samplesPerFrame</td>
<td>uint</td>
<td>N</td>
<td>The number of audio samples in one frame of this audio Track. If neither the Track nor the Switching Set has this value set, it SHALL equal 1024.</td>
</tr>

<tr>
<td>segmentDuration</td>
<td>ScaledValue</td>
<td>N</td>
<td>The duration (in seconds) of each Segment contained by this Track. If this value is set, then every <tt>segmentDuration</tt> seconds, a new Segment SHOULD be available. More information about the availability is given in <xref target="cont-segment-availability"></xref>. If not set, then each Segment MUST be individually defined by the <tt>segments</tt> field.</td>
</tr>
</tbody>
</table><t>For any attributes that also exist on AudioSwitchingSetType, if both AudioSwitchingSetType and AudioTrackType have a value for this attribute, then only the value set by AudioTrackType must be considered (except for the identifier and label.)</t>
</section>

<section anchor="videotracktype"><name>VideoTrackType</name>
<t>This is the type definition of a video Track.</t>
<table>
<thead>
<tr>
<th>Attribute</th>
<th>Type</th>
<th>Req?</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td>bandwidth</td>
<td>uint</td>
<td>Y</td>
<td>The peak bitrate of this Track. It is denoted in bits per second. The measured bitrates of all Segments of the Track MUST NOT exceed this value. This value is used to aid in ABR decisions by the player.</td>
</tr>

<tr>
<td>id</td>
<td>string</td>
<td>Y</td>
<td>The unique identifier for this Track. It MUST be unique within its Switching Set.</td>
</tr>

<tr>
<td>resolution</td>
<td>Resolution</td>
<td>Y</td>
<td>The resolution of this video Track.</td>
</tr>

<tr>
<td>segments</td>
<td>SegmentType[]</td>
<td>Y</td>
<td>The metadata of the Segments contained by this Track at the moment of Manifest creation. If a segment duration is not set, then each Segment MUST be announced by the Manifest before it is active. More information about Segment availability is given in <xref target="cont-segment-availability"></xref>.</td>
</tr>

<tr>
<td>startSegmentId</td>
<td>uint</td>
<td>N</td>
<td>The identifier of the first Segment in the Continuation Stream of this Track. If not defined, this value SHALL equal <tt>0</tt>.</td>
</tr>

<tr>
<td>startSequenceNumber</td>
<td>uint</td>
<td>N</td>
<td>The Sequence Number of the first Initialization Packet in this Track. It MUST denote the Sequence Number of the first Initialization Packet of this Track which was published in this Track. If not defined, this value SHALL equal <tt>0</tt>.</td>
</tr>

<tr>
<td>averageBandwidth</td>
<td>uint</td>
<td>N</td>
<td>The average bitrate of this Track, denoted in bits per second. It is expected that over a duration of 10 minutes, the average bitrate of this Track SHALL be within 5% of the given value. This value is used to aid in ABR decisions by the player.</td>
</tr>

<tr>
<td>baseUrl</td>
<td>string</td>
<td>N</td>
<td>The base URL of this Track. It is part of the content base URLs used to request this Track's Initialization Packets and Continuation Segments. See <xref target="manifest-addressing-content"></xref> for more information.</td>
</tr>

<tr>
<td>codecs</td>
<td>string</td>
<td>N</td>
<td>The definition of the codec(s) necessary to render the content of this Track. It MUST follow the ISO File Format Name Space as defined by <xref target="RFC6381"></xref>. If this is not defined here, then it MUST be defined by the Track's Switching Set.</td>
</tr>

<tr>
<td>continuationPattern</td>
<td>string</td>
<td>N</td>
<td>The URL pattern used to request Continuation Segments belonging to this Track. The pattern needs to include the <tt>{segmentId}</tt> string; see <xref target="manifest-addressing-content"></xref> for more information. If this is not defined here, then it MUST be defined by the Track's Switching Set.</td>
</tr>

<tr>
<td>frameRate</td>
<td>ScaledValue</td>
<td>N</td>
<td>The frame rate of this video Track. If it is not defined by the Switching Set, then it must be defined here.</td>
</tr>

<tr>
<td>label</td>
<td>string</td>
<td>N</td>
<td>A human-readable name for this Track.</td>
</tr>

<tr>
<td>initializationPattern</td>
<td>string</td>
<td>N</td>
<td>The URL pattern used to request Initialization Packets belonging to this media Track. The pattern needs to include the <tt>{initId}</tt> string; see <xref target="manifest-addressing-content"></xref> for more information. If this is not defined here, then it MUST be defined by the Track's Switching Set.</td>
</tr>

<tr>
<td>mediaTimeOffset</td>
<td>ScaledValue</td>
<td>N</td>
<td>The offset to be added to Manifest Timestamps to calculate Media Timestamps contained by segments of this Track. If neither the Track nor the Switching Set has this value set, it SHALL equal 0.</td>
</tr>

<tr>
<td>segmentDuration</td>
<td>ScaledValue</td>
<td>N</td>
<td>The duration (in seconds) of each Segment contained by this Track. If this value is set, then every <tt>segmentDuration</tt> seconds, a new Segment SHOULD be available. More information about the availability is given in <xref target="cont-segment-availability"></xref>. If not set, then each Segment MUST be individually defined by the <tt>segments</tt> field.</td>
</tr>
</tbody>
</table><t>For any attributes that also exist on VideoSwitchingSetType, if both VideoSwitchingSetType and VideoTrackType have a value for this attribute, then only the value set by VideoTrackType must be considered (except for the identifier and label.)</t>
</section>

<section anchor="resolution"><name>Resolution</name>
<t>A Resolution contains the following elements:</t>
<table>
<thead>
<tr>
<th>Attribute</th>
<th>Type</th>
<th>Required</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td>width</td>
<td>uint</td>
<td>Y</td>
<td>The width of the picture.</td>
</tr>

<tr>
<td>height</td>
<td>uint</td>
<td>Y</td>
<td>The height of the picture.</td>
</tr>

<tr>
<td>sarWidth</td>
<td>uint</td>
<td>N</td>
<td>The width of the sample aspect ratio of the resolution. If it is not set, it SHALL equal 1.</td>
</tr>

<tr>
<td>sarHeight</td>
<td>uint</td>
<td>N</td>
<td>The height of the sample aspect ratio of the resolution. If it is not set, it SHALL equal 1.</td>
</tr>
</tbody>
</table><t>The display aspect ratio belonging to this Resolution can be calculated by using this sample aspect ratio width and height, and the picture width and height as follows:</t>

<artwork><![CDATA[dar = (darWidth, darHeight)
darWidth = sarWidth * width
darHeight = sarHeight * height
]]>
</artwork>
</section>

<section anchor="manifest-metadata-track"><name>MetadataTrackType</name>
<t>This is the type definition of a metadata Track.</t>
<table>
<thead>
<tr>
<th>Attribute</th>
<th>Type</th>
<th>Req?</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td>id</td>
<td>string</td>
<td>Y</td>
<td>The unique identifier for this Track. It MUST be unique within its Switching Set.</td>
</tr>

<tr>
<td>segments</td>
<td>SegmentType[]</td>
<td>Y</td>
<td>The metadata of the Segments that are contained by this Track at the moment of Manifest creation.<br />
 If a segment duration is not set, then each Segment MUST be announced by the Manifest before it is active. More information about Segment availability is given in <xref target="cont-segment-availability"></xref>.</td>
</tr>

<tr>
<td>startSegmentId</td>
<td>uint</td>
<td>N</td>
<td>The identifier of the first Segment in the Continuation Stream of this Track. If not defined, this value SHALL equal <tt>0</tt>.</td>
</tr>

<tr>
<td>averageBandwidth</td>
<td>uint</td>
<td>N</td>
<td>The average bitrate of this Track, denoted in bits per second. <br />
 It is expected that over a duration of 10 minutes, the average bitrate of this Track SHALL be within 5% of the given value. This value is used to aid in ABR decisions by the player.</td>
</tr>

<tr>
<td>bandwidth</td>
<td>uint</td>
<td>N</td>
<td>The peak bitrate of this Track. It is denoted in bits per second. The measured bitrates of all Segments of the Track MUST NOT exceed this value. This value is used to aid in ABR decisions by the player.</td>
</tr>

<tr>
<td>baseUrl</td>
<td>string</td>
<td>N</td>
<td>The base URL of this Track. It is part of the content base URLs used to request this Track's Continuation Segments. See <xref target="manifest-addressing-content"></xref> for more information.</td>
</tr>

<tr>
<td>codecs</td>
<td>string</td>
<td>N</td>
<td>The definition of the codec(s) necessary to render the content of this Track. It MUST follow the ISO File Format Name Space as defined by <xref target="RFC6381"></xref>.</td>
</tr>

<tr>
<td>continuationPattern</td>
<td>string</td>
<td>N</td>
<td>The URL pattern used to request Continuation Segments belonging to this Track. The pattern needs to include the <tt>{segmentId}</tt> string; see <xref target="manifest-addressing-content"></xref> for more information. If this is not defined here, then it MUST be defined by the Track's Switching Set.</td>
</tr>

<tr>
<td>label</td>
<td>string</td>
<td>N</td>
<td>A human-readable name for this Track.</td>
</tr>

<tr>
<td>mediaTimeOffset</td>
<td>ScaledValue</td>
<td>N</td>
<td>The offset to be added to Manifest Timestamps to calculate Media Timestamps contained by segments of this Track. If neither the Track nor the Switching Set has this value set, it SHALL equal 0.</td>
</tr>

<tr>
<td>segmentDuration</td>
<td>ScaledValue</td>
<td>N</td>
<td>The duration (in seconds) of each Segment contained by this Track. If this value is set, then every <tt>segmentDuration</tt> seconds, a new Segment SHOULD be available. More information about the availability is given in <xref target="cont-segment-availability"></xref>. If not set, then each Segment MUST be individually defined by the <tt>segments</tt> field.</td>
</tr>
</tbody>
</table></section>

<section anchor="segmenttype"><name>SegmentType</name>
<t>This is the type definition of a Segment.</t>
<table>
<thead>
<tr>
<th>Attribute</th>
<th>Type</th>
<th>Required</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td>id</td>
<td>uint</td>
<td>Y</td>
<td>The unique identifier for this Segment within the Track. This identifier MUST be incremented by 1 for every new Segment of the same Track.</td>
</tr>

<tr>
<td>timeBounds</td>
<td>TimeBounds</td>
<td>N</td>
<td>The time boundaries of this Segment. If this Segment's Track does not have a constant segment duration, then this value MUST be set, and the Segment's duration MUST be available at least 2 seconds before the actual end of that Segment.</td>
</tr>
</tbody>
</table></section>

<section anchor="manifest-presentation-event"><name>PresentationEventType</name>
<t>Timed metadata events can be embedded in the Manifest by using Presentation Events. Each PresentationEventType contains the following elements:</t>
<table>
<thead>
<tr>
<th>Attribute</th>
<th>Type</th>
<th>Req?</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td>data</td>
<td>string</td>
<td>Y</td>
<td>The event payload.</td>
</tr>

<tr>
<td>id</td>
<td>string</td>
<td>Y</td>
<td>An unique identifier for this event. It MUST be unique within this Presentation's events.</td>
</tr>

<tr>
<td>timeBounds</td>
<td>PresentationEventTimeBounds</td>
<td>Y</td>
<td>The time boundaries for the event, during which the event SHALL be active.</td>
</tr>

<tr>
<td>encoding</td>
<td>Enum(&quot;identity&quot;, &quot;base64&quot;, &quot;json&quot;)</td>
<td>N</td>
<td>The content encoding of the event data. <tt>identity</tt> signifies that the encoding is plaintext. <tt>base64</tt> signifies Base64 encoding. <tt>json</tt> signifies that the payload given by <tt>data</tt> is a valid JSON document. If this JSON document is not valid, this event MUST be ignored. If not set, the encoding SHALL default to <tt>identity</tt>.</td>
</tr>
</tbody>
</table></section>

<section anchor="presentationeventtimebounds"><name>PresentationEventTimeBounds</name>
<t>The time boundaries of a PresentationEventType are defined as follows:</t>
<table>
<thead>
<tr>
<th>Attribute</th>
<th>Type</th>
<th>Required</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td>startTimeOffset</td>
<td>uint</td>
<td>N</td>
<td>The scaled start time offset of the event, defined in seconds divided by the timescale. The actual start time of the event can be calculated by dividing this value by the scale and adding the resulting value to the start time of the Presentation. It is defined as seconds divided by the timescale. If it is not set, it SHALL equal 0.</td>
</tr>

<tr>
<td>duration</td>
<td>uint</td>
<td>N</td>
<td>The scaled duration of the event, defined in seconds divided by the timescale. If it is not set, it SHALL equal 0.</td>
</tr>

<tr>
<td>scale</td>
<td>uint</td>
<td>N</td>
<td>The timescale used for these time bounds. If it is not set, it SHALL be equal to 1.</td>
</tr>
</tbody>
</table></section>
</section>

<section anchor="manifest-requests"><name>Manifest requests</name>
<t>The Manifest SHOULD be available through an HTTP GET request. The URL of this Manifest can be personalized by the user.</t>
<table><name>Manifest Requests
</name>
<thead>
<tr>
<th>Request path</th>
<th>Method</th>
<th>Summary</th>
</tr>
</thead>

<tbody>
<tr>
<td>(chosen by the stream publisher)</td>
<td>GET</td>
<td>Retrieve the stream Manifest.</td>
</tr>
</tbody>
</table>
<section anchor="manifest-responses"><name>Manifest responses</name>
<table><name>Manifest Responses: Success
</name>
<thead>
<tr>
<th>Success</th>
<th></th>
</tr>
</thead>

<tbody>
<tr>
<td>Status code</td>
<td>200</td>
</tr>

<tr>
<td>Content-Type</td>
<td>JSON (MIME type MUST equal <tt>application/vnd.theo.hesp+json</tt>)</td>
</tr>

<tr>
<td>Description</td>
<td>When the given Manifest exists on the server, the Manifest data is returned.</td>
</tr>
</tbody>
</table><table><name>Manifest Responses: Error
</name>
<thead>
<tr>
<th>Status code</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td>404</td>
<td>The Manifest is not available for the given URL.</td>
</tr>
</tbody>
</table><t>If an error occurs, the player SHOULD attempt to retry the request. In case of consecutive unsuccessful requests, the media player SHOULD assume the content is unavailable and cease playback.</t>
</section>
</section>

<section anchor="manifest-addressing-content"><name>Addressing of content requests</name>
<t>Several elements of the Manifest contain a <tt>baseUrl</tt> attribute. These attributes are used to construct the URLs of media content (and metadata.)</t>

<section anchor="content-request-url-resolution"><name>Content request URL resolution</name>
<t>A content request URL for a specific Track SHALL be constructed by applying relative resolution (as explained in Section 5.2 of RFC 3986 <xref target="RFC3986"></xref>) to each defined <tt>baseUrl</tt> or <tt>contentBaseUrl</tt> attribute relating to that Track, starting from the root of the Manifest. The URL of the Manifest SHALL be used as a base for the first resolution, after which the target URL of the previous resolution SHALL be used as the base URL of the next resolution.</t>
<t>This means that the content request URL for a Track's Initialization Stream is constructed as follows (in pseudocode):</t>

<artwork><![CDATA[T = manifestUrl
if isDefined(Manifest.contentBaseUrl):
      T = resolve(Manifest.contentBaseUrl, T)
if isDefined(Presentation.baseUrl):
      T = resolve(Presentation.baseUrl, T)
if isDefined(SwitchingSet.baseUrl):
      T = resolve(SwitchingSet.baseUrl, T)
if isDefined(Track.baseUrl):
      T = resolve(Track.baseUrl, T)
contentRequestURL = resolve(initializationPattern, T)
]]>
</artwork>
<t>where <tt>initializationPattern</tt> is the attribute given by either SwitchingSet or Track (with Track taking precedence if given by both) and <tt>resolve(R, B)</tt> is relative resolution, where R is a URI reference and B is a base URI.</t>
<t>The content request URL for the Track's Continuation Stream can be constructed in the same manner if the <tt>initializationPattern</tt> is replaced with the <tt>continuationPattern</tt>. Both URLs MUST be unique for each unique Track.</t>
<t>An example of content addressing is given in <xref target="use-case-content-addressing"></xref>.</t>
</section>

<section anchor="requesting-using-an-identifier-and-the-content-request-url"><name>Requesting using an identifier and the content request URL</name>
<t>The content request URL MUST include the <tt>{initId}</tt> pattern for Initialization Stream URLs and the <tt>{segmentId}</tt> pattern for Continuation Stream URLs. These patterns are replaced in the actual content requests: <tt>{initId}</tt> is replaced with the requested Sequence Number of the Initialization Stream, and <tt>{segmentId}</tt> is replaced with the requested segment identifier of the Continuation Stream.</t>
<t>In order to add leading zeros to the identifier, the following can be added to the patterns <tt>initId</tt> and <tt>segmentId</tt>: <tt>:0(n)d</tt> where <tt>(n)</tt> is the minimal amount of characters the resulting identifier has (if the identifier is already longer, it will not be altered.)</t>
<t>For example, for a continuation segment with identifier 100, a following content request URL <tt>https://www.example.com/s-1/content-{segmentId:06d}.mp4</tt> will resolve to <tt>https://www.example.com/s-1/content-000100.mp4</tt>. While the content request URL <tt>https://www.example.com/s-1/content-{segmentId:02d}.mp4</tt> resolves to <tt>https://www.example.com/s-1/content-100.mp4</tt></t>
<t>Further details about these requests are given in <xref target="cha-initialization"></xref> and <xref target="cha-continuation"></xref>.</t>
</section>
</section>

<section anchor="manifest-example"><name>Manifest example</name>
<t>A full Manifest example, together with information on content addressing and timing information, is given by <xref target="use-case-manifest"></xref>.</t>
</section>
</section>

<section anchor="cha-initialization"><name>Initialization Stream</name>
<t>The player uses the information in the HESP Manifest to fetch an Initialization Packet.</t>

<section anchor="initialization-stream-purpose"><name>Initialization Stream purpose</name>
<t>The Initialization Stream is a stream containing only independent samples. This stream is not regularly used by a single client nor completely streamed. The purpose of the Initialization Stream is to make available, upon request of the client, the separate independent frames packaged in an Initialization Packet. This way, a client has more control over the specific position of the stream where it wants to initiate playback. For example, it allows for more granularity when starting playback at the live edge, seeking a specific point in time, or switching to an alternative Track.</t>
<t>Additionally, the Initialization Packet contains information on the position of the following frame in the Continuation Stream to achieve regular media playback.</t>
</section>

<section anchor="initialization-packet-format"><name>Initialization Packet format</name>

<section anchor="video-initialization-packet"><name>Video Initialization Packet</name>
<t>A video Initialization Packet MUST contain:</t>

<ul>
<li><t>A CMAF header, i.e., all information required to initialize the media decoder, stored in ISO Base Media File Format <xref target="ISOBMFF"></xref> boxes as specified by CMAF <xref target="CMAF"></xref>. The minimal required contents are defined in <xref target="initialization-cmaf-header"></xref></t>
</li>
<li><t>An additional in-band metadata event containing extra HESP information, included in the aforementioned CMAF segment. The definition of such an event is denoted in <xref target="initialization-event-information"></xref>.</t>
</li>
<li><t>A CMAF segment, i.e., a group of one or more media samples, starting with at least one independent media sample, and timing information related to these samples, stored in ISOBMFF <xref target="ISOBMFF"></xref> boxes as specified by CMAF <xref target="CMAF"></xref>.</t>
</li>
</ul>
</section>

<section anchor="audio-initialization-packet"><name>Audio Initialization Packet</name>
<t>An audio Initialization Packet MUST contain:</t>

<ul>
<li><t>A CMAF header, i.e., all information required to initialize the media decoder, stored in ISO Base Media File Format <xref target="ISOBMFF"></xref> boxes as specified by CMAF <xref target="CMAF"></xref>. The minimal required contents are defined below.</t>
</li>
<li><t>An additional in-band metadata event containing extra HESP information, included in the aforementioned CMAF segment. The definition of such an event is denoted below.</t>
</li>
</ul>
<t>It MAY also contain an independent (audio) media sample, as is defined by the video Initialization Packet. However, it is heavily discouraged as it leads to more storage costs for the provider without any significant advantage.</t>
</section>

<section anchor="constraint-on-media-information"><name>Constraint on media information</name>
<t>In order to avoid any decoder anomalies, a HESP stream MUST follow the following constraint:
* When the media sample(s) contained in an Initialization Packet are concatenated with the samples contained in the corresponding Continuation Segment with the referenced <tt>index</tt>, starting as of the referenced <tt>offset</tt>, the resulting bitstream should be compliant to the codec limitations.</t>
</section>

<section anchor="initialization-cmaf-header"><name>CMAF header</name>
<t>A CMAF header is defined in ISO/IEC 23000-19:2020 <xref target="CMAF"></xref> as: &quot;a sequence of CMAF constrained ISOBMFF boxes that do not reference any media samples (3.3.15), but are associated with a CMAF track (3.2.1) and necessary for the decoding of its CMAF fragments (3.1.1).&quot; As such, the header MUST NOT contain any samples.</t>
<t>This means that for the ISO Base Media File Format, at least the following boxes MUST be given:</t>

<ul>
<li><t>An <tt>ftyp</tt> box.</t>
</li>
<li><t>A <tt>moov</tt> box.</t>
</li>
<li><t>One or more tracks and accompanying boxes, though <tt>stbl</tt> and other related boxes cannot give information about any actual sample entries.</t>
</li>
</ul>
</section>

<section anchor="initialization-event-information"><name>Event message information</name>
<t>In order to pass information about the Continuation Stream, an in-band event in the form of an <tt>emsg</tt> box must be added to the Initialization Packet. The format of this box is defined at <xref target="metadata-event-init-data"></xref>. The <tt>message_data</tt> field MUST contain the following two values (see <xref target="init-data-json-structure"></xref>):</t>

<ul>
<li><t>the Continuation Segment index (<tt>index</tt>), which defines the index of the Continuation Segment containing the next media sample.</t>
</li>
<li><t>the Continuation Segment byte offset (<tt>offset</tt>), which defines the position of the next media sample in the Continuation Segment. This field is OPTIONAL if the next media sample is located at the start of the segment, in which case a byte-range request is not needed.</t>
</li>
</ul>
<t>The <tt>message_data</tt> field of the <tt>emsg</tt> box could for example look like this:</t>
<t><tt>{&quot;index&quot;:1,&quot;offset&quot;:1234}</tt></t>
</section>
</section>

<section anchor="initialization-stream-addressing"><name>Initialization Stream addressing</name>
<t><xref target="manifest-addressing-content"></xref> defines how to create the correct URLs and how to request Initialization Packets. The request content URL is used together with the desired Sequence Number to retrieve an Initialization Packet.</t>
<t>Additionally, <tt>{initId}</tt> can also be replaced with the string <tt>now</tt>. If this URL is requested, the most recently available Initialization Packet MUST be returned.</t>

<section anchor="initialization-stream-requests"><name>Initialization Stream requests</name>
<t>Initialization Packets MUST be available through a basic GET request. They MUST be able to be retrieved through two methods:</t>
<table><name>Initialization Stream requests
</name>
<thead>
<tr>
<th>Request path</th>
<th>Method</th>
<th>Summary</th>
</tr>
</thead>

<tbody>
<tr>
<td>Initialization content request URL <br />
<tt>{initId}</tt> is replaced with the string <tt>now</tt></td>
<td>GET</td>
<td>Retrieve the most recent Initialization Packet from the given track.</td>
</tr>

<tr>
<td>Initialization content request URL <br />
<tt>{initId}</tt> is replaced with a Sequence Number</td>
<td>GET</td>
<td>Retrieve the Initialization Packet with the given Sequence Number from the given track.</td>
</tr>
</tbody>
</table></section>

<section anchor="initialization-stream-responses"><name>Initialization Stream responses</name>
<table><name>Initialization Stream responses: success
</name>
<thead>
<tr>
<th>Success</th>
<th></th>
</tr>
</thead>

<tbody>
<tr>
<td>Status Code</td>
<td>200</td>
</tr>

<tr>
<td>Content-Type</td>
<td>MUST match the <tt>mimeType</tt> given by the Stream's Switching Set (see <xref target="cha-manifest"></xref> for more details)</td>
</tr>

<tr>
<td>Description</td>
<td>Return the requested Initialization Packet when it exists on the media server.</td>
</tr>
</tbody>
</table><table><name>Initialization Stream responses: error
</name>
<thead>
<tr>
<th>Status code</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td>404</td>
<td>The Track does not exist on the media server, or the requested Initialization Packet does not exist on the media server.</td>
</tr>
</tbody>
</table></section>
</section>
</section>

<section anchor="cha-continuation"><name>Continuation Stream</name>
<t>The client uses the information given by the Manifest and the Initialization Packet to request the Continuation Stream. This low latency stream is used for the bulk of media playback of HESP. The Continuation Stream is an independently playable CMAF stream.</t>

<section anchor="continuation-stream-format"><name>Continuation Stream format</name>

<section anchor="media-content"><name>Media content</name>
<t>The Continuation Stream is packaged as a regular CMAF stream, albeit with some encoding constraints, depending on the chosen profile (see <xref target="app-profiles"></xref>.)</t>
<t>Maximum Gain Profile</t>

<ul spacing="compact">
<li>Samples MUST only contain references to the one sample preceding it.</li>
<li>Each CMAF chunk MUST contain at most one sample (for the lowest possible latency.)</li>
<li>Long CMAF Segments (values of multiple minutes are possible.)</li>
</ul>
<t>Compatibility Profile</t>

<ul spacing="compact">
<li>Regular sized CMAF Segments (recommendation: between 1 and 30 seconds)</li>
<li>Chunk sizes range between one sample and one sub-GOP</li>
<li>The following GOP structure MUST be followed: I B ... B P B ... B P B ... B P</li>
<li>The number of B frames MAY vary from 0 (same behavior as Maximum Gain Profile) to 4 (recommended maximum)</li>
<li>Each sub-GOP (B ... B P) MUST only reference one previous frame (allowing injection of keyframes)</li>
</ul>
</section>
</section>

<section anchor="cont-segment-availability"><name>Continuation Segment availability</name>
<t>Continuation Segments must together create a continuous stream of media data. There MUST NOT exist gaps in the time boundaries of subsequent Continuation Segments.</t>
<t>There exist two ways to signal Segment availability. Either the Track has a <tt>segmentDuration</tt> set in the Manifest, which represents the constant duration of each Segment. In this case, a client can derive on its own when to start requesting a new Segment. A new Segment SHOULD be published every n seconds, where n equals <tt>segmentDuration</tt>. This new Segment MUST become available within 100 milliseconds from that point. It is not allowed to drift, i.e., if a Segment is only published at n+100ms, then the next Segment MUST be available at 2n+100ms at the latest.</t>
<t>The duration of the last Segment of a Track MAY be shorter than this constant <tt>segmentDuration</tt>. As a new Manifest must be retrieved near the end of a Presentation, a client should be able to start requesting content of the new Presentation in a timely manner, regardless of the length of this last Segment.</t>
<t>The other option is for the Track to specify all its Segments outright in the <tt>segments</tt> field of the Manifest. As the Manifest does not get regularly retrieved, it is REQUIRED to signal a Manifest update (see <xref target="metadata-event-manifest-update"></xref>) at the end of each Segment. It is recommended only to use this option in the &quot;Maximum Gain Profile&quot;, as it can otherwise significantly increase the number of Manifest requests necessary.</t>
</section>

<section anchor="continuation-stream-addressing"><name>Continuation Stream addressing</name>
<t>To start playback of a Track, a client MUST request the Continuation Segments of the chosen Track(s) using the information given by (already obtained) the Manifest and Initialization Packet.</t>
<t>Requests for Content Segments should be made using HTTP GET requests. The first request, following an Initialization Packet, fetches the Segment indicated by the Initialization Packet. A byte-range header should be used to specify the range of data that needs to be requested. This information is also given by the Initialization Packet.</t>
<t>A client then automatically calculates the URL of the next Segment as it is indicated in the Manifest and requests the next Segment using a regular HTTP GET request.</t>

<section anchor="continuation-stream-urls"><name>Continuation Stream URLs</name>
<t><xref target="manifest-addressing-content"></xref> defines how to create the correct URLs and how to request a continuation segment. The request content URL is used together with the desired segment identifier to retrieve the Segment.</t>
</section>

<section anchor="continuation-stream-requests"><name>Continuation Stream requests</name>
<t>Suppose a client needs to request a partial Continuation Segment, for example, starting at the byte offset given by metadata of an Initialization Packet. In that case, it MUST use one or more HTTP Range <xref target="RFC9110"></xref> Requests. The start and end of the range of each request MUST be defined.</t>
<t>Often, the total length of the Continuation Segment will not yet be known at the time of the request. If a client cannot define an accurate end value for the HTTP Range of a request, then 2^53 - 1 (9007199254740991) SHOULD be used.</t>
<table><name>Continuation Stream requests
</name>
<thead>
<tr>
<th>Request path</th>
<th>Method</th>
<th>Summary</th>
</tr>
</thead>

<tbody>
<tr>
<td>Continuation content request URL <br />
<tt>{segmentId}</tt> is replaced with a segment identifier</td>
<td>GET</td>
<td>Retrieve the Continuation Segment with the requested identifier from the given track.</td>
</tr>
</tbody>
</table></section>

<section anchor="continuation-stream-responses"><name>Continuation Stream responses</name>
<t>A distinction is made between how responses should be returned, depending on the version of the HTTP protocol used.</t>

<section anchor="http-1-1-successful-response"><name>HTTP/1.1 successful response</name>
<t>If an HTTP Range request is sent, then a 206 Partial Content response MUST be returned upon a successful response. The response MUST use Chunked Transfer Coding <xref target="RFC9112"></xref> to ensure timely delivery of media data.</t>
<table><name>Continuation Stream responses: success (Range header is given)
</name>
<thead>
<tr>
<th>Success</th>
<th></th>
</tr>
</thead>

<tbody>
<tr>
<td>Status Code</td>
<td>206</td>
</tr>

<tr>
<td>Content-Type</td>
<td>MUST match the <tt>mimeType</tt> given by the Stream's Switching Set (see <xref target="cha-manifest"></xref> for more details)</td>
</tr>

<tr>
<td>Transfer-Encoding</td>
<td>chunked</td>
</tr>

<tr>
<td>Description</td>
<td>The requested range of Segment data is returned from the server. Depending on the byte-range requested, the connection is kept open to retrieve live data.</td>
</tr>
</tbody>
</table><t>For requests without a Range header, a 200 OK response MUST be returned upon success.</t>
<table><name>Continuation Stream responses: success (Range header is not given)
</name>
<thead>
<tr>
<th>Success</th>
<th></th>
</tr>
</thead>

<tbody>
<tr>
<td>Status Code</td>
<td>200</td>
</tr>

<tr>
<td>Content-Type</td>
<td>MUST match the <tt>mimeType</tt> given by the Stream's Switching Set (see <xref target="cha-manifest"></xref> for more details)</td>
</tr>

<tr>
<td>Transfer-Encoding</td>
<td>chunked</td>
</tr>

<tr>
<td>Description</td>
<td>The Segment data is returned from the server. Depending on the availability of the Segment, the connection is kept open to retrieve live data.</td>
</tr>
</tbody>
</table></section>

<section anchor="http-2-successful-response"><name>HTTP/2 successful response</name>
<t>HTTP/2 uses frame-based transmission and cannot use Chunked Transfer Coding. As such, this header is not given here.</t>
<table><name>Continuation Stream responses: success (Range header is given)
</name>
<thead>
<tr>
<th>Success</th>
<th></th>
</tr>
</thead>

<tbody>
<tr>
<td>Status code</td>
<td>206</td>
</tr>

<tr>
<td>Content-Type</td>
<td>MUST match the <tt>mimeType</tt> given by the Stream's Switching Set (see <xref target="cha-manifest"></xref> for more details)</td>
</tr>

<tr>
<td>Description</td>
<td>The requested range of Segment data is returned from the server. Depending on the byte-range requested, the connection is kept open to retrieve live data.</td>
</tr>
</tbody>
</table><t>For requests without a Range header, a 200 OK response MUST be returned upon success.</t>
<table><name>Continuation Stream responses: success (Range header is not given)
</name>
<thead>
<tr>
<th>Success</th>
<th></th>
</tr>
</thead>

<tbody>
<tr>
<td>Status code</td>
<td>200</td>
</tr>

<tr>
<td>Content-Type</td>
<td>MUST match the <tt>mimeType</tt> given by the Stream's Switching Set (see <xref target="cha-manifest"></xref> for more details)</td>
</tr>

<tr>
<td>Description</td>
<td>The Segment data is returned from the server. Depending on the availability of the Segment, the connection is kept open to retrieve live data.</td>
</tr>
</tbody>
</table></section>

<section anchor="response-errors"><name>Response errors</name>
<table><name>Continuation Stream responses: error
</name>
<thead>
<tr>
<th>Status code</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td>404</td>
<td>The Track does not exist on the media server, or the requested Segment does not exist on the media server.</td>
</tr>

<tr>
<td>416</td>
<td>The requested byte-range could not be fulfilled.</td>
</tr>
</tbody>
</table></section>
</section>
</section>
</section>

<section anchor="timed-metadata"><name>Timed metadata</name>
<t>HESP supports timed metadata through two methods: metadata Tracks for continuous, often segmented metadata and metadata events for sporadic metadata updates.</t>

<section anchor="metadata-tracks"><name>Metadata Tracks</name>
<t>Metadata Tracks function similarly to media Tracks, however, without the need for an Initialization Stream. The attributes for metadata Tracks in the Manifest can be found at <xref target="manifest-metadata-track"></xref>. The addressing happens similarly to the addressing of Continuation segments:
a <tt>continuationPattern</tt> is given for each metadata Track, either directly set on the Track or on the Switching Set of the Track. Each metadata Segment contains a numerical identifier that increments by one for each new Segment of the Track. This identifier, together with the addressing pattern, creates the URL used to retrieve the contents of the Segment.</t>
<t>The Manifest contains the duration of each metadata Segment, either stated individually per Segment or through the <tt>segmentDuration</tt> attribute set on the metadata Track. It is possible that chunked encoding is used here to ensure that the contents are delivered as soon as possible, but this is not a requirement. This can be useful for subtitles alongside live content, for example.</t>
</section>

<section anchor="metadata-events"><name>Metadata events</name>
<t>Metadata events can be used to signal information that does not become available in regular intervals. These events can either be transmitted in-band, where it MUST be added to the video or audio Continuation Streams, or out-of-band, in which case details about the metadata event MUST be available in the Manifest.</t>

<section anchor="in-band-events"><name>In-band events</name>
<t>In order to deliver events in-band, root-level Event Message (<tt>emsg</tt>) boxes MUST be added to ongoing media Continuation Streams. The definition of an <tt>emsg</tt> box can be found in the CMAF <xref target="CMAF"></xref> specification.</t>
<t>These boxes should be appended to each Track of a Presentation to ensure all viewers can receive such an event. It is recommended to use out-of-band events if the included data is significantly large.</t>
<t>The HESP specification defines a few in-band events that it leverages for client initialization and Manifest updates. The structure of these events is given below.</t>

<section anchor="metadata-event-init-data"><name>Initialization data</name>
<t>This event is appended to each Initialization Packet for the client to request the correct Continuation Segments. It is an ´emsg´ box with the following REQUIRED values:</t>
<table><name><tt>emsg</tt> box containing an initialization data event
</name>
<thead>
<tr>
<th>Attribute</th>
<th>Value</th>
</tr>
</thead>

<tbody>
<tr>
<td><tt>version</tt></td>
<td>0</td>
</tr>

<tr>
<td><tt>scheme_id_uri</tt></td>
<td>&quot;urn:theo:hesp:2020&quot;</td>
</tr>

<tr>
<td><tt>value</tt></td>
<td>&quot;initdata&quot;</td>
</tr>

<tr>
<td><tt>timescale</tt></td>
<td>MUST match the timescale of the Initialization Packet in case of video or MUST equal 1 otherwise.</td>
</tr>

<tr>
<td><tt>presentation_time_delta</tt></td>
<td>0</td>
</tr>

<tr>
<td><tt>event_duration</tt></td>
<td>MUST match the duration of the Initialization Packet in case of video or MUST equal 0 otherwise.</td>
</tr>

<tr>
<td><tt>id</tt></td>
<td>can be freely set (and MUST be ignored by the player)</td>
</tr>

<tr>
<td><tt>message_data</tt></td>
<td>MUST contain the data defined by <xref target="init-data-json-structure"></xref>, formatted as JSON</td>
</tr>
</tbody>
</table><table anchor="init-data-json-structure"><name><tt>message_data</tt> contents of an initialization data event </name>
<thead>
<tr>
<th>Attribute</th>
<th>Type</th>
<th>Required</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td>index</td>
<td>integer</td>
<td>Y</td>
<td>The Continuation Segment index (see <xref target="initialization-event-information"></xref>.)</td>
</tr>

<tr>
<td>offset</td>
<td>integer</td>
<td>N</td>
<td>The Continuation Segment byte offset (see <xref target="initialization-event-information"></xref>), it SHALL be 0 if not given here.</td>
</tr>
</tbody>
</table></section>

<section anchor="metadata-event-manifest-update"><name>Manifest update</name>
<t>This event is used to signal the client that a new Manifest must be retrieved. This can occur for many reasons, such as before a Presentation change, availability of new out-of-band metadata, etc. It is an <tt>emsg</tt> box with the following REQUIRED values:</t>
<table><name><tt>emsg</tt> box containing a Manifest update event
</name>
<thead>
<tr>
<th>Attribute</th>
<th>Value</th>
</tr>
</thead>

<tbody>
<tr>
<td><tt>version</tt></td>
<td>0</td>
</tr>

<tr>
<td><tt>scheme_id_uri</tt></td>
<td>&quot;urn:theo:hesp:2020&quot;</td>
</tr>

<tr>
<td><tt>value</tt></td>
<td>&quot;manifestupdate&quot;</td>
</tr>

<tr>
<td><tt>timescale</tt></td>
<td>1</td>
</tr>

<tr>
<td><tt>presentation_time_delta</tt></td>
<td>0</td>
</tr>

<tr>
<td><tt>event_duration</tt></td>
<td>0</td>
</tr>

<tr>
<td><tt>id</tt></td>
<td>MAY be freely set (and MUST be ignored by the player)</td>
</tr>

<tr>
<td><tt>message_data</tt></td>
<td>MUST contain the data defined by <xref target="manifest-update-json-structure"></xref>, formatted as JSON</td>
</tr>
</tbody>
</table><table anchor="manifest-update-json-structure"><name><tt>message_data</tt> contents of a Manifest data event </name>
<thead>
<tr>
<th>Attribute</th>
<th>Type</th>
<th>Required</th>
<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td>url</td>
<td>string</td>
<td>N</td>
<td>the URL of an alternative location of the Manifest</td>
</tr>
</tbody>
</table><t>If the event contains a URL, then this Manifest request MUST be made to this address. Further Manifest requests do not have this requirement.</t>
</section>
</section>

<section anchor="out-of-band-events"><name>Out-of-band events</name>
<t>Out-of-band events can be added to the Manifest through Presentation Events. The definition of a Presentation Event is given in <xref target="manifest-presentation-event"></xref>. The contents of the event data are not predefined. It is possible to include arbitrary data as Base64 encoded data, and plaintext data can be included as-is. If needed, the data can be an URL that needs to be resolved separately, but this is up to the publisher of this data.</t>
</section>
</section>
</section>

<section anchor="cha-content-protection"><name>Content protection</name>
<t>HESP has support for content protection. It allows DRM systems to be implemented through the common encryption standard.</t>

<section anchor="common-encryption-support"><name>Common encryption support</name>
<t>As HESP has the requirement for media to be structured as ISOBMFF <xref target="ISOBMFF"></xref>, common encryption <xref target="CENC"></xref> is to be used to encrypt that media content.</t>
<t>Common encryption specifies 4 protection schemes that may be used: <tt>cenc</tt>, <tt>cbc1</tt>, <tt>cens</tt> and <tt>cbcs</tt>. For HESP, either the AES-CBC subsample pattern encryption scheme (<tt>cbcs</tt>) or the AES-CTR full sample pattern encryption scheme (<tt>cenc</tt>) MUST be used.</t>
</section>

<section anchor="hesp-manifest"><name>HESP Manifest</name>
<t>The HESP Manifest has a <tt>SwitchingSetProtection</tt> structure to set up common encryption for audio and video Switching Sets. The definition of this structure can be found in Chapter 3. As a result of this being set on a Switching Set level:</t>

<ul spacing="compact">
<li>All Tracks belonging to such a Switching Set MUST be encrypted with the same content key. Aligned Switching Sets can be used to ensure that a client can still switch through Tracks of different Switching Sets.</li>
<li>Audio and video data SHOULD be encrypted with different content keys. This is a recommendation as both often have different encryption strength requirements.</li>
</ul>
<t>The <tt>SwitchingSetProtection</tt> structure MAY contain a <tt>ProtectionSystemSpecificHeaderBox</tt> (<tt>pssh</tt>), which can also be contained by the Initialization Stream. Note that this box MUST be given by at least one of both options in order for a license request to be made. If both exist for a Switching Set, the <tt>pssh</tt> box from the Initialization Stream MUST be disregarded. This allows for more straightforward alterations of license information after a stream has been created or published.</t>
</section>

<section anchor="cmaf-box-structure"><name>CMAF box structure</name>
<t>The media of a protected stream needs to contain certain ISOBMFF boxes to be compliant with CENC (and CMAF). HESP requires the same boxes to be present in media streams. A brief overview of these boxes is given below. More information on this requirement can be found in the CENC <xref target="CENC"></xref> and CMAF <xref target="CMAF"></xref> specifications.</t>

<section anchor="initialization-stream"><name>Initialization Stream</name>
<t>Initialization Packets MUST contain the following boxes:</t>

<ul spacing="compact">
<li><tt>TrackEncryptionBox</tt> (<tt>tenc</tt>): This box contains default parameters regarding sample encryption.</li>
<li><tt>SchemeTypeBox</tt> (<tt>schm</tt>): This box identifies the protection scheme. <tt>scheme_type</tt> MUST be set to <tt>cbcs</tt> or <tt>cenc</tt>.</li>
<li><tt>ProtectionSystemSpecificHeaderBox</tt> (<tt>pssh</tt>): If the Manifest does not contain a <tt>pssh</tt> box that applies to this Track, then this box MUST be included in each Initialization Packet of each Track of the protected Switching Set. If the Manifest does contain a <tt>pssh</tt> box that applies to this Track, then this box MUST be disregarded.</li>
</ul>
<t>In order to signal that the Track is encrypted, the stream type MUST equal <tt>encv</tt> for video Tracks and <tt>enca</tt> for audio Tracks.</t>
<t>Additionally, video Initialization Packets contain a sample. If that sample is encrypted, then the same requirements for CMAF fragments of a Continuation Segment also apply here: a <tt>senc</tt> box MUST be included, <tt>pssh</tt>, <tt>saiz</tt> and <tt>saio</tt> boxes MAY be included. The following section contains more information about these requirements.</t>
</section>

<section anchor="continuation-stream"><name>Continuation Stream</name>
<t>CMAF fragments of a Continuation Segment MUST contain the following box:</t>

<ul spacing="compact">
<li><tt>SampleEncryptionBox</tt> (<tt>senc</tt>): This box is used to store initialization vector data and information on subsample encryption. As each chunk in HESP must currently contain at most one sample, <tt>sample_count</tt> SHALL always 1.</li>
</ul>
<t>The following boxes MAY be included:</t>

<ul spacing="compact">
<li><tt>ProtectionSystemSpecificHeaderBox</tt> (<tt>pssh</tt>): If any updates need to be made to the underlying licensing system, a <tt>pssh</tt> box MAY be included.</li>
<li><tt>SampleAuxiliaryInformationSizesBox</tt> (<tt>saiz</tt>): This box is used to store the size of per-sample auxiliary information. It is only REQUIRED if such per-sample information exists.</li>
<li><tt>SampleAuxiliaryInformationOffsetsBox</tt> (<tt>saio</tt>): This box is used to store the offsets of per-sample auxiliary information. It is only REQUIRED if such per-sample information exists.</li>
</ul>
</section>
</section>
</section>

<section anchor="contributors"><name>Contributors</name>
<t>Significant contributions to the design of this protocol were made by Egon Okerman, Samie Beheydt, and Johan Vounckx.</t>
</section>

<section anchor="iana-considerations"><name>IANA Considerations</name>
<t>This memo requests that the following MIME type <xref target="RFC2046"></xref> be registered with the IANA:</t>
<blockquote><t>Type name: <tt>application</tt></t>
<t>Subtype name: <tt>vnd.theo.hesp+json</tt></t>
<t>Required parameters: (none)</t>
<t>Optional parameters: (none)</t>
<t>Encoding considerations: encoded as text.</t>
<t>Security considerations: See <xref target="cha-security-considerations"></xref>.</t>
<t>Compression: this media type does not employ compression.</t>
<t>Interoperability considerations: There are no byte-ordering issues, since files are 7- or 8-bit text. Applications could encounter unrecognized tags, which SHOULD be ignored.</t>
</blockquote></section>

<section anchor="cha-security-considerations"><name>Security Considerations</name>
<t>Since the protocol uses HTTP to transmit data, the regular HTTP security considerations apply.  See section 15 of RFC 9112 <xref target="RFC9112"></xref>.</t>
<t>Clients SHOULD take care when parsing files received from a server so that non-compliant files are rejected. Clients SHOULD range-check responses to prevent buffer overflows.  See also the Security Considerations section of RFC 3986 <xref target="RFC3986"></xref>. Clients SHOULD load resources identified by URI lazily to avoid contributing to denial-of-service attacks.</t>
<t>HTTP requests often include session state (&quot;cookies&quot;), which may contain private user data. Implementations MUST follow cookie restriction and expiry rules specified by RFC 6265 <xref target="RFC6265"></xref>. See also the Security Considerations section of RFC 6265, and RFC 2964 <xref target="RFC2964"></xref>.</t>
<t>Encryption keys are specified by URI. The delivery of these keys SHOULD be secured by a mechanism such as HTTP over TLS <xref target="RFC8446"></xref> (formerly SSL) in conjunction with a secure realm or a session cookie.</t>
</section>

</middle>

<back>
<references><name>References</name>
<references><name>Normative References</name>
<reference anchor="CENC" target="https://www.iso.org/standard/68042.html">
  <front>
    <title>Information technology - MPEG systems technologies - Part 7: Common encryption in ISO base media file format files</title>
    <author>
      <organization>International Organization for Standardization</organization>
    </author>
    <date year="2020" month="December"></date>
  </front>
</reference>
<reference anchor="CMAF" target="https://www.iso.org/standard/79106.html">
  <front>
    <title>Information technology - Multimedia application format (MPEG-A) - Part 19: Common media application format (CMAF) for segmented media</title>
    <author>
      <organization>International Organization for Standardization</organization>
    </author>
    <date year="2020" month="March"></date>
  </front>
</reference>
<reference anchor="DASH" target="https://www.iso.org/standard/79329.html">
  <front>
    <title>Information technology — Dynamic adaptive streaming over HTTP (DASH) — Part 1: Media presentation description and segment formats</title>
    <author>
      <organization>International Organization for Standardization</organization>
    </author>
    <date year="2019" month="December"></date>
  </front>
</reference>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml3/reference.I-D.pantos-hls-rfc8216bis.xml"/>
<reference anchor="ISO6392" target="https://www.iso.org/standard/4767.html">
  <front>
    <title>Codes for the representation of names of languages - Part 2: Alpha-3 code</title>
    <author>
      <organization>International Organization for Standardization</organization>
    </author>
    <date year="1998" month="November"></date>
  </front>
</reference>
<reference anchor="ISO8601" target="https://www.iso.org/standard/70907.html">
  <front>
    <title>Date and time - Representations for information interchange</title>
    <author>
      <organization>International Organization for Standardization</organization>
    </author>
    <date year="2019" month="February"></date>
  </front>
</reference>
<reference anchor="ISOBMFF" target="https://www.iso.org/standard/74428.html">
  <front>
    <title>Information technology - Coding of audio-visual objects - Part 12: ISO base media file format</title>
    <author>
      <organization>International Organization for Standardization</organization>
    </author>
    <date year="2020" month="December"></date>
  </front>
</reference>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2964.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.3986.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.6265.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.6381.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8259.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8446.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.9110.xml"/>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.9112.xml"/>
<reference anchor="UUID" target="https://www.iso.org/standard/62795.html">
  <front>
    <title>Information technology - Procedures for the operation of object identifier registration authorities - Part 8: Generation of universally unique identifiers (UUIDs) and their use in object identifiers</title>
    <author>
      <organization>International Organization for Standardization</organization>
    </author>
    <date year="2014" month="August"></date>
  </front>
</reference>
</references>
<references><name>Informative References</name>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2046.xml"/>
</references>
</references>

<section anchor="example-usage"><name>Example usage</name>

<section anchor="use-case-manifest"><name>Manifest</name>
<t>For the initial step, the client retrieves a Manifest.</t>

<section anchor="retrieving-the-manifest"><name>Retrieving the Manifest</name>
<t>The URL of the Manifest is given out of band to the client. The client sends out a GET request. In this case, let's suppose the Manifest is available at <tt>https://example.com/stream1/manifest.json</tt>. The client then makes such a request:</t>

<artwork><![CDATA[GET /stream1/manifest.json HTTP/1.1
Host: example.com
Accept: application/vnd.theo.hesp+json
]]>
</artwork>
<t>The server responds with the following headers:</t>

<artwork><![CDATA[HTTP/1.1 200 OK
Content-Type: application/vnd.theo.hesp+json; charset=utf-8
Content-Length: 6867
Date: Wed, 31 Mar 2021 08:00:00 GMT
]]>
</artwork>
<t>and the following body:</t>

<sourcecode type="json"><![CDATA[{
   "activePresentation":"1",
   "availabilityDuration":{
      "value":2400
   },
   "creationDate":"2021-03-31T08:00:00.000Z",
   "fallbackPollRate":300,
   "manifestVersion":"2.0.0",
   "streamType":"live",
   "currentTime":{
            "value":1134000000,
            "scale":90000
   },
   "presentations":[
      {
         "id":"0",
         "timeBounds":{
            "startTime":0,
            "endTime":972000000,
            "scale":90000
         },
         "audio":[
            {
               "id":"main-audio",
               "language":"eng",
               "baseUrl":"audio/",
               "channels":2,
               "codecs":"mp4a.40.2",
               "continuationPattern":"content-{segmentId}.mp4",
               "initializationPattern":"init-{initId}.mp4",
               "sampleRate":48000,
               "tracks":[
                  {
                     "id":"96kbps",
                     "averageBandwidth":96000,
                     "bandwidth":96000,
                     "baseUrl":"96k/",
                     "segmentDuration":{
                        "value":540000,
                        "scale":90000
                     },
                     "segments":[
                        {
                           "id":1799,
                           "timeBounds":{
                              "startTime":971460000,
                              "scale":90000
                           }
                        }
                     ]
                  }
               ]
            }
         ],
         "video":[
            {
               "id":"main-video",
               "baseUrl":"video/",
               "frameRate":{
                  "value":25
               },
               "continuationPattern":"content-{segmentId}.mp4",
               "initializationPattern":"init-{initId}.mp4",
               "tracks":[
                  {
                     "id":"720p",
                     "bandwidth":3000000,
                     "baseUrl":"720p/",
                     "codecs":"avc1.4d001f",
                     "resolution":{
                        "width":1280,
                        "height":720
                     },
                     "segmentDuration":{
                        "value":540000,
                        "scale":90000
                     },
                     "segments":[
                        {
                           "id":1799,
                           "timeBounds":{
                              "startTime":971460000,
                              "scale":90000
                           }
                        }
                     ]
                  }
               ]
            }
         ]
      },
      {
         "id":"1",
         "timeBounds":{
            "startTime":972000000,
            "scale":90000
         },
         "baseUrl":"https://otherexample.com/s2/",
         "audio":[
            {
               "id":"main-audio",
               "language":"eng",
               "baseUrl":"audio/",
               "channels":2,
               "codecs":"mp4a.40.2",
               "sampleRate":48000,
               "mediaTimeOffset":{
                  "value":-972000000,
                  "scale":90000
               },
               "tracks":[
                  {
                     "id":"128kbps",
                     "averageBandwidth":128000,
                     "bandwidth":128000,
                     "continuationPattern":
                        "128k-content-{segmentId}.mp4",
                     "initializationPattern":
                        "128k-init-{initId}.mp4",
                     "segmentDuration":{
                        "value":540000,
                        "scale":90000
                     },
                     "segments":[
                        {
                           "id":300,
                           "timeBounds":{
                              "startTime":1134000000,
                              "scale":90000
                           }
                        }
                     ]
                  }
               ]
            }
         ],
         "video":[
            {
               "id":"main-video",
               "baseUrl":"video/",
               "frameRate":{
                  "value":25
               },
               "mediaTimeOffset":{
                  "value":-972000000,
                  "scale":90000
               },
               "tracks":[
                  {
                     "id":"720p",
                     "bandwidth":3000000,
                     "codecs":"avc1.4d001f",
                     "continuationPattern":
                        "720p-content-{segmentId}.mp4",
                     "initializationPattern":
                        "720p-init-{initId}.mp4",
                     "resolution":{
                        "width":1280,
                        "height":720
                     },
                     "segmentDuration":{
                        "value":540000,
                        "scale":90000
                     },
                     "segments":[
                        {
                           "id":300,
                           "timeBounds":{
                              "startTime":1134000000,
                              "scale":90000
                           }
                        }
                     ]
                  },
                  {
                     "id":"1080p",
                     "bandwidth":5000000,
                     "codecs":"avc1.4d001f",
                     "continuationPattern":
                        "1080p-content-{segmentId}.mp4",
                     "initializationPattern":
                        "1080p-init-{initId}.mp4",
                     "resolution":{
                        "width":1920,
                        "height":1080
                     },
                     "segmentDuration":{
                        "value":540000,
                        "scale":90000
                     },
                     "segments":[
                        {
                           "id":300,
                           "timeBounds":{
                              "startTime":1134000000,
                              "scale":90000
                           }
                        }
                     ]
                  }
               ]
            }
         ]
      }
   ]
}
]]>
</sourcecode>
</section>

<section anchor="use-case-timing-information"><name>Timing information</name>
<t>In the above Manifest, we have two Presentations:</t>

<ul spacing="compact">
<li>The first Presentation with ID &quot;0&quot; is not active at the current time, though it was active previously. It contains one audio Track and one video Track.
It is 3 hours long in total and starts at 00:00:00.000 (Manifest Timestamp.) No Track has a <tt>mediaTimeOffset</tt> defined, so the Manifest Timestamps match the Media Timestamps.</li>
<li>The second Presentation with ID &quot;1&quot; is active at the current time. It contains one audio Track and two video Tracks. Its total duration is yet to be defined.
Its start time is 03:00:00.000 in Manifest Time and its current time (at the moment the Manifest was retrieved) is 03:30:00.000. All Tracks have the same <tt>mediaTimeOffset</tt> defined. The Media Timestamps for this Presentation start at 00:00:00.000. That also means that for all the currently active Segments, the media data will contain a starting timestamp of 00:30:00.000.</li>
</ul>
<t>The <tt>availabilityDuration</tt> of the Manifest is 40 minutes. As only 30 minutes of the second Presentation have elapsed, some Segments of the first Presentation are still available. Therefore, the first Presentation must still be included in the Manifest. Once 10 minutes have passed, the first Presentation can be left out of the Manifest.</t>
</section>

<section anchor="use-case-content-addressing"><name>Content addressing</name>
<t>The client uses the Manifest to derive the content request URLs of each Track.</t>
<t>For the audio Track of the first Presentation, the following parts are found:</t>

<artwork><![CDATA[AudioSwitchingSet.baseUrl = "audio/"
AudioTrack.baseUrl = "96k/"
AudioSwitchingSet.initializationPattern = "init-{initId}.mp4"
AudioSwitchingSet.continuationPattern = "content-{segmentId}.mp4"
]]>
</artwork>
<t>The Manifest URL (<tt>https://example.com/stream1/manifest.json</tt>) is used as the base for resolution, and then relative resolution is applied to each of the parts above.</t>

<artwork><![CDATA[T = resolve(AudioSwitchingSet.baseUrl, manifestUrl) 
  = "https://example.com/stream1/audio/"
T = resolve(AudioTrack.baseUrl, T)
  = "https://example.com/stream1/audio/96k/"
initBaseUrl = resolve(AudioSwitchingSet.initializationPattern, T)
  = "https://example.com/stream1/audio/96k/init-{initId}.mp4"
contBaseUrl = resolve(AudioSwitchingSet.continuationPattern, T)
  = "https://example.com/stream1/audio/96k/content-{segmentId}.mp4"
]]>
</artwork>
<t>The final content request URLs are <tt>https://example.com/stream1/audio/96k/init-{initId}.mp4</tt> and <tt>https://example.com/stream1/audio/96k/content-{segmentId}.mp4</tt> for the Initialization Stream and Continuation Stream respectively.</t>
<t>For the audio Track of the second Presentation, the following parts are found:</t>

<artwork><![CDATA[Presentation.baseUrl = "https://otherexample.com/s2/"
AudioSwitchingSet.baseUrl = "audio/"
AudioTrack.initializationPattern = "128k-init-{initId}.mp4"
AudioTrack.continuationPattern = "128k-content-{segmentId}.mp4"
]]>
</artwork>
<t>The Manifest URL (<tt>https://example.com/stream1/manifest.json</tt>) is used as the base for resolution, and then relative resolution is applied to each of the parts above.</t>

<artwork><![CDATA[T = resolve(Presentation.baseUrl, manifestUrl)
  = "https://otherexample.com/s2/"
T = resolve(AudioSwitchingSet.baseUrl, T)
  = "https://otherexample.com/s2/audio/"
initBaseUrl = resolve(AudioTrack.initializationPattern, T)
  = "https://otherexample.com/s2/audio/128k-init-{initId}.mp4"
contBaseUrl = resolve(AudioTrack.continuationPattern, T)
  = "https://otherexample.com/s2/audio/128k-content-{segmentId}.mp4"
]]>
</artwork>
<t>The final content request URLs are <tt>https://otherexample.com/s2/audio/128k-init-{initId}.mp4</tt> and <tt>https://otherexample.com/s2/audio/128k-content-{segmentId}.mp4</tt> for the Initialization Stream and Continuation Stream respectively.</t>
</section>
</section>

<section anchor="initialization-stream-1"><name>Initialization Stream</name>
<t>In the second step, the client uses the Manifest to retrieve and then parse an Initialization Packet.</t>

<section anchor="retrieving-initialization-packets"><name>Retrieving Initialization Packets</name>
<t>The client decides to retrieve the only audio Track of the active Presentation. The URL pattern for the Initialization Stream of this Track is <tt>https://otherexample.com/s2/audio/128k-init-{initId}.mp4</tt>.</t>
<t>The client opts to retrieve the most recent Initialization Packet and sends out a request:</t>

<artwork><![CDATA[GET /s2/128k-init-now.mp4 HTTP/1.1
Host: otherexample.com
Accept: */*
]]>
</artwork>
<t>The server responds with the following headers</t>

<artwork><![CDATA[HTTP/1.1 200 OK
Content-Type: audio/mp4
Content-Length: 742
Date: Wed, 31 Mar 2021 08:00:02 GMT
]]>
</artwork>
<t>and a binary body containing the ISOBMFF media data of the audio Initialization Packet.</t>
<t>The client repeats this to retrieve one of the video Tracks.</t>
</section>

<section anchor="parsing-offset-information"><name>Parsing offset information</name>
<t>The client parses the ISOBMFF boxes in the audio Initialization Packet. Once the <tt>emsg</tt> box with scheme ID <tt>urn:theo:hesp:2020</tt> and value <tt>initdata</tt> is found, its <tt>message_data</tt> field is parsed.</t>
<t>In this case, <tt>message_data</tt> equals <tt>{&quot;index&quot;:200,&quot;offset&quot;:63275}</tt>. The following media data for the chosen audio Track is available in the Continuation Segment with ID 200 and at a byte offset of 63275.</t>
</section>
</section>

<section anchor="continuation-stream-1"><name>Continuation Stream</name>
<t>In the final step, the client uses the retrieved information to retrieve Continuation Segments and reaches playback.</t>

<section anchor="retrieving-continuation-segments"><name>Retrieving Continuation Segments</name>
<t>The client retrieves the Continuation Segment of the chosen audio Track. The URL pattern for the Continuation Stream of this Track is <tt>https://otherexample.com/s2/audio/128k-content-{segmentId}.mp4</tt>.</t>
<t>The client sends out a request for the segment with ID 200, starting at offset 63275:</t>

<artwork><![CDATA[GET /s2/128k-content-200.mp4 HTTP/1.1
Host: otherexample.com
Accept: */*
Range: bytes=63275-9007199254740991
]]>
</artwork>
<t>As discussed in <xref target="continuation-stream-requests"></xref>, a sufficiently large end byte value is chosen to ensure the entire range is retrieved.</t>
<t>The server responds with the following headers</t>

<artwork><![CDATA[HTTP/1.1 200 OK
Content-Type: audio/mp4
Content-Range: bytes 63275-9007199254740991/9007199254677716
Transfer-Encoding: chunked
Date: Wed, 31 Mar 2021 08:00:03 GMT
]]>
</artwork>
<t>and a (chunked) binary body containing the ISOBMFF media data of the Continuation Segment.</t>
<t>The client repeats this for the chosen video Track and uses the retrieved media data to reach playback. The Manifest that was retrieved previously contains sufficient information to retrieve new Continuation Segments from this point on.</t>
</section>
</section>
</section>

<section anchor="cdns"><name>CDNs</name>
<t>A Content Delivery Network (CDN) MAY be employed to increase the scalability and cacheability for delivering HESP.</t>
<t>While the HESP protocol uses HTTP/1.1 and delivery should be possible over most HTTP CDNs, care must be taken to ensure that the CDN has all the required features.</t>
<t>To correctly handle HESP, the CDN MUST either support HTTP/1.1 with chunked transfer encoding or support HTTP/2. It MUST also support Range Requests as well as the caching of the partial object responses. This ensures that HESP requests and responses pass correctly through the CDN and that responses can be cached for future use.</t>
<t>The CDN SHOULD support collapsing multiple HTTP/1.1 Range Requests with overlapping byte-ranges into a single request. This ensures that two requests with byte-ranges that are partially overlapping require only a single request to the media server for the overlapping part. This reduces the number of concurrent requests arriving on the media server.</t>
</section>

<section anchor="app-profiles"><name>HESP Profiles (using H.264 as video codec)</name>
<t>In this annex, we describe two possible profiles for video Tracks of HESP streams. Both profiles place certain requirements on the underlying H.264 encoding of the HESP stream.</t>

<section anchor="maximal-gain-profile"><name>Maximal Gain Profile</name>
<t>The goal of this profile is to allow the stream to reach the lowest latency and zapping times possible using the HESP protocol. In this profile, it should be ensured that an Initialization Packet exists for each frame of the Continuation Stream. As a result, a client can start playback, seek and switch between Tracks at any time position of the stream.</t>
<t>The Initialization Stream for Tracks of the Maximal Gain Profile must satisfy the following requirements:</t>

<ul spacing="compact">
<li>The frame rate of the Initialization Stream MUST match the Continuation Stream.</li>
<li>Each media sample of the Initialization Stream MUST be independent (i.e., an I frame in H.264) and individually addressable (the latter is currently always true for HESP.)</li>
</ul>
<t>The Continuation Stream for Tracks of the Maximal Gain Profile must satisfy the following requirements:</t>

<ul spacing="compact">
<li>Each media sample of the Continuation Stream MUST be either independent (i.e., an I frame in H.264) or dependent on the media sample directly preceding it in decode order (i.e., a P frame referencing only the previous frame.)</li>
<li>Each CMAF Fragment of the Continuation Segment MUST only contain one media sample.</li>
<li>Each Continuation Segment SHOULD be significantly long (values of multiple minutes are possible and encouraged.)</li>
</ul>
<figure><name>Maximum Gain Profile
</name>
<artwork><![CDATA[               +----+----+----+----+----+----+----+----+----+----+
time positions | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 |
               +----+----+----+----+----+----+----+----+----+----+

               +----+----+----+----+----+----+----+----+----+----+
initialization | IDR| IDR| IDR| IDR| IDR| IDR| IDR| IDR| IDR| IDR|
stream         | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 |
               +----+----+----+----+----+----+----+----+----+----+
               | REF| REF| REF| REF| REF| REF| REF| REF| REF| REF|
               |  a |  b |  c |  d |  e |  f |  g |  h |  i |  j |
               +----+----+----+----+----+----+----+----+----+----+

               +----+----+----+----+----+----+----+----+----+----+
continuation   | I,P| I,P| I,P| I,P| I,P| I,P| I,P| I,P| I,P| I,P|
stream         | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 |
               +----+----+----+----+----+----+----+----+----+----+
                    |    |    |
                    |    |     \___ position c
                    |    |
                    |     \___ position b
                    |
                     \___ position a
]]>
</artwork>
</figure>
</section>

<section anchor="compatibility-profile"><name>Compatibility Profile</name>
<t>The goal of this profile is to allow media data from other HTTP-based adaptive bitrate protocols to be reused. This comes at the cost of some optimizations made by the previous profile. In this profile, it is not possible to start playback at any time position of the stream, as it is not possible for Initialization Packets to refer to every sample of a Continuation Segment.</t>
<t>The Initialization Stream for Tracks of the Compatibility Profile must satisfy the following requirements:</t>

<ul spacing="compact">
<li>Each media sample of the Initialization Stream MUST be independent (i.e., an I frame in H.264) and individually addressable (the latter is currently always true for HESP.)</li>
<li>Initialization Packets MUST contain a reference to the subsequent sample in the Continuation Segment, where this subsequent sample MUST be either independent (i.e., an I frame in H.264) or dependent on the media sample directly preceding it in decode order (i.e., a P frame referencing only the previous frame.)
If the subsequent sample does not meet this constraint, then this Initialization Packet MUST NOT be published. Instead, the last valid Initialization Packet MUST be returned if this Sequence Number is queried.</li>
</ul>
<t>The Continuation Stream for Tracks of the Compatibility Profile must satisfy the following requirements:</t>

<ul spacing="compact">
<li>Continuation Segments SHOULD NOT exceed a duration of 30 seconds.</li>
<li>Each CMAF Fragment of the Continuation Segment MUST contain at most the amount of media samples of a sub-GOP (defined below.)</li>
<li>The following GOP structure must be followed in the underlying H.264 stream: I, B (repeated n times), P, B (repeated n times), P, where n lies between 0 and 4.</li>
<li>Each sub-GOP (B ... B P) MUST depend on at most one previous frame (allowing for keyframe insertion.)</li>
</ul>

<section anchor="example"><name>Example</name>
<t><xref target="fig-sub-gop-ex"></xref> depicts a part of H.264 output of 7 frames, sorted in decode order. The dependencies of each frame are shown with arrows.</t>
<figure anchor="fig-sub-gop-ex"><name>Continuation stream with sub-GOPs
</name>
<artwork align="center"><![CDATA[            ______________________________________
            |                     |       |       |
            V                     |       |       |
+----+  +----+  +----+  +----+  +----+  +----+  +----+
|####|  |%%%%|  |%%%%|  |%%%%|  |@@@@|  |@@@@|  |@@@@|
| I1 |  | P4 |  | B2 |  | B3 |  | P7 |  | B5 |  | B6 |
+----+  +----+  +----+  +----+  +----+  +----+  +----+
    ^     |  ^     |       |        ^      |       |
    |_____|  |     |       |        |______|       |
    |        |     |       |        |              |
    |________|_____|       |        |______________|
    |                      |
    |______________________|
]]>
</artwork>
</figure>

<section anchor="sub-gops"><name>Sub GOPs</name>
<t>A &quot;sub-GOP&quot; defines a set of B and P frames that only depend on one
previous frame.</t>
<t>In <xref target="fig-sub-gop-ex"></xref>, there are 3 sub-GOPs:</t>

<ul spacing="compact">
<li>sub-GOP 1 (<tt>####</tt>) contains only a single I frame.</li>
<li>sub-GOP 2 (<tt>%%%%</tt>) contains a P frame (P4) that only depends on I1; all B frames depend on I1 and P4.</li>
<li>sub-GOP 3 (<tt>@@@@</tt>) contains a P frame (P7) that only depends on P4; all B frames depend on P7 and P4.</li>
</ul>
</section>

<section anchor="initialization-packets"><name>Initialization Packets</name>
<t>An Initialization Packet can be published if the subsequent media sample of the Continuation Stream depends on at most one previous frame.</t>
<t>For <xref target="fig-sub-gop-ex"></xref>, this is the case at the following positions:</t>

<ul spacing="compact">
<li>position 1: an Initialization Packet can be published. It will contain IDR1 (a keyframe matching the timestamp of I1) and will reference P4, the subsequent media sample of the Continuation Stream that only depends on I1. On the client-side, the media data will be decoded with IDR1 inserted at the location of I1.</li>
<li>position 4: an Initialization Packet can be published. It will contain IDR4 (a keyframe matching the timestamp of P4) and will reference P7, the subsequent media sample of the Continuation Stream that only depends on P4. On the client-side, the media data will be decoded with IDR4 inserted at the location of P4.</li>
</ul>
<t>A client requesting an Initialization Packet at other time positions must receive the most recent valid Initialization Packet. For example, that means that a request for an Initialization Packet at position 2 in <xref target="fig-sub-gop-ex"></xref> must return the Initialization Packet at position 1.</t>
</section>
</section>
</section>
</section>

</back>

</rfc>
