<?xml version='1.0' encoding='UTF-8'?>

<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>

<rfc xmlns:xi="http://www.w3.org/2001/XInclude" 
ipr="trust200902" 
docName="draft-ietf-nfsv4-scsi-layout-nvme-07" 
number="9561"
category="std" 
consensus="true" 
submissionType="IETF" 
obsoletes=""
updates=""
tocInclude="true" 
sortRefs="true" 
symRefs="true" 
version="3"
xml:lang="en"
>


  <front>
    <title abbrev="pNFS SCSI Layout for NVMe">Using the Parallel NFS (pNFS) SCSI Layout to Access Non-Volatile Memory Express (NVMe) Storage Devices</title>
    <seriesInfo name="RFC" value="9561"/>
    <author initials="C." surname="Hellwig" fullname="Christoph Hellwig" role="editor">
      <organization/>
      <address>
        <email>hch@lst.de</email>
      </address>
    </author>
    <author initials="C." surname="Lever" fullname="Charles Lever">
      <organization abbrev="Oracle">Oracle Corporation</organization>
      <address>
        <postal>
          <country>United States of America</country>
        </postal>
        <email>chuck.lever@oracle.com</email>
      </address>
    </author>
    <author initials="S." surname="Faibish" fullname="Sorin Faibish">
      <organization>Opendrives.com</organization>
      <address>
        <postal>
          <street>11 Selwyn Road</street>
          <city>Newton</city>
          <region>MA</region>
          <code>02461</code>
          <country>United States of America</country>
        </postal>
        <phone>+1 617-510-0422</phone>
        <email>s.faibish@opendrives.com</email>
      </address>
    </author>
    <author initials="D." surname="Black" fullname="David L. Black">
      <organization>Dell Technologies</organization>
      <address>
        <postal>
          <street>176 South Street</street>
          <city>Hopkinton</city>
          <region>MA</region>
          <code>01748</code>
          <country>United States of America</country>
        </postal>
        <email>david.black@dell.com</email>
      </address>
    </author>

    <date year="2024" month="April"/>

    <area>Web and Internet Transport</area>
    <workgroup>NFSv4</workgroup>
    <keyword>NFSv4</keyword>

    <abstract>
      <t>This document specifies how to use the Parallel Network File System (pNFS)
Small Computer System Interface (SCSI) Layout Type to access storage devices
using the Non-Volatile Memory Express (NVMe) protocol family.</t>
    </abstract>
  </front>
  <middle>
    <section anchor="sec_intro">
      <name>Introduction</name>
      <t>NFSv4.1 <xref target="RFC8881"/> includes a pNFS feature that allows
reads and writes to be performed by means other than directing read and
write operations to the server.  Through use of this feature, the server,
in the role of metadata server, is responsible for managing file and
directory metadata while separate means are provided to execute 
reads and writes.</t>
      <t>These other means of performing file reads and writes are defined by
individual mapping types, which often have their own specifications.</t>
      <t>The pNFS Small Computer System Interface (SCSI) layout <xref target="RFC8154"/> is a layout
type that allows NFS clients to directly perform I&wj;/&wj;O to block storage devices
while bypassing the Metadata Server (MDS).  It is specified by using
concepts from the SCSI protocol family for the data path to the storage devices.</t>
      <t>NVM Express (NVMe), or the Non-Volatile Memory Host Controller Interface
Specification, is a set of specifications to talk to storage devices over
a number of protocols such as PCI Express (PCIe), Fibre Channel (FC),
TCP/IP, or Remote Direct Memory Access (RDMA) networking.  NVMe is currently
the predominantly used protocol to access PCIe Solid State Disks (SSDs),
and it is increasingly being adopted for remote storage access to replace
SCSI-based protocols such as iSCSI.</t>
      <t>This document defines how NVMe Namespaces using the NVM Command Set <xref target="NVME-NVM"/>
exported by NVMe Controllers implementing the
NVMe Base specification <xref target="NVME-BASE"/> are to be used as
storage devices using the SCSI Layout Type.
The definition is independent of the underlying transport used by the
NVMe Controller and thus supports Controllers implementing a wide variety
of transports, including PCIe, RDMA, TCP, and FC.</t>
      <t>This document does not amend the existing SCSI layout document.  Rather,
it
defines how NVMe Namespaces can be used within the SCSI Layout by
establishing a mapping of the SCSI constructs used in the SCSI layout
document to corresponding NVMe constructs.</t>
      <section anchor="ssc_intro_reqlang">
        <name>Requirements Language</name>
        <t>
    The key words "<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>",
    "<bcp14>REQUIRED</bcp14>", "<bcp14>SHALL</bcp14>", "<bcp14>SHALL NOT</bcp14>",
    "<bcp14>SHOULD</bcp14>", "<bcp14>SHOULD NOT</bcp14>",
    "<bcp14>RECOMMENDED</bcp14>", "<bcp14>NOT RECOMMENDED</bcp14>",
    "<bcp14>MAY</bcp14>", and "<bcp14>OPTIONAL</bcp14>" in this document are to be
    interpreted as described in BCP&nbsp;14 <xref target="RFC2119"/> <xref
    target="RFC8174"/> when, and only when, they appear in all capitals, as
    shown here.
        </t>

      </section>
      <section anchor="ssc_intro_defs">
        <name>General Definitions</name>
        <t>The following definitions are included to provide context for the reader.</t>
        <dl>
          <dt>Client:</dt>
          <dd>
            <t>The "client" is the entity that accesses the NFS server's
resources.  The client may be an application that contains the
logic to access the NFS server directly, or it may be part of the operating
system that provides remote file system services for a set of
applications.</t>
          </dd>
          <dt>Metadata Server (MDS):</dt>
          <dd>
            <t>The Metadata Server (MDS) is the entity responsible for coordinating
client access to a set of file systems and is identified by a server
owner.</t>
          </dd>
        </dl>
      </section>
      <section anchor="ssc_intro_conv">
        <name>Numerical Conventions</name>
        <t>Numerical values defined in the SCSI specifications (e.g., <xref target="SPC5"/>) and the
NVMe specifications (e.g., <xref target="NVME-BASE"/>) are represented using the same
conventions as those specifications wherein a 'b' suffix denotes a binary
(base 2) number (e.g., 110b = 6 decimal) and an 'h' suffix denotes a
hexadecimal (base 16) number (e.g., 1ch = 28 decimal).</t>
      </section>
    </section>
    <section anchor="sec_slm">
      <name>SCSI Layout Mapping to NVMe</name>
      <t>The SCSI layout definition <xref target="RFC8154"/> references only a
few SCSI-specific concepts directly.  This document provides a mapping
from these SCSI concepts to NVM Express concepts that are used
when using the pNFS SCSI layout with NVMe namespaces.</t>
      <section anchor="ssc_volident">
        <name>Volume Identification</name>
        <t>The pNFS SCSI layout uses the Device Identification Vital Product Data (VPD)
page (page code 83h) from <xref target="SPC5"/> to identify the devices used by
a layout. Implementations that use NVMe namespaces as storage devices
map NVMe namespace identifiers to a subset of the identifiers
that the Device Identification VPD page supports for SCSI logical
units.</t>
        <t>To be used as storage devices for the pNFS SCSI layout, NVMe namespaces
<bcp14>MUST</bcp14> support either the IEEE Extended Unique Identifier (EUI64) or
Namespace Globally Unique Identifier (NGUID) value reported in a Namespace
Identification Descriptor, the I&wj;/&wj;O Command Set Independent Identify
Namespace data structure, and the Identify Namespace data structure,
NVM Command Set <xref target="NVME-BASE"/>. If available, use of the NGUID value is
preferred as it is the larger identifier.</t>

        <aside><t>Note: The PS_DESIGNATOR_T10 and PS_DESIGNATOR_NAME have no equivalent
in NVMe and cannot be used to identify NVMe storage devices.</t></aside>
        <t>The pnfs_scsi_base_volume_info4 structure for an NVMe namespace
<bcp14>SHALL</bcp14> be constructed as follows:</t>
        <ol spacing="normal" type="1"><li>
            <t>The "sbv_code_set" field <bcp14>SHALL</bcp14> be set to PS_CODE_SET_BINARY.</t>
          </li>
          <li>
            <t>The "pnfs_scsi_designator_type" field <bcp14>SHALL</bcp14> be set to
  PS_DESIGNATOR_EUI64.</t>
          </li>
          <li>
            <t>The "sbv_designator" field <bcp14>SHALL</bcp14> contain either the NGUID or
  the EUI64 identifier for the namespace.  If both NGUID and EUI64
  identifiers are available, then the NGUID identifier <bcp14>SHOULD</bcp14> be
  used as it is the larger identifier.</t>
          </li>
        </ol>

        <t>RFC 8154 <xref target="RFC8154"/> specifies the "sbv_designator" field as an XDR variable
length opaque&lt;&gt; (refer to Section <xref target="RFC4506" sectionFormat="bare" section="4.10"/> of RFC 4506 <xref target="RFC4506"/>). The length of that XDR opaque&lt;&gt; value (part of
its XDR representation) indicates which NVMe identifier is present.
That length <bcp14>MUST</bcp14> be 16 octets for an NVMe NGUID identifier and
<bcp14>MUST</bcp14> be 8 octets for an NVMe EUI64 identifier.  All other lengths
<bcp14>MUST NOT</bcp14> be used with an NVMe namespace.</t>
      </section>
      <section anchor="ssc_fencing">
        <name>Client Fencing</name>
        <t>The SCSI layout uses Persistent Reservations (PRs) to provide client
fencing.  For this to be achieved, both the MDS and the Clients have to
register a key with the storage device, and the MDS has to create a
reservation on the storage device.</t>
        <t>The following subsections provide a full mapping of the required
PERSISTENT RESERVE IN and PERSISTENT RESERVE OUT SCSI commands <xref target="SPC5"/>
to NVMe commands that <bcp14>MUST</bcp14> be used when using
NVMe namespaces as storage devices for the pNFS SCSI layout.</t>
        <section anchor="ssc_fencing_keys">
          <name>PRs - Key Registration</name>
          <t>On NVMe namespaces, reservation keys are registered using the
Reservation Register command (refer to Section 7.3 of <xref target="NVME-BASE"/>)
with the Reservation Register Action
(RREGA) field set to 000b (i.e., Register Reservation Key) and
supplying the reservation key in the New Reservation Key (NRKEY)
field.</t>
          <t>Reservation keys are unregistered using the Reservation Register
command with the Reservation Register Action (RREGA) field set to
001b (i.e., Unregister Reservation Key) and supplying the reservation
key in the Current Reservation Key (CRKEY) field.</t>
          <t>One important difference between SCSI Persistent Reservations
and NVMe Reservations is that NVMe reservation keys always apply
to all controllers used by a host (as indicated by the NVMe Host
Identifier). This behavior is analogous to setting the ALL_TG_PT
bit when registering a SCSI Reservation Key, and it is always supported
by NVMe Reservations, unlike the ALL_TG_PT for which SCSI support is
inconsistent and cannot be relied upon.
Registering a reservation key with a namespace creates an
association between a host and a namespace. A host that is a
registrant of a namespace may use any controller with which that
host is associated (i.e., that has the same Host Identifier,
refer to Section 5.27.1.25 of <xref target="NVME-BASE"/>)
to access that namespace as a registrant.</t>
        </section>
        <section anchor="ssc_fencing_reg">
          <name>PRs - MDS Registration and Reservation</name>
          <t>Before returning a PNFS_SCSI_VOLUME_BASE volume to the client, the MDS
needs to prepare the volume for fencing using PRs. This is done by
registering the reservation generated for the MDS with the device
(see <xref target="ssc_fencing_keys"/>) followed by a Reservation Acquire
command (refer to Section 7.2 of <xref target="NVME-BASE"/>) with
the Reservation Acquire Action (RACQA) field set to 000b (i.e., Acquire)
and the Reservation Type (RTYPE) field set to 4h (i.e., Exclusive Access
- Registrants Only Reservation).</t>
        </section>
        <section anchor="ssc_fenceaction">
          <name>Fencing Action</name>
          <t>In case of a non-responding client, the MDS fences the client by
executing a Reservation Acquire command (refer to Section 7.2 of <xref target="NVME-BASE"/>),
with the Reservation Acquire Action
(RACQA) field set to 001b (i.e., Preempt) or 010b (i.e., Preempt and
Abort), the Current Reservation Key (CRKEY) field set to the
server's reservation key, the Preempt Reservation Key (PRKEY) field
set to the reservation key associated with the non-responding client,
and the Reservation Type (RTYPE) field set to 4h (i.e., Exclusive
Access - Registrants Only Reservation).
The client can distinguish I&wj;/&wj;O errors due to fencing from other
errors based on the Reservation Conflict NVMe status code.</t>
        </section>
        <section anchor="ssc_recovery">
          <name>Client Recovery after a Fence Action</name>
          <t>If an NVMe command issued by the client to the storage device returns
a non-retryable error (refer to the DNR bit defined in Figure 92 in
<xref target="NVME-BASE"/>), the client <bcp14>MUST</bcp14> commit all layouts that
use the storage device through the MDS, return all outstanding layouts
for the device, forget the device ID, and unregister the reservation
key.</t>
        </section>
      </section>
      <section anchor="ssc_caches">
        <name>Volatile Write Caches</name>
        <t>For NVMe controllers, a volatile write cache is enabled if bit 0 of the
Volatile Write Cache (VWC) field in the Identify Controller data
structure, I&wj;/&wj;O Command Set Independent (refer to Figure 275 in <xref target="NVME-BASE"/>)
is set and the Volatile Write Cache Enable (WCE) bit (i.e., bit 00) in
the Volatile Write Cache Feature (Feature Identifier 06h)
(refer to Section 5.27.1.4 of <xref target="NVME-BASE"/>) is set.
If a volatile write cache is enabled on an NVMe namespace used as a
storage device for the pNFS SCSI layout, the pNFS server (MDS) <bcp14>MUST</bcp14>
use the NVMe Flush command to flush the volatile write cache to
stable storage before the LAYOUTCOMMIT operation returns by using the
Flush command (refer to Section 7.1 of <xref target="NVME-BASE"/>).
The NVMe Flush command is the equivalent to the SCSI SYNCHRONIZE
CACHE commands.</t>
      </section>
    </section>
    <section anchor="sec_security">
      <name>Security Considerations</name>

      <t>NFSv4 clients access NFSv4 metadata servers using the NFSv4
protocol. The security considerations generally described in <xref target="RFC8881"/>
apply to a client's interactions with
the metadata server. However, NFSv4 clients and servers access
NVMe storage devices at a lower layer than NFSv4. NFSv4 and
RPC security are not directly applicable to the I&wj;/&wj;Os to data servers
using NVMe.
Refer to Sections <xref target="RFC8154" section="2.4.6" sectionFormat="bare">Extents Are Permissions</xref> and <xref target="RFC8154" section="4" sectionFormat="bare">Security Considerations</xref> of <xref target="RFC8154"/> for the
security considerations of direct access to block storage from NFS clients.</t>
      <t>pNFS with an NVMe layout can be used with NVMe transports (e.g., NVMe
over PCIe <xref target="NVME-PCIE"/>) that provide essentially no additional security
functionality. Or, pNFS may be used with storage protocols such as NVMe
over TCP <xref target="NVME-TCP"/> that can provide significant transport layer
security.</t>
      <t>It is the responsibility of those administering and deploying pNFS with
an NVMe layout to ensure that appropriate protection is deployed to that
protocol based on the deployment environment as well as the nature and
sensitivity of the data and storage devices involved.  When using IP-based
storage protocols such as NVMe over TCP, data confidentiality and
integrity <bcp14>SHOULD</bcp14> be provided for traffic between pNFS clients and NVMe
storage devices by using a secure communication protocol such as Transport
Layer Security (TLS) <xref target="RFC8446"/>. For NVMe over TCP, TLS <bcp14>SHOULD</bcp14> be used as
described in <xref target="NVME-TCP"/> to protect traffic between pNFS clients and NVMe
namespaces used as storage devices.</t>
      <t>A secure communication protocol might not be needed for pNFS with NVMe
layouts in environments where physical and/or logical security measures
(e.g., air gaps, isolated VLANs) provide effective access control
commensurate with the sensitivity and value of the storage devices and data
involved (e.g., public website contents may be significantly less sensitive
than a database containing personal identifying information, passwords,
and other authentication credentials).</t>
      <t>Physical security is a common means for protocols not based on IP. In environments where the security requirements for the storage
protocol cannot be met, pNFS with an NVMe layout <bcp14>SHOULD NOT</bcp14> be
deployed.</t>
      <t>When security is available for the data server storage protocol,
it is generally at a different granularity and with a different
notion of identity than NFSv4 (e.g., NFSv4 controls user access
to files, and NVMe controls initiator access to volumes).  As
with pNFS with the block layout type <xref target="RFC5663"/>,
the pNFS client is responsible for enforcing appropriate
correspondences between these security layers. In environments
where the security requirements are such that client-side
protection from access to storage outside of the layout is not
sufficient, pNFS with a SCSI layout on a NVMe namespace <bcp14>SHOULD
NOT</bcp14> be deployed.</t>
      <t>As with other block-oriented pNFS layout types, the metadata server
is able to fence off a client's access to the data on an NVMe namespace
used as a storage device.  If a metadata server revokes a layout, the
client's access <bcp14>MUST</bcp14> be terminated at the storage devices via fencing
as specified in <xref target="ssc_fencing"/>.  The client has a
subsequent opportunity to acquire a new layout.</t>
    </section>
    <section anchor="sec_iana">
      <name>IANA Considerations</name>
      <t>This document has no IANA actions.</t>
    </section>
  </middle>
  <back>
    <references>
      <name>References</name>
      <references>
        <name>Normative References</name>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.4506.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.5663.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8154.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8881.xml"/>

        <reference anchor="SPC5">
          <front>
            <title>SCSI Primary Commands - 5 (SPC-5)</title>
            <author>
              <organization>INCITS Technical Committee T10</organization>
            </author>
            <date year="2019"/>
          </front>
          <seriesInfo name="INCITS" value="502-2019"/>
        </reference>

        <reference anchor="NVME-BASE" target="https://nvmexpress.org/wp-content/uploads/NVM-Express-Base-Specification-2.0d-2024.01.11-Ratified.pdf">
          <front>
            <title>NVM Express Base Specification</title>
            <author>
              <organization>NVM Express, Inc.</organization>
            </author>
            <date year="2024" month="January"/>
          </front>
<refcontent>Revision 2.0d</refcontent>
        </reference>

        <reference anchor="NVME-NVM" target="https://nvmexpress.org/wp-content/uploads/NVM-Express-             NVM-Command-Set-Specification-1.0d-2023.12.28-Ratified.pdf">
          <front>
            <title>NVM Express NVM Command Set Specification</title>
            <author>
              <organization>NVM Express, Inc.</organization>
            </author>
            <date year="2023" month="December"/>
          </front>
<refcontent>Revision 1.0d</refcontent>
        </reference>

        <reference anchor="NVME-TCP" target="https://nvmexpress.org/wp-content/uploads/NVM-Express-TCP-Transport-Specification-1.0d-2023.12.27-Ratified.pdf">
          <front>
            <title>NVM Express TCP Transport Specification</title>
            <author>
              <organization>NVM Express, Inc.</organization>
            </author>
            <date year="2023" month="December"/>
          </front>
<refcontent>Revision 1.0d</refcontent>
        </reference>

        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8446.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml"/>
      </references>
      <references>
        <name>Informative References</name>

        <reference anchor="NVME-PCIE" target="https://nvmexpress.org/wp-content/uploads/NVM-Express-PCIe-Transport-Specification-1.0d-2023.12.27-Ratified.pdf">
          <front>
            <title>NVMe over PCIe Transport Specification</title>
            <author>
              <organization>NVM Express, Inc.</organization>
            </author>
            <date year="2023" month="December"/>
          </front>
<refcontent>Revision 1.0d</refcontent>
        </reference>

      </references>
    </references>
    <section numbered="false" anchor="acknowledgements">
      <name>Acknowledgements</name>
      <t><contact fullname="Carsten Bormann"/> converted an earlier RFCXML v2 source for this document to a markdown source format.</t>
      <t>David Noveck provided ample feedback to various drafts of this document.</t>
    </section>

  </back>
</rfc>
