<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://samuelwstark.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://samuelwstark.com/" rel="alternate" type="text/html" /><updated>2026-04-04T20:00:42+01:00</updated><id>https://samuelwstark.com/feed.xml</id><title type="html">Samuel W. Stark</title><subtitle>Samuel W. Stark is a PhD Candidate at the University of Cambridge, studying computer science with interests in computer hardware, security, graphics, and GPUs.</subtitle><author><name>Samuel W. Stark</name></author><entry><title type="html">My first article!</title><link href="https://samuelwstark.com/posts/2023/07/my-first-article" rel="alternate" type="text/html" title="My first article!" /><published>2023-07-07T00:00:00+01:00</published><updated>2023-07-07T00:00:00+01:00</updated><id>https://samuelwstark.com/posts/2023/07/my-first-article</id><content type="html" xml:base="https://samuelwstark.com/posts/2023/07/my-first-article"><![CDATA[<p>My first article, “How Flexible is CXL’s Memory Protection?”, was just published in ACM Queue (it’s even on the front cover!)
It discusses PCIe and CXL, their memory security model (and how it leaves something to be desired), and various flavors of <em>capability</em> which may be able to help.
<a href="https://queue.acm.org/detail.cfm?id=3606014">The article is available online here</a>, and I plan to publish a copy here in the coming weeks to demo <a href="https://github.com/theturboturnip/turnip_text">my pie-in-the-sky alternative typesetting language</a>.</p>]]></content><author><name>Samuel W. Stark</name></author><category term="PhD" /><category term="cxl" /><category term="capabilities" /><summary type="html"><![CDATA[My first article, “How Flexible is CXL’s Memory Protection?”, was just published in ACM Queue (it’s even on the front cover!) It discusses PCIe and CXL, their memory security model (and how it leaves something to be desired), and various flavors of capability which may be able to help. The article is available online here, and I plan to publish a copy here in the coming weeks to demo my pie-in-the-sky alternative typesetting language.]]></summary></entry><entry><title type="html">CHERI-RVV summary: The quest for secure, vectorized memcpy</title><link href="https://samuelwstark.com/posts/2022/12/cheri-rvv-rise" rel="alternate" type="text/html" title="CHERI-RVV summary: The quest for secure, vectorized memcpy" /><published>2022-12-08T00:00:00+00:00</published><updated>2022-12-08T00:00:00+00:00</updated><id>https://samuelwstark.com/posts/2022/12/cheri-rvv-condensed</id><content type="html" xml:base="https://samuelwstark.com/posts/2022/12/cheri-rvv-rise"><![CDATA[<p>This is a condensed version of my master’s thesis (from 15,000 to 5,000 words), which <a href="https://twitter.com/UK_RISE/status/1598685532434964480">won the RISE 2022 Student competition!</a>
I also submitted a 5-minute presentation, which is available <a href="https://www.youtube.com/watch?v=J82OFvF3yGY">on YouTube</a> and <a href="/academia/2022-06-06-capability-protection-scalable-vectors">the project summary page</a>.</p>

<p>The RISE submission form did not allow images, so it’s a bit sparse.
I plan to release the full chapters as blogposts in the future, images and tables included, which will serve as a testing ground for <a href="https://github.com/theturboturnip/turnip_text">my pie-in-the-sky alternative typesetting language</a>.</p>

<h1 data-number="1" id="introduction"><span class="header-section-number">1</span> Introduction</h1>
<p>The CHERI architecture project improves computer security by checking all memory accesses in hardware. Under CHERI,
    memory cannot be accessed with integer addresses, but must pass through a <em>capability</em><span class="citation" data-cites="TR-941">[1]</span> - unforgeable tokens that grant fine-grained access to ranges of memory. Instead
    of generating them from scratch, capabilities must be <em>derived</em> from another capability with greater
    permissions. This vastly reduces the scope of security violations through spatial errors (e.g. buffer overflows<span class="citation" data-cites="szekeresSoKEternalWar2013">[2]</span>), and creates interesting opportunities for
    software compartmentalization<span class="citation" data-cites="watsonCHERIHybridCapabilitySystem2015">[3]</span>.
</p>
<p>Industry leaders have recognized the value CHERI provides. Arm Inc have manufactured the Morello System-on-Chip which
    incorporates CHERI into the Armv8.2 ISA. However some features haven’t fully embraced CHERI, such as Arm’s Scalable
    Vector Extension (SVE) , which is designed to remain in use well into the future<span class="citation" data-cites="stephensARMScalableVector2017">[4]</span>. Supporting this and other scalable vector ISAs is
    essential to CHERI’s long-term relevance.</p>
<h2 data-number="1.1" id="motivation"><span class="header-section-number">1.1</span> Motivation</h2>
<p>Modern vector implementations all provide vector load/store instructions. Vector-enabled CHERI CPUs must support
    those instructions, but adding CHERI’s bounds-checking for each vector element could impact performance.</p>
<p>Vector memory access performance is critical, because vectors aren’t just used for computation. For example,
    <code>glibc</code> uses vector memory accesses to implement <code>memcpy</code> where available. These
    implementations are written in assembly and heavily optimized. If they hit the cache, extra cycles of
    bounds-checking for each access could make a difference.
</p>
<p><code>memcpy</code> also raises the important question of how vectors interact with capabilities. In non-CHERI
    processors, <code>memcpy</code> will copy pointers around in memory. An equivalent CHERI-enabled vector memcpy would
    need to load/store capabilities from vectors without violating security guarantees.</p>
<p>The goal of this project is to investigate the impact of, and the roadblocks for, integrating a scalable vector
    architecture with CHERI’s memory protection system. Specifically we focus on integrating the RISC-V Vector
    extension<span class="citation" data-cites="specification-RVV-v1.0">[5]</span> (RVV) with the CHERI-RISC-V ISA, with
    the aim of enabling a future CHERI-RVV implementation and informing the approach for a future CHERI Arm SVE
    implementation.</p>
<p>The full dissertation addresses nine hypotheses, but for the sake of brevity we examine four here. Chapters 2-5 each
    cover one hypothesis in order, and Chapter 6 concludes.</p>
<ol type="1">
    <li>It is possible to use CHERI capabilities as memory references in all vector instructions.</li>
    <li>The capability bounds checks for vector elements within a known range (e.g. a cache line) can be performed in a
        single check, amortizing the cost.</li>
    <li>Legacy vector code can be compiled into a pure-capability form with no changes.</li>
    <li>It is possible for a vector architecture to load, store, and manipulate capabilities in vector registers without
        violating CHERI security principles.</li>
</ol>
<h1 data-number="2" id="background"><span class="header-section-number">2</span> Background</h1>
<p>This chapter provides background on CHERI and RVV. It has been cut down to only include information relevant to the
    rest of this summary.</p>
<h2 data-number="2.1" id="cheri"><span class="header-section-number">2.1</span> CHERI</h2>
<p>In CHERI, addresses/pointers are replaced with capabilities: unforgeable tokens that provide <em>specific kinds of
        access</em> to an <em>address</em> within a <em>range of memory</em>. The above statement is enough to
    understand what capabilities contain:</p>
<ul>
    <li>
        <p>Permission bits, to restrict access</p>
    </li>
    <li>
        <p>The <em>cursor</em>, i.e. the address it currently points to</p>
    </li>
    <li>
        <p>The <em>bounds</em>, i.e. the range of addresses this capability could point to</p>
    </li>
</ul>
<p>By using floating-point, all of this data has been reduced to just 2x the architectural register size (see <span class="citation" data-cites="woodruffCHERIConcentratePractical2019">[6]</span>). For example, on 64-bit RISC-V a
    standard capability is 128-bits long, and we assume capabilities are 128-bits long throughout this summary.</p>
<p>A CHERI implementation has to enforce three security properties about its capabilities<span class="citation" data-cites="TR-951">[7]</span>:</p>
<ul>
    <li>
        <p>Provenance - Capabilities must always be derived from valid manipulations of other capabilities.</p>
    </li>
    <li>
        <p>Integrity - Corrupted capabilities cannot be dereferenced.</p>
    </li>
    <li>
        <p>Monotonicity - Capabilities cannot increase their rights.</p>
    </li>
</ul>
<p>Integrity is enforced by tagging registers and memory. Every 128-bit register and aligned 128-bit region of memory
    has an associated tag bit, which denotes if its data encodes a valid capability. If any non-capability data is
    written to any part of the region the tag bit is zeroed out. Instructions that perform memory accesses can only do
    so if the provided capability has a valid tag bit. As above, significant work has gone into the implementation to
    reduce the DRAM overhead of this method (see <span class="citation" data-cites="joannouEfficientTaggedMemory2017">[8]</span>).</p>
<p>Provenance and Monotonicity are enforced by all instructions that manipulate capabilities. If an instruction violates
    either property, it will zero out the tag bit and rely on Integrity enforcement to ensure it is not dereferenced.
    Some CHERI-enabled architectures, such as CHERI-RISC-V, also raise a synchronous exception when this occurs.</p>
<h3 data-number="2.1.1" id="cheri-risc-v-isa"><span class="header-section-number">2.1.1</span> CHERI-RISC-V ISA</h3>
<p>CHERI-RISC-V (described in <span class="citation" data-cites="TR-951">[7]</span>) is a straightforward set of
    additions to basic RISC-V ISAs. It adds thirty-two general-purpose capability registers <code>cx0-cx31</code>,
    thirty-two Special Capability Registers (SCRs), and many new instructions. <!-- The new general-purpose capability registers are each of size
`CLEN = 2 * XLEN` plus a tag bit. --> Many of the new SCRs are intended to support the privileged ISA extensions for
    e.g. hypervisors or operating systems and are unused here. The two most relevant SCRs are the Program Counter
    Capability (PCC) and Default Data Capability (DDC).</p>
<p>The PCC replaces the program counter and adds more metadata, ensuring instruction fetches have the same security
    properties as normal loads and stores. The DDC is used to sandbox integer addressing modes. CHERI-RISC-V includes
    new instructions which use integer addressing, and allows legacy (i.e. integer addressed) code to function on CHERI
    systems without recompiling for CHERI-RISC-V. These instructions all use integer addresses relative to the DDC, and
    the DDC controls the permissions those instructions have.</p>
<h3 data-number="2.1.2" id="capability-and-integer-encoding-mode"><span class="header-section-number">2.1.2</span>
    Capability and Integer encoding mode</h3>
<p>CHERI-RISC-V specifies two encoding modes, selected using a flag in the PCC <code>flags</code> field. <em>Capability
        mode</em> modifies the behaviour of pre-existing instructions (e.g. Load Byte) to use capability addressing, and
    <em>Integer mode</em> keeps those instructions using integer addresses but dereferences them relative to the DDC.
    This allows legacy code to run in a sandbox defined by the DDC without recompiling.
</p>
<h3 data-number="2.1.3" id="cheri_purecap_hybrid"><span class="header-section-number">2.1.3</span> Pure-capability and
    Hybrid compilation modes</h3>
<p>CHERI-Clang, the main CHERI-enabled compiler, supports two ways to compile CHERI-RISC-V which map to the different
    encoding modes.</p>
<p><em>Pure-capability</em> mode treats all pointers as capabilities, and emits pre-existing RISC-V instructions that
    expect to be run in capability mode.</p>
<p><em>Hybrid</em> mode treats pointers as integer addresses, dereferenced relative to the DDC, unless they are
    annotated with <code>__capability</code>. This mode emits a mix of capability-addressed and integer-addressed memory
    instructions. All capabilities in hybrid mode are created manually by the program by copying and shrinking the DDC.
</p>
<p>Hybrid mode allows programs to be gradually ported to CHERI, making it very easy to adopt on legacy/large codebases.
    Any extensions to the model (e.g. CHERI-RVV) should try and retain this property.</p>
<h2 data-number="2.2" id="the-rvv-vector-model"><span class="header-section-number">2.2</span> The RVV vector model</h2>
<p>RVV defines thirty-two vector registers, each of an implementation-defined constant width <code>VLEN</code>. These
    registers can be interpreted as <em>vectors</em> of <em>elements</em>. The program can configure the size of
    elements, and the implementation defines a maximum width <code>ELEN</code>.</p>
<p>RVV instructions operate on <em>groups</em> of vector registers. The implementation holds two status registers,
    <code>vstart</code> and <code>vl</code>, which define the start and length of the “body” section within the vector.
    Instructions only operate on body elements, and some allow elements within the body to be masked out and ignored.
</p>
<h3 data-number="2.2.1" id="rvv-memory-instructions"><span class="header-section-number">2.2.1</span> RVV memory
    instructions</h3>
<p>The only RVV instructions that interact with memory are vectorized loads and stores. These instructions take a
    register index to use as the “base address”, and calculates offsets from that base for each individual vector
    element. The offsets can be calculated in three ways:</p>
<ul>
    <li>Unit-stride, where elements are tightly packed together.</li>
    <li>Strided, where a second register specifies the distance between consecutive elements.</li>
    <li>Indexed, where a vector register holds the offsets for each vector element.</li>
</ul>
<p>Additionally, unit-stride accesses support a few alternate modes of operation. The most relevant one here is the
    fault-only-first unit-strided load. This loads as much contiguous data as possible from the base address, until one
    of the elements triggers an exception (such as a capability bounds violation). That exception will be silently
    swallowed, and <code>vl</code> will be shrunk to the index of the offending element.</p>
<h3 data-number="2.2.2" id="exception-handling"><span class="header-section-number">2.2.2</span> Exception handling</h3>
<p>If synchronous exceptions (e.g. invalid memory access) or asynchronous interrupts are encountered while executing a
    vector instruction, RVV defines two ways to trap them. In both cases, the PC of the instruction is saved in a
    register <code>*epc</code>.</p>
<p>If the instruction should be resumed after handling the trap, e.g. in the case of demand paging, the implementation
    may use a “precise trap”. The implementation must complete all instructions up to <code>*epc</code>, and no
    instructions after that, and save the index of the offending vector element in <code>vstart</code>. Within the
    instruction, all vector elements before <code>vstart</code> must have committed their results, and all other
    elements must either 1) not have committed results, or 2) be idempotent e.g. repeatable without changing the
    outcome.</p>
<p>In other cases “imprecise traps” may be used, which allow instructions after <code>*epc</code> and vector elements
    after <code>vstart</code> to commit their results. <code>vstart</code> must still be recorded, however.</p>
<h2 data-number="2.3" id="the-hypothesis"><span class="header-section-number">2.3</span> The Hypothesis</h2>
<p><em>It is possible to use CHERI capabilities as memory references in all vector instructions.</em></p>
<p>This is entirely true - all RVV memory instructions take the index of a “base address register” in the scalar
    register file, and it is trivial to index into the capability register file instead. This can be applied to other
    ISAs wherever memory references are accessed through a scalar register file, e.g. all Arm Morello scalar
    instructions and most of Arm SVE’s memory instructions. Notably Arm SVE’s <code>u64base</code> addressing mode,
    which uses a vector directly as a set of 64-bit integer addresses<span class="citation" data-cites="armltdArmCompilerScalable2019">[9]</span>, is not as simple to port to CHERI.</p>
<h1 data-number="3" id="hardware-emulation-investigation"><span class="header-section-number">3</span> Hardware
    emulation investigation</h1>
<p>In order to experiment with integrating CHERI and RVV, we implemented a RISC-V emulator in the Rust programming
    language named <code>riscv-v-lite</code>. The emulator supports the Multiply, CSR, Vector, and CHERI extensions, and
    was also used as the base for capabilities-in-vectors research.</p>
<p>The emulator is very modular, such that each ISA extension is defined as a separate module which can easily be
    plugged into different processor implementations. Each ISA module uses a “connector” structure, containing
    e.g. virtual references to register files and memory, which allows different processors to reuse ISA modules despite
    using different register file/memory implementations.</p>
<p>Each processor implements a single stage pipeline. Instructions are fetched, decoded with a common decoder function,
    and executed. The processor asks each ISA module in turn if it wants to handle the instruction, and uses the first
    module to say yes. If the ISA module returns a new PC value it is immediately applied, otherwise it is automatically
    incremented. This structure easily represents basic RISC-V architectures, and can scale up to support many different
    new modules.</p>
<h2 data-number="3.1" id="emulating-cheri"><span class="header-section-number">3.1</span> Emulating CHERI</h2>
<p>Manipulating CHERI capabilities securely and correctly is a must for any CHERI-enabled emulator. Capability encoding
    logic is not trivial by any means, so the <code>cheri-compressed-cap</code> C library was re-used rather than
    implementing it from scratch. There were a few issues with implementing Rust/C interoperation, which are addressed
    in the dissertation.</p>
<p>Integrating capabilities into the emulator was relatively simple thanks to the modular emulator structure.
    CHERI-specific memory and register file types were created, which could expose both integer and capability
    functionality. The CHERI register file exposed integer-mode and capability-mode accesses, and memory was built in
    three layers:</p>
<ol type="1">
    <li>Normal integer-addressed memory</li>
    <li>Capability-addressed CHERI memory, which checks capability properties before accessing 1)</li>
    <li>Integer-mode CHERI memory, which adds an integer address to the DDC before accessing 2)</li>
</ol>
<p>This approach meant code for basic RV64I operations did not need to be modified for CHERI at all - simply passing the
    integer-mode memory and register file would perform all relevant checks.</p>
<p>Integrating capability instructions was also simple. Two new ISA modules were created: <code>XCheri64</code> for the
    new CHERI instructions, and <code>Rv64imCapabilityMode</code> to override legacy instructions in
    capability-encoding-mode. Integer addresses were changed to capabilities throughout, memory and register file types
    were changed as described above, and the PCC/DDC were added.</p>
<p>To reduce the chances of accidentally converting integers to capabilities, the emulator defines a
    <code>SafeTaggedCap</code> type: a sum type which represents either a <code>CcxCap</code> with the tag bit set, or
    raw data with the tag bit unset. This adds type safety, as the Rust compiler forces every usage of
    <code>SafeTaggedCap</code> to consider both options, preventing raw data from being interpreted as a capability by
    accident and enforcing Provenance.
</p>
<h2 data-number="3.2" id="emulating-vectors"><span class="header-section-number">3.2</span> Emulating vectors</h2>
<p>Vector instructions are executed by a Vector ISA module, which stores all registers and other state.
    <code>VLEN</code> is hardcoded as 128-bits, chosen because it’s the largest integer primitive provided by Rust
    that’s large enough to hold a capability. <code>ELEN</code> is also 128-bits, which isn’t supported by the
    specification, but is required for capabilities-in-vectors.
</p>
<p>To support both CHERI and non-CHERI execution pointers are separated into an address and a <em>provenance</em><a href="#fn1" class="footnote-ref" id="fnref1" role="doc-noteref"><sup>1</sup></a>. The vector unit retrieves an
    address + provenance pair from the base register, generates a stream of addresses to access, then rejoins each
    address with the provenance to access memory. When using capabilities, provenance is defined in terms of the base
    register e.g. “the provenance is provided by capability register X”, or defined by the DDC in integer mode. On
    non-CHERI platforms the vector unit doesn’t check provenance.</p>
<p>The initial motivation for this project was investigating the impact of capability checks on performance. Rather than
    check each element’s access individually, we determine a set of “fast-path” checks which count as checks for
    multiple elements at once.</p>
<h2 data-number="3.3" id="fast-path-calculations"><span class="header-section-number">3.3</span> Fast-path calculations
</h2>
<p>A fast-path check can be performed over various sets of elements. The emulator chooses to perform a single fast-path
    check for each vector access, calculating the tight bounds before starting the actual access, but in hardware this
    may introduce prohibitive latency. This section describes the general principles surrounding fast-paths for
    CHERI-RVV, notes the areas where whole-access fast-paths are difficult to calculate, and describes possible
    approaches for hardware.</p>
<h3 data-number="3.3.1" id="possible-fast-path-outcomes"><span class="header-section-number">3.3.1</span> Possible
    fast-path outcomes</h3>
<p>In some cases, a failed address range check may not mean the access fails. The obvious case is fault-only-first
    loads, where capability exceptions may be handled without triggering a trap. Implementations may also choose to
    calculate wider bounds than accessed for the sake of simplicity, or even forego a fast-path check altogether. Thus,
    a fast-path check can have four outcomes depending on the circumstances.</p>
<ul>
    <li>Success - All accesses will succeed</li>
    <li>Failure - At least one access <em>will</em> raise an exception</li>
    <li>Likely-Failure <em>or</em> Unchecked - At least one access <em>may</em> raise an exception</li>
</ul>
<p>A Success means no per-access capability checks are required. Likely-Failure and Unchecked results mean each access
    must be checked, to see if any of them actually raise an exception. Unfortunately, accesses still need to be checked
    under Failure, because both precise and imprecise traps need to report the offending element in <code>vstart</code>.
    Because all archetypes may have Failure or Likely-Failure outcomes, and thus require a fallback slow-path which
    checks elements individually, computing the fast-path can only be worthwhile if Success is the common case.</p>
<h3 data-number="3.3.2" id="m-element-known-range-fast-paths"><span class="header-section-number">3.3.2</span>
    <em>m</em>-element known-range fast-paths
</h3>
<p>A hardware implementation of a vector unit may be able to issue <em>m</em> requests within a set range in parallel.
    For example, elements in the same cache line may be accessible all at once. In these cases, checking elements
    individually would either require <em>m</em> parallel bounds checks, <em>m</em> checks’ worth of latency, or
    something in-between. In this subsection we consider a fast-path check for <em>m</em> elements for unit and strided
    accesses. Indexed addressing has very little opportunity for fast-path checking, which is discussed in the full
    dissertation.</p>
<p>We consider two approaches for these accesses. First, one could amortize the checking logic cost over multiple sets
    of <em>m</em> elements by operating in terms of cache lines. Iterating through all accessed cache lines, and then
    iterating over the elements inside, allows the fast-path to hardcode the bounds width and do one check for multiple
    cycles of work (if cache lines contain more than <em>m</em> elements). Cache-line-aligned allocations benefit here,
    as all fast-path checks will be in-bounds i.e. Successful, but misaligned data is guaranteed to create at least one
    Likely-Failure outcome per access (requiring a slow-path check). Calculating tight bounds for the <em>m</em>
    accessed elements per cycle could address this.</p>
<p>Another approach is to simply calculate the bounds occupied by <em>m</em> elements, which is simple for unit and
    strided accesses. The minimum and maximum can then be picked easily to generate tight bounds. An <em>m</em>-way
    multiplexer is still required for taking the minimum and maximum, because <code>evl</code> and <code>vstart</code>
    may not be <em>m</em>-aligned. If <em>m</em> is small, this also neatly extends to handle masked/inactive elements.
    This may use less logic overall than <em>m</em> parallel bounds checks, depending on the hardware platform<a href="#fn2" class="footnote-ref" id="fnref2" role="doc-noteref"><sup>2</sup></a>, but it definitely uses more
    logic than the cache-line approach. Clearly, there’s a trade-off to be made.</p>
<h2 data-number="3.4" id="the-hypothesis-1"><span class="header-section-number">3.4</span> The Hypothesis</h2>
<p><em>The capability bounds checks for vector elements within a known range (e.g. a cache line) can be performed in a
        single check, amortizing the cost.</em></p>
<p>This is true for Successful accesses. Because the RVV spec requires that the faulting element is <em>always</em>
    recorded<span class="citation" data-cites="specification-RVV-v1.0">[5]</span>, a Failure due to a capability
    violation requires elements to be checked individually.</p>
<p>There are fast-path checks that consume less logic than <em>m</em> parallel checks for unit and strided accesses.
    Even though a slow-path is always necessary, it can be implemented in a slow way (e.g. doing one check per cycle) to
    save on logic. Particularly if other parts of the system rely on constraining the addresses accessed in each cycle,
    a fast-path check can take advantage of those constraints.</p>
<h1 data-number="4" id="the-cheri-rvv-software-stack"><span class="header-section-number">4</span> The CHERI-RVV
    software stack</h1>
<p>This chapter, being less relevant to RISE/hardware security, has been greatly condensed.</p>
<p>As part of the project, we considered how adding CHERI to RVV would affect the software stack. We tested our
    hypotheses by adding CHERI-RVV to Clang, which is the current focus for CHERI and RVV compiler development. Clang
    supports three methods of vectorization:</p>
<ol type="1">
    <li>Auto-vectorization, where the compiler converts scalar code to vector code</li>
    <li>Vector intrinsics, where the programmer writes vector code and the compiler handles low-level details
        e.g. register allocation</li>
    <li>Inline assembly, where the programmer directly describes the assembly instructions to execute.</li>
</ol>
<h2 data-number="4.1" id="our-changes-to-cheri-clang"><span class="header-section-number">4.1</span> Our changes to
    CHERI-Clang</h2>
<p>The first step was to fix up the pre-existing RVV definition in Clang to use capability registers for the base
    address. This meant CHERI-RVV assembly code could be compiled, as long as it explicitly referenced capability
    registers. Unfortunately, non-CHERI inline assembly could not be automatically compiled under this method.</p>
<p>We investigated updating vector intrinsics to support CHERI, but found the code defining the intrinsics was more
    complicated than for assembly. We believe it is possible to update the intrinsics, but it requires significant
    engineering work.</p>
<p>Clang currently supports intrinsics and inline assembly for RVV, but not auto-vectorization yet. This just requires
    engineering work - Arm SVE, a similar model, has great auto-vectorization.</p>
<h2 data-number="4.2" id="the-hypothesis-2"><span class="header-section-number">4.2</span> The Hypothesis</h2>
<p><em>Legacy vector code can be compiled into a pure-capability form with no changes.</em></p>
<p>This is true for CHERI-RVV, but cannot be done in practice yet. Engineering effort is required to support this in
    CHERI-Clang. Because this argument concerns source code, all three ways to generate CHERI-RVV instructions must be
    examined.</p>
<h3 data-number="4.2.1" id="inline-assembly"><span class="header-section-number">4.2.1</span> Inline assembly</h3>
<p>For GCC-style inline assembly, where register types are specified in the source code, this is impossible. Legacy
    integer-addressed RVV instructions will specify general-purpose registers for the base address, and the new
    pure-capability versions require capability registers instead. A programmer will have to change the register types
    by hand before the code compiles in pure-capability form.</p>
<h3 data-number="4.2.2" id="intrinsics"><span class="header-section-number">4.2.2</span> Intrinsics</h3>
<p>The current RVV intrinsics use pointer types for all base addresses<span class="citation" data-cites="specification-RVV-intrinsics">[11]</span>. In pure-capability compilers these pointers should be
    treated as capabilities instead of integers. All RVV memory intrinsics have equivalent RVV instructions, which all
    use capabilities in pure-capability mode, so changing the intrinsics to match is valid.</p>
<p>This is not currently the case for CHERI-Clang, as RVV memory access intrinsics are broken, but this can be fixed
    with engineering effort.</p>
<h3 data-number="4.2.3" id="auto-vectorization"><span class="header-section-number">4.2.3</span> Auto-vectorization</h3>
<p>All vanilla RVV instructions have counterparts with identical encodings and behaviour in CHERI-RVV pure-capability
    mode, assuming the base addresses can be converted to valid capabilities. Any auto-vectorized legacy code which uses
    valid base addresses can thus be converted to pure-capability CHERI-RVV code with no changes.</p>
<p>This is not currently possible for CHERI-Clang, as RVV auto-vectorization is not implemented yet.</p>
<h1 data-number="5" id="capabilities-in-vectors"><span class="header-section-number">5</span> Capabilities-in-vectors
</h1>
<p>Implementing <code>memcpy</code> correctly for CHERI systems requires copying the tag bits as well as the data, which
    means the vector registers must store the tag bits and thus store valid capabilities. <code>memcpy</code> is
    frequently vectorized, so it’s vital that CHERI-RVV can implement it correctly. Manipulating capabilities-in-vectors
    could also accelerate CHERI-specific processes, such as revoking capabilities for freed memory<span class="citation" data-cites="xiaCHERIvokeCharacterisingPointer2019">[12]</span>.</p>
<h2 data-number="5.1" id="extending-the-emulator"><span class="header-section-number">5.1</span> Extending the emulator
</h2>
<p>We defined three goals for capabilities-in-vectors:</p>
<ol type="1">
    <li>Vector registers should be able to hold capabilities.</li>
    <li>At least one vector memory operation should be able to load/store capabilities from vectors.
        <ul>
            <li>Because <code>memcpy</code> should copy both integer and capability data, vector memory operations
                should be able to handle both together.</li>
        </ul>
    </li>
    <li>Vector instructions should be able to manipulate capabilities.
        <ul>
            <li>Clearing tag bits counts as manipulation.</li>
        </ul>
    </li>
</ol>
<p>First, we considered the impact on the theoretical vector model. We decided that any operation with elements smaller
    than <code>CLEN</code> cannot output valid capabilities under any circumstances<a href="#fn3" class="footnote-ref" id="fnref3" role="doc-noteref"><sup>3</sup></a>, meaning a new element width equal to <code>CLEN</code> must be
    introduced. We set <code>ELEN = VLEN = CLEN = 128</code><a href="#fn4" class="footnote-ref" id="fnref4" role="doc-noteref"><sup>4</sup></a> for our vector unit.</p>
<p>Two new memory access instructions were created to take advantage of this new element width. Similar to
    CHERI-RISC-V’s <code>LC/SC</code> instructions, we implemented 128-bit unit-stride vector loads and stores, which
    took over officially-reserved encodings for 128-bit accesses. We have not tested other types of access, but expect
    them to be noncontroversial. <!-- Indexed accesses require extra scrutiny, as they
may be expected to use 128-bit offsets on 64-bit systems. --></p>
<p>The next step was to add capability support to the vector register file. Our approach to capabilities-in-vectors is
    similar in concept to CHERI-RISC-V’s Merged scalar register file, in that the same bits of a register can be
    accessed in two contexts: an integer context, zeroing the tag, or a capability context which maintains the current
    tag. The only instructions which can access data in a capability context are 128-bit memory accesses<a href="#fn5" class="footnote-ref" id="fnref5" role="doc-noteref"><sup>5</sup></a>. All other instructions read out untagged
    integer data and clear tags when writing data.</p>
<p>A new CHERI-specific vector register file was created, where each register is a <code>SafeTaggedCap</code>
    i.e. either zero-tagged integer data or a valid tagged capability. This makes it much harder to accidentally violate
    Provenance, and reuses the code path (and related security properties) for accessing capabilities in memory. Just
    like scalar accesses, vectorized capability accesses are atomic and 128-bit aligned.</p>
<h2 data-number="5.2" id="the-hypothesis-3"><span class="header-section-number">5.2</span> The Hypothesis</h2>
<p><em>It is possible for a vector architecture to load, store, and manipulate capabilities in vector registers without
        violating CHERI security principles.</em></p>
<p>We considered this from three perspectives, checking they each fulfil Provenance, Monotonicity, and Integrity.</p>
<h3 data-number="5.2.1" id="holding-capabilities-in-vector-registers"><span class="header-section-number">5.2.1</span>
    Holding capabilities in vector registers</h3>
<p>As long as <code>VLEN = CLEN</code>, and a tag bit is stored alongside each one, a single vector register can hold a
    capability and differentiate it from integer data. One could also hold multiple capabilities in a register if
    <code>VLEN</code> was an integer multiple of <code>CLEN</code>, but this was not tested. We decided that only
    <code>CLEN</code>-width operations could produce capabilities, thus we had to ensure <code>ELEN = CLEN</code>. Our
    implementation sets <code>VLEN = ELEN = CLEN = 128bits</code>.
</p>
<p>To ensure Provenance and Monotonicity were upheld, we decided that the tag bit for a register would only be set when
    loading a valid capability from memory, and cleared in all other circumstances. Integrity is not affected by how a
    capability is stored.</p>
<h3 data-number="5.2.2" id="moving-capabilities-between-vector-registers-and-memory"><span class="header-section-number">5.2.2</span> Moving capabilities between vector registers and memory</h3>
<p>Memory access instructions must follow the same rules as scalar capability accesses. To maintain Provenance they must
    be <code>CLEN</code>-aligned, or at least only load valid capabilities from aligned addresses, because tag bits only
    apply to <code>CLEN</code>-aligned regions; and they must be atomic<span class="citation" data-cites="TR-951">[7]</span>.</p>
<p>This atomicity requirement applies to the individual element accesses within each vector access too. If multiple
    elements within a vector access try to write to the same 128-bit region non-atomically, it could result in a
    corrupted/malformed/forged capability.</p>
<p>Monotonicity is not affected by simply loading/storing capabilities from memory. Integrity requires that the accesses
    themselves are checked against a valid base capability, just like normal scalar and vector accesses.</p>
<h3 data-number="5.2.3" id="manipulating-capabilities-in-vector-registers"><span class="header-section-number">5.2.3</span> Manipulating capabilities in vector registers</h3>
<p>The emulator limits all manipulation to clearing the tag bit, achieved by writing data to the register in an integer
    context. This preserves Provenance and Monotonicity, because it’s impossible to create or change capabilities, and
    doesn’t affect Integrity.</p>
<p>In theory it’s possible to do more complex transformations, but there aren’t many scalar transformations which would
    benefit from vectorization. If more transformations are added they should be considered carefully, rather than
    creating vector equivalents for all scalar manipulations. For example, revocation as described in <span class="citation" data-cites="xiaCHERIvokeCharacterisingPointer2019">[12]</span> may benefit from a vector
    equivalent to <code>CLoadTags</code>.</p>
<h1 data-number="6" id="conclusion"><span class="header-section-number">6</span> Conclusion</h1>
<p>This project demonstrated the viability of integrating CHERI with scalable vector models by producing an example
    CHERI-RVV implementation. This required both research effort in studying the related specifications and a
    substantial implementation effort. We produced four software artifacts: a Rust wrapper for the
    <code>cheri-compressed-cap</code> C library (900 lines of code), a RISC-V emulator supporting multiple architecture
    extensions (5,300 LoC), a fork of CHERI-Clang supporting CHERI-RVV (400 changed LoC), and test programs for the
    emulator (3,000 LoC). Developing these artifacts provided enough information to make conclusions for the initial
    hypotheses.
</p>
<p>Based on the hypotheses examined in this summary and the original dissertation, scalable vector models can be adapted
    to CHERI without significant loss of functionality. Most of the hypotheses are general enough to cover other
    scalable models, e.g. Arm SVE, but any differences from RVV’s model will require careful examination. Given the
    importance of vector processing to modern computing, and thus its importance to CHERI, we hope that this research
    paves the way for future vector-enabled CHERI processors.</p>
<h2 data-number="6.1" id="testing"><span class="header-section-number">6.1</span> Testing</h2>
<p>Alongside the theoretical hypotheses, the emulator was tested with a comprehensive set of self-checking test
    programs. RVV memory access correctness was tested by implementing functions that mimicked <code>memcpy</code> on
    integer data, under various scenarios for different instructions and addressing modes. These included unit, strided,
    and indexed addressing modes, “segmented” and “masked” accesses (explained in the dissertation), and
    fault-only-first loads. Fault-only-first loads were also tested on the boundary of mapped memory, showing they
    correctly swallowed memory access exceptions.</p>
<p>Capabilities-in-vectors had a dedicated testbench, which would attempt to <code>memcpy</code> an array of structures
    holding pointers to other structures. This had two variants. The first simply copied the data, which worked on CHERI
    and non-CHERI. The other would add 0 to the data after loading it from memory, which invalidates any capabilities
    before copying it to the output. That only worked on CHERI, showing that capability manipulation was possible.</p>
<h2 data-number="6.2" id="future-work"><span class="header-section-number">6.2</span> Future work</h2>
<p>The stated purpose of this project was to enable future implementations of CHERI-RVV and CHERI Arm SVE. We’ve shown
    this is feasible, and we believe our research is enough to create an initial CHERI-RVV specification, but both could
    benefit from more research on capabilities-in-vectors.</p>
<p>All architectures may benefit from more advanced vectorized capability manipulation. Because these processes are
    still evolving, it may be wise to standardize the first version of CHERI-RVV based on this dissertation and only add
    new instructions as required. Once created, the standard can be implemented in CHERI-Clang and added to existing
    CHERI-RISC-V processors.</p>
<p>More theoretically, other vector models could benefit from <em>dereferencing</em> capabilities-in-vectors. Arm SVE
    has addressing modes that directly use vector elements as memory references, as do its predecessors and
    contemporaries. A draft specification of CHERI-x86 is in the works<span class="citation" data-cites="TR-951">[7]</span>, and existing x86 vector models like AVX have similar features. This may prove
    impractical, but this could be mitigated by e.g. replacing these addressing modes with variants of RVV’s “indexed”
    mode. Once this problem is solved, CHERI will be able to match the memory access abilities of any vector ISA it
    needs to, making it that much easier for industry to adopt CHERI in the long term.</p>

<h1 id="references">References</h1>
<div id="refs" class="references csl-bib-body" role="doc-bibliography">
    <div id="ref-TR-941" class="csl-entry" role="doc-biblioentry">
        <div class="csl-left-margin">[1] </div>
        <div class="csl-right-inline">Watson R N M, Moore S W, et al; 2019; "<em>An <span>Introduction</span> to
                <span>CHERI</span></em>". <a href="https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-941.pdf">https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-941.pdf</a>
        </div>
    </div>
    <div id="ref-szekeresSoKEternalWar2013" class="csl-entry" role="doc-biblioentry">
        <div class="csl-left-margin">[2] </div>
        <div class="csl-right-inline">Szekeres L, Payer M, et al; 2013; "<span>SoK</span>: <span>Eternal War</span> in
            <span>Memory</span>". 2013 <span>IEEE Symposium</span> on <span>Security</span> and <span>Privacy</span> pp
            48–62. <a href="https://doi.org/10.1109/SP.2013.13">https://doi.org/10.1109/SP.2013.13</a>
        </div>
    </div>
    <div id="ref-watsonCHERIHybridCapabilitySystem2015" class="csl-entry" role="doc-biblioentry">
        <div class="csl-left-margin">[3] </div>
        <div class="csl-right-inline">Watson R N M, Woodruff J, et al; 2015; "<span>CHERI</span>: <span>A Hybrid
                Capability-System Architecture</span> for <span>Scalable Software Compartmentalization</span>". 2015
            <span>IEEE Symposium</span> on <span>Security</span> and <span>Privacy</span> (<span>SP</span>) (<span>San
                Jose, CA</span>: <span>IEEE</span>) pp 20–37. <a href="https://ieeexplore.ieee.org/document/7163016/">https://ieeexplore.ieee.org/document/7163016/</a>
        </div>
    </div>
    <div id="ref-stephensARMScalableVector2017" class="csl-entry" role="doc-biblioentry">
        <div class="csl-left-margin">[4] </div>
        <div class="csl-right-inline">Stephens N, Biles S, et al; 2017; "The <span>ARM Scalable Vector
                Extension</span>". <em>IEEE Micro</em> 37 26–39. <a href="http://ieeexplore.ieee.org/document/7924233/">http://ieeexplore.ieee.org/document/7924233/</a>
        </div>
    </div>
    <div id="ref-specification-RVV-v1.0" class="csl-entry" role="doc-biblioentry">
        <div class="csl-left-margin">[5] </div>
        <div class="csl-right-inline">Anon; 2021; "<span>RISC-V</span> "<span>V</span>" <span>Vector Extension</span>".
            <a href="https://github.com/riscv/riscv-v-spec/releases/download/v1.0/riscv-v-spec-1.0.pdf">https://github.com/riscv/riscv-v-spec/releases/download/v1.0/riscv-v-spec-1.0.pdf</a>
        </div>
    </div>
    <div id="ref-woodruffCHERIConcentratePractical2019" class="csl-entry" role="doc-biblioentry">
        <div class="csl-left-margin">[6] </div>
        <div class="csl-right-inline">Woodruff J, Joannou A, et al; 2019; "<span>CHERI Concentrate</span>:
            <span>Practical Compressed Capabilities</span>". 15. <a href="https://doi.org/gm9ngf">https://doi.org/gm9ngf</a>
        </div>
    </div>
    <div id="ref-TR-951" class="csl-entry" role="doc-biblioentry">
        <div class="csl-left-margin">[7] </div>
        <div class="csl-right-inline">Watson R N M, Neumann P G, et al; 2020; "<em>Capability <span>Hardware Enhanced
                    RISC Instructions</span>: <span>CHERI Instruction-Set Architecture</span> (<span>Version</span>
                8)</em>". (<span>University of Cambridge, Computer Laboratory</span>). <a href="https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-951.html">https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-951.html</a>
        </div>
    </div>
    <div id="ref-joannouEfficientTaggedMemory2017" class="csl-entry" role="doc-biblioentry">
        <div class="csl-left-margin">[8] </div>
        <div class="csl-right-inline">Joannou A, Woodruff J, et al; 2017; "Efficient <span>Tagged Memory</span>". 2017
            <span>IEEE International Conference</span> on <span>Computer Design</span> (<span>ICCD</span>) pp 641–8. <a href="https://doi.org/ghnj26">https://doi.org/ghnj26</a>
        </div>
    </div>
    <div id="ref-armltdArmCompilerScalable2019" class="csl-entry" role="doc-biblioentry">
        <div class="csl-left-margin">[9] </div>
        <div class="csl-right-inline">Arm Ltd; 2019; "<em>Arm <span>Compiler Scalable Vector Extension User Guide
                    Version</span> 6.12</em>". <a href="https://developer.arm.com/documentation/100891/latest/">https://developer.arm.com/documentation/100891/latest/</a>
        </div>
    </div>
    <div id="ref-memarianExploringSemanticsPointer2019" class="csl-entry" role="doc-biblioentry">
        <div class="csl-left-margin">[10] </div>
        <div class="csl-right-inline">Memarian K, Gomes V B F, et al; 2019; "Exploring <span>C</span> semantics and
            pointer provenance". <em>Proc. ACM Program. Lang.</em> 3 1–32. <a href="https://dl.acm.org/doi/10.1145/3290380">https://dl.acm.org/doi/10.1145/3290380</a></div>
    </div>
    <div id="ref-specification-RVV-intrinsics" class="csl-entry" role="doc-biblioentry">
        <div class="csl-left-margin">[11] </div>
        <div class="csl-right-inline">Anon; 2021; "<em><span>RISC-V Vector Extension Intrinsics</span> (v1.0)</em>".
            (<span>RISC-V Non-ISA Specifications</span>). <a href="https://github.com/riscv-non-isa/rvv-intrinsic-doc/blob/00882f19a84ab354dc8cf6a10c100b8daa2654e4/rvv-intrinsic-api.md">https://github.com/riscv-non-isa/rvv-intrinsic-doc/blob/00882f19a84ab354dc8cf6a10c100b8daa2654e4/rvv-intrinsic-api.md</a>
        </div>
    </div>
    <div id="ref-xiaCHERIvokeCharacterisingPointer2019" class="csl-entry" role="doc-biblioentry">
        <div class="csl-left-margin">[12] </div>
        <div class="csl-right-inline">Xia H, Woodruff J, et al; 2019; "<span>CHERIvoke</span>: <span>Characterising
                Pointer Revocation</span> using <span>CHERI Capabilities</span> for <span>Temporal Memory
                Safety</span>". 14. <a href="https://doi.org/gm9ngg">https://doi.org/gm9ngg</a></div>
    </div>
</div>
<section id="footnotes" class="footnotes footnotes-end-of-document" role="doc-endnotes">
    <hr />
    <ol>
        <li id="fn1">
            <p>The “original allocation the pointer is derived from”<span class="citation" data-cites="memarianExploringSemanticsPointer2019">[10]</span>, or in CHERI terms the bounds within
                which the pointer is valid.<a href="#fnref1" class="footnote-back" role="doc-backlink">↩︎</a></p>
        </li>
        <li id="fn2">
            <p>e.g. on FPGAs multiplexers can be relatively cheap.<a href="#fnref2" class="footnote-back" role="doc-backlink">↩︎</a></p>
        </li>
        <li id="fn3">
            <p>This avoids edge cases with masking, where one part of a capability could be modified while the other
                parts are left alone.<a href="#fnref3" class="footnote-back" role="doc-backlink">↩︎</a></p>
        </li>
        <li id="fn4">
            <p>The tag bits are implicitly instead of explicitly included here because <code>VLEN,ELEN</code> must be
                powers of two.<a href="#fnref4" class="footnote-back" role="doc-backlink">↩︎</a></p>
        </li>
        <li id="fn5">
            <p>The encoding mode does not affect register usage: when using the Integer encoding mode, instructions can
                still access the vector registers in a capability context. This is just like how scalar capability
                registers are still accessible in Integer encoding mode.<a href="#fnref5" class="footnote-back" role="doc-backlink">↩︎</a></p>
        </li>
    </ol>
</section>]]></content><author><name>Samuel W. Stark</name></author><category term="CHERI" /><category term="RISC-V" /><category term="RISC-V V" /><category term="cheri" /><category term="rvv" /><category term="risc-v" /><summary type="html"><![CDATA[This is a condensed version of my master’s thesis (from 15,000 to 5,000 words), which won the RISE 2022 Student competition! I also submitted a 5-minute presentation, which is available on YouTube and the project summary page.]]></summary></entry></feed>