bachelors-thesis/BachelorsThesis.mdk

Title       : Examining Effective Design Patterns for Concurrent Game Technologies

Author      : Kieran Osborne
Affiliation : University of Derby
Email       : k.osborne2@unimail.derby.ac.uk

Bibliography: example.bib
Logo        : False

[TITLE]

~ Abstract
Within the last two decades, computer hardware turned from focusing on single-core performance founded on clock speed to multi-core architectures concerned with distributing latency across smaller, concurrent workloads. In the domain of software, many tools have not embraced these new architectural constraints. Further, those that are doing so using highly specialised tools.

This study focuses on identifying what re-usable patterns emerge from various software tools to propose a better, concurrency focused model for developing real-time software and testing performance characteristics against existing tools within the same market.
~

&pagebreak;

[TOC]

&pagebreak;

# Acknowledgments {-}

In the course of writing this dissertation I have been subject to assistance and support throughout. I would like thank my supervisor, Dr. C. Windmill, who's expertise and passion surrounding the Cell Broadband Engine Architecture was responsible for helping to identify the core drive for this research project.

More personally, I would like to thank my mother for her support throughout my studies and Maria for her patience while I have undertaken both this project and the final year of my undergraduate degree.

&pagebreak;

# Introduction { #sec-introduction }

Over the last twenty years, hardware-assisted parallelism and concurrency have become the new frontier for low-latency software engineering. While the transition to a multi-core paradigm has produced some new best practices and design patterns, the problem of building re-usable software that can efficiently exploit the gains of parallel execution in an easy and modular manner is still largely unsolved.

The goal of this study is to examine existing literature in the field of parallel computation for real-time simulations. Upon completion of a thorough literature survey, a framework for effective parallel computation - along with a prototype implementation - is identified. The objective of the produced artefact is to propose a simple and reusable framework for building parallel software in the domain of latency-constrained applications, such as a game or real-time simulation. In doing so, this study plans to identify a model for computation that is easily scalable within the needs of a real-time application.

In development of the prototype, this study aims to demonstrate that parallel computation is achievable in a trivial manner within the domain of computer games programming via an asynchronous systems-oriented approach to game development. The prototype implementation intends to be measurably more performant over competing game engines and tooling already present in the market today, providing itself as a model from which more involved tooling may be built.

This study does not intend to produce a fully-featured game engine tooling. Instead, the priority of development focuses on two-dimensional texture rendering and data passing between systems of interaction - fundamentals of real-time two-dimensional games programming. That being said, the architectural principles of this project should be extractable to three-dimensional and virtual reality experiences as well.

&pagebreak;

# Terminology { #sec-terminology }

Language surrounding multi-core programming is not well defined as of this paper, with many words adopting multiple meanings; take "thread", for example, which may refer to hardware threads, software threads, or both [@bell_2021]. This paper opts to use "thread" when referring to software threads. Any reference to hardware threads is to be named explicitly so or by their implementation-specific terminology.

Concepts outlined in the following sections are generalised to apply to many computer processor architectures that align with the well-established conventions of parallel computation [@randell_2013]. In particular, this paper examines architectures that support distributed processing elements with layered memory hierarchies, such as x86/x86-64, ARM, and PowerPC.

Unlike threads, the terminology established for concurrent and parallel execution is far better defined. Fundamentally, concurrency is about performing multiple operations simultaneously, while parallelism concerns the splitting of divisible operations into smaller units of computation to be processed alongside each other [@jenkov_2021].

This paper is foremost interested in the concepts and problems that belong to parallel computation. That being said, concurrent computation will be referred to when examining the wider context of multi-processing.

&pagebreak;

# Literature Review { #sec-literature-review }

Prior to the discussion of the project artefact, an analysis of existing literature and systems is needed to assess software interactions, and proposed solutions, open issues. Foremost, this paper is concerned with the architecture of the hardware platform and how its evolution has changed the way that software is conceptualised and realised.

## Atomics { #sec-atomics }

Atomics are one such feature of hardware that interacts with whole systems. As modern hardware can divide simple operations like arithmetic into multiple stages, these instructions are not guaranteed to complete before something else begins to operate on data [@williams_2021]. Under these circumstances, two parallel actions enter a contest in accessing the data, which produces an ill-defined result.

It is the case that most in-memory data operations are not guaranteed to occur in a single action on modern architectures [@williams_2021]. For example, incrementing integers on x86 first requires loading the operand value into a register, writing to it, then saving the altered state back to memory [@cmpxchg_2021]. While this operation may appear to all happen in a single instruction, the CPU will internally compute it in multiple steps.

~ Figure { #fig-atomic-exchange; caption:"Atomic compare-and-exchange under Intel x86 assembly."; page-align:top}
```
lock cmpxchg source, eax
```
~

An atomic operation is a hardware-level assurance that an instruction is indivisible in its execution [@lock_free_2019]. Because atomic operations are practically instantaneous, actions brokered through them do not need to protect against concurrent read-write operations [@williams_2021]. Such erroneous read-write operations are known as "data races" or "race conditions". Typically, atomic instructions use a single instruction with a locking modifier [@cmpxchg_2021] - as with figure [#fig-atomic-exchange] - however, the exact implementation details can vary wildly between processor architecture and software platform. For example, the C++ standard specifies `std::atomic` as using software-assisted locks when there is no intrinsic hardware support for atomic operations on a given type [@stdatomic_2020].

&pagebreak;

## Memory Hierarchy { #sec-memory-hierarchy }

~ Figure { #fig-memory-hierarchy; caption:"Abstract representation of a typical processor memory hierarchy."; page-align:top}
![MemoryHierachy]

[MemoryHierachy]: images/MemoryHierarchy.png "Memory Hierarchy" { width:auto; max-width:30% }
~

The memory hierarchy is a conceptual representation of the data storage layers that a processor may access [@shanthi_2021]. Figure [#fig-memory-hierarchy] is organised visually by response latency, with the top-most being the quickest and the bottom-most being the slowest. In the case of x86/x86-64, ARM, and PowerPC, registers are the quickest storage devices to access, while disk storage presents the highest latency.

> Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available. ... We are ... forced to recognize the possibility of constructing a hierarchy of memories, for each of which has greater capacity than the preceding but which is less quickly accessible. - A. W. Burks, H. H. Goldstine, J. von Neumann [@randell_2013]

As explained by Burks, Goldstine, and von Neumann, the need to split memory into a layered hierarchy is derived from the physical limitations of how quickly light can move around the processor [@randell_2013]. Access speed negatively scales with the size of a memory region, resulting in the need to segment memory into a prioritised hierarchy [@randell_2013].

Under these constraints, the second-most ideal scenario would be to perform all computation within the confines of registers. While registers are utilised constantly by software, they are far too small to be used for anything more than temporary transfer or transmutation storage. In programming languages targeting the x86/x86-64 architecture, registers see frequent usage in moving data to and from procedures [@fog_2021].

Beyond registers, the only other processor-bound storage is the memory cache. Following the memory hierarchy model established in figure [#fig-memory-hierarchy], cache operates as an intermediary buffer between the CPU and main memory [@shanthi_2021]. Before the CPU reaches out to main memory for data, it will first check that said data isn't already in the cache, thereby potentially avoiding a high-latency memory transaction [@shanthi_2021].

~ Figure { #fig-cache-hierarchy; caption:"Abstract representation of different CPU cache levels."; page-align:top}
![CacheLevels]

[CacheLevels]: images/CacheLevels.png "Cache Levels" { width:auto; max-width:90% }
~

Due to the ever-battling demands of storage capacity versus access speed, cache has its own multi-level hierarchy on most modern architectures. Figure [#fig-cache-hierarchy] represents the most common cache configuration at the moment, with the outer-most third layer shared between multiple cores [@drepper_2007]. While not directly controllable from software, many established practices can be employed to encourage better utilisation of the cache [@drepper_2007].

&pagebreak;

### Data Contiguity { #sec-data-contiguity }

The object array is a prevalent design pattern of object-oriented software development. The practice involves storing an array of references to objects allocated elsewhere in memory for iteration and manipulation [@fabian_2018].

~ Figure { #fig-cache-inefficient-java; caption:"Cache-inefficient java software using an object array"; page-align:top}
```java
class Foo {
    private int bar = 0;

    public static void compute(Foo[] foos) {
        for (int i = 0; i < foos.length; i += 1) {
            foos[i].bar += i;
        }
    }
}
```
~

~ Figure { #fig-object-array-layout; caption:"Example of an object array layout in memory."; page-align:top}

![CacheIneffieicnt]

[CacheIneffieicnt]: images/CacheInefficient.png "Cache Inefficient" { width:auto; max-width:50% }
~

Within the concerns of cache efficiency, such code can be a burden to the CPU. Due to the sparse clustering of the data set, the CPU is encouraged to repeatedly call out from cache and into main memory to find the relevant data [@drepper_2007]. In a scenario where the data depicted in figure [#fig-cache-inefficient-java] is not readily available to the CPU cache, a read operation can balloon from the range of `50` cycles to upwards of `200` [@terman_2018]. Platforms like the Java Virtual Machine employ escape analysis to move short-lived objects onto the stack for both cache and allocation efficiency [@mirek_2017]; however, this has little benefit on persistent data structures like the array in figure [#fig-cache-inefficient-java], due to the memory layout depicted in figure [#fig-object-array-layout].

~ Figure { #fig-cache-efficient-c; caption:"Cache-efficient C software using a packed array."; page-align:top}
```cpp
struct Foo {
    int32_t bar;
};

void compute(size_t fooCount, Foo * foos) {
    for (int32_t i = 0; i < fooCount; i += 1) {
        foos[i].bar += i;
    }
}
```
~

~ Figure { #fig-packed-array-layout; caption:"Example of a packed array layout in memory."; page-align:top}
![CacheEfficient]

[CacheEfficient]: images/CacheEfficient.png "Cache Efficient" { width:auto; max-width:50% }
~

In languages like C, where the data model allows aggregates to be stored directly within an array, the CPU has the working data set locally available, encouraging far better utilisation of cache memory [@drepper_2007].

&pagebreak;

### Memory Alignment { #sec-memory-alignment }

~ Figure { #fig-poorly-aligned-c; caption:"Alignment-inefficient C structure type."; page-align:top}
```cpp
struct Foo {
    int32_t bar;

    uint8_t buffer;
};
```
~

~ Figure { #fig-alignment-gaps; caption:"Example of struct alignment under x86-64."; page-align:top}
![AlignmentIneffecient]

[AlignmentIneffecient]: images/AlignmentInefficient.png "Alignment Inefficiency" { width:auto; max-width:30% }
~

Unlike data contiguity, memory alignment is significantly more difficult to solve in existing, popular programming languages. Consider the example program in figure [#fig-cache-efficient-c] using the structure type in figure [#fig-poorly-aligned-c]. Implementations of the C programming language standard, namely those on Intel and AMD CPUs, will insert additional padding to align the memory footprint structures to a power of `2` [@naik_2013]. Memory alignment scenarios, like the one depicted in figure [#fig-alignment-gaps], can result in sub-optimal cache throughput as it introduces gaps in the data set.

Simply packing data into a more compact representation will not improve cache utilisation, as x86 fetches data in powers of `2` [@naik_2013] as well. Ergo, data that cannot fit on the current cache line due to alignment issues will be deferred to the next fetch [@naik_2013], resulting in alignment problems in the cache.

The Zig programming language is currently experimenting with zero-guarantee memory layouts for structures [@zigdoc_2021]. Avoiding layout standardisation gives the compiler greater freedom during optimisation stages. Regardless, this is not a silver bullet, as data cannot become aligned to a power of `2` without either truncating or padding its memory.

In situations where maximum throughput is a must, programmers are encouraged to refactor their data into smaller, distinct arrays, with each one specialised to the current working set. This approach to data composition is commonly known as "structure of arrays", the inverse of the "array of structures" composition technique [@fabian_2018].

~ Figure { #fig-array-of-structures; caption:"Array of structure composition in C."; page-align:top}
```cpp
struct Vector2 {
    float x, y;
};

struct Invader {
    Vector2 position;

    int32_t health;
};

void updateInvaders(size_t invaderCount, Invader * invaders) {
	for (size_t i = 0; i < invaderCount; i += 1) {
        renderInvader(invaders[i].position);

        if (invaders[i].health <= 0) {
            markForRemoval(i);
        }
    }
}
```
~

~ Figure { #fig-array-of-structures-diagram; caption:"Invaders \"array of structures\" composition."; page-align:top}
![ArrayOfStructures]

[ArrayOfStructures]: images/ArrayOfStructures.png "Array of Structures" { width:auto; max-width:90% }
~

Consider that an `Invader` type is composed of both an `8` byte `Vector2` type and a `4` byte `int32_t` type in figure [#fig-array-of-structures]. On x86, the compiler will align `Invader` to `16` bytes unless prompted otherwise, meaning every datum will receive 4 bytes of padding between them, as depicted in figure [#fig-array-of-structures-diagram].

~ Figure { #fig-structure-of-arrays; caption:"Structure of arrays composition in C."; page-align:top}
```cpp
struct Vector2 {
    float x, y;
};

void updateInvaders(
    size_t invaderCount,
    Vector2 * invaderPositions,
    int32_t * invaderHealths
) {
	for (size_t i = 0; i < invaderCount; i += 1) {
        renderInvader(invaderPositions[i]);
	}

    for (size_t i = 0; i < invaderCount; i += 1) {
        if (invaderHealths[i] <= 0) {
            markForRemoval(i);
        }
    }
}
```
~

~ Figure { #fig-structures-of-arrays-diagram; caption:"Invaders \"structure of arrays\" composition."; page-align:top}
![StructureOfArrays]

[StructureOfArrays]: images/StructureOfArrays.png "Structure of Arrays" { width:auto; max-width:90% }
~

By decomposing the `Invader` structure into its position and health components, as shown in figure [#fig-structures-of-arrays-diagram], each data set now aligns to powers of 2 in figure [#fig-structure-of-arrays]. An improvement in cache throughput occurs even though the processor now has to perform an additional looping operation, due to how CPU fetches and pre-fetches data and instructions. Moreover, decomposition of the structure opens up new future optimisations, as neither loop depends on the state of the other, thereby allowing them to be processed in parallel.

&pagebreak;

## Instruction Pipeline

Alongside the memory cache, processors also host instruction pipelines. Many modern CPUs facilitate automatic concurrency at the instruction level through the batch processing of multiple instructions at once, referred to as "pipelining" [@alglave_2009;@sutter_2007].

~ Figure { #fig-sequential-pipeline; caption:"Sequential instruction pipeline across 5 different cycle stages."; page-align:top}
![SequentialPipeline]

[SequentialPipeline]: images/SequentialPipeline.png "Sequential instructions" { width:auto; max-width:40% }
~

Under the naive model of instruction processing represented by figure [#fig-sequential-pipeline], only a single instruction may occupy the fetch-decode-execute pipeline at a time. In reality, this has not been the case for practical implementations of the von Neumann architecture for over fourty years now [@shinsel_2021].

~ Figure { #fig-concurrent-pipeline; caption:"Concurrent instruction pipeline across 5 different cycle stages."; page-align:top}
![ConcurrentPipeline]

[ConcurrentPipeline]: images/ConcurrentPipeline.png "Parallel instructions" { width:auto; max-width:40% }
~

The instruction pipeline is composed of both a front-end - responsible for fetching and decoding instructions - and a back-end, overseeing execution and retirement of instructions. Figure [#fig-concurrent-pipeline] demonstrates how four instructions, each prefixed with "op" and identified by a unique number, make their way through the pipeline through different cycle stages. As more instructions begin to saturate the pipeline, more instructions may be processed concurrently by the processing element.

Note that, while figure [#fig-concurrent-pipeline] presents a pipeline as having four stages, modern processor pipelines can actually have any number of internal stages [@sutter_2007].

> You can always buy more bandwidth... but you can't count on buying your way out of a latency problem. - H. Sutter [@sutter_2007]

Hardware initialisation isn't the only time that the instruction pipeline can be empty. Similar to CPU-bound memory caches, numerous programming practices can introduce stalls in the pipelining of instructions [@shinsel_2021]. H. Sutter, the lead architect of the Microsoft C++ compiler and ISO C++ committee, spoke at the 2007 Northwest C++ Users' Group conference to present issues he saw with modern software tooling and how it abstracts away hardware. Sutter's salient point was how difficult it is to reason about what work a processor is doing in modern C++ [@sutter_2007].

There are a variety of pipelining hazards that software can silently introduce as part of their regular flow control and data manipulation, but the majority of them fall into one of three categories.

### Structural Hazards

Structural hazards occur when two or more instructions need access to the same resource. Cross-instruction dependencies typically result in a pipeline stall as dependent instructions have to operate sequentially rather than concurrently [@prabhu_2021].

Additionally, parallel operations on resources shared across multiple cores can also increase the likelihood of structural structural hazards occurring, by increasing the number of things for the CPU to pipeline [@prabhu_2021].

### Control Hazards

Sometimes referred to as "branch hazards", control hazards are products of misprediction by the processor when performing decision-based branching operations [@prabhu_2021;@naik_2013]. Control hazards cause pre-fetched instructions to be considered invalid and sometimes must flush the entire pipeline to begin again [@prabhu_2021;@naik_2013].

### Data Hazards

Similar to the inter-instructional dependencies presented by structural hazards, data hazards introduce bottlenecks by making instructions rely on data computed by prior instructions [@prabhu_2021].

### Hazard Avoidance

While case-specific, there are some established practices for avoiding the introduction of hazards into the instruction pipeline.

  * Avoid branching logic where possible to reduce control hazards [@fabian_2018].

  * Introduce padding around instructions in the pipeline that can produce structural hazards [@cheng_2013].

Some degree of pipelining hazards are inherent to every software system, as they are nearly impossible to reason about without first analysing with sufficient profiling tools [@sutter_2007]. The focus should not be on not programming them, but rather, keeping check of them and identifying case-specific practices that help avoid them.

&pagebreak;

## Hardware { #sec-hardware }

While it is true that many of the complexities associated with software development on modern computing hardware are due to the aforementioned memory hierarchies and instruction pipelines [@alglave_2009], they are not without good reason. Over the last four decades, both CPU clock speed and performance-over-clock speed increases have slowed to a halt [@sutter_2005]. Processor manufacturers arrived at the physical limitations of single-core computation at the turn of the twenty-first century, with complications surrounding heat production and power consumption prohibiting anything practical from being fabricated that breaks the 5 GHz boundary at the moment [@sutter_2005].

Hardware vendors have made many efforts to improve latency in single-core programs via instruction pipelining and vectorisation [@alglave_2009]. However, newer multi-core processors have made it a requirement for software to distribute its workload across more than one core [@sutter_2005]. One alternative way forward is to take advantage of other heuristics inherent in modern processors, such as CPU-bound memory caches, speculative execution, and multi-core processing.

### Transition to Multi-Core { #sec-transition-to-multicore }

H. Sutter, who was discussed earlier in section [#sec-instruction-pipeline], authored a 2005 article for Dr Dobb's Journal analysing the then-current trends of computer hardware and the wider programming industry. In the article, Sutter posited that the software industry was going to drastically change over the following decade as the world quickly approached the physical limitations of raw single-core performance [@sutter_2005]. Sutter also placed many critiques on a perceived mentality of the software development industry at the turn of the new millennium, making an open call for a change in how software was built [@sutter_2005].

It wouldn't be until another article published in 2005 by Sutter, as well as J. Larus, that Sutter would re-evaluate his perspective on concurrent software development [@sutter_larus_2005].

> Although sequential programming is hard, concurrent programming is demonstrably more difficult. - H. Sutter, J. Larus [@sutter_larus_2005]

Sutter and Larus go on to explain how parallelism for server-side applications is a "mostly solved issue", as the operations of a server-side application are inherently isolated and rarely rely on shared state within their logic [@sutter_larus_2005]. Because of these architectural constraints, server-side applications are not a well-formed model for comparing against client-side applications. Instead, Sutter and Larus suggested focusing on gains through "granular parallelism", a form of parallelism that focuses on the division of a workload into smaller pieces so it may be processed in parallel [@sutter_larus_2005]. While simple to conceptualise and implement, this model quickly breaks down for data sets that contain interdependencies, as the order of computation prevents efficient parallelism.

~ Figure { #fig-granular-parallelism; caption:"Granular parallelism with data cross-dependencies [@sutter_larus_2005]."; page-align:top}
```cpp
A[i, j] =
    (A[i - 1, j] +A[i, j - 1] + A[i + 1, j] + A[i, j + 1]) / 4;
```
~

Figure [#fig-granular-parallelism] is provided by Sutter and Larus in the article as an example of cross-dependencies in a data set constraining parallel computation [@sutter_larus_2005]. While the problem remains solvable, the complexity increases and, so too has the cost of maintaining the software implementation.

Sutter continued to codify his conceptual model for multi-core computation and, later, produced a further article. Unlike previous works by Sutter, this article was focused on standardising a way of conveying parallel concepts [@sutter2_2007]. Throughout the publication, Sutter describes his mental model and presents each understood problem posed by parallelism under one of three categories.

#### Consistency via Safely Shared Resources

Avoids race conditions when reading from and writing to shared, mutable state by gaining mutually exclusive access to them [@sutter2_2007].

Aggressive synchronisation is a popular but often-naive approach to concurrency, as it can easily introduce bottlenecks within a system. Despite this, when used effectively and in indivisible operations, such as atomic reference counting, it can be the only way to achieve working parallelism [@lock_free_2019].

#### Responsiveness and Isolation via Asynchronous Agents

Maximises responsiveness by avoiding blocking operations and decoupling systems to run independently from each other. Any communication that needs to occur happens over a minimal and restricted message-passing interface [@sutter2_2007].

Real-world examples of this in practice include application user interfaces, web services, and background processes that communicate over a separate front-end.

#### Throughput and Scalability via Concurrent Collections

Enables higher throughput of data processing in existing systems by breaking large data sets into smaller components of computation and running them in parallel across multiple cores to reach the solution faster [@sutter2_2007].

Unlike asynchronous agents, Sutter describes this approach as more of an optimisation technique than a complete architecture for building new, large-scale systems.

#### Pillars of Concurrency

As referred to in the paper by Sutter, these "pillars of concurrency" should be composed together to build scalable solutions rather than treated as tightly encapsulated, holistic solutions to the performance concerns of an entire software application [@sutter2_2007].

Sutter et al have done the important groundwork for developing a common conceptual basis for concurrency. That being said, the hypothesises made across the discussed papers and articles are all heavily centred around traditional desktop computing interfaces. It is the case that embedded hardware solutions, like that of games consoles, make up the majority of the consumer video games market [@clement_2021].

&pagebreak;

### Parallel Hardware in Practice { #sec-parallel-hardware-in-practice }

Starting in 2001, an alliance involving Sony, Toshiba, and IBM started the development of a new processor based on the PowerPC architecture named the "Cell Broadband Engine Architecture", often shortened to just "CBEA" or "Cell" [@kahle_day_hofstee_johns_shippy_2005]. Unlike architectures competing in the same market, Cell is composed of a single PowerPC processor for general-purpose processing. Alongside this primary core are a variable number of "synergistic processing elements", often initialised to "SPE", that handle specialised computations of floating-point values [@kahle_day_hofstee_johns_shippy_2005].

The design philosophy of low-latency, high-bandwidth data throughput made Cell the processor of choice for the Sony PlayStation 3 [@kahle_day_hofstee_johns_shippy_2005;@maragos_2005]. Though, despite being promoted for use in the Sony games console, the Cell architecture later found further application in the IBM Roadrunner supercomputer, being responsible for the first demonstration of a system capable of computing 1.0 petaflops at a sustained rate [@gaudin_2008].

Despite its potential, however, the design of Cell has made it very difficult to attract programmers and even harder for the programmers that it did attract to make full use of the hardware [@krishnalal_2017]. Software applications that do not make use of Cell hardware features perform worse when compared to the same software targeted for, and running on, dual-core architectures like the Xbox PowerPC processor and general-purpose x86 processors [@turley_2021].

&pagebreak;

### Non-Uniform Memory Access Architectures { #sec-non-uniform-memory-access-architectures }

Architectures like Cell have also been traditionally difficult to develop for due to hardware design choices that propagate into the implementations of software targeted for them [@turley_2021]. Efficient saturation of the SPE cores is required to get the most out of Cell [@krishnalal_2017], while Intel x86 Hyper-Threading is an automatic instruction scheduling optimisation performed by the processor on the behalf of the program [@hyperthreading_2021].

Floating point computations are also more costly on cell by default compared to its simpler architecture [@krishnalal_2017]. To efficiently perform floating-point computations on Cell, the task of vectorising sequential operations is mostly left in the hands of the programmer [@alimemon_amin_2009]. Typically, this involves batching all necessary data and streaming it to the various cores of the SPE array for the result to be computed and returned asynchronously [@alimemon_amin_2009]. Asynchronous data streaming entails a few complications and is made more difficult by Cell not having a uniform memory layout [@turley_2021].

Modern CPU architectures, such as the new generation of AMD Zen chipsets, also make use of non-uniform memory access - or NUMA for short [@amdnuma_2018]. Recent AMD CPUs have featured non-uniform pathways to dedicated memory hardware such that, depending on the core that is processing the task, preferential treatment is given to cold memory accesses [@amdnuma_2018]. More recent implementations of Zen have featured "Dynamic Local Mode", which gives preferential treatment to processes based on the number of cores they use [@amddlm2_2018]. While Dynamic Local Mode will can assist with existing software, it forgoes the scalability that NUMA offers in doing so.

The growing presence of NUMA on the desktop was investigated in a 2011 study by L. Bergstrom, which compared the performance characteristics of newer processors against the STREAM benchmark [@bergstrom_2011]. Bergstrom found that the effects of NUMA were far more profound on high-end AMD processors compared to high-end Intel ones [@bergstrom_2011]. Additionally, Bergstrom recommended that software which intends to make full usage of NUMA-backed architectures should lean heavily on its CPU-local caches, rather than relying on shared memory access [@bergstrom_2011]. While heavy cache utilisation has been a staple of low-latency programming on x86 for many years now, the shift to GPU-like streaming architectures similar to Zen make it more integral for the development of consistently fast software on the desktop [@amdnuma_2018].

&pagebreak;

### Memory Locality { #sec-memory-locality }

Beyond NUMA, accessing main memory on modern desktop hardware still has a high cost associated with it [@memperf_2015]. Modern processor clock speeds have diverged from main memory read speeds appreciably over the last fourty years, encouraging more use of the multiple memory cache layers built into the processors of today [@memperf_2015]. While the cost of floating point operations have reduced drastically over the last four decades, main memory access speeds have seen a much more tepid improvement [@sutter_2007].

Schoene, Hackenberg, and Molka explored the divergence of clock and access speeds in a study that tested how much bandwidth was affected by a reduction in processor clock speed. Schoene et al developed highly optimised assembler and C code which was executed in parallel equally saturate each core [@schoene_hackenberg_molka_2012]. Benchmarks tested the rate at which data could be streamed from dedicated memory at different voltage frequencies to analyse changes in bandwidth performance [@schoene_hackenberg_molka_2012]. The study found that, at relative clock frequencies ranging from `1.0` down to `0.7`, the majority of processors tested experienced minimal or no drop in main memory bandwidth performance [@schoene_hackenberg_molka_2012]. Meanwhile, the drop-off in L3 cache bandwidth performance with most processors happened earlier at a relative clock frequency of `0.8` [@schoene_hackenberg_molka_2012].

In the main memory bandwidth benchmarks detailed by Schoene et al, CPUs with notably divergent test results included AMD Magny-Cours, AMD Istanbul, AMD Interlagos, and Intel Sandy Bridge-EP [@schoene_hackenberg_molka_2012]. Of these four cores, each was designed for usage in servers and/or workstations, with Istanbul and Magny-Cours as part of the Opteron line, Interlagos as part of the Bulldozer line, and Sandy Bridge-EP as a member of the Sandy Bridge-E line of products. Moreover, these results align with the conclusions made by L. Bergstrom in his earlier-mentioned analysis of NUMA-enabled AMD cores making greater use of streaming than Intel [@bergstrom_2011].

As discussed earlier in section [#sec-memory-hierarchy], the expected wait time for main memory access is around `200` cycles [@terman_2018]. Compared to the typical cycle cost of accessing a CPU register, this is 200 times slower [@terman_2018]. An overhead of such magnitude places additional strain on the requirements of both the programmer and compiler to produce software that minimises the frequency that main memory needs to be accessed by the CPU, as such an indirection can create significant bottlenecks in the pipelining of instructions [@prabhu_2021].

&pagebreak;

### Complexity { #sec-complexity }

Managing the expectations of the CPU are complex enough within serial software but quickly scale in complexity within parallel systems [@sutter_larus_2005]. Consider a program that must transform a set of fragmented object-oriented data to a processor-optimised format for operating on in software; these parallelisable bulk data transformations will experience an increase in pipelining hazards as well as throughput. Alongside this, some latency problems are just not solvable by parallelising them, as discussed in [#sec-transition-to-multicore]. Software refactored to support parallel computation based on assumed gains are fruitless if no empirical latency improvements are proven. Ideally, measuring the theoretical impact of parallelism in a software system would be mathematically provable.

While no single proof exists for deriving all characteristics of a concrete software system, many individual principles exist to help inform time investment in optimisation through parallel execution [@ashcroft_1975]. One such principle is a formula described by Gene Amdahl and later codified as "Amdahl's Law", an approach for deriving the theoretical limit in execution time reduction after splitting a parallelisable workload across units of parallel computation [@amdahl_1967]. The fundamental argument of Amdahl's Law states that a system can become no faster than the time it takes its non-parallelisable components to process [@amdahl_1967].

~ Figure { #fig-amdahls-law; caption:"Depiction of process time before and after it is parallelised across three processing elements."; page-align:top}
![AmdahlsLaw]

[AmdahlsLaw]: images/AmdahlsLaw.png "Amdahl's Law" { width:auto; max-width:90% }
~

Figure [#fig-amdahls-law] expresses a scenario where `0.5` seconds of a system is non-parallelisable while the remaining `1` second is. Under Amdahl's Law, the parallelised workload can theoretically subdivide infinitely across more and more processing units to reduce latency [@amdahl_1967]. Amdahl's Law is formally expressed as

~ Equation { #eqa-amdahls-law }
latency = \dfrac{1}{\left( 1 - fractionEnhanced \right) + \left( \dfrac{fractionEnhanced}{speedupEnhanced} \right)}.
~

`fractionEnhanced` represents the fraction of the processing time that is enhanced by parallelism and `speedupEnhanced` is the factor that the parallelisable tasks are improved by.

Despite inherently serial problems having no formal definition, complexity classes like P-completeness help group problems that are either difficult to solve or have no known parallel solution [@greenlaw_hoover_ruzzo_1995].

The Lempel-Ziv-Welch compression algorithm is a well-established example of a P-complete problem, due to how it encodes data [@pm_chezian_2014]. LZW operates in two distinct stages of string factorisation derived from the LZ1 [@ziv_lempel_1977] and LZ2 [@ziv_lempel_1978] compression algorithms. The serial constraints exist in the requirements of the second stage, with its input data relying on the order of accumulated state guaranteed by sequential execution [@ziv_lempel_1978;@pm_chezian_2014].

Research conducted by M. Kumar Maishra, T. Kumar Mishra, and A. Kumar Pani proposed a solution for parallelising Lempel-Ziv-Welch encoding through internment and comparison. However, these advances deviate from the original LZW specification and, therefore, cannot be considered a faithful implementation of the algorithm [@mishra_mishra_pani_2012].

Inversely to P complexity, NC complexity groups problems for which there are well-understood, efficient parallel solutions to, such as integer arithmetic and matrix multiplication [@arora_barak_2007]. While it is currently unclear if `P != NC`, this is widely speculated by the scientific community to be the case [@greenlaw_hoover_ruzzo_1995]. Greenlaw, Hoover, and Ruzzo discuss the failure to demonstrate `P = NC` as evidence that it is false [@greenlaw_hoover_ruzzo_1995].

> The best known parallel solutions of general sequential models give very modest improvements - R. Greelaw, H. J. Hoover, W. L. Ruzzo [@greenlaw_hoover_ruzzo_1995]

Alongside inherently serial problems, practical computer science is prevalent with examples of accidentally serial problems. Many libraries, frameworks, and tooling prefer design practices that hinder parallel processing, such as unnecessary reliances on far-reaching shared resources and mutable state. The OpenGL standard is a prominent example of a serial problem created by shared state, in part, due to its use of an opaque rendering context that exists locally to the thread that created it [@glconcepts_2017]. In more conventionally object-oriented tooling, this same practice can manifest itself in the singleton pattern, incurring the same complexities should it need to parallelise logic that mutates its state [@gamma_helm_johnson_vlissides_1994].

Accessing a shared state typically also incurs cache misses, as most uses of such data sharing typically reside non-local to the stack, such as static or heap memory [@fog2_2021]. Moreover, as reliances on shared state accumulate, sections of program logic become non-parallelisable and poor separation makes logic harder to reason about [@sutter_2005].

&pagebreak;

## Abstractions { #sec-abstractions }

Complications that arise at the hardware level are bifurcated further by the layers of software built atop it [@sutter_2007]. It is the case that the majority of real-time consumer applications target an operating system of some kind, such as the PlayStation 4 and Orbis OS [@clark_2013]. Further to this, managed runtimes and scripting languages add more layers of indirection, composing a hierarchy of abstraction from the hardware.

### Operating Systems { #sec-operating-systems }

Within the desktop computing space, threading implementations at the operating system level have largely converged on a common approach that treats hardware resources as typed values to be acquired and released as needed [@bell_2021]. Between the threading library implementations present on major desktop software platforms, three distinct primitives are common throughout them.

#### Threads { #sec-threads }

Threads represent an execution context for running sections of software logic independently from another [@bell_2021;@mitchell_samuel_oldham_2001]. While similar in a sense to operating system processes, threads differ in that many of them can be composed under a single process to perform isolated sets of computation [@bell_2021;@mitchell_samuel_oldham_2001].

Nothing strictly requires a thread type to map directly to a hardware thread and, in many cases, threading libraries will silently provide more threads than are actually available on a given hardware platform [@bell_2021;@mitchell_samuel_oldham_2001].

#### Mutexes { #sec-mutexes }

Mutexes are a synchronisation primitive used for brokering access to a shared resource [@mitchell_samuel_oldham_2001]. Synchronising is necessary when there is the potential for multiple threads to concurrently write to a single resource, such as memory, as the program otherwise runs the risk of encountering retrieving partially written invalid data [@mitchell_samuel_oldham_2001].

#### Conditions { #sec-conditions }

Sometimes referred to as "condition variables" or "signals", conditions are a way of sending an actively processing thread into a sleeping state [@mitchell_samuel_oldham_2001]. Threads marked as sleeping will continue to do so until the condition variable they are waiting on emits a "wake up" signal [@mitchell_samuel_oldham_2001].

Conditions are particularly useful in avoiding the pushing of unnecessary data and instructions through the CPU while a thread is not actually required to be doing anything [@mitchell_samuel_oldham_2001].

#### Platform Differences { #sec-platform-differences }

Despite the high degree of similarities in the programming interfaces expressed by many threads libraries, their implementations often vary significantly, making them far from inter-operable [@pthreadswin32_2003]. For example, all Win32 threading primitives are generalised under an opaque "handle" type [@pthreadswin32_2003]. Meanwhile, the POSIX threading library chooses to use concrete, transparent structures that the API consumer has full access to [@pthreadswin32_2003]. This results in POSIX being considered more strongly typed and safe over Win32, as functions intended for use with thread instances will only accept `pthread_t` instances [@pthreadswin32_2003]. Conversely, a Win32 `HANDLE` instance may reference a thread, a mutex, or something else entirely [@pthreadswin32_2003].

~ Figure { #fig-win32-critical-sections; caption:"Example of Win32 critical section object usage."; page-align:top}

```cpp
static CRITICAL_SECTION criticalSection;

void initCriticalSections() {
    InitializeCriticalSection(&criticalSection);
}

void synchronisedDoThing() {
	EnterCriticalSection(&criticalSection);

	sharedResource->DoThing();

	LeaveCriticalSection(&criticalSection);
}
```
~

Alongside differences in typing disciplines, approaches to the handling of shared resource access also differ between Windows and POSIX threads [@pthreadswin32_2003]. Take, for example, critical section objects, which act as synchronisation primitives that offer a lower performance overhead when compared to regular mutex handles on Windows [@win32criticalsections_2018;@binstock_2011]. In exchange for having a lower performance penalty, critical sections only work within the process that created them, disallowing cross-process resource synchronisation [@win32criticalsections_2018].

For a long time, programmers have either worked directly with these low-level threading APIs or through a platform-agnostic layer like `std::thread` in C++11 [@stdthread_2021] or `SDL_Thread *` in Simple Direct-Media Layer 2 [@sdlthread_2021]. Despite being hardware abstractions, all of the mentioned threading libraries depict a close mapping of the actual hardware resources involved in parallel computation [@ousterhout_1995]. While granular control over the hardware threading model is necessary for the domain of systems programming, this model of computation becomes cumbersome and difficult at higher levels of abstraction due to the boilerplate and generality involved in their programming interfaces [@ousterhout_1995]. These kinds of abstraction have been referred to as "leaking abstractions", owing to the way hardware-specific implementation details "leak" into the software tooling [@spolsky_2002]. For that reason, it has become commonplace for more specialised models of multi-core computation to be implemented at the programming language level.

&pagebreak;

### Java and Concurrent Memory Models { #sec-java-concurrent-memory-models }

Despite these threading abstractions being considered "leaky" in design, they have seen very varied mileage based on their implementations. M. Batty, K. Memarian, K Nienhuis, J. Pichon-Pharabod, and P. Sewell conducted a formal analysis of the standardised threading support outlined in the C++11 and Java JSR-133 language specifications [@batty_memarian_nienhuis_pichon-pharabod_sewell_2015]. The paper levied many criticisms at the memory model proposed by the original Java language specification, said to be both too restrictive to be practically implemented on hardware and too weakly specified to be safe [@batty_memarian_nienhuis_pichon-pharabod_sewell_2015]. Much of the paper focused on an article by W. Pugh regarding the Java Memory Model, which further elaborated that the constraints specified made it impossible for well-understood compiler optimisations to be applied to generated code [@pugh_2000; @batty_memarian_nienhuis_pichon-pharabod_sewell_2015]. Further, over-specificity in certain sections made it impractical for existing hardware to execute Java threads with meaningful performance gains [@pugh_2000]. Throughout Pugh's paper, there were two salient points that arose multiple times.

  * Memory coherence requirements guarantee bottlenecks when reading memory [@pugh_2000].

  * Safety guarantees enforced by the Java Memory Model are stronger than those that actually exist on real hardware [@pugh_2000].

Pugh also made many references to "coherence", specifically memory coherence, which is the way a CPU tries to manage the commonality of writable data state between two different parallel reads [@li_hudak_1989]. Because the original specification of the Java language guaranteed strong memory coherence, calls to the `getfield` JVM instruction could not be optimised away, as the operand variable may be referring to aliased data [@pugh_2000].

Another paper by Pugh analysed the flaws of "double-checked locking" in Java, again, giving reference to the original specification of the Java memory model [@pugh_2021].

> Double-Checked Locking is widely cited and used as an efficient method for implementing lazy initialization in a multithreaded environment - W. Pugh [@pugh_2021]

~ Figure { #fig-java-double-checked-locking; caption:"Example of double-checked locking in Java."; page-align:top}

```java
class SharedResource {
    private static SharedResource lazyInstance = null;

    public final String message = "Hello, world";

    public static SharedResource instance() {
        if (this.lazyInstance == null) {
            synchronized (this) {
                if (this.lazyInstance == null) {
                    this.lazyInstance = new SharedResource();
                }
            }
        }

        return this.lazyInstance;
    }

    public static void main(String[] args) {
        var resource = SharedResource.instance();

        System.out.println(resource.message);
    }
}
```
~

A frequent usage pattern of double-checked locking appears in the Java implementation of singletons demonstrated in figure [#fig-java-double-checked-locking], which may lazily initialise a new global instance of `SharedResource` if one does not yet exist when calling `SharedResource.instance()`. Passing the uninitialised check will cause the control flow to enter a synchronised block and initialise the `SharedResource` instance. Subsequent calls to `SharedResource.instance()` then-after fail the "not initialised" check and return the same instance. Moving the synchronised block within another "not initialised" check allows for calls to access the static `SharedResource` instance to skip access synchronisation every time `SharedResource.instance()` is called. Nevertheless, as noted by Pugh, depending on how the Java virtual machine or the underlying hardware chooses to order instructions, this design pattern can become fatally flawed in practice [@pugh_2021]. If either the instructions for class construction or assignment of the constructed instance are re-ordered, `SharedResource.instance()` runs the risk of exposing the accessing thread to either `null` or a partially constructed instance of `SharedResource` [@pugh_2021]. In the paper, Pugh proposed two pragmatic solutions to the broken design pattern.

  * Apply the `synchronized` keyword to the function body rather than inside a single branch, sacrificing the benefits of unnecessary synchronisation supposedly avoided by double-checked locking in favour of tighter guarantees on memory safety [@pugh_2021].

  * Replace the usage of lazy initialisation with the direct assignment of the static variable declaration. Java already guarantees that explicitly initialised static variables be in an invalid state [@pugh_2021]. However, this will present complications for singleton classes that do not supply a parameterless constructor.

Issues surrounding double-checked locking in Java - as well as the declaration presented by Pugh - would go on to be recognised in an InfoWorld article by the now Java language architect B. Goetz [@goetz_2001]. Issues persisted with double-checked locking until changes made to the specification of the `volatile` keyword as part of Java 5 to ensure that double-checked locking would produce correct results [@jenkov_2020;@javaatomic_2021].

More recently, Java has been making continued efforts toward a better memory model and concurrency features, with projects Valhalla, Loom, and the Foreign-Memory Access API Java enhancement proposals [@smith_2021;@javaloom_2021;@cimadamore_2021]. Nevertheless, all but the Foreign-Memory Access API are still in active development and experience frequent changes in design direction [@smith_2021;@javaloom_2021].

&pagebreak;

### C++ and Asynchronous Execution { #sec-cpp-asynchronous-execution }

As mentioned in section [#sec-java-concurrent-memory-models], the paper by M. Batty et al also analysed the concurrency model outlined by the C++11 standard [@batty_memarian_nienhuis_pichon-pharabod_sewell_2015]. Due to the number of correctness proofs produced by various independent and company-sponsored research, it is stated by Batty et al that C++11 boasts one of the most sound models for computational concurrency at the time of the paper concerning compiler optimisations [@batty_memarian_nienhuis_pichon-pharabod_sewell_2015]. With that said, issues were raised about the opaqueness of the implementation, referring to details specified in the standard that are not actually present in real software and hardware platform implementations [@batty_memarian_nienhuis_pichon-pharabod_sewell_2015].

> Without a semantics, programmers currently have to program against their folklore understanding of what the Java and C/C++ implementations provide. - M. Batty, K. Memarian, K Nienhuis, J. Pichon-Pharabod, and P. Sewell [@batty_memarian_nienhuis_pichon-pharabod_sewell_2015]

Alongside threads, C++11 saw the addition of `std::async` and `std::future` as well, two complimentary data types for expressing parallel operations and results at a level of abstraction above the operating system [@stdthread_2021;@stdfuture_2020].

~ Figure { #fig-cpp-futures-example; caption:"Example of `std::future` usage in C++."; page-align:top}

```cpp
int32_t const seed = std::rand();

std::future<vector<int>> futureValues =
    std::async(std::launch::async, computeValues, seed);

// Blocks the process from continuing until the values are
// computed by the asynchronous function call.
for (int value : futureValues.get()) {
    std::printf("%d\n", value);
}
```
~

While `std::async` presents a more concise API for easier concurrency over `std::thread`, there are some hidden costs and pitfalls associated its use compared to similar language features present in other languages like JavaScript and C#.

  * Thread pooling is not a requirement of the standard specification [@stdasync_2021;@bendersky_2016].

  * `std::launch::async` guarantees that every invocation will always be executed asynchronously, typically on a new thread [@stdasync_2021;@asynctut_2017]. On x86 processors, this causes diminishing returns as the operating system switches between virtual execution contexts with many asynchronous tasks in flight [@mitchell_samuel_oldham_2001].

  * `std::future` and `std::async` lack practical composability with other C++11 threading library primitives. C++11 locking primitives are difficult to use from within asynchronous tasks and can result in undefined behaviour [@milewski_2009].

Improvements in the C++17 standard revision have resolved some composability issues of `std::async`. Nevertheless, the tooling still lacks the kind of performance against higher-level language implementations of the same feature.

For all of these reasons, `std::async` is closer to a fire-and-forget operation with hard-to-predict performance characteristics, rather than a reusable or composable element for creating larger systems.

&pagebreak;

### Dlang and Shared State { #sec-dlang-shared-state }

Despite being similar in stated language goals to C++ [@doverview_2021], the D programming language - also known as "Dlang" - has expressed a very different approach to multi-threaded programming. One such difference in design direction is the handling of shared state across multiple threads.

~ Figure { #fig-dlang-shared-state; caption:"Example of `shared` keyword usage in D."; page-align:top}
```d
class SharedResource {
    private static shared (SharedResource) _instance =
        new shared SharedResource();

    public const (string) message = "Hello, world";

    public static shared (SharedResource) instance() {
        return this._instance;
    }
}

void main(string[] args) {
    shared (SharedResource) resource = SharedResource.instance();

    writeln(resource.message);
}
```
~

By default, global variables in D are restricted to the thread that created them [@dshared_2021]. Other threads that attempt to access that same variable will be touching their thread-local copy of it [@dshared_2021]. The D language specification states that thread-local by default ensures tighter thread-safe execution and allows for more intelligent optimisations [@dshared_2021]. There are a few restricted mechanisms for making data accessible between threads, with one such example being the `shared` keyword.

Qualifying a global variable instance with the `shared` keyword, as depicted in figure [#fig-dlang-shared-state], is the only way to make a global variable shared between threads [@dsynchronization_2021]. Additionally, any variable not qualified as `shared` is not permitted to interact with a shared global [@dshared_2021]. In principle, this makes `shared` similar in concept to `const` in C and C++. However, unlike `const`, a non-`shared` instance and its members may not become `shared` later after initialisation. Figure [#fig-dlang-shared-state] shows this, as the result of `SharedResource.instance()` must be assigned to a `shared (SharedResource)` qualified variable.

~ Figure { #fig-dlang-immutability; caption:"Example of `immutable` keyword usage in D."; page-align:top}
```d
class SharedResource {
    private static immutable (SharedResource) _instance =
        new SharedResource();

    public const (string) message = "Hello, world";

    public static immutable (SharedResource) instance() {
        return this._instance;
    }
}

void main(string[] args) {
    immutable (SharedResource) resource =
        SharedResource.instance();

    writeln(resource.message);
}
```
~

Alongside `shared`, D also has `immutable` [@dattributes_2021]. Immutability serves as a stricter variant of `const`, wherein any variable marked `immutable` is guaranteed to have been so since its creation [@dmutability_2021]. Similar to figure [#fig-dlang-shared-state], the result of `SharedResource.instance()` in figure [#fig-dlang-immutability] must be assigned to another `SharedResource` type with the same storage qualifier [@dmutability_2021]. Furthermore, like `shared`, `immutable` makes variables accessible from multiple threads, due to the strong read-only guarantees it gives data [@dmutability_2021;@dsynchronization_2021].

Comparable experiments to `shared` and `immutable` occurred in a study attempting to trace and validate Java object references through arbitrary object hierarchies carried by M. Servetto, D. J. Peace, L. Groves, and A. Potanin [@servetto_pearce_groves_potanin_2013]. An artefact of the study was a pre-processor for the Java programming language that the paper coined "Balloon Immutable Java" or "BI-Java" for short [@servetto_pearce_groves_potanin_2013]. Similar to D, BI-Java has transitive immutability guarantees, meaning that any Java class marked as immutable also guarantees immutability to any of its "owned" members [@servetto_pearce_groves_potanin_2013]. Moreover, Servetto et al found that, through the insertion of a mathematically sound ownership model into the Java programming language, the potential for parallelism presented itself as an automatic compiler optimisation [@servetto_pearce_groves_potanin_2013]. Servetto et al went on to implement these optimisation features into the compiler through automatically inserted thread fork and join operations [@servetto_pearce_groves_potanin_2013].

&pagebreak;

### Golang and Fibers { #sec-golang-fibers }

While compiler-automated parallelisation can be useful, it is still fundamentally limited by what the compiler can deduce from the rules of the language. Rather than abstract away concurrent operations, the Go programming language - also known as "Golang" - offers a suite of concurrency features that cordon off direct access to operating system threads and encourage fire-and-forget programming through the use of "goroutines" [@goconcurrency_2021].

~ Figure { #fig-golang-concurrency; caption:"Example of concurrent operations Golang."; page-align:top}
```go
func main() {
    var valuesChannel = make(chan []int);

    go func() {
        var seed = rand.Int31();

        // Pass the values back to the "main thread".
        valuesChannel <- computeValues(seed);
    }();

    // Wait here until the values are received.
    for value := range <-valuesChannel {
        fmt.Printf("%d\n", value);
    }
}
```
~

On the surface, goroutines appear to be very similar to the concept of "async await" present in other languages like C++, C#, and JavaScript. However, goroutines differ in that they cannot return values like other async implementations [@goroutines_2020;@stdasync_2021;@stdfuture_2020]. For a goroutine to return values, a language primitive known as a "channel" must be used to transmit data across asynchronous boundaries [@goroutines_2020;@gochannels_2021].

Goroutines and channels operate atop a runtime that utilises fibers [@goconcurrency_2021], a self-managing execution context that exists within the program runtime [@greenthreads_2021]. Unlike OS threads, which are designed around pre-emptive multitasking, fibers are intended for co-operative multitasking. Further, the relationship between OS threads and fibers is commonly one-to-many, as fibers carry a far lower instantiation overhead compared to threads [@singh_2020].

An analysis by N. Togashi and V. Klyuev tested the performance characteristics of concurrency features in both Go and Java by implementing a matrix multiplication program in each [@togashi_klyuev_2014]. Togashi et al found that Go performed appreciably better than Java when concurrency was involved [@togashi_klyuev_2014]. Despite Go having the lead under a multi-core workload, Java scaled noticeably better with matrix size on single-core computation [@togashi_klyuev_2014]. The 2014 paper noted that Java has had many more years of optimisations in its just-in-time Hotspot compiler compared to the relatively new ahead-of-time LLVM-based Go language compiler [@togashi_klyuev_2014].

## Findings { #sec-findings }

Of the abstractions researched as part of this literature survey, the fiber-based, concurrent programming model touted by the Go programming language appears to have the lowest barrier to entry with the highest yields. While lessons in thread-local access and immutability can be taken from the D programming language, they are not applicable as a full-scale solution to concurrent computation in applications that must manipulate state constantly. Further, the lower-level programming interfaces present in C++ and Java are overly cumbersome for the majority of consumers to use in an efficient manner.

While it is clear that the multi-core programming model exposed at the operating system level is less than adequate for the straightforward creation of multi-threaded software, there has been no agreement on approaching abstractions. The lack of a common abstraction beyond the hardware indicates that higher-level concurrency is not solvable at the general level. However, these findings do not contradict the study goals as the applications for concurrency and parallelism are focused primarily on the needs of real-time applications, which most of these higher-level programming interfaces struggle to solve.

&pagebreak;

# Implementation { #sec-implementation }

Ona, Catalan for "Wave", is a real-time rendering framework written in C++ atop Simple Direct-Media Layer 2, OpenGL 4.6, and The OpenGL Extension Wrangler. The Ona framework serves as an informed prototype implementation of the concepts explored in this paper regarding practical concurrency under the features of modern computing hardware.

As with many game engines and frameworks focused on performance, Ona tries not to hinder itself with hidden costs. While much of the framework is not immediately related to the research area of this paper, understanding its architectural underpinnings is useful for contextualising any performance characteristics identified during testing.

## Systems and Parallelism { #systems-and-parallelism }

A prominent example of avoiding hidden cost is the user-facing API that Ona exposes. Generic game engines typically express a rigid model of data representation within them, with the most common example being the scene graph. Used in engines like Unity and Godot, the scene graph expresses game logic and data as a series of loosely connected objects in memory [@godotscenetree_2021;@unityhirarchy_2021]. While the representation of these scene graphs vary between engines, the fundamental concepts that they espouse remain the same.

Alternative frameworks, such as the Rust-based Amethyst game engine, try to take a more data-oriented approach, handling individual data as flat components in memory rather than as a tree [@amethystbook_2021]. Despite the design that engines like Amethyst use being generally easier to optimise, they still subscribe to a particular kind of data representation. Such abstractions on data can complicate tasks that would otherwise be trivial in nature, like performing collision detection or rendering a 3D model.

In either of the above-mentioned cases, choosing one data layout over another restricts or non-trivialises certain expressions of patterns required by certain games. Take, for example, a grid-based game that has to express itself within a sparse, object-oriented tree layout [@fabian_2018]. Mapping the requirements of a grid-based system to a hierarchy becomes less efficient and harder to build. Rather than subscribe to any one of these data abstraction ideologies, Ona provides systems.

~ Figure { #fig-ona-systems; caption:"Spawning systems from a module in Ona."; page-align:top}
```cpp
void OnaInit(OnaContext * ona, void * module) {
  OnaSystemInfo const playerControllerInfo = {
    .size = sizeof(PlayerController),

    .initializer = [](void * system, OnaContext const * ona) {
      reinterpret_cast<PlayerController *>(system)->Init(ona);
    },

    .processor = [](
      void * system,
      OnaContext const * ona,
      OnaEvents const * events
    ) {
      reinterpret_cast<PlayerController *>(system)->
        Process(ona, events);
    },

    .finalizer = [](void * system, OnaContext const * ona) {
      reinterpret_cast<PlayerController *>(system)->Exit(ona);
    },
  };

  ona->spawnSystem(module, &playerControllerInfo);
}
```
~

Systems, shown in figure [#fig-ona-systems], are class-like encapsulations of behaviour and state with an emphasis on isolation. In the context of a system, isolation refers to the constraint that two systems are not permitted to cross-communicate with each other without using an intermediary message-passing facility. When spawned via `OnaContext::spawnSystem`, they are added to a flat list of running systems managed by the engine runtime.

Conceptually, systems are akin to fibers, with the addition of a class context for storing state local to its callback functions. Under this design philosophy of tightly encapsulated actors containing both logic and state, Ona achieves concurrent and parallel execution without the need to concern itself with managing risks associated with deadlocks and data races.

~ Figure { #fig-ona-rendering; caption:"Rendering sprites in Ona."; page-align:top}
```cpp
Sprite const sprite = {
  .origin = Vector3{position.x, position.y, 0},
  .tint = Color{0xFF, 0xFF, 0xFF, 0xFF},
};

ona->renderSprite(this->graphicsQueue, this->material, &sprite);
```
~

While less user-friendly to beginning game developers, the approach presented by Ona has a lower barrier to entry for experienced programmers, allowing them to apply their own specialised data management model to the engine without having to abstract away another one. To exemplify the barrier to placing something on the screen, figure [#fig-ona-rendering] contains the required code to submit a sprite render request. Ona systems map directly to the conceptual systems prevalent in video game design, such as player managers, progression systems, and AI controllers. In doing so, management of state becomes localised to easily parallelisable systems, as the management logic is less fragment across many individual class instances in a sparse scene graph.

&pagebreak;

## Scheduling and Communication { #sec-scheduling-communication }

The isolation and immutability of systems are easy ways to guarantee safe parallel execution, but they are not sufficient for real-time applications that must change state many times a second. Furthermore, video games rely on complex interactions between systems to fulfil the requirements of their gameplay experience. Under these constraints, Ona systems on their own do not suffice in supplying parity between other game logic management approaches. To remedy this without creating too many couplings between systems, Ona uses channels.

~ Figure { #fig-ona-channels; caption:"Communicating between systems in Ona using channels."; page-align:top}
```cpp
// Sending system...
void Process(OnaContext const * ona, OnaEvents const * events) {
  Vector2 const position = {.x = 0, .y = 0};

  ona->channelSend(
    playerPositionChannel,
    sizeof(Vector2),
    &position
  );
}
// ...

// Receiving system...
void Process(OnaContext const * ona, OnaEvents const * events) {
  Vector2 position;

  ona->channelReceive(
    playerPositionChannel,
    sizeof(Vector2),
    &position
  );
}
// ...
```
~

Ona channels are blocking collections that function similar to Go-style channels discussed in section [#sec-golang-fibers], however they differ in a few ways.

  * Ona channels are single-element "collections", meaning that only one value of the channel type may be occupy space in the channel at a time.

  * Ona channels have tighter constraints on how long a system may await for another to receive a value, printing warnings when a channel blocks for longer than `16` milliseconds, as per the maximum latency between frames to achieve 60 frames per second.

  * Ona channels may only be used from within the process callback of an Ona system.

By following these rules, data may be passed between two separate systems asynchronously. Systems and channels may be used without the need to use low-level locking primitives, making concurrent communication simpler and safer.

Initial approaches of the Ona systems manager applied a naive implementation of a thread pool, wherein each new system would be allocated its own thread. However, this presented a few performance complexities.

  * Ona systems sat idle awaiting results were using CPU resources doing nothing, as communication between threads would lock the whole execution context.

  * The workload of Ona systems were not efficiently balanced between CPUs as far as the workload distribution would allow.

Issues surrounding the thread pool approach led to the incorporation of Google Marl - a userland scheduling library designed for embedding into programs that need to maximise throughput via asynchronous operations like web requests and multi-core processing. Marl provided the two missing pieces of the Ona concurrency model that were required for it to work, allowing multi-threaded operations to work cooperatively under a multi-tasking environment hosted by the userland software program.

&pagebreak;

## Architecture and Concurrency { #sec-architecture-and-concurrency }

Concurrency is built directly into the core architecture of Ona. Rather than taking a single-core philosophy to later expand to multi-core, every initial consideration regarding the internals of Ona found itself based around supporting straight-forward concurrent and parallel execution.

~ Figure { #fig-ona-flow; caption:"Ona operational flow chart."; page-align:top}
![OnaFlow]

[OnaFlow]: images/OnaFlow.png "Engine API Diagram" { width:auto; max-width:80% }
~

Described in figure [#fig-ona-flow] is the process workflow that Ona goes through in its lifetime. Not depicted are the variable user-defined components, such as the configuration file or systems code. During the initialisation stage, the engine loads any dynamic libraries, otherwise known as "modules" in Ona, specified in its configuration file into memory and attempts to trigger an entry-point identified "OnaInit". From here, the user can begin spawning systems for the engine to run in parallel.

While the user-facing API exposes class-like concepts, the engine internals take a composition-focused approach, using virtual dispatch and inheritance sparingly. Where more rigid, heavier abstractions are necessary, such as the graphics subsystem, the engine utilises lightweight, pure virtual classes like `Ona::GraphicsServer` and `Ona::GraphicsQueue`. The structure of the engine API makes efforts to reduce the frequency that these interfaces are interacted with to avoid unnecessary pointer chasing and cache misses. Inversely, for more lightweight abstractions, like the application configuration instance, the composition of data structures like `Ona::HashTable` is preferred to avoid additional memory indirections and allocations, which would otherwise accumulate as complexity in the CPU instruction pipeline.

~ Figure { #fig-engine-api; caption:"Architectural diagram of the engine API."; page-align:top}
![EngineAPI]

[EngineAPI]: images/EngineAPI.png "Engine API Diagram" { width:auto; max-width:90% }
~

Aside from Ona systems, the other most frequently used component of the engine runtime every frame is `Ona::GraphicsServer`, an interface that provides highly abstracted access to the underlying hardware and software implementations related to the rasterisation of vector graphics. Currently, `Ona::OpenGLServer` is the only graphics server implementation within Ona, as component depicted in [#fig-engine-api]. The OpenGL graphics backend supports all of the functionality defined in the graphics server API on any OpenGL 4.6 compliant platform.

The decision to use OpenGL 4.6 over earlier versions, with wider hardware compatibility, was due to the advances in the OpenGL specification that reduce developmental friction when working with it. OpenGL 4.5 and up provides support for uniform buffer objects, the direct state access API, and SPIR-V shader binaries [@opengldsa_2020;@openglspirv_2018;@openglubo_2017]. Irrespective of the performance benefits intrinsic to these recent specification additions, they enable easier development as the APIs require less opaque state management.

On the point of opaque, shared state within the OpenGL specification; using it as a graphics backend has had implications on how the user-facing APIs handle graphics servers. Originally, it was not intended for Ona to have one central graphics server which receives all requests. Nevertheless, due to the constraints imparted by the foundations of the OpenGL backend, the decision to make the core graphics server shared state raised difficult questions and considerations in everything that came after it.

  * How to handle requests made asynchronously to create new graphics resources?

  * How to handle the dispatching of rendering requests without relying on synchronisation?

  * How much of the rendering logic can be parallelised?

By introducing the `Ona::GraphicsQueue` type into the API specification, synchronisation concerns were solved. The graphics queue is a shared graphics resource restricted to the scope of the acting thread via thread-local storage. Each thread used by Ona shares its instance of a graphics queue amongst itself, resolving any need for the queue to use synchronisation primitives like mutexes. It is the job of the graphics queue to take the data provided through its public methods and normalise it into a format easier and quicker for the main thread to compute. In the case of textured quads, this data normalisation takes the form of packing the transformed data into contiguous buffers for efficient instancing on the GPU.

Commands submitted to the graphics queue are stored within it and dispatched at the end of the current process update once all threads conclude processing and control returns to the main thread. Synchronisation is not required under this model, as it is only possible for the main thread to touch the queues after their threads have finished executing logic for the current frame. Under the definitions of concurrency described by H. Sutter, the Ona graphics queue API can be considered a lock-free concurrent collection designed for high throughput and scalability [@sutter2_2007]. However, this is not to say that bottlenecks cannot occur with practices that considered ill-formed by the API specifications. For example, accessing the core graphics server from within a processing function will synchronise access.

As the core graphics server is a globally shared resource, it relies on locking operations to broker synchronised access to avoid falling into data races at run-time. That being said, it is not intended for the core graphics server to be used from within a processing loop. Further, any operation that falls under the domain of the graphics server already has an inherently high cost associated with its use regardless.

&pagebreak;

## Strings and Thread-Safety { #sec-strings-thread-safety }

The string data type is another problem with a high cost associated to it. It is typical that strings created dynamically at run-time need to be allocated in dynamic memory, due to their often unknown size requirements. However, aggressive allocation can become a major bottleneck, especially in low-latency systems, as demonstrated by the Chromium web browser.

Java and C# both prefer visible immutability in their string type implementations, which guarantees inherent thread safety. However, immutability alone does not address other concerns that dynamic strings in C++ face, such as the lack of intelligent memory management. This problem of dynamic memory overhead became a long-standing issue in the Chromium web engine codebase, which was raised in a 2014 Google Groups discussion started by G. Khalil [@khalil_2014]. In the thread, Khalil profiled the software and found that around half of all allocations made during its execution are related to `std::string` [@khalil_2014]. Alongside the number of `std::string` instances created, Khalil also identified several recurring patterns of inefficient `std::string` usage within the codebase.

  * `std::string` values were frequently unpacked to their raw `char const *` representation to be re-packaged into new `std::string` instances [@khalil_2014].

  * `std::string::operator+` was largely preferred for string concatenation over more specialised multi-string formatting functions [@khalil_2014].

  * The codebase possessed many microscopic string operations, which result in many unnecessary re-allocations of the internal buffer [@khalil_2014].

Many of these hidden costs are related to the implementation differences of `std::string` versus string data types present in other programming languages [@stdbasicstring_2021]. In 2016, C. Carruth, LLVM compiler infrastructure contributor and software engineer at Google, gave a CppCon talk discussing hybrid data structures used throughout the codebase of the LLVM toolchain [@carruth_2016]. In the conference talk, Carruth explored a basic optimisation principle named "small buffer optimisation", or "SBO" for short. SBO is the practice of using the unoccupied bytes of an object type to circumvent the need to dynamically allocate resources [@carruth_2016;@guntheroth_2016]. Carruth explored examples of SBO-backed types used by LLVM like `llvm::SmallVector` [@carruth_2016;@llvmsmallvector_2021]. Further, Carruth pointed out that `std::string` is functionally equivalent to `std::vector<char>`, meaning that string can theoretically benefit from the same kinds of optimisation as they both serve as mutable collections of characters [@carruth_2016].

Many platform-specific implementations of `std::string` can and do make use of small buffer optimisation, but there is no requirement by the standard [@mike_2016;@stdbasicstring_2021]. Ergo, it becomes harder to reason about the memory allocation characteristics - something that is incredibly detrimental to software, like games, that fall under real-time constraints.

An older optimisation of `std::string` exists within some versions of the GNU C++ implementation [@gccdualabi_2021;@copyonwrite_2020]. To avoid potentially unnecessary re-allocations of memory when copying `std::string` instances, GNU C++ opted to use atomic reference counting to manage string data automatically at run-time [@copyonwrite_2020]. Whenever a program running under GCC 5 or earlier instantiated a `std::string` from another, both instances would point to the same memory allocation and internal reference counter. The counter updated according to the number of in-memory references belonged to its associated resource, typically through atomic operations [@copyonwrite_2020]. Once the reference count of a GNU C++ `std::string` reached zero, the last instance would clean up the dynamic memory on its way out of scope [@copyonwrite_2020]. Should the contents of the `std::string` implementation change at run-time, it duplicated the memory contents and reset its reference counter to `1` - thereby giving it the illusion of owning its referenced memory [@copyonwrite_2020].

Preceding the GNU Compiler Collection implementation of copy-on-write strings, A. Meredith, H. Boehm, L. Crowl, and P. Dimov presented a proposal draft such a feature in the C++ standard [@meredith_boehm_crowl_dimov_2008]. Meredith et al presented complications in thread-safe concurrency that the specification of `std::string` had [@meredith_boehm_crowl_dimov_2008]. To resolve issues identified in the proposal, Meredith et al put forward two variants of a new specification; a weak implementation that supplied the minimum changes required for copy-on-write semantics in `std::string` and a strong proposal that made `std::string` iterators safer in concurrent contexts [@meredith_boehm_crowl_dimov_2008]. GNU C++ would eventually adopt non-standardised copy-on-write semantics for `std::string`. However, the implementation received heavy criticism on the fronts of performance, memory safety, and standard compliance [@copyonwrite_2020;@gccdualabi_2021].

A decade earlier, H. Sutter analysed how different string implementations compare in performance, memory safety, and thread-safety under the compliance of the earlier C++98 standards [@sutter_1999]. Of the various string implementations examined by Sutter, three supported thread-safe reference counting and copy-on-write semantics [@sutter_1999]. Sutter's benchmark concluded that copy-on-write with atomic reference counting was 223 milliseconds slower compared to implementations that had unique ownership of their memory [@sutter_1999]. Further, copy-on-write strings that used Windows Critical Section objects and mutexes performed materially worse, with a latency increase of `8,934` and `33,571` milliseconds respectively under Sutter's benchmark [@sutter_1999].

~ Figure { #fig-copy-on-write; caption:"Copy-on-write semantics invalidating standard C++ types that carry references"; page-align:top}
```cpp
String greeting = "Hello, world";
String greetingCopy = greeting;

// Iterator references the memory in "greeting".
auto greetingIterator = greetingCopy.begin();

// "operator[]" returns a mutable reference, so to be safe a
// copy- on-write string will ensure its memory contents are
// unique and allocate a new copy if not.
char greetingIterator = greetingCopy[0];
```
~

Sutter also distinguished non-trivial memory safety concerns regarding the invalidation of references in situations where the C++ standard forbids them from being invalid [@sutter_1999]. A modernised example of what Sutter explores in the article is presented with C++11 iterators in figure [#fig-copy-on-write]. Assuming that `String` is a copy-on-write data type, it will have to copy its memory contents when returning a mutable reference for the character at index `0`. This copy-on-write memory copy results in `greetingIterator` pointing to an outdated memory location that is no longer valid for or used by `greeting`.

Due to these inconsolable differences in copy-on-write versus the standard specification, the GNU Compiler Collection decided to remove copy-on-write semantics from their platform implementation of the standard in GCC 5.1 [@gccdualabi_2021;@copyonwrite_2020]. Today, GNU C++ copy-on-write data types may still be used through a special compilation mode for the sake of backwards compatibility with existing codebases that rely on copy-on-write behaviour [@gccdualabi_2021].

~ Figure { #fig-small-buffer-optimisation; caption:"Small buffer optimisation in `Ona::String`."; page-align:top}
![SmallBufferOptimisation]

[SmallBufferOptimisation]: images/SmallBufferOptimisation.png "Small Buffer Optimisation" { width:auto; max-width:90% }
~

In contrast, Ona benefits by not following the requirements of the C++ standard in its string implementation. Instead, it chooses to veer toward the more managed implementations present in languages like C# and Java. `Ona::String` is a UTF-8 encoded string type that leverages immutable visibility, small-buffer optimisation, and situation-specific atomic reference counting to give the runtime more intelligent control over how instances of text are created and passed around in memory. As shown in figure [#fig-small-buffer-optimisation], Ona is able to avoid the need for both dynamic allocation and reference counting by giving local ownership of UTF-8 encoded text less than or equal to `24` bytes in size. For text that exceeds these constraints, its memory is elevated onto the heap and automatically managed through atomic-backed reference counting.

~ Figure { #fig-string-testing-environment; caption:"String benchmarking test environment."; page-align:top}
```cpp
void Greet(StringType message) {
    // Prevent optimising function call away.
    message;
}

int main(int argc, char * * argv) {
    for (size_t i = 0; i < Iterations; i += 1) {
        StringType createdString{Message};

        Greet(created_string);
    }
}
```
~

Figure [#fig-string-testing-environment] depicts a benchmark designed to compare the execution time of a program creating and copying strings using `Ona::String` versus `std::string` on QuickBench under Clang 11.0 C++17 at optimisation level 1. The benchmark environment tests the execution time of creating and copying for both short and long strings. For each test case, `StringType` is substituted with the test subject string implementation and `Message` is either `"Hello, world"` or `"Neque porro quisquam est qui dolorem ipsum quia dolor sit amet"` - a section of the popularly used "Lorem Ipsum" placeholder text, depending on whether short or long strings are being tested.

~ Figure { #fig-string-testing-results; caption:"String benchmarking test results."; page-align:top}

| String Type         | QuickBench Execution Duration (Milliseconds) |
| ------------------- | :------------------------------------------: |
| Short `std::string` | 382.30                                       |
| Short `Ona::String` | 308.72                                       |
| Long `std::string`  | 575.97                                       |
| Long `Ona::String`  | 546.58                                       |

~

In both benchmarks, `Ona::String` leads, but not by much. Reference counting optimisations become more apparent the longer data exists for, as it is passed around more by atomic reference rather than re-allocation each time. However, creating benchmarks that specifically stress-test reference counting is difficult, as this requires a benchmark to intentionally run for longer periods of time. With the constraints of most short-running benchmarking frameworks like QuickBench, this is not an option made trivially available.

Alongside better performance, `Ona::String` also guarantees higher consistency in operational latency `std::string` due to small buffer optimisation being a known quantity of its specification. M. Acton, the current principal architect of Unity game engine Data-Oriented Technology Stack, argues that understanding the constraints of a target platform is fundamental to effective reasoning about its problems [@acton_2014]. In exchange for safer concurrency and better performance, `Ona::String` adheres to many constraints that the C++ standard specified variant does not.

  * Every instance of `Ona::String` is read-only after construction, offering no way to modify itself through the user API.

  * String concatenation must be done through dedicated formatting functions rather than by chaining operator overload functions.

  * `Ona::String` instances are not guaranteed to be sentineled with a `0` byte after the last valid character.

  * Characters of a `Ona::String` cannot be individually queried, nor are any handles that rely on the internal data handed out.

&pagebreak;

# Testing { #sec-testing }

Per the study goals, testing is foremost concerned with overall performance as measured by frames per second. The efficient utilisation of hardware forms the second concern of testing, with software that achieves both the highest output workload and most efficient utilisation of the hardware platform considered the best result. Three different game engines have been chosen for comparison to Ona, due to the specific problems that each attempt to solve.

  * Unity is a game engine with full three-dimensional rendering and partial two-dimensional support. Over the last decade, it has risen to prominence as one of the industry standard tools for both independent and large-scale game development.

  * Godot provides a lightweight game engine and development environment intended for independent development. While it has full support for three-dimensional rendering, it is primarily targeted at two-dimensional games development at the moment.

  * Raylib focuses on minimalism and simplicity, written entirely in C99 and catering foremost to two-dimensional games development [@raylib_2021].

~ Figure { #fig-ona-test-flow; caption:"Ona testing logic."; page-align:top}
![OnaTestFlow]

[OnaTestFlow]: images/OnaTestFlow.png "Engine API Diagram" { width:auto; max-width:80% }
~

Ona is tested and compared to these three different game technologies using the logic outlined in [#fig-ona-test-flow]. Further, three different parallel execution configurations of Ona are used to compare against.

  * Sequential, single-threaded execution using one core to schedule each system.

  * Parallel, multi-threaded, unbatched execution that passes the new data calculations between two concurrent systems every frame.

  * Parallel, multi-threaded, batched execution that bulk passes data calculations between two concurrent systems every frame.

## Methodology { #sec-methodology }

In order to ensure a fair assessment of how Ona compares to other games engines and frameworks, a two-stage test is conducted on the technologies. The first test stage focuses on identifying the surface-level performance characteristics, while the second stage is used to identify how well the subject software makes use of available hardware.

~ Figure { #fig-testing-texture; caption:"Testing texture used across benchmarks."; page-align:top}
![Texture]

[Texture]: images/Texture.png "Testing texture used across benchmarks" { width:auto; max-width:25% }
~

For stage one, each application is stress-tested by drawing figure [#fig-testing-texture] to the screen at new positions every frame as many times as possible at increasing intervals of `100` at a time. The graphic in figure [#fig-testing-texture] was chosen as it is easily visible and memory efficient due to its simple color palette, however beyond that the decision is primarily arbitrary as rendering one texture over another at the same resolution is no more expensive on modern rendering hardware, regardless of contents.

While executing, benchmarking of the software is performed by monitoring framerate. This stage is primarily concerned with framerate output, as it is the most significant indicator of performance in a real-time application. In order to accommodate for inconsistency in application framerates, each threshold has an acceptable framerate range. The highest number of sprites rendered for each threshold is recorded as the performance benchmark for that application.

  * `50 - 55` frames per second is considered the minimum working framerate based on the standard refresh rate of current displays being 60Hz [@haynes_2020]. A range below `60` is used to handle scenarios where the framerate may temporarily drop, as is common in real-time software with a lot of content on the screen at a single time.

  * `25 - 30` frames per second is the worst case scenario and cut-off point for testing. Beyond this, a software application is not considered to be rendering enough frames to satisfy real-time requirements.

  * `40 - 45` frames per second acts as the control framerate for better averaging between the minimum working and worst case framerate thresholds.

Application framerate is monitored externally rather than from within the software itself, as significant drops in frames per second would result in the framerate indicator losing real-time accuracy. Therefore, framerate monitoring is provided via the Steam gaming client overlay as it a lightweight and readily available external tool.

Prior to testing stage two, the software application is run ten times to derive the median time between initialisation and execution. After the mean initialisation time is accounted for, stress testing begins by rendering as were recorded for the working framerate threshold to be achieved. As the process runs, the Linux performance analysis tool `perf` records its hardware utilisation for an arbitrary duration of `5` seconds plus initialisation time, after which the process terminates. As the application runs in real-time and the workload is constant, `5` seconds is long enough to produce data that is holistically representative of the wider application. Of the telemetry recorded by `perf`, this test is interested in the average number of instructions per cycle, the percentage of process cycles spent idle, and the percentage of branches mispredicted.

  * Instructions per cycle, or IPC, represents how well the process avoids pipelining hazards in its execution. A low IPC yield shows that an application is struggling to efficiently pipeline its instructions, while a high IPC yield suggests that the program is well-structured and does not fall into many of the hazards discussed in section [#sec-instruction-pipeline].

  * Idle cycles are a strong measure of how well a software application can schedule its parallel tasks, with gaps in a single thread workload resulting in idle cycles as it does nothing. While a relevant metric, `perf` does not directly produce readings for overall cycles idle. However, these performance criteria are derivable from the mean of the percentage of stalled front-end and back-end cycles.

  * Branch mispredictions are another indicator of pipelining hazards and are beneficial in giving further context to the results derived from the number of instruction cycles executed per clock. For example, given an application with a low IPC and high misprediction rate, it is evident that it there are many control hazards in its logic.

~ Figure { #fig-testing-function; caption:"Example of function used across benchmarks."; page-align:top}
```cpp
int32_t RandomValue(int32_t min, int32_t max) {
    return (rand() % (abs(max - min) + 1) + min);
}
```
~

To ensure equality of opportunity across the software tested, each application is timed for `5` seconds of execution after initialisation before being exited by a bash script. Initialisation time is discerned by running the software `10` times and averaging the duration that was taken for it to get from the initialisation or splash screen to the testing logic. Attempts at achieving further consistency between testing conditions are made by implementing a custom randomisation function as close to [#fig-testing-function] as each software will allow. Furthermore, software that requires native compilation, such as Raylib and Ona, will both use the same version of the Clang C++ compiler with identical optimisation flags. The Clang C++ compiler is used due to its currently superior code generation in its LLVM backend compared to other compiler vendors.

Finally, the testing platform used in the test is an AMD Ryzen 7 1700x clocked at 3.4GHz with an AMD RX 5600 running Manjaro Linux x86_64 under Linux Kernel 5.10. Each test is ran on the hardware at an unrestricted framerate and a resolution of 1920×1080, based on the most popular resolution from the April 2021 Steam hardware survey [@steamsurvey_2021].

&pagebreak;

## Stage 1 Results { #sec-stage-1-results }

~ Figure { #fig-framework-benchmarks; caption:"Engine and framework texture drawing benchmark results."; page-align:top}

| Software                    | <55 FPS | <45 FPS | <30 FPS | Mean   |
| :-------------------------- | :-----: | :-----: | :-----: | :----: |
| Raylib                      | 35,000  | 42,500  | 67,500  | 48,333 |
| Ona (Batched Channels)      | 24,100  | 29,800  | 43,200  | 32,033 |
| Unity 2021                  | 20,800  | 25,800  | 39,100  | 28,567 |
| Ona (Sequential)            | 17,400  | 21,500  | 34,500  | 24,467 |
| Ona (Unbatched Channels)    | 15,100  | 17,700  | 27,400  | 20,067 |
| Godot 3.3 (Single-Threaded) | 8,700   | 10,600  | 16,100  | 11,800 |
| Godot 3.3 (Multi-Threaded)  | 6,600   | 8,200   | 12,800  | 9,200  |

~

The Raylib game framework demonstrates the most workload throughput of the software tools benchmarked, despite avoiding any explicit use of multi-threading in its source code [@raylibmultithreading_2019]. While `perf` reports Raylib as a multi-threaded process, it is likely a library dependency performing multi-threaded tasks in the background.

Ona, on the other hand, achieved second-place in the benchmarks in its batched message-passing configuration. While exhibiting notably better throughput than both Godot 3.3 and Unity 2020.21, Raylib still outperformed it. To investigate further, the Linux profiling tool, "OProfile", was executed on both Raylib and Ona to see where each software spent most of its time. In the case of Raylib, `63.34%` of its execution time was spent within its library calls for updating the running loop, polling events, and dispatching draw calls. The code generated by the random number calculation function consumes a further `15.08%` of the execution time, and the remaining `21.58%` is spent in calls with no discernible debug information. Meanwhile, Ona uses `51.01%` of its execution time within its parallel calculation and rendering systems, `27.3%` in a system-blocking operation that copies the batched data between systems, and `5.49%` in the end-of-frame graphics server update. The remaining `16.2%` is composed of calls with no debug information. From these results, it is evident that Ona spends a significant amount of time in blocking operations, likely explaining the performance disparity against Raylib. The Raylib framework, being single-threaded in design, does not need to worry about scheduling and blocking.

Unity, while demonstrating an average disparity of `-3,466` sprites per frame against parallel Ona, maintained a significant lead over purely sequential Ona, averaging at `+4,100` sprites per frame. The lead presented by Unity is significant due to the amount of additional work it must do compared to Ona; Unity is not only rendering sprites but also processing a sparse scene graph hierarchy. The throughput demonstrated in the Unity benchmark suggests that significant engineering efforts have gone into its rendering pipeline and scene processing technologies.

Godot 3.3 was determined to be the slowest of the benchmarked software, with both its multi-threaded and single-threaded configurations showing a measurably lower throughput compared to the other technologies tested. Furthermore, Godot presented another surprising characteristic when its multi-threaded rendering system performed worse when compared to its single-threaded one.

&pagebreak;

## Stage 2 Results { #sec-stage-2-results }

~ Figure { #fig-framework-performance; caption:"Engine and framework hardware utilisation as described by average instructions per second (IPC), total CPU time idle, and total branch mispredictions."; page-align:top}

| Software                    | Avg. IPC | Tot. Idle | Tot. BM |
| :---------------------------| :---:    | :---:     | :------:|
| Raylib                      | 2.18     | 29.4%     | 0.26%   |
| Ona (Batched Channels)      | 2.12     | 12.96%    | 0.31%   |
| Ona (Sequential)            | 2.00     | 25.31%    | 0.13%   |
| Ona (Unbatched Channels)    | 1.85     | 27.58%    | 0.1%    |
| Godot 3.3 (Single-Threaded) | 1.64     | 10.23%    | 0.45%   |
| Godot 3.3 (Multi-Threaded)  | 1.59     | 9.46%     | 0.56%   |
| Unity 2021.1                | 1.02     | 21.27%    | 0.4%    |

~

Raylib has the highest effective CPU throughput of the tested software, as evidenced by its instructions per cycle. Further, of all the applications tested, Raylib spends the most time idle in its process. It is likely that, while it may be less-efficient in terms of processing, its performance is made by the significantly lower number of pipelining instructions it contains.

Under the hardware utilisation test, each of the configurations used by Ona sit next to each other according to their instructions per cycle. This is unsurprising, as adding further threads should not significantly complicate the already efficient pipelining of instructions on the part of the process. More interesting, however, is the percentage of cycles spent idle - with the batched message-passing approach demonstrating the lowest. The reduction in CPU idle time suggests that the parallel model used by Ona is successful in reducing idle time on the processor.

While the lowest-performing in throughput, both configurations of Godot 3.3 present better hardware utilisation compared to Unity 2020.1. Nevertheless, similar to stage 1 of the testing, the Godot single-threaded renderer performs better when compared to its multi-threaded counterpart. Godot is still a relatively young engine compared to Unity, so it has not had the same number of years to receive as many optimisations. Regardless, the performance characteristics of its single-threaded renderer versus its multi-threaded one is interesting.

Unity 2021.1, compared to the other software tested, has the lowest average number of instructions executing per cycle. This lower average occurs despite performing better than Godot 3.1. Additionally, Unity 2021.1 also struggles with a relatively high percentage of branch mispredictions, pointing to a high number of pipelining hazards in its internals.

# Conclusions { #sec-conclusions }

While Ona performed as expected compared to higher-level frameworks like Unity 2021.1 and Godot 3.3, it failed to out-perform the sequential logic used in Raylib. As observed in section [#sec-stage-1-results], this is likely related to the overhead of message-passing between systems every frame. Beating single-threaded performance on small scale tasks is also difficult to measure, as multi-threading presents diminishing returns the more atomised a task becomes.

In conclusion, the research conducted and prototype produced are solid foundations for a more refined model of computation for real-time solutions. While further optimisations can be made based on the findings of this research, it has been proven that the construction of an easy to use framework for parallel programming in real-time systems is possible without the intervention of lower-level concepts like threads or mutexes.

# Future Work { #sec-future-work }

Future work on Ona will look at improving performance further to out-compete single-threaded lightweight solutions like Raylib. As observed from testing, a significant bottleneck in processing exists in the channel passing mechanism - specifically in data copying. Copying of data passed across system boundaries is necessary for both memory and thread safety, as the runtime cannot currently ensure the lifetime of data shared between two systems. Further, immutability guarantees in C++ are relatively weak, meaning that data changed by the receiving system may invalidate the state of the sending system. C++17 alone does not offer viable solutions to both of these safety concerns, so a managed language runtime would be preferable. Due to their growing popularity in general-purpose game engines, integration of a managed language like C# to address the above-mentioned problems may be an avenue for future research.

&pagebreak;

[BIB]