v1 - ami-0185c61124653544c

Propagate .NET HPC — Supercomputing Edition

AMI Documentation & User Guide

Overview

Propagate .NET HPC is a production-ready Amazon Machine Image (AMI) built on Ubuntu 24.04 LTS, purpose-configured for high-performance .NET workloads on AWS. It provides three .NET runtimes, a full parallel computing stack, kernel-level performance tuning, and integrated management tooling — eliminating days of manual setup and configuration.

This AMI is designed for teams running scientific simulations, parallel number crunching, high-throughput server applications, and any workload where .NET performance matters.

What's Installed

.NET SDKs & Runtimes

Version	Type	Support Window	Use Case
.NET 6.0.428	SDK + Runtime	Legacy (EOL Nov 2024)	Existing codebases that haven't migrated
.NET 8.0.125	SDK + Runtime	LTS (through Nov 2026)	Current production stable
.NET 10.0.104	SDK + Runtime	LTS (through Nov 2028)	Latest performance features, AOT compilation

All three versions are installed side by side. Switch between them per-project using a global.json file or set a system-wide default with propagate dotnet-default <version>.

A template global.json is provided at /opt/propagate/config/global.json.

.NET Global Profiling Tools

Tool	Version	Purpose
dotnet-trace	9.0.x	Collect diagnostic traces from running .NET processes
dotnet-counters	9.0.x	Monitor real-time .NET performance counters (GC, threadpool, exceptions)
dotnet-dump	9.0.x	Capture and analyze process dumps for debugging
dotnet-gcdump	9.0.x	Capture GC heap snapshots for memory analysis

All tools are installed to /usr/local/share/dotnet-tools and symlinked to /usr/local/bin so they are available to all users without PATH configuration.

Parallel Computing Libraries

Library	Version	Purpose
OpenMPI	4.1.6	Distributed message passing for multi-process and multi-node parallel computing
OpenBLAS	System	Optimized Basic Linear Algebra Subprograms (matrix operations, vector math)
LAPACK / LAPACKE	System	Linear algebra routines (eigenvalues, SVD, least squares)
FFTW3	System	Fast Fourier Transform library, including MPI-distributed FFT support
HDF5 (OpenMPI)	System	High-performance data format for large scientific datasets
Eigen3	System	C++ template library for linear algebra (for native interop via P/Invoke)

Performance Profiling Tools

Tool	Location	Purpose
Linux perf	System	Hardware-level CPU profiling (cache misses, branch prediction, cycles)
FlameGraph	/opt/FlameGraph	Stack trace visualization toolkit for generating flame graphs from perf data
BCC/eBPF tools	System	Dynamic kernel tracing and analysis without recompilation
dotnet-trace	/usr/local/bin	.NET-specific event tracing (GC events, JIT, threadpool)
dotnet-counters	/usr/local/bin	Real-time .NET runtime metrics
htop	System	Interactive process viewer
sysstat (sar, iostat)	System	System activity reporting and I/O statistics
numactl	System	NUMA policy control for process binding
hwloc	System	Hardware topology discovery and visualization

System Utilities

Package	Purpose
build-essential	GCC, G++, make — for compiling native interop libraries
cmake	Build system for C/C++ dependencies
git	Version control
jq	JSON processing from the command line
curl, wget	HTTP clients for downloading packages and data
zip, unzip	Archive management

Security

Component	Configuration
UFW (Uncomplicated Firewall)	Enabled. Default deny incoming, allow outgoing. SSH (port 22) and OpenMPI (ports 10000-10100) allowed.
fail2ban	Enabled. SSH brute-force protection with 3 max retries, 1 hour ban time.
SSH	Key-based authentication only. No default passwords.

Kernel & System Optimizations

CPU Performance

Setting	Value	Effect
CPU governor	performance	CPU runs at maximum frequency at all times. Eliminates frequency scaling latency that can cause inconsistent benchmark results and computation stalls. Configured via systemd service `propagate-cpu-governor`.
CPU idle latency	Minimized	Reduces C-state transition latency by writing to `/dev/cpu_dma_latency`. Configured via systemd service `propagate-cpu-latency`.
Scheduler migration cost	Profile-dependent	Controls how aggressively the kernel migrates processes between CPUs. Higher values reduce migration (better cache locality).
Scheduler autogroup	Profile-dependent	Controls automatic task grouping. Disabled in compute-heavy profile to give the scheduler full control.
NUMA balancing	Disabled	Automatic NUMA page migration is turned off (`kernel.numa_balancing=0`). This prevents the kernel from moving memory pages between NUMA nodes during computation, which causes unpredictable latency spikes. Applications should manage their own NUMA placement using `numactl`.

Memory

Setting	Value	Effect
Huge pages (2MB)	512 default	Pre-allocated huge pages reduce TLB misses for large memory allocations. Configurable via setup wizard or `propagate hugepages <count>`. The recommended value is calculated during setup based on available RAM.
Swappiness	10 (default) / 1 (compute-heavy)	Controls how aggressively the kernel swaps memory to disk. Low values keep compute data in RAM.
Dirty ratio	40%	Percentage of RAM that can be filled with dirty (unwritten) pages before the process must write to disk.
Dirty background ratio	10%	Percentage of RAM with dirty pages before background writeback starts.
Shared memory max	64 GB	`kernel.shmmax` set to 68719476736 bytes. Required for large MPI shared memory segments.
Shared memory total pages	4 billion	`kernel.shmall` set to 4294967296. Total shared memory pages available system-wide.
Max memory map count	Profile-dependent	Increased to 1048576 in memory-heavy profile for applications that memory-map many files.

Network

Setting	Value	Effect
TCP receive buffer max	16 MB	`net.core.rmem_max=16777216`. Allows large TCP receive windows for high-throughput MPI communication between nodes.
TCP send buffer max	16 MB	`net.core.wmem_max=16777216`. Allows large TCP send windows.
TCP congestion control	HTCP	Hamilton TCP — designed for high-bandwidth, high-latency networks. Better throughput than default Cubic for inter-node MPI traffic.
MTU probing	Enabled	`net.ipv4.tcp_mtu_probing=1`. Automatically discovers the maximum segment size, avoiding fragmentation on networks with jumbo frames.
Backlog queue	30000	`net.core.netdev_max_backlog=30000`. Prevents packet drops during burst MPI communication.

Process Limits

Limit	Value	Why
Open files (nofile)	1,048,576	Large parallel jobs may open thousands of file descriptors simultaneously (sockets, data files, shared memory segments).
Max processes (nproc)	Unlimited	MPI applications spawn one process per core per node. No artificial limit.
Locked memory (memlock)	Unlimited	Required for MPI shared memory and RDMA. Prevents the kernel from swapping pinned buffers.
Stack size	Unlimited	Deep recursion in scientific computing code (solvers, tree searches) needs large stacks.

File System

Setting	Value	Effect
fs.file-max	2,097,152	System-wide maximum file descriptors.
fs.nr_open	2,097,152	Per-process maximum file descriptors.

.NET Runtime Environment Variables

The following environment variables are set globally via /etc/profile.d/dotnet-hpc.sh and apply to all .NET processes:

Variable	Value	Effect
`DOTNET_gcServer`	`1`	Enables server garbage collection. Uses one GC thread per logical processor, reducing pause times for multi-threaded applications. Critical for compute workloads — workstation GC (default) uses a single GC thread that blocks all application threads.
`DOTNET_EnableAVX2`	`1`	Enables AVX2 SIMD instructions (256-bit vector operations). Allows `Vector<T>` to process 8 floats or 4 doubles per instruction.
`DOTNET_EnableSSE41`	`1`	Enables SSE4.1 instructions. Provides additional vectorized operations for string processing, integer operations, and rounding.
`DOTNET_TieredCompilation`	`1`	Enables tiered JIT compilation. Methods are first quickly JIT-compiled (Tier 0), then recompiled with full optimizations (Tier 1) after repeated execution. Balances startup speed with steady-state performance.
`DOTNET_TC_QuickJitForLoops`	`1`	Allows Tier 0 compilation for methods containing loops. Without this, loop-containing methods skip Tier 0 and wait for full optimization, slowing initial execution.
`DOTNET_ReadyToRun`	`1`	Uses pre-compiled (ReadyToRun) framework assemblies. Reduces startup time by avoiding JIT compilation of framework code.

Override any variable per-process by setting it before the command:

DOTNET_gcServer=0 dotnet run    # Use workstation GC for this run only

HPC Tuning Profiles

Four pre-configured profiles are available via sudo propagate tune <profile>. Each adjusts kernel parameters for different workload types.

compute-heavy

Best for CPU-bound simulations, numerical methods, Monte Carlo, ray tracing.

Parameter	Value
vm.swappiness	1
kernel.sched_migration_cost_ns	5000000
kernel.sched_autogroup_enabled	0
CPU governor	performance

What it does: Virtually eliminates swapping, keeps processes pinned to their current CPU longer (better L1/L2 cache hit rates), and disables automatic task grouping so the scheduler treats every process individually. Best when every CPU cycle matters.

balanced

Best for mixed workloads, development, testing, web APIs.

Uses the default kernel tuning applied during provisioning. No additional changes.

What it does: Provides the HPC network tuning and memory configuration without aggressive CPU pinning. Good starting point when you're not sure which profile to use.

memory-heavy

Best for large dataset processing, in-memory databases, genomics, bioinformatics.

Parameter	Value
vm.swappiness	5
vm.overcommit_memory	1
vm.max_map_count	1048576

What it does: Allows memory overcommit (the kernel won't refuse allocations based on available RAM), increases the maximum number of memory-mapped regions (important for memory-mapped files and databases like MongoDB), and keeps swappiness very low.

mpi-cluster

Best for distributed computing across multiple EC2 instances.

Parameter	Value
net.core.rmem_max	33554432 (32 MB)
net.core.wmem_max	33554432 (32 MB)
net.ipv4.tcp_rmem	4096 1048576 33554432
net.ipv4.tcp_wmem	4096 1048576 33554432

What it does: Doubles the network buffer sizes from the default HPC configuration. Designed for high-throughput MPI message passing between nodes where large messages need to be buffered in-kernel.

SIMD Capabilities

This AMI runs on x86_64 EC2 instances and automatically enables all available SIMD instruction sets. The actual capabilities depend on your instance type:

Instruction Set	Vector Width	Operations per Instruction (float)	Supported Instance Families
SSE4.2	128-bit	4 floats / 2 doubles	All x86_64
AVX2	256-bit	8 floats / 4 doubles	All current-gen (c5+, m5+, r5+)
AVX-512	512-bit	16 floats / 8 doubles	c5.metal, m5zn, c6i, r6i, hpc6a
FMA	256-bit	8 fused multiply-adds	All current-gen

.NET's System.Numerics.Vector<T> and System.Runtime.Intrinsics.X86 namespaces automatically use the best available instruction set. No code changes required — the JIT compiler detects CPU capabilities at runtime.

Verify your instance's SIMD support:

propagate benchmark    # Runs the built-in SIMD capability check

Propagate Management CLI

The propagate command is available system-wide and provides all management operations.

Commands

Command	Requires sudo	Description
`propagate status`	No	Full system overview: instance info, CPU, memory, .NET versions, OpenMPI, load, storage, HPC profile
`propagate benchmark`	No	Run the built-in benchmark suite: vector/SIMD throughput, parallel math, memory bandwidth, MPI test
`propagate dotnet-list`	No	List all installed .NET SDKs, runtimes, and global tools
`propagate dotnet-default <ver>`	Yes	Set the default .NET SDK version (6.0, 8.0, or 10.0)
`propagate tune <profile>`	Yes	Apply an HPC tuning profile (compute-heavy, balanced, memory-heavy, mpi-cluster)
`propagate mpi-test [n]`	No	Run an MPI hello world test with n processes (default: 4)
`propagate profile <PID>`	Yes	Profile a running .NET process using dotnet-trace, with perf fallback
`propagate hugepages [count]`	Yes (to set)	Show current huge pages status, or set a new count
`propagate numa`	No	Display NUMA topology and CPU affinity map
`propagate ebs`	No	Show EBS volume status and mount info
`propagate logs`	No	View provisioning and setup logs
`propagate setup`	Yes	Re-run the interactive setup wizard

Multi-Node MPI Setup

For distributed computing across multiple EC2 instances:

Prerequisites

Launch all instances from this AMI in the same VPC and subnet
Use a placement group (cluster strategy) for lowest latency
Configure the security group to allow TCP ports 10000-10100 between instances
Use the same SSH key pair for all instances

Configuration Steps

Set up passwordless SSH between nodes:

On the primary node, generate a key and distribute it:

ssh-keygen -t ed25519 -N "" -f ~/.ssh/id_ed25519
# Copy to each worker node:
ssh-copy-id -i ~/.ssh/id_ed25519 ubuntu@<worker-ip>

Edit the MPI hostfile:

sudo nano /opt/propagate/config/mpi_hosts

Example for three 32-vCPU nodes:

10.0.1.10 slots=32
10.0.1.11 slots=32
10.0.1.12 slots=32

Apply the MPI cluster tuning profile on all nodes:
```
sudo propagate tune mpi-cluster
```
Test connectivity:
```
propagate mpi-test 8
```

Run your application:

mpirun --hostfile /opt/propagate/config/mpi_hosts \
  -np 96 --map-by node \
  dotnet run -c Release

Recommended Instance Types for MPI

Instance	vCPUs	RAM	Network	Notes
hpc6a.48xlarge	96	384 GB	100 Gbps EFA	Purpose-built HPC, best MPI performance
c6i.32xlarge	128	256 GB	50 Gbps	High CPU count, good price/performance
c7i.48xlarge	192	384 GB	50 Gbps	Latest gen compute, highest single-node CPU count

Recommended Instance Types

Instance	vCPUs	RAM	Best For	Approx. EC2 Cost/hr
c6i.xlarge	4	8 GB	Development, testing, small simulations	$0.17
c6i.4xlarge	16	32 GB	Medium parallel workloads	$0.68
c6i.8xlarge	32	64 GB	Production compute jobs	$1.36
c7i.16xlarge	64	128 GB	Large parallel simulations	$2.86
c7i.metal-48xl	192	384 GB	Maximum single-node performance	$8.57
m5zn.6xlarge	24	96 GB	High clock speed (4.5 GHz), latency-sensitive	$1.98
r6i.8xlarge	32	256 GB	Memory-heavy scientific computing	$2.02
hpc6a.48xlarge	96	384 GB	Dedicated HPC with EFA networking	$2.88

Firewall Rules

Port	Protocol	Purpose	Default
22	TCP	SSH access	Open (configurable via setup wizard)
10000-10100	TCP	OpenMPI inter-node communication	Open

All other incoming ports are blocked by default. Add rules as needed:

sudo ufw allow 8080/tcp comment 'Web API'
sudo ufw allow 5000/tcp comment 'Kestrel'
sudo ufw status

File System Layout

Path	Contents
`/opt/propagate/bin/`	Management CLI and setup scripts
`/opt/propagate/config/`	Configuration files (hpc.conf, mpi_hosts, global.json)
`/opt/propagate/benchmarks/`	Built-in benchmark source code (SimdCheck.cs)
`/opt/propagate/docs/`	Documentation
`/data/`	Default EBS data volume mount point (configured via setup wizard)
`/usr/lib/dotnet/sdk/`	.NET SDK installations
`/usr/share/dotnet/`	.NET shared runtime components
`/usr/local/share/dotnet-tools/`	Global .NET diagnostic tools
`/opt/FlameGraph/`	Brendan Gregg's FlameGraph toolkit
`/etc/sysctl.d/99-propagate-hpc.conf`	HPC kernel tuning parameters
`/etc/profile.d/dotnet-hpc.sh`	.NET environment variables (loaded on login)
`/etc/security/limits.d/99-propagate-hpc.conf`	Process resource limits
`/var/log/propagate-provision.log`	Provisioning log

Quick Start Examples

Hello World with .NET 10

dotnet new console -n HelloHPC -f net10.0
cd HelloHPC
dotnet run

SIMD Vector Addition

using System.Numerics;

var a = new float[1024];
var b = new float[1024];
var c = new float[1024];

// Fill with data...
for (int i = 0; i <= a.Length - Vector<float>.Count; i += Vector<float>.Count)
{
    var va = new Vector<float>(a, i);
    var vb = new Vector<float>(b, i);
    (va + vb).CopyTo(c, i);
}

Console.WriteLine($"Vector width: {Vector<float>.Count} floats");
Console.WriteLine($"Hardware accelerated: {Vector.IsHardwareAccelerated}");

Parallel Computation

using System.Threading.Tasks;

var data = new double[10_000_000];
var results = new double[data.Length];

Parallel.For(0, data.Length, i =>
{
    results[i] = Math.Sin(data[i]) * Math.Cos(data[i]);
});

Profiling a Running Application

# Find the PID
ps aux | grep dotnet

# Collect a 30-second trace
dotnet-trace collect -p <PID> --duration 00:00:30

# Monitor real-time counters
dotnet-counters monitor -p <PID>

# Generate a flame graph with perf
sudo perf record -g -p <PID> -- sleep 30
sudo perf script | /opt/FlameGraph/stackcollapse-perf.pl | /opt/FlameGraph/flamegraph.pl > flame.svg

Troubleshooting

.NET SDK not found

If dotnet --list-sdks doesn't show all three versions, source the environment:

source /etc/profile.d/dotnet-hpc.sh
dotnet --list-sdks

Instance type shows "unknown"

The instance metadata service may require IMDSv2 with a hop limit of 2. Check with:

curl -s -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 60"

If this fails, update the instance metadata options in the AWS Console to allow IMDSv1 or increase the hop limit.

CPU governor shows "N/A"

This is normal for virtualized EC2 instances. The hypervisor manages CPU frequency directly. The governor service is included for bare metal instances (.metal types) where it does take effect.

High memory usage after boot

Huge pages are pre-allocated at boot time. 512 huge pages × 2 MB = 1 GB of reserved memory. This is intentional and reduces TLB misses during computation. Adjust with:

sudo propagate hugepages 256    # Reduce to 256 pages (512 MB)

MPI test fails

For single-node MPI, use --oversubscribe if you're requesting more slots than available cores:

mpirun -np 8 --oversubscribe ./your_program

For multi-node, ensure SSH connectivity between all nodes and that the security group allows TCP ports 10000-10100.

Support

Vendor: Propagate LLC
Product: .NET HPC — Supercomputing Edition
Base OS: Ubuntu 24.04 LTS
Architecture: x86_64 (amd64)