v1 - ami-0185c61124653544c
Propagate .NET HPC — Supercomputing EditionAMI
AMI
A Documentation & User Guide
Overview
Propagate .NET HPC is a production-ready Amazon Machine Image (AMI) built onperformance-tuned Ubuntu 24.04 LTS,LTS purpose-configuredimage with .NET 6, 8 and 10 SDKs
installed side by side, ready for high-performancecompute-intensive .NET workloadsworkloads. onThe AWS.system
Itships providespre-tuned threefor high performance computing (Server GC, CPU governor,
NUMA, huge pages, kernel and network tuning) and includes an optional
containerized .NET runtimes,app-hosting amode fullbehind parallelnginx-proxy computingwith stack,Let's kernel-levelEncrypt.
tuning,Security
andisintegratedhandledmanagementbytoolingAWS Security Groups only. There is no host firewall on this AMI —eliminatingopendaysjustofthemanualportssetupyouandneedconfiguration.in your Security Group.
This AMI is designed for teams running scientific simulations, parallel number crunching, high-throughput server applications, and any workload where .NET performance matters.
What's InstalledRequirements
.NET SDKs & Runtimes
| Data volume (EBS) | Attach |
For real HPC throughput, pick a compute-optimized or HPC instance family and attach a second EBS volume for your data.
First boot
On first launch the instance auto-configures via dotnet-hpc-firstboot.service:
- No user-data → applies the
throughputHPC profile, mounts any attached data volume, sets .NET 8 as default, and does not start any container. SSH in and start building immediately. - JSON user-data → applies your settings and (optionally) deploys a containerized .NET app. Example user-data:
{
"dotnet_version": "8",
"hpc_profile": "throughput",
"hugepages_mb": 0,
"enable_app": true,
"app_image": "your-registry/your-app:latest",
"internal_port": 8080,
"domain": "app.example.com",
"enable_https": true,
"letsencrypt_email": "admin@example.com",
"admin_user": "admin",
"admin_password": "ChangeMe123!"
}
To configure (or re-configure) interactively at any time:
sudo bash /opt/dotnet-hpc/configure-dotnet-hpc.sh
After configuring, open a new shell (or source /etc/profile.d/dotnet-hpc.sh)
so the .NET environment variables load.
Using .NET
All three versionsSDKs arelive under /usr/share/dotnet and dotnet is on the PATH.
dotnet --list-sdks # show installed sideSDKs
bydotnet side.--list-runtimes Switch# betweenshow theminstalled per-projectruntimes
usingdotnet new console -o app # uses the default SDK
Set the default SDK for new projects (writes a global.json filetemplate):
sudo set a system-widedotnet-hpc default with10
propagate
legacy
A.NETtemplate6 is providedglobal.jsonatfor/opt/propagate/config/global.json.past
.NETcompatibilityGlobalandProfilingisToolsMicrosoft's support
window. ToolVersionPurposeor dotnet-trace9.0.x8 Collect diagnostic traces from runningUse .NETprocesses10 for new work. hpc
HPC tuning
The image applies Server GC, the
performanceCPU governor, NUMA settings, raised file/socket limits and network buffer tuning. Switch profiles anytime:sudo dotnet-counterstune 9.0.xMonitorthroughputreal-time# max compute throughput (default) sudo dotnet-hpc tune latency # low/steady latency for services sudo dotnet-hpc tune balanced # general purposeVerify the environment and run a quick parallel benchmark:
dotnet-hpc status dotnet-hpc bench 8 # Monte Carlo Pi across all vCPUsOptional .NET
performance counters (GC, threadpool, exceptions)dotnet-dump9.0.xCapture and analyze process dumps for debuggingdotnet-gcdump9.0.xCapturelarge-page GCheap snapshots for memory analysis
All tools are installed to/usr/local/share/dotnet-toolsand symlinked to/usr/local/binso they are available to all users without PATH configuration.
Parallel Computing Libraries
LibraryVersionPurposeOpenMPI4.1.6Distributed message passing for multi-process and multi-node parallel computingOpenBLASSystemOptimized Basic Linear Algebra Subprograms(matrix operations, vector math)LAPACK / LAPACKESystemLinear algebra routines (eigenvalues, SVD, least squares)FFTW3SystemFast Fourier Transform library, including MPI-distributed FFT supportHDF5 (OpenMPI)SystemHigh-performance data format for large scientific datasetsEigen3SystemC++ template library for linear algebra (for native interop via P/Invoke)
Performance Profiling Tools
ToolLocationPurposeLinux perfSystemHardware-level CPU profiling (cache misses, branch prediction, cycles)FlameGraph/opt/FlameGraphStack trace visualization toolkit for generating flame graphs from perf dataBCC/eBPF toolsSystemDynamic kernel tracing and analysis without recompilationdotnet-trace/usr/local/bin.NET-specific event tracing (GC events, JIT, threadpool)dotnet-counters/usr/local/binReal-time .NET runtime metricshtopSystemInteractive process viewersysstat (sar, iostat)SystemSystem activity reporting and I/O statisticsnumactlSystemNUMA policy control for process bindinghwlocSystemHardware topology discovery and visualization
System Utilities
PackagePurposebuild-essentialGCC, G++, make — for compiling native interop librariescmakeBuild system for C/C++ dependenciesgitVersion controljqJSON processing from the command linecurl, wgetHTTP clients for downloading packages and datazip, unzipArchive management
Security
ComponentConfigurationUFW (Uncomplicated Firewall)Enabled. Default deny incoming, allow outgoing. SSH (port 22) and OpenMPI (ports 10000-10100) allowed.fail2banEnabled. SSH brute-force protection with 3 max retries, 1 hour ban time.SSHKey-based authentication only. No default passwords.
Kernel & System Optimizations
CPU Performance
SettingValueEffectCPU governorperformanceCPU runs at maximum frequency at all times. Eliminates frequency scaling latency that can cause inconsistent benchmark results and computation stalls. Configured via systemd servicepropagate-cpu-governor.CPU idle latencyMinimizedReduces C-state transition latency by writing to/dev/cpu_dma_latency. Configured via systemd servicepropagate-cpu-latency.Scheduler migration costProfile-dependentControls how aggressively the kernel migrates processes between CPUs. Higher values reduce migration (better cache locality).Scheduler autogroupProfile-dependentControls automatic task grouping. Disabled in compute-heavy profile to give the scheduler full control.NUMA balancingDisabledAutomatic NUMA page migration is turned off (kernel.numa_balancing=0). This prevents the kernel from moving memory pages between NUMA nodes during computation, which causes unpredictable latency spikes. Applications should manage their own NUMA placement usingnumactl.
Memory
SettingValueEffectHuge pages (2MB)512 defaultPre-allocatedreserve huge pagesreducefirst,TLBe.g.misses2048for large memory allocations. Configurable via setup wizard orpropagate hugepages <count>. The recommended value is calculated during setup based on available RAM.Swappiness10 (default) / 1 (compute-heavy)Controls how aggressively the kernel swaps memory to disk. Low values keep compute data in RAM.Dirty ratio40%Percentage of RAM that can be filled with dirty (unwritten) pages before the process must write to disk.Dirty background ratio10%Percentage of RAM with dirty pages before background writeback starts.Shared memory max64 GBkernel.shmmaxset to 68719476736 bytes. Required for large MPI shared memory segments.Shared memory total pages4 billionkernel.shmallset to 4294967296. Total shared memory pages available system-wide.Max memory map countProfile-dependentIncreased to 1048576 in memory-heavy profile for applications that memory-map many files.
Network
SettingValueEffectTCP receive buffer max16 MBnet.core.rmem_max=16777216. Allows large TCP receive windows for high-throughput MPI communication between nodes.TCP send buffer max16 MBnet.core.wmem_max=16777216. Allows large TCP send windows.TCP congestion controlHTCPHamilton TCP — designed for high-bandwidth, high-latency networks. Better throughput than default Cubic for inter-node MPI traffic.MTU probingEnablednet.ipv4.tcp_mtu_probing=1. Automatically discovers the maximum segment size, avoiding fragmentation on networks with jumbo frames.Backlog queue30000net.core.netdev_max_backlog=30000. Prevents packet drops during burst MPI communication.
Process Limits
LimitValueWhyOpen files (nofile)1,048,576Large parallel jobs may open thousands of file descriptors simultaneously (sockets, data files, shared memory segments).Max processes (nproc)UnlimitedMPI applications spawn one process per core per node. No artificial limit.Locked memory (memlock)UnlimitedRequired for MPI shared memory and RDMA. Prevents the kernel from swapping pinned buffers.Stack sizeUnlimitedDeep recursion in scientific computing code (solvers, tree searches) needs large stacks.
File System
SettingValueEffectfs.file-max2,097,152System-wide maximum file descriptors.fs.nr_open2,097,152Per-process maximum file descriptors.
.NET Runtime Environment Variables
The following environment variables are set globally via/etc/profile.d/dotnet-hpc.shand apply to all .NET processes:
VariableValueEffectDOTNET_gcServer1Enables server garbage collection. Uses one GC thread per logical processor, reducing pause times for multi-threaded applications. Critical for compute workloads — workstation GC (default) uses a single GC thread that blocks all application threads.DOTNET_EnableAVX21Enables AVX2 SIMD instructions (256-bit vector operations). AllowsVector<T>to process 8 floats or 4 doubles per instruction.DOTNET_EnableSSE411Enables SSE4.1 instructions. Provides additional vectorized operations for string processing, integer operations, and rounding.DOTNET_TieredCompilation1Enables tiered JIT compilation. Methods are first quickly JIT-compiled (Tier 0), then recompiled with full optimizations (Tier 1) after repeated execution. Balances startup speed with steady-state performance.DOTNET_TC_QuickJitForLoops1Allows Tier 0 compilation for methods containing loops. Without this, loop-containing methods skip Tier 0 and wait for full optimization, slowing initial execution.DOTNET_ReadyToRun1Uses pre-compiled (ReadyToRun) framework assemblies. Reduces startup time by avoiding JIT compilation of framework code.
Override any variable per-process by setting it before the command:MiB):DOTNET_gcServer=0 dotnet run#Useduringworkstationconfigure,GCanswerforthethis"hugerunpages"onlyprompt with a MiB value
HPCManagementTuning Profilescommands
Fourdotnet-hpcpre-configuredstatusprofilesShowareversions,availabletuningviastatesudoandpropagateapp stack dotnet-hpc versions List installed SDKs and runtimes dotnet-hpc default <6|8|10> Set default SDK dotnet-hpc tune <profile>.EachRe-applyadjustsHPCkernelprofileparametersdotnet-hpcforbenchdifferent[6|8|10]workloadRuntypes.acompute
compute-heavythe
Bestbenchmarkfordotnet-hpcCPU-boundinfosimulations,Shownumericalsavedmethods,configurationMontedotnet-hpcCarlo,backuprayTar.gztracing.data
ParameterValuevm.swappiness1kernel.sched_migration_cost_ns5000000kernel.sched_autogroup_enabled0CPU governorperformance
What it does:Virtually eliminates swapping, keeps processes pinnedvolume totheir current CPU longer (better L1/L2 cache hit rates), and disables automatic task grouping so the scheduler treats every process individually. Best when every CPU cycle matters.
balanced
Best for mixed workloads, development, testing, web APIs.
Uses the default kernel tuning applied during provisioning. No additional changes.
What it does:Provides the HPC network tuning and memory configuration without aggressive CPU pinning. Good starting point when you're not sure which profile to use.
memory-heavy
Best for large dataset processing, in-memory databases, genomics, bioinformatics.
ParameterValuevm.swappiness5vm.overcommit_memory1vm.max_map_count1048576
What it does:Allows memory overcommit (the kernel won't refuse allocations based on available RAM), increases the maximum number of memory-mapped regions (important for memory-mapped files and databases like MongoDB), and keeps swappiness very low.
mpi-cluster
Best for distributed computing across multiple EC2 instances.
ParameterValuenet.core.rmem_max33554432 (32 MB)net.core.wmem_max33554432 (32 MB)net.ipv4.tcp_rmem4096 1048576 33554432net.ipv4.tcp_wmem4096 1048576 33554432
What it does:Doubles the network buffer sizes from the default HPC configuration. Designed for high-throughput MPI message passing between nodes where large messages need to be buffered in-kernel.
SIMD Capabilities
This AMI runs on x86_64 EC2 instances and automatically enables all available SIMD instruction sets. The actual capabilities depend on your instance type:
Instruction SetVector WidthOperations per Instruction (float)Supported Instance FamiliesSSE4.2128-bit4 floats/2 doublesAll x86_64AVX2256-bit8 floats / 4 doublesAll current-gen (c5+, m5+, r5+)AVX-512512-bit16 floats / 8 doublesc5.metal, m5zn, c6i, r6i, hpc6aFMA256-bit8 fused multiply-addsAll current-gen
.NET'sSystem.Numerics.Vector<T>andSystem.Runtime.Intrinsics.X86namespaces automatically use the best available instruction set. No code changes required — the JIT compiler detects CPU capabilities at runtime.
Verify your instance's SIMD support:propagate benchmarkroot #Runsappthestackbuilt-in(onlySIMDifcapabilitycontainerizedcheckhosting is enabled) dotnet-hpc start | stop | restart | logs | update
Propagate Management CLI
Thepropagatecommand is available system-wide and provides all management operations.
Commands
CommandRequires sudoDescriptionpropagate statusNoFull system overview: instance info, CPU, memory, .NET versions, OpenMPI, load, storage, HPC profilepropagate benchmarkNoRun the built-in benchmark suite: vector/SIMD throughput, parallel math, memory bandwidth, MPI testpropagate dotnet-listNoList all installed .NET SDKs, runtimes, and global toolspropagate dotnet-default <ver>YesSet the default .NET SDK version (6.0, 8.0, or 10.0)propagate tune <profile>YesApply an HPC tuning profile (compute-heavy, balanced, memory-heavy, mpi-cluster)propagate mpi-test [n]NoRun an MPI hello world test with n processes (default: 4)propagate profile <PID>YesProfile a running .NET process using dotnet-trace, with perf fallbackpropagate hugepages [count]Yes (to set)Show current huge pages status, or set a new countpropagate numaNoDisplay NUMA topology and CPU affinity mappropagate ebsNoShow EBS volume status and mount infopropagate logsNoView provisioning and setup logspropagate setupYesRe-run the interactive setup wizard
Multi-Node MPI Setup
For distributed computing across multiple EC2 instances:
Prerequisites
Launch all instances from this AMI in thesame VPC and subnetUse aplacement group(cluster strategy) for lowest latencyConfigure thesecurity groupto allow TCP ports 10000-10100 between instancesUse thesame SSH key pairfor all instances
Configuration Steps
Set up passwordless SSH between nodes:
On the primary node, generate a key and distribute it:ssh-keygen -t ed25519 -N "" -f ~/.ssh/id_ed25519 # Copy to each worker node: ssh-copy-id -i ~/.ssh/id_ed25519 ubuntu@<worker-ip>
Edit the MPI hostfile:sudo nano /opt/propagate/config/mpi_hosts
Example for three 32-vCPU nodes:10.0.1.10 slots=32 10.0.1.11 slots=32 10.0.1.12 slots=32
Apply the MPI cluster tuning profile on all nodes:sudo propagate tune mpi-cluster
Test connectivity:propagate mpi-test 8
Run your application:mpirun --hostfile /opt/propagate/config/mpi_hosts \ -np 96 --map-by node \ dotnet run -c Release
Recommended Instance Types for MPI
InstancevCPUsRAMNetworkNoteshpc6a.48xlarge96384 GB100 Gbps EFAPurpose-built HPC, best MPI performancec6i.32xlarge128256 GB50 GbpsHigh CPU count, good price/performancec7i.48xlarge192384 GB50 GbpsLatest gen compute, highest single-node CPU count
Recommended Instance Types
InstancevCPUsRAMBest ForApprox. EC2 Cost/hrc6i.xlarge48 GBDevelopment, testing, small simulations$0.17c6i.4xlarge1632 GBMedium parallel workloads$0.68c6i.8xlarge3264 GBProduction compute jobs$1.36c7i.16xlarge64128 GBLarge parallel simulations$2.86c7i.metal-48xl192384 GBMaximum single-node performance$8.57m5zn.6xlarge2496 GBHigh clock speed (4.5 GHz), latency-sensitive$1.98r6i.8xlarge32256 GBMemory-heavy scientific computing$2.02hpc6a.48xlarge96384 GBDedicated HPC with EFA networking$2.88
Firewall RulesPorts
Port ProtocolPurpose DefaultWhen22 TCPSSHSSH accessOpenAlways (configurableopenviatosetupyourwizard)IP only)10000-1010080TCPHTTP (nginx-proxy)OpenMPIAppinter-nodehostingcommunicationenabled443 OpenHTTPS (nginx-proxy)App hosting enabled with Let's Encrypt 8080 App container (internal) Never exposed; reached via the proxy
AllOpenothertheincomingrelevant portsareinblockedyour Security Group. Pure compute use needs only port 22.
HTTPS notes
- HTTPS uses nginx-proxy + acme-companion (Let's Encrypt).
- Let's Encrypt requires a domain name, not a bare IP. Point an A record at the instance's public IP before enabling HTTPS.
- The first certificate may take 1–2 minutes to issue after the first request.
- The app is protected by
default.HTTPAddbasic-authrulesusingastheneeded:admin credentials you set.- Public IPs change on stop/start. Use an Elastic IP or a domain for stable access.
Backup
sudoufwdotnet-hpcallow 8080/tcp comment 'Web API' sudo ufw allow 5000/tcp comment 'Kestrel' sudo ufw statusbackupWrites
File System Layout
PathContentsdurable /opt/propagate/bin/root/dotnet-hpc-backup-<timestamp>.tar.gzFor ManagementofCLItheanddatasetupvolume.scriptsbackups, orcopy /opt/propagate/config/S3, Configurationthatfilesarchive(hpc.conf,tompi_hosts,Amazonglobal.json)take /opt/propagate/benchmarks/Built-in benchmark source code (SimdCheck.cs)/opt/propagate/docs/Documentation/data/Defaultan EBS snapshot of the attached data volumemount point (configured via setup wizard)/usr/lib/dotnet/sdk/.NET SDK installations/usr/share/dotnet/.NET shared runtime components/usr/local/share/dotnet-tools/Global .NET diagnostic tools/opt/FlameGraph/Brendan Gregg's FlameGraph toolkit/etc/sysctl.d/99-propagate-hpc.confHPC kernel tuning parameters/etc/profile.d/dotnet-hpc.sh.NET environment variables (loaded on login)/etc/security/limits.d/99-propagate-hpc.confProcess resource limits/var/log/propagate-provision.logProvisioning log
Quick Start Examples
Hello World with .NET 10dotnet new console -n HelloHPC -f net10.0 cd HelloHPC dotnet run
SIMD Vector Additionusing System.Numerics; var a = new float[1024]; var b = new float[1024]; var c = new float[1024]; // Fill with data... for (int i = 0; i <= a.Length - Vector<float>.Count; i += Vector<float>.Count) { var va = new Vector<float>(a, i); var vb = new Vector<float>(b, i); (va + vb).CopyTo(c, i); } Console.WriteLine($"Vector width: {Vector<float>.Count} floats"); Console.WriteLine($"Hardware accelerated: {Vector.IsHardwareAccelerated}");
Parallel Computationusing System.Threading.Tasks; var data = new double[10_000_000]; var results = new double[data.Length]; Parallel.For(0, data.Length, i => { results[i] = Math.Sin(data[i]) * Math.Cos(data[i]); });
Profiling a Running Application# Find the PID ps aux | grep dotnet # Collect a 30-second trace dotnet-trace collect -p <PID> --duration 00:00:30 # Monitor real-time counters dotnet-counters monitor -p <PID> # Generate a flame graph with perf sudo perf record -g -p <PID> -- sleep 30 sudo perf script | /opt/FlameGraph/stackcollapse-perf.pl | /opt/FlameGraph/flamegraph.pl > flame.svg
Troubleshooting
.NET SDK not found
Ifdotnet --list-sdksdoesn't show all three versions, source the environment:source /etc/profile.d/dotnet-hpc.sh dotnet --list-sdks
Instance type shows "unknown"
The instance metadata service may require IMDSv2 with a hop limit of 2. Check with:curl -s -X PUT "http://169.254.169.254/latest/api/token" \ -H "X-aws-ec2-metadata-token-ttl-seconds: 60"
If this fails, update the instance metadata options infrom the AWSConsoleconsoleto/allow IMDSv1 or increase the hop limit.
CPU governor shows "N/A"
This is normal for virtualized EC2 instances. The hypervisor manages CPU frequency directly. The governor service is included for bare metal instances (.metaltypes) where it does take effect.
High memory usage after boot
Huge pages are pre-allocated at boot time. 512 huge pages × 2 MB = 1 GB of reserved memory. This is intentional and reduces TLB misses during computation. Adjust with:sudo propagate hugepages 256 # Reduce to 256 pages (512 MB)
MPI test fails
For single-node MPI, use--oversubscribeif you're requesting more slots than available cores:mpirun -np 8 --oversubscribe ./your_program
For multi-node, ensure SSH connectivity between all nodes and that the security group allows TCP ports 10000-10100.CLI.
Support
Documentation:
- money-back
Vendor:https://docs.propagate.solutionsPropagate30-dayLLCProduct:guarantee..NET HPC — Supercomputing EditionBase OS:Ubuntu 24.04 LTSArchitecture:x86_64 (amd64)