HPC in AWS
Links: 101 AWS SAA Index
Services that will be used¶
Data management and transfer¶
- AWS Direct Connect: Move GB/s of data to the cloud, over a private secure network
- Snowball & Snowmobile: Move PB of data to the cloud
- AWS DataSync: Move large amount of data between on-premise and S3, EFS, FSx for Windows
Compute and Networking¶
- Cluster placement group for good performance (low latency, 10Gbps network)
-
If you are experiencing high packets per second (PPS) or require lower latencies then it means you need enhanced networking.
-
EC2 Enhanced Networking (SR-IOV) (ENA)
- Higher bandwidth, higher PPS (packet per second), lower latency
- Option I: Elastic Network Adapter (ENA) up to 100 Gbps
- Works only for newer EC2 instances.
- Option 2: Intel 82599 VF up to 10 Gbps LEGACY
- Option I: Elastic Network Adapter (ENA) up to 100 Gbps
- Higher bandwidth, higher PPS (packet per second), lower latency
-
Elastic Fabric Adapter (EFA)
- Improved ENA for HPC, only works for Linux
- Great for inter-node communications, tightly coupled workloads
- Leverages Message Passing Interface (MPI) standard
- Bypasses the underlying Linux OS to provide low-latency, reliable transport
If you need OS bypassing capabilities use EFA.
If you have windows instances then you have to use ENA over EFA since EFA is supported by only Linux instances.
Storage¶
- FSx for Lustre is the best option.
Automation and Orchestration¶
AWS Batch¶
- AWS Batch supports multi-node parallel jobs, which enables you to run single jobs that span multiple EC2 instances.
- Easily schedule jobs and launch EC2 instances accordingly
AWS Parallel Cluster¶
- Open source cluster management tool to deploy HPC on AWS
- Automate creation of VPC, Subnet, cluster type and instance types
Last updated: 2023-03-11