SAS recently performed testing using the Intel Cloud Edition for Lustre* Software - Global Support (HVM) available on AWS marketplace to determine how well a standard workload mix using SAS Grid Manager performs on AWS. Our testing demonstrates that with the right design choices you can run demanding compute and I/O applications on AWS. You can find the detailed results in the technical paper, SAS® Grid Manager 9.4 Testing on AWS using Intel® Lustre.
In addition to the paper, Amazon will be publishing a post on the AWS Big Data Blog that will take a look at the approach to scaling the underlying AWS infrastructure to run SAS Grid Manager to meet the demands of SAS applications with demanding I/O requirements.
System design overview – network, instance sizes, topology, performance
For our testing, we set up the following AWS infrastructure to support the compute and IO needs for these two components of the system:
- the SAS workload that was submitted using SAS Grid Manager
- the underlying Lustre file system required to meet the clustered file system requirement of SAS Grid Manager.
The SAS Grid nodes in the cluster are i2.8xlarge instances. The 8xlarge instance size provides proportionally the best network performance to shared storage of any instance size, assuming minimal EBS traffic. The i2 instance also provides high performance local storage, which is covered in more detail in the following section.
The use of an 8xlarge size for the Lustre cluster is less impactful since there is significant traffic to both EBS and the file system clients, although an 8xlarge is still is more optimal. The Lustre file system has a caching strategy, and you will see higher throughput to clients in the case of frequent cache hits which effectively reduces the network traffic to EBS.
Steps to maximize storage I/O performance
The shared storage for SAS applications needs to be high speed temporary storage. Typically temporary storage has the most demanding load. The high I/O instance family, I2, and the recently released dense storage instance, D2, provide high aggregate throughput to ephemeral (local) storage. For the SAS workload tested, the i2.8xlarge has 6.4 TB of local SSD storage, while the D2 has 48 TB of HDD.
Throughput testing and results
We wanted to achieve a throughput of least 100 MB/sec/core to temporary storage, and 50-75 MB/sec/core to shared storage. The i2.8xlarge has 16 cores (32 virtual CPUs, each virtual CPU is a hyperthread on a core, and a core has two hyperthreads). Testing done with lower level testing tools (fio and a SAS tool, iotest.sh) showed a throughput of about 3 GB/sec to ephemeral (temporary) storage and about 1.5 GB/sec to shared storage. The shared storage performance does not take into account file system caching, which Lustre does well.
This testing demonstrates that with the right design choices you can run demanding compute and I/O applications on AWS. For full details of the testing configuration and results, please see the SAS® Grid Manager 9.4 Testing on AWS using Intel® Lustre technical white paper.