Dell - Terascala HPC Storage Solution (DT-HSS3)

Dell | Terascala HPC Storage Solution (DT-HSS3)
A Dell Technical White Paper
David Duncan
Dell | Terascala High-Performance Storage Solution DT-HSS3
THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND. 2011 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without the express written permission of Dell Inc. is strictly forbidden. For more information, contact Dell. Dell, the DELL logo, and the DELL badge, PowerConnect, and PowerVault are trademarks of Dell Inc. Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Dell Inc. disclaims any proprietary interest in trademarks and trade names other than its own.
Page ii
Contents
Figures .................................................................................................................... 2 Tables ..................................................................................................................... 3 Introduction ............................................................................................................. 4 Lustre Object Storage Attributes .................................................................................... 4 Dell | Terascala HPC Storage Solution Description ............................................................... 6 Metadata Server ...................................................................................................... 6 Object Storage Server Pair ......................................................................................... 7 Scalability ............................................................................................................. 8 Networking .......................................................................................................... 10 Managing the Dell | Terascala HPC Storage Solution........................................................... 10 Performance Studies ................................................................................................. 11 N-to-N Sequential Reads / Writes ............................................................................... 15 Random Reads and Writes ........................................................................................ 16 IOR N-to-1 Reads and Writes ..................................................................................... 16 Metadata Testing .................................................................................................. 17 Conclusions ............................................................................................................ 21 Appendix A: Benchmark Command Reference................................................................... 22 Appendix B: Terascala Lustre Kit .................................................................................. 23 References ............................................................................................................. 24
Page 1
Figures
Figure 1: Lustre Overview ............................................................................................... 5 Figure 2: Dell PowerEdge R710 ......................................................................................... 6 Figure 3: OSS Base Components ........................................................................................ 7 Figure 4: 96TB Dell | Terascala HPC Storage Solution Configuration ............................................ 8 Figure 5: MDS Cable Configuration ..................................................................................... 9 Figure 6: OSS Scalability ................................................................................................. 9 Figure 7: Cable Diagram for OSS ...................................................................................... 10 Figure 8: Terascala Management Console ........................................................................... 11 Figure 11: N-to-N Random reads and writes. ....................................................................... 16 Figure 12: N-to-1 Read / Write .................................................... Error! Bookmark not defined. Figure 13: Metadata create comparison ............................................................................. 18 Figure 14: Metadata File / Directory Stat ........................................................................... 19 Figure 15: Metadata File / Directory Remove ...................................................................... 20
Page 2
Tables
Table 1: Test Cluster Details .......................................................................................... 12 Table 2: DT-HSS3 Configuration ...................................................................................... 14 Table 3: Metadata Server Configuration ............................................................................ 18
Page 3
Introduction
In High-Performance computing, the requirement for efficient delivery of data to and from compute nodes is critical. With the speed at which research can use data in HPC systems, storage is fast becoming a bottleneck. Managing complex storage systems is another growing burden on storage administrators and researchers. Data requirements are increasing rapidly, from the perspective of both performance and capacity. Ways to increase the throughput and scalability of the storage devices supporting those compute nodes typically require a great deal of planning and configuration. The Dell | Terascala HPC Storage Solution (DT-HSS) is designed for researchers, universities, HPC users and enterprises who are looking to deploy a fully supported, easy to use, scale-out and cost-effective parallel file system storage solution. DT-HSS is a new scale-out storage solution providing high throughput storage as an appliance. Utilizing intelligent, yet intuitive management interface, the solution simplifies managing and monitoring all the hardware and file system components. It is easy to scale in capacity or performance or both, thereby providing you a path to grow in future. The storage appliance uses Lustre, the leading open source parallel file system. The storage solution is delivered as a pre-configured, ready to go appliance and is available with full hardware and software support from Dell and Terascala. Utilizing the enterprise ready Dell PowerEdge servers and PowerVault storage products, the latest generation Dell | Terascala HPC Storage Solution, referred to as DT-HSS3 in the rest of the paper, delivers a great combination of performance, reliability, ease of use and cost-effectiveness. Because of the changes made in this third generation of the Dell | Terascala High-Performance Computing Storage Solution, it is important to evaluate the performance progress of the solution to determine what should be expected. This paper is a description of this generation of the solution and outlines some of the performance characteristics. Many details of the Lustre filesystem and integration with Platform Computings Platform Cluster Manager (PCM) remain unchanged from the previous generation solution or DT-HSS2. Readers may refer to Appendix B on Page 23 for instructions on integrating with PCM.
Lustre Object Storage Attributes

Lustre is a clustered file system, offering high performance through parallel access and distributed locking. In the DT-HSS family, access to the storage is provided via a single namespace. This single namespace is easily mounted from PCM and managed through the Terascala Management Console. A parallel file system such as Lustre delivers its performance and scalability by distributing (called striping) data across multiple access points, allowing multiple compute engines to access data simultaneously. A Lustre installation consists of three key systems: the metadata system, the object storage system, and the compute clients. The metadata system is comprised of the Metadata Target (MDT) and the Metadata Server (MDS). The MDT stores all metadata for the file system including file names, permissions, time stamps, and where the data objects are stored within the object storage system. The MDS is the dedicated server that manages the MDT. For the Lustre version used in our testing, there is an active/passive pair of MDS servers providing a highly available metadata service to the cluster. The object storage system is comprised of the Object Storage Target (OST) and the Object Storage Server (OSS). The OST provides storage for file object data, while the OSS manages one or more OSTs. There are typically several active OSSs at any time. Lustre is able to deliver increased throughput with Page 4
Dell | Terascala High-Performance Storage Solution DT-HSS3 an increase in the number of active OSSs. Each additional OSS increases the maximum networking throughput, and storage capacity. Figure 1 shows the relationship of the OSS and OST components of the Lustre filesystem.
Figure 1: Lustre Overview
Client MDS Server Client Client OSS 1 Server Client OSS 2 Server OSTs Data MDT Metadata
The Lustre client software is installed on the compute nodes to gain access to data stored in the Lustre clustered filesystem. To the clients, the filesystem appears as a single branch in the filesystem tree. This single directory makes a simple starting point for application data access. This also ensures access via tools native to the operating system for ease of administration. Lustre also includes a sophisticated storage network protocol enhancement referred to as lnet that can leverage features of certain types of networks. For example, when the DT-HSS3 utilizes Infiniband in support of lnets sophisticated features, Lustre can take advantage of the RDMA capabilities of the Infiniband fabric to provide more rapid IO transport than can be achieved by typical networking protocols. The elements of the Lustre clustered file system are as follows: Metadata Target (MDT) Tracks the location of stripes of data Object Storage Target(OST) Stores the stripes or extents on a filesystem Lustre Clients Access the MDS to determine where files are located, then access the OSSs to read and write data
Typically, Lustre deployments and configurations are considered complex and time consuming. Lustre installation and administration is done via a command line interface, which may prevent the Systems Administrator unfamiliar with Lustre from performing an installation, possibly preventing his or her organization from experiencing the benefits of a clustered filesystem. The Dell | Terascala HPC Storage Solution removes the complexities of installation and minimizes Lustre deployment time as well as configuration time. This decrease in time and effort leaves opportunity for filesystem testing and speeds the general preparations for production use.
Page 5
Dell | Terascala HPC Storage Solution Description

The DT-HSS3 solution provides a pre-configured storage solution consisting of Lustre Metadata Servers (MDS), Lustre Object Storage Servers (OSS), and the associated storage. The application software images have been modified to support the PowerEdge R710 as a standards-based replacement for the previously custom fabricated servers used as the Object Storage and Metadata Servers in the configuration. This substitution, shown in Figure 2, allows for a significant enhancement in the performance and serviceability of these solution components with a decrease in the overall complexity of the solution itself.
Figure 2: Dell PowerEdge R710
Metadata Server Pair In the latest DT-HSS3 solution, the Metadata Server (MDS) pair is comprised of two Dell PowerEdge R710 servers configured as an active/passive cluster. Each server is directly attached to a Dell PowerVault MD3220 storage array housing the Lustre MDT. The MD3220 is fully populated with 24, 500GB, 7.2K, 2.5 near-line SAS drives configured in a RAID10 for a total available storage of 6TB. In this single Metadata Target (MDT), the DT-HSS3 provides 4.8TB of formatted space for filesystem metadata. The MDS is responsible for handling file and directory requests and routing tasks to the appropriate Object Storage Targets (OST's) for fulfillment. With this single MDT, the maximum number of files served will be in excess of 1.45 Billion. Storage requests are handled across lnet by either a single 4OGb QDR Infiniband or a 10 Gib Ethernet connection.
Figure 3 Metadata Server Pair
Page 6
Dell | Terascala High-Performance Storage Solution DT-HSS3 Object Storage Server Pair
Figure 4: OSS Pair
The hardware improvements also extend to the Object Storage Server (OSS). The previous generation of the OSS was built with the custom fabricated server and this has been changed. The PowerEdge R710 server has become the new standard for this solution. In the DT-HSS3, Object Storage Servers are arranged in two-node high availability (HA) clusters providing active/active access to two Dell PowerVault MD3200 storage arrays. Each MD3200 array is fully populated with 12 2TB 3.5 7.2K nearline SAS drives. Capacity of each MD3200 array can be extended with up to seven additional MD1200s. Each OSS Pair provides raw storage capacity ranging from 48TB up to 384TB. Object Storage Servers (OSSs) are the building blocks of the DT-HSS solutions. With the two 6Gb SAS controllers in each PowerEdge R710, two servers are redundantly connected to each of two MD3200 storage arrays. The MD3200 storage arrays are extended with MD1200 attached SAS devices to provide additional capacity. With 12 7.2K, 2TB Near Line SAS drives in each PowerVault MD3200 or MD1200 enclosure provides a total of 24TB raw storage capacity. The storage is divided equally in to two RAID 5 (Five data and one parity disk) virtual disks per enclosure to yield two Object Storage Targets (OSTs) per enclosure. Each OST provides 9TB of formatted object storage space. The OSS pair is expanded by 4 OSTs with each stage of increase. The OSTs can be connected to Lnet via 40Gb QDR Infiniband or 10 Gib Ethernet connections. When viewed from any compute node equipped with the Lustre client the entire namespace can be reviewed and manage like any other filesystem, but with enhancements for Lustre management.
Page 7

Figure 5: 96TB Dell | Terascala HPC Storage Solution Configuration
Scalability Providing the Object Storage Servers in active/active cluster configurations yields greater throughput and product reliability. This is also the recommended configuration for decreasing maintenance requirements, consequently reducing potential downtime. The PowerEdge Servers are included to provide greater performance and simplify maintenance by eliminating specialty hardware. This 3rd Generation solution continues to scale from 48TB up to 384TB of raw storage for each OSS pair. The DT-HSS3 solution leverages the QDR Infiniband interconnects for high-speed, low-latency storage transactions or 10 Gib Ethernet for high speed, lower cost and allow the use of any existing 10 Gib infrastructure. The 3rd generation upgrade to PowerEdge R710 takes advantage of the PCIe Gen2 interface for QDR Infiniband helping achieve higher network throughput per OSS. The Terascala Management console also offers the same level of management and monitoring. The Terascala Lustre Kit for PCM 2.0.1 remains unchanged at version 1.1. This combination of components from storage to client access is formally offered as the DT-HSS3 appliance. An example, 96TB configuration shown with the PowerEdge R710 is provided in Figure 5.
Page 8

Figure 6: MDS Cable Configuration
Figure 7: OSS Scalability
Scaling the DT-HSS3 is achieved in two ways. The first method, demonstrated in Figure 8 involves simply expanding the storage capacity of a single OSS pair using MD1200 storage arrays in pairs. This allows for an increase in the volume of storage available while maintaining a consistent maximum network throughput. As shown in Figure 7, the second method adds additional OSS pairs, thus increasing the total network throughput and increasing the storage capacity at once. Also see Figure 8 for detail of the cable configuration and expansion of the DT-HSS3 storage by increasing the number of storage arrays.
Page 9

Figure 8: Cable Diagram for OSS
Networking Management Network The private management network provides a communication infrastructure for Lustre and Lustre HA functionality as well as storage configuration and maintenance. This network creates the segmentation required to facilitate day-to-day operations and to limit the scope of troubleshooting and maintenance. Access to this network is provided via the Terascala Management Server, the TMS2000. The TMS2000 extends a single communications port for the external management network. All userlevel access is via this device. The TMS2000 is responsible for user interaction as well as systems health management and monitoring. While the TMS2000 is responsible for collecting data and management, it does not play an active role in the Lustre filesystem and can be serviced without requiring downtime on the filesystem. The TMS2000 presents the data collected and provides management via an interactive GUI. Users interact with the TMS2000 through the Terascala Management Console. Data Network The Lustre filesystem is served via a preconfigured LustreNet implementation on either Infiniband or 10 Gib Ethernet. In the Infiniband network, fast transfer speeds with low latency can be achieved. LustreNet leverages the use of RDMA for rapid file and metadata transfer from MDTs and OSTs to the clients. The OSS and MDS servers take advantage of the full QDR Infiniband fabric with single port ConnectX-2 40 Gb adapters. The QDR Infiniband controllers can be easily integrated in to existing DDR networks using QDR to DDR crossover cables if necessary. With the 10 Gib Ethernet network Lustre can still benefit from fast transfer speeds and take advantage of the lower cost and pervasiveness of Ethernet technology, allowing leverage of any existing 10 Gib infrastructure.
Managing the Dell | Terascala HPC Storage Solution

The Terascala Management Console (TMC) takes the complexity out of administering a Lustre file system by providing a centralized GUI for management purposes. The TMC can be used as a tool to standardize the following actions: mounting and unmounting the file system, initiating failover of the Page 10
Dell | Terascala High-Performance Storage Solution DT-HSS3 file system from one node to another, and monitoring performance of the file system and the status of its components. Figure 9 illustrates the TMC main interface. The TMC is a Java-based application that can be run from any Java JRE equipped computer to remotely manage the solution (assuming all security requirements are met). It provides a comprehensive view of the hardware and filesystem, while allowing monitoring and management of the DT-HSS solution. Figure 9 shows the initial window view of a DT-HSS3 system. In the left pane of the window are all the key hardware elements of the system. Each element can be selected to drill down for additional information. In the center pane is a view of the system from the Lustre perspective, showing the status of the MDS and various OSS nodes. In the right pane is a message window that highlights any conditions or status changes. The bottom pane displays a view of the PowerVault storage arrays. Using the TMC, many tasks that once required complex CLI instructions, can now be completed easily with a few mouse clicks. The TMC can be used to shut down a file system, initiate a failover from one MDS to another, monitor the MD32XX arrays, etc.
Figure 9: Terascala Management Console
Performance Studies
The goal of the performance studies in this paper is to profile the capabilities of a selected DT-HSS3 configuration. The goal is to identify points of peak performance and the most appropriate methods for scaling to a variety of use cases. The test bed is a Dell HPC compute cluster with configuration as described in Table 1.
Page 11
Dell | Terascala High-Performance Storage Solution DT-HSS3 A number of performance studies were executed, stressing a DT-HSS3 192TB configuration with different types of workloads to determine the limitations of performance and define the sustainability of that performance. Also, for these studies, Infiniband was the network technology of choice for its higher bandwidth and lower latency compared to 10 Gib Ethernet.
Table 1: Test Cluster Details
Component Compute Nodes: SERVER BIOS: PROCESSORS: MEMORY: INTERCONNECT: IB SWITCH: CLUSTER SUITE: OS: IB STACK:
Description Dell PowerEdge M610, 16 each in 4 PowerEdge M1000e chassis 3.0.0 Intel Xeon X5650 6 x 4GB 1333 MHz RDIMM Infiniband - Mellanox MTH MT26428 [ConnectX IB QDR, Gen-2 PCIe] Mellanox 3601Q QDR blade chassis I/O switch module Platform PCM 2.0.1 Red Hat Enterprise Linux 5.5 x86_64 (2.6.18-194.el5) OFED 1.5.1
Testing focused on three key performance markers: Throughput, data transferred in MB/s I/O Operations per second (IOPS), and Metadata Operations per second
The goal is a broad, but accurate review of the computing platform. There are two types of file access methods used in these benchmarks. The first file access method is Nto-N, where every thread of the benchmark (N clients) writes to a different file (N files) on the storage system. IOzone and IOR can both be configured to use the N-to-N file-access method. The second file access method is N-to-1, which means that every thread writes to the same file (N clients, 1 file). IOR can use MPI-IO, HDF5, or POSIX to run N-to-1 file-access tests. N-to-1 testing determines how the file system handles the overhead introduced with multiple concurrent requests when multiple clients (threads) write to the same file. The overhead encountered comes from threads dealing with Lustres file locking and serialized writes. See Appendix A for examples of the commands used to run these benchmarks. The number of threads or tasks chosen to run with each of these servers is chosen around the test cluster configuration. The number of threads corresponds to a number of hosts up to 64. That is to say that the number of threads is consistently 1 per host until we reach 64. Total numbers of threads above 64 are achieved by increasing the number of threads across all clients. IOzone limits the number of threads to 255 so the final data point for IOzone is consistent, save for one client who will have 3 rather than 4 IOzone threads like the other 3. Figure 10 shows a diagram of the cluster configuration used for this study. The number of threads or tasks chosen to run with each of these servers is chosen around the test cluster configuration. The number of threads corresponds to a number of hosts up to 64. That is to say that the number of threads is consistently 1 per host until we reach 64. Total numbers of threads above 64 are achieved by increasing the number of threads across all clients. IOzone limits the number of threads to 255 so the final data point for IOzone is consistent, save for one client who will have 3 rather than 4 IOzone threads like the other 3. Page 12

Figure 10: Test Cluster Configuration
The test environment for the DT-HSS3 has a single MDS and a single OSS Pair. There is a total of 192TB of raw disk space for the performance testing. The OSS pair contains two PowerEdge R710s, each with 24GB of memory, two 6Gbps SAS controllers and a single Mellanox ConnectX-2 QDR HCA. The MDS has 48GB of memory, a single 6Gbps SAS controller and a single Mellanox ConnectX-2 QDR HCA. The Infiniband fabric was configured as a 50% blocking fabric. It was created by populating Mezzanine slot B on all blades, using a single Infiniband switch in I/O module slot B1 in the chassis and using one external 36 port switch. 8 external ports on the internal switch were connected to the external 36 port switch. As the total throughput to and from any chassis exceeds the throughput to the OSS servers 4:1, this fabric configuration exceeded the bandwidth requirements necessary to fully utilize the OSS pair and MDS pair for all testing.
Page 13

Table 2: DT-HSS3 Configuration
DT-HSS3 Configuration Configuration Size Lustre Version # of Drives OSS Nodes OSS Storage Array Drives in OSS Storage Arrays MDS Nodes MDS Storage Array Drives in MDS Storage Array Infiniband PowerEdge R710 Servers Compute Nodes External QDR IB Switch IB Switch Connectivity 10 Gb Ethernet (Test data not included) PowerEdge R710 Servers Compute Nodes External 10 Gb Ethernet Switch Ethernet 10 Gb Ethernet Switch Connectivity
192TB Medium 1.8.4 96 x 2TB NL SAS drives. 2 x PowerEdge R710 Servers with 24GB Memory 2 x MD3200 /3 x MD1200 96 3.5" 2TB 7200RPM Nearline SAS 2 x PowerEdge R710 Servers with 48GB Memory 1 x MD3220 24 2.5" 500GB 7200RPM Nearline SAS
Mellanox ConnectX-2 QDR HCA Mellanox ConnectX-2 QDR HCA - Mezzanine Card Mellanox 36 Port is5030 QDR Cables
Intel X520 DA 10Gb Dual Port SFP+ Advanced Intel 82599 10G - Mezzanine Card 10GbE PowerConnect 8024 switch SFP+ Cables
All tests are performed with a cold cache established with the following technique. After each test, the filesystem is unmounted from all clients, then unmounted from the metadata server. Once the metadata server is unmounted, a sync is performed and the kernel is instructed to drop caches with the following command: sync echo 3 > /proc/sys/vm/drop_caches
In order to further thwart any client cache errors, the compute node responsible for writing the sequential file was not the same one used for reading the files. A file size of 64GB was determined for sequential writes. This is 25% greater than the total memory combined for each PowerEdge R710 , so all sequential reads and writes have an aggregate sample size of 64GB multiplied by the number of threads, up to a total 64GB * 255 threads or greater than 15TB. The block size for IOzone was set to 1MB to match the 1MB request size. In measuring the performance of the DT-HSS3, all tests were performed within similar environments. The filesystem was configured to be fully functional and the targets tested were emptied of files and directories prior to each test. Page 14
Dell | Terascala High-Performance Storage Solution DT-HSS3 N-to-N Sequential Reads / Writes The sequential testing was done with the IOzone testing tools version 3.347 and throughput results are reported in units of KB/sec, but are converted to MiB for consistency. N-to-N testing describes the technique of generating a single file for each process thread spawned by IOzone. The file size determined to be appropriate for testing is a 64GB file. Each individual file written was large enough to saturate the total available cache on each OSS/MD3200 pair. Files written were distributed across the client LUNs evenly. This was to prevent uneven I/O load on any single SAS connection in the same way that a user would expect to balance a workload.
Figure 11: Sequential Reads / Writes DT-HSS3
DT-HSS3 Sequential Reads / Writes

7000 6000 5000 MB/sec 4000 3000 2000 1000 0 1 2 4 8 16 Threads readers writers 32 64 128 255
Figure 11 demonstrates the performance of the HSS-3 192TB test configuration. Write performance peaks at 4.3GB/sec at sixteen concurrent threads and continues to exceed 3.0GB/sec while each of the 64 clients writes up to four files at a time. With four files per client, sixteen files are being written simultaneously on each OST. Read performance exceeds 6.5GB/sec for sixteen simultaneous processes and continues to exceed 3.5GB/sec while four threads per compute-node access individual OST's. The write and read performance rises steadily as we increase the number of process threads up to sixteen. This is the result of increasing the number of OST's utilized directly with the number of processes and compute nodes. When the client to OST ratio (CO) exceeds one, the write performance is evenly distributed between the multiple files on each OST, consequently increasing seek and Lustre callback time. By careful crafting of the IOzone hosts file, each thread added balances between blade chassis, compute nodes, and SAS controllers, as well as adding a single file to each of the OST's allowing a consistent increase in the number of files written while creating minimal locking contention between the files. As the number of files written increases beyond one per OST, throughput declines. This is most likely the result of issues related to the larger number of requests per OST. Throughput drops as the number of clients making requests to each OST increase. Positional delays and other overhead are expected as a result of the increased randomization of client requests. In order to maintain the higher throughput for a greater number of files, increasing the number of OSTs should help. A review of the storage array performance from the Dell PowerVault Modular Disk Storage Manager or using SMCli Page 15
Dell | Terascala High-Performance Storage Solution DT-HSS3 performance monitor was done to independently confirm the throughput values produced by the benchmarking tools. As a reference point for 10 Gb Ethernet implementations, a HSS using 2 OSS with 4 OST each, tested against 16 clients using 1 Gb Ethernet links running 2 threads each (32 threads in total) gave a write performance of 1.75 GB/sec and a read performance of 1.37 GB/sec. Random Reads and Writes The same tool, IOzone, was used to gather random reads and writes metrics. In this case, we have chosen to write a total of 64GB during any one execution. This total 64GB size is divided amongst the threads evenly. The IOzone hosts are arranged to distribute the work load evenly across the compute nodes. The storage is addressed as a single volume with an OST count of 16 and stripe size of 4MB. As we are measuring our performance in Operations per second (OPS), a 4KB request size is used because it aligns with this Lustre versions 4KB filesystem block size. Figure 12 shows that the random writes are saturated at sixteen threads and around 5,200 IOPS. At sixteen threads, the client to OST ratio (CO) is one. As we increase the number of threads in IOzone, we increase the CO to a maximum of 4:1 at 64 threads. The reads increase rapidly from sixteen to 32 and then continuously increase at a relatively steady rate. As the writes require a file lock per OST accessed, saturation is not unexpected. Reads take advantage of Lustres ability to grant overlapping read extent locks for part or all of a file. Increasing the number of disks in the single OSS pair or additional OSS pairs can increase the throughput.
Figure 12: N-to-N Random reads and writes.
DT-HSS3 N-to-N Random

25000 20000 15000
IOPS
10000 5000 0 8 16 32 64 Threads Random reads Random writes 128 192 255
IOR N-to-1 Reads and Writes Performance review of the DT-HSS3 with reads and writes to a single file was done with the IOR benchmarking tool. IOR is written to accommodate MPI communications for parallel operations and has support for manipulating Lustre striping. IOR allows several different IO interfaces for working with the files, but this testing used the POSIX interface to exclude more advanced features and associated Page 16
Dell | Terascala High-Performance Storage Solution DT-HSS3 overhead. This gives us an opportunity to review the filesystem and hardware performance independent of those additional enhancements and most applications have adopted a one-file-perprocessor approach to disk-IO rather than using parallel IO APIs. IOR benchmark version 2.10.3 was used in this study, and is available at http://sourceforge.net/projects/ior-sio/. The MPI stack used for this study was Open MPI version 1.4.1, and was provided as part of the Mellanox OFED kit. The configuration for writing included a directory set with striping characteristics designed to stripe across all sixteen OSTs with a stripe chunk size of 4MB. The file to which all clients write is evenly striped across all OSTs. The request size for Lustre is 1MB, but in this test, a transfer size of 4MB was used to match the stripe size.
Figure 13: N-to-1 Read / Write
DT-HSS3 N-to-1 Read / Write

7000 6000 5000 4000 3000 2000 1000 0 2 4 8 16 Threads read write 32 64
After setting the 4MB transfer size for IOR, reads have the advantage and peak at greater than 5.9 GB per second, a value significantly near the sequential N-to-N performance. Write performance peaks within 10% of the sequential writes at 4.1GB, but as the client to OST ratio (CO) increases, the write performance settles around 2.9GB per second, sustained. Reads are less affected since the read locks for clients are allowed to overlap some or all of a given file. Metadata Testing Metadata testing measures the time to complete certain file and directory operations that return attributes. Mdtest is an MPI-coordinated benchmark that performs create, stat, and delete operations on files and directories. The output of mdtest is an evaluation of the timing for completion of the operations per second. The mdtest benchmark is MPI enabled. Using mpirun, tests create threads from each included compute node. Mdtest can be configured to compare metadata performance for directories and files. This study used mdtest version 1.8.3, which is available at http://sourceforge.net/projects/mdtest/. The MPI stack used for this study was Open MPI version 1.4.1, and was provided as part of the Mellanox OFED kit. The Metadata Server internal details, including processor and memory configuration, are outlined in Table 3 for comparison. Page 17
MB/Sec

Table 3: Metadata Server Configuration
Processor Memory Controller PCI Riser
2 x Intel Xeon E5620 2.4Ghz, 12M Cache, Turbo 48GB Memory (12x4GB), 1333MHz Dual Ranked RDIMMs 2 x SAS 6Gbps HBA Riser with 2 PCIe x8 + 2 PCIe x4 Slot
On a Lustre file system, the directory metadata operations are all performed on the MDT. The OSTs are queried for object identifiers in order to allocate or locate extents associated with the metadata operations. This interaction requires the indirect involvement of OSTs in most metadata operations. As a result of this dependency, Lustre favors metadata operations on files with lower stripe counts. The greater the stripe count, the more operations are required to complete the OST object allocation or calculations. The first test stripes files in 1MB stripes over all available OSTs. The next test repeats the previous operations on a single OST for comparison. Metadata operations across a single OST vs. the same operations spreading single stripe files evenly over sixteen OST stripes are very similar.
Figure 14: Metadata create comparison
DT-HSS3 Metadata Create

8000 7000 Operations / sec 6000 5000 4000 3000 2000 1000 0 1 2 4 8 Tasks (Nodes) 16 OST File Create 1 OST File Create DT-HSS3 16 OST Directory Create 16 32 64
Preallocating from sixteen OSTs or a single OST does not seem to change the rate at which files are created. The performance of the MDT and MDS are more important to the create operation.
Page 18

Figure 15: Metadata File / Directory Stat
DT-HSS3 Metadata stat

30000 25000 Operations / sec 20000 15000 10000 5000 0 1 2 4 8 Tasks (Nodes) 16-OST File Stat 1-OST File Stat 16 OST Directory Stat 1 OST Directory Stat 16 32 64
The mdtest benchmark for stat uses the native call as opposed to the bash shell call to retrieve file and directory metadata. The directory stats seem to peak around 26,000 per second, but the increased number of OSTs makes a significant difference in the number of calls handled by 1 task or greater. If the files were striped across a number of OSTs greater than 1, all OSTs would be contacted in order to accurately calculate the file size. The file stats indicate that beyond 8 tasks, the advantage lies in distributing the requests across more than a single OST, but the increase is not significant as the requests to any one OST do not represent more than a fraction of the data requested by the test. Directory creates do not require the same draw from the prefetch pool of OST object IDs as the creates do for files. The result is fewer operations to completion and therefore faster time to completion. Removal of the files requires less communication between the compute nodes (Lustre clients) and the MDS than removing directories. Directory removal requires verification from the MDS that the directory is void of content prior to execution. With 32 simultaneous stripes, a break occurs in the file removal between the single OST result and that of the sixteen OSTs. The metadata operations for directories across multiple OSTs appear to gain a slight advantage after the number of tasks exceeds 32. In the case of the removal of the files, we see a definite increase in performance when commands are executed across multiple OSTs.
Page 19

Figure 16: Metadata File / Directory Remove
DT-HSS3 Metadata Remove

30000 Operations / sec 25000 20000 15000 10000 5000 0 1 2 4 8 Tasks (Nodes) 16 OST File Remove 16 OST Directory Remove 1 OST File Remove 1 OST Directory Remove 16 32 64
Page 20
Conclusions
There is a well-known requirement for scalable, high-performance clustered filesystem solutions. There are many solutions for customized storage. The third generation of the Dell | Terascala HSS, the DTHSS3, addresses this need with the added benefit of standards based and proven Dell PowerEdge and PowerVault products and Lustre, the leading open source solution for clustered filesystem. The Terascala Management Console (TMC) unifies the enhancements and optimizations included in the DTHSS3 into a single control and monitoring panel for ease of use. The latest generation Dell | Terascala HPC Storage Solution continues to offer the same advantages as the previous generation solutions, but with greater processing power and memory and greater IO bandwidth. The scale of the raw storage from 48TB up to a total of 384TB per Object Storage Server Pair and up to 6.2GB/s of throughput in a packaged component is consistent with the needs of the high performance computing environments. The DT-HSS3 is also capable of scaling horizontally as easily as it scales vertically. The performance studies demonstrate an increase in the throughput for both reads, and writes over previous generations for n-to-n and n-to-1 file type access. Results from mdtest show a generous increase in the metadata operations speed. With the PCI-e 2.0 interface, controllers and HCAs will excel in both high IOP and high bandwidth applications. The continued use of generally available, industry-standard tools like IOzone, IOR and mdtest provide an easy way to match current and expected growth with the performance outlined for the DT-HSS3. The profiles reported from each of these tools provide sufficient information to align the configuration of the DT-HSS3 with the requirements of many applications or group of applications. In short, the Dell | Terascala HPC Storage Solution delivers all the benefits of a scale-out parallel file system-based storage for your high performance computing needs.
Page 21
Appendix A: Benchmark Command Reference

This section describes the commands used to benchmark the IOzone IOzone Sequential Writes /root/iozone/iozone -i 0 -c -e -w -r 1024k -s IOzone Sequential Reads /root/iozone/iozone -i 1 -c -e -w -r 1024k -s IOzone IOPS Random Reads / Writes /root/iozone/iozone -i 2 -w -O -r 4k -s $SIZE Dell | Terascala HPC Storage Solution 3.
64g -t $THREAD -+n -+m ./hosts 64g -t $THREAD -+n -+m ./hosts -t $THREAD -I -+n -+m ./hosts
The O_Direct command line parameter (-I) allows us to bypass the cache on the compute nodes where the IOzone threads are running. IOR IOR Writes mpirun -np $i -s 1 -b $SIZE IOR Reads mpirun -np $i -s 1 -b $SIZE
--hostfile hosts $IOR -a POSIX -i 2 -d 32 -e -k -w -o $IORFILE -t 4m --hostfile hosts $IOR -a POSIX -i 2 -d 32 -e -k -r -o $IORFILE -t 4m
mdtest - Metadata Create Files mpirun -np $THREAD --nolocal --hostfile $HOSTFILE $MDTEST -d $FILEDIR -i 6 -b $DIRS -z 1 -L -I $FTP -y -u -t -F -C Stat files mpirun -np $THREAD --nolocal --hostfile $HOSTFILE $MDTEST -d $FILEDIR -i 6 -b $DIRS -z 1 -L -I $FTP -y -u -t -F -R -T Remove files mpirun -np $THREAD --nolocal --hostfile $HOSTFILE $MDTEST -d $FILEDIR -i 6 -b $DIRS -z 1 -L -I $FTP -y -u -t -F -r
Page 22
Appendix B: Terascala Lustre Kit

PCM software uses the following terminology to describe software provisioning concepts: 1. 2. 3. 4. 5. Installer Node Runs services such as DNS, DHCP, HTTP, TFTP, etc. Components RPMS Kits Collection of components Repositories Collection of kits Node Groups Allow association of software to a particular set of compute nodes
The following is an example of how to integrate a Dell | Terascala HPC Storage Solution into an existing PCM cluster:
1. Access a list of existing repositories on the head node:
The key here is the Repo name: line. 2. Add the Terascala kit to the cluster:
3. Confirm the kit has been added:
4. Associate the kit to the compute node group on which the Lustre client should be installed: a. Launch ngedit at the console: # ngedit b. On the Node Group Editor screen, select the compute node group to add the Lustre client. c. Select the Edit button at the bottom of the screen. d. Accept all default settings until you reach the Components screen. e. Using the down arrow, select the Terascala Lustre kit. f. Expand and select the Terascala Lustre kit component. g. Accept the default settings, and on the Summary of Changes screen, accept the changes and push the packages out to the compute nodes. On the front-end node, there is now a /root/terascala directory that contains sample directory setup scripts. There is also a /home/apps-lustre directory that contains Lustre client configuration parameters. This directory contains a file that the Lustre file system startup script uses to optimize clients for Lustre operations.
Page 23
References
Dell | Terascala HPC Storage Solution Brief http://i.dell.com/sites/content/business/solutions/hpcc/en/Documents/Dell-terascala-hpcstoragesolution-brief.pdf Dell | Tersascala HPC Storage Solution Whitepaper http://i.dell.com/sites/content/business/solutions/hpcc/en/Documents/Dell-terascala-dt-hss2.pdf Dell PowerVault MD3200 / MD3220 http://www.dell.com/us/en/enterprise/storage/powervault-md3200/pd.aspx?refid=powervaultmd3200&s=biz&cs=555 Dell PowerVault MD1200 http://configure.us.dell.com/dellstore/config.aspx?c=us&cs=555&l=en&oc=MLB1218&s=biz Lustre Home Page http://wiki.lustre.org/index.php/Main_Page Transitioning to 6Gb/s SAS http://www.dell.com/downloads/global/products/pvaul/en/6gb-sas-transition.pdf Dell HPC Solutions Home Page http://www.dell.com/hpc Dell HPC Wiki http://www.HPCatDell.com Terascala Home Page http://www.terascala.com Platform Computing http://www.platform.com Mellanox Technologies http://www.mellanox.com
Page 24

Dell - Terascala HPC Storage Solution (DT-HSS3)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dell - Terascala HPC Storage Solution (DT-HSS3)

Uploaded by

Copyright:

Available Formats

Dell | Terascala HPC Storage Solution (DT-HSS3)

A Dell Technical White Paper

Dell | Terascala High-Performance Storage Solution DT-HSS3

Dell | Terascala High-Performance Storage Solution DT-HSS3

Dell | Terascala High-Performance Storage Solution DT-HSS3

Dell | Terascala High-Performance Storage Solution DT-HSS3

Dell | Terascala High-Performance Storage Solution DT-HSS3

Lustre Object Storage Attributes

Dell | Terascala High-Performance Storage Solution DT-HSS3

Dell | Terascala HPC Storage Solution Description

Dell | Terascala High-Performance Storage Solution DT-HSS3

Dell | Terascala High-Performance Storage Solution DT-HSS3

Figure 7: OSS Scalability

Dell | Terascala High-Performance Storage Solution DT-HSS3

Managing the Dell | Terascala HPC Storage Solution

Table 1: Test Cluster Details

Dell | Terascala High-Performance Storage Solution DT-HSS3

Dell | Terascala High-Performance Storage Solution DT-HSS3

DT-HSS3 Sequential Reads / Writes

DT-HSS3 N-to-N Random

DT-HSS3 N-to-1 Read / Write

Dell | Terascala High-Performance Storage Solution DT-HSS3

Processor Memory Controller PCI Riser

DT-HSS3 Metadata Create

Dell | Terascala High-Performance Storage Solution DT-HSS3

DT-HSS3 Metadata stat

Dell | Terascala High-Performance Storage Solution DT-HSS3

DT-HSS3 Metadata Remove

Dell | Terascala High-Performance Storage Solution DT-HSS3

Dell | Terascala High-Performance Storage Solution DT-HSS3

Appendix A: Benchmark Command Reference

Dell | Terascala High-Performance Storage Solution DT-HSS3

Appendix B: Terascala Lustre Kit

1. Access a list of existing repositories on the head node:

3. Confirm the kit has been added:

Dell | Terascala High-Performance Storage Solution DT-HSS3

You might also like