Include Discovery in a Grant

Grant-Specific Language

Description of Discovery

NMSU Information and Communication Technologies (ICT) centrally manages Discovery, a high performance computing (HPC) resource. It’s free to use for current students, faculty, and staff to use for research, and classroom activities. As of December 2021, the campus HPC has 54 compute nodes(38 CPU nodes and 16 GPU nodes), with a total of 1,536 cores. The nodes range from 16–32 cores each and 64GB to 256GB of RAM per node. High memory nodes have up to 3TB of RAM per node. The system uses Red Hat Enterprise Linux release 8 (RHEL 8). Both Slurm, a scheduler, and modules are used to enhance user experience, and to mimic the design of national computing resources (ex: ACCESS). Unlike most HPC systems, the NMSU system employs a fair-share system, meaning that those who use the system less are given slight priority over those who use it at a high rate, but every user can use the system as much as they need.

Purchasing hardware on the Discovery cluster guarantees on-demand access to the purchased resources and priority access to any idle compute resources on the cluster. This allows the investors to "burst out" into the main cluster, thus allowing them access to more resources than a stand-alone system can offer as well as contributing to the community resource as their idle resources are also available for community use. No allocations of time are given so it’s a good environment to learn and become comfortable with the command line and HPCs before moving onto the highly competitive environments of the national resources where time and access is limited.

The enterprise data center provides a professionally managed, secure, and environmentally controlled facility. This 24/7 facility houses enterprise and special research servers and storage equipment. The data center provides enhanced physical and data security, many-layered redundant power and cooling systems, off-site data backup in secure facilities, protected by an Inergen fire suppression system, and redundant data center class high-speed networking to ensure high availability of critical instruction, research, and administrative systems.

The NMSU IT staff is well trained to develop next-generation cyberinfrastructure for data centers and network virtualization. They’re well respected throughout the campuses and the state and have provided support for many other institutions. Strahinja Trecakov and Nicholas Von Wolff are the current HPC administrators. Mr. Trecakov has over 2 years of experience and an extensive network of HPC professionals to draw upon for assistance. Mr. Von Wolff is new to HPC but has over 3 years over experience in Systems Administration. Curtis Ewing is the Systems Administration Manager with over 25 years of experience. Mr. Trecakov, Mr. Von Wolff, and Mr. Ewing are personnel who will support and maintain the proposed computational system and their resumes are included in this submission. In addition, the HPC is supported by the entire Systems Team, consisting of 9 additional people, with regards to niche expertise, storage, and backups, among others.

Nicholas’s and Strahinja’s resume found by clicking on their names. (last updated January 2020).

Standard Compute Nodes

We propose to purchase XX compute nodes to be installed in the NMSU Discovery campus cluster. Each node will contain dual Intel XXX processors, translating into XX total physical cores, and 192/256/384 GB of RDIMM, 2666MT/s, Dual Rank memory. The nodes are connected via an InfiniBand network. NMSU provides staff to administer and maintain the cluster. NMSU guarantees on-demand and immediate access to the nodes for no less than 5 years.

GPU Nodes

We propose to purchase XX compute nodes to be installed in the NMSU Discovery campus cluster. Each node will contain dual Intel XXX processors, translating into XX total physical cores, and 192/256/384 GB of RDIMM, 2666MT/s, Dual Rank memory. The nodes will be equipped with dual NVIDIA Tesla V100. The nodes are connected via an InfiniBand network. NMSU provides staff to administer and maintain the cluster. NMSU guarantees on-demand and immediate access to the nodes for no less than 5 years.

Storage

The following language is a place-holder until an appropriate storage solution has been determined.