Description of Discovery
NMSU Information and Communication Technologies (ICT) centrally manages Discovery, a high performance computing (HPC) resource, that is free to use for current students, faculty, and staff to use for research and classroom activities. As of December 2019, the campus HPC has 25 compute, 11 GPU and 2 high memory nodes, with total of 968 cores. The nodes range from 16-32 cores each and 64GB to 256GB of RAM per node. High memory nodes have up to 3TB of RAM per node. The system uses Centos7. Both SLURM, a scheduler, and modules are used to enhance user experience and to mimic the design of national computing resources (ex: XSEDE). Unlike most HPC systems, however, the NMSU system employs a fair-share system, meaning that those who use the system less are given slight priority over those who use it at a high rate, but every user can utilize the system as much as they need.
Purchasing hardware on the Discovery cluster guarantees on-demand access to the purchased resources and priority access to any idle compute resources on the cluster, allowing the investors to “burst out” into the main cluster, thus allowing them access to more resources than a stand-alone system can provide as well as contributing to the community resource as their idle resources are also available for community use. No allocations of time are given so it is a good environment to learn and become comfortable with the command line and HPCs before moving onto the highly competitive environments of the national resources where time and access is limited.
The enterprise datacenter provides a professionally managed, secure, and environmentally controlled facility. This 24/7 facility houses enterprise and special research servers and storage equipment. The datacenter provides enhanced physical and data security, many-layered redundant power and cooling systems, off-site data backup in secure facilities, protected by an Inergen fire suppression system, and redundant datacenter class high-speed networking to ensure high availability of critical instruction, research, and administrative systems.
The NMSU IT staff is well trained to develop next-generation cyberinfrastructure for datacenters and network virtualization. They are well respected throughout the campuses and the state and have provided support for numerous other institutions. Strahinja Trecakov and Nicholas Von Wolff are the current HPC administrators. Mr. Trecakov has over 2 years of experience and an extensive network of HPC professionals to draw upon for assistance. Mr. Von Wolff is new to HPC but has over 3 years over experience in Systems Administration. Curtis Ewing is the Systems Administration Manager with over 25 years of experience. Mr. Trecakov, Mr. Von Wolff, and Mr. Ewing are personnel who will support and maintain the proposed computational system and their resumes are included in this submission. In addition, the HPC is supported by the entire Systems Team, consisting of 9 additional people, with regards to niche expertise, storage, and backups, among others.
Standard Compute Nodes
We propose to purchase XX compute nodes to be installed in the NMSU Discovery campus cluster. Each node will contain dual Intel XXX processors, translating into XX total physical cores, and 192/256/384 GB of RDIMM, 2666MT/s, Dual Rank memory. The nodes are connected via an InfiniBand network. NMSU provides staff to administer and maintain the cluster. NMSU guarantees on-demand and immediate access to the nodes for no less than 5 years.
We propose to purchase XX compute nodes to be installed in the NMSU Discovery campus cluster. Each node will contain dual Intel XXX processors, translating into XX total physical cores, and 192/256/384 GB of RDIMM, 2666MT/s, Dual Rank memory. The nodes will be equipped with dual NVIDIA Tesla V100. The nodes are connected via an InfiniBand network. NMSU provides staff to administer and maintain the cluster. NMSU guarantees on-demand and immediate access to the nodes for no less than 5 years.
The following language is a place-holder until an appropriate storage solution has been determined.