Policies

New Mexico State University’s High-Performance Computing (HPC) mission is to provide the best possible computer systems, networking, and expertise to optimize utilization of these technologies. The Discovery Cluster facilitates computationally intensive tasks that can be executed quickly or need more storage space than the normal PC.

Account Policies

Eligibility

Accounts are available to current students, faculty, staff, and affiliates of New Mexico State University for free.
To get an account you will need to pass our self-paced Onboarding course or have 1-on-1 Onboarding with an HPC team member.
Affiliated users must have a New Mexico State University faculty member that will be their sponsor. The sponsor takes on the responsibility of the affiliate’s compliance with policies.

Removal

HPC policies for account removal/deletion under the following circumstances:

When a user indicates via written request that they no longer need the account.
When a user leaves the university.
Users affiliated with New Mexico State University will be removed after their sponsorship ends.
Accounts that aren’t access for a year will be locked, and the data will be removed 60 days after being locked. Users will be notified if they didn’t log in for a year and will be warned 3 times (60, 40 and 20 days) before account and data deletion.

Locking

Account sharing is prohibited and will result in account locking.
NMSU reserves the right to lock your account at any time, including for mandatory retraining of users.
NMSU reserves the right to revoke the access to and retain any data or files that may be used as evidence of violating NMSU’s Data Use Agreement.

Software Policies

New Mexico State University’s HPC team provides support to its users by installing and maintaining the Core Software Stack. To request installation of software, the users request must be determined if it will be accepted into the Core Software Stack or installed in a user’s home/project directory.

For example, custom Python/R environments should be stored in the user’s home/project directories. Software for OnDemand Applications will be included in the Core Software Stack. The latest software stack can be updated, although compilers and programming languages will be version locked. Non-latest software stacks will be frozen and can no longer receive new packages. Software Stacks naming convention will determine the year and time it was released, (2022A, 2022B, 2023A, etc..). New software stacks will be released yearly or when a compiler/language needs to be updated. Software stacks will be available for a minimum of 2 years, from the date released. To request an installation of a software please fill out the software request form at → Software Request Form. Certain software may not be able to be installed due to licensing issues. Our team retains the right to reject the request if they believe that the software might be malicious or might harm the system.

Software that doesn’t require root level/core stack (administrative) access should be installed in a user’s home directory or a group’s project directory. If the software does require root privileges to be installed or feel it would benefit other HPC users, a request for the software be installed can be submitted.

Security Policies

Our policy is to take acceptable measures to insure the security of our systems.

Please read and comply with NMSU Information Management and Data Security Policies.
We don’t allow account sharing with anyone under any circumstances.
Don’t leave your terminal unattended while you are logged in to your account.
Don’t store or use sensitive or regulated data on NMSU’s HPC cluster. This includes HIPAA, FERPA, and other legally regulated data.
Don’t distribute or copy software or privileged data.
If you see anything suspicious or face any security issue, please send an email to the HPC Team at hpc-team@nmsu.edu.

Downtime and Maintenance Policies

NMSU’s HPC system administrators will have downtimes to install patches and work on improvements of the system.

Full outages (every 6 months) Our goal is to have no, or as short as possible full outages during the school year. These are expected to last up to 2 weeks and users will be notified 2 weeks in advance with alternative resources suggested for those who need a system during this downtime.
Partial outages (as needed) These will be done as needed but we will always try to impact as little as possible and keep the cluster running. These happen on a node-by-node basis, and computation resources will be available, although not all nodes will be available during this time. Users may not be notified as impact should be negligible.
Planned downtimes These downtimes we will notify HPC users 2 weeks in advance through our user mailing list, group on MS teams, on our website hpc.nmsu.edu, Twitter, Discourse, our Newsletter, and as a "message of the day" that will show when users log into the Discovery cluster.
Unplanned/Emergency downtimes That can happen for various reasons such as:
1. Loss of power or cooling in the data center.
2. Failure of hardware.
3. Security issues.
  1. Critical patches.
4. Adjusting configuration to the whole cluster where immediate change is necessary. Users will be notified if any of the above reasons occur, as soon as possible, however the running jobs will likely be impacted. As emergency downtimes are out of our control, we aim to get the system functioning as quickly as possible, but an impact on the users is unavoidable.

Computational Resource Policies

Policies around the use of the computational resources are adapting to support the greatest number of users. As this is a moving target, these policies may be updated regularly and communicated via our user mailing list, group on MS teams, on our website hpc.nmsu.edu, Twitter, Discourse, our Newsletter, and as a 'message of the day" that will show when users log into the Discovery cluster.

Login Node

The login node of HPC cluster isn’t designed to run intensive interactive work. It’s the gateway to the HPC clusters, providing users with access to cluster resources and job submission capabilities. Processes run directly on these nodes shouldn’t be resource intensive. Please use the Slurm Scheduler (via the sbatch/srun command) to compile or test your code.
- To do interactive work, the recommended way is to use an interactive job through Open OnDemand.
Arbiter2 is the service to monitor CPU and memory usage and enforce the policy. It protects the node by two kinds of limitation:
1. When a user reaches the soft limit, the processes can still be running, but Arbiter2 will impose more restrictions on the user’s access to CPU and memory resources. Also, the user will receive an email shows the high-impact processes and CPU and memory usage over time with graphs. After a period of time, this status will be lifted, with an email notification.
2. When a user reaches the hard limit, the process will be killed.
For large data transfer (greater than 100 GB or many files at once), please contact the HPC team at hpc-team@nmsu.edu, and we will assist you.

Batch System

Users are allowed to have up to 10 jobs actively running on the normal partition and another 10 in the queue. Total number of 20 jobs in queue on the normal partition.
Users are allowed to have up to 2 jobs actively running on the GPU partition and another 2 in the queue. Total number of 4 jobs in queue on the GPU partition.
Users are allowed to have up to 3 jobs actively running on the interactive partition and another 3 in the queue. Total number of 6 jobs in queue on the interactive partition.
Users are allowed to have up to 10 jobs actively running on the backfill partition and another 10 in the queue. Total number of 20 jobs in queue on the backfill partition.
Restricted partitions or partitions owned by labs don’t have these restrictions in place.

Job Scheduling

Wall time–changes per partition. If you don’t specify "--time" in your SBATCH script, your job will run for the default of 1 minute. This is to encourage users to submit "--time" which enables the backfill scheduler to run more efficiently.
CPU memory–Default memory per CPU is set to 500MB. If you need more, please add this requirement to your SBATCH script.
Scheduling highest priority jobs will run first, however, backfill scheduling is enabled. You can find more about backfill scheduling at https://slurm.schedmd.com/sched_config.html.

File Storage Policies

We’re running GPFS parallel file storage system, and all files located in home, project, and scratch directories are mounted from a storage system. This means that all requests for data must go over the network, and proper management of files is critical to the performance of applications, and the performance of the entire network.

Storage Space for Accounts

Each account is given 100GB of quota on /home by default.
Each account is given 1024GB of quota on /scratch by default.
Users working together on a project can request 500GB of project space by default. This storage space is shared between users working on a project and can be requested by a faculty member by submitting this form: Project Space Request. The permissions on the project directory will be set as owner=root, group="group of users that share that space"
Users are able to purchase extra storage for $8/TB/month by submitting this form: Project Space Request.

Backups and Data Retention

ICT will make nightly backups of /home and /project. These backups are kept for 30 days. Files in /home and /project will be kept as long as your account is active. If your account remains inactive for 1 year, the account will get locked and after 60 days of grace period your files in /home and /project will be deleted.
ICT doesn’t make backups of /scratch. Files in /scratch not accessed for 120 days will be automatically deleted. Users will get notify when their files weren’t accessed for 92 days, 106 days, and 120 days (purged).

Acknowledging HPC team in Publications Policy

If you use NMSU HPC systems and services, please let us know of any published results.

Publications that feature work that relied on NMSU’s HPC or HTC computing resources should cite the following publication:

Strahinja Trecakov and Nicholas Von Wolff. 2021. Doing more with less: Growth, improvements, and management of NMSU’s computing capabilities. In Practice and Experience in Advanced Research Computing (PEARC '21), July 18–22, 2021, Boston, MA, USA. ACM, New York, NY, USA 4 Pages. https://doi.org/10.1145/3437359.3465610

Please use the following language to acknowledge Discovery and the New Mexico State University High Performance Computing group in any published or presented work for which results were obtained using Discovery cluster:

This work utilized resources from the New Mexico State University High Performance Computing Group, which is directly supported by the National Science Foundation (OAC-2019000), the Student Technology Advisory Committee, and New Mexico State University and benefits from inclusion in various grants (DoD ARO-W911NF1810454; NSF EPSCoR OIA-1757207; Partnership for the Advancement of Cancer Research, supported in part by NCI grants U54 CA132383 (NMSU)).