Openflight Slurm nodes set "down" after start

Hi All,

We’re experiencing a problem where after running the Openflight HPC Ansible Playbook and Slurm is installed and configured, the nodes start and using “sinfo” report “Idle” and then shortly after disappear and say “down”.

We started a cluster with 2 nodes and found the Slurm logs at /opt/flight/opt/slurm/var/log/slurm/slurmctld.log however this just reports:

[2020-04-15T18:12:29.520] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-04-15T18:17:26.489] error: Nodes node[01-02] not responding
[2020-04-15T18:17:26.490] error: NOTE: Trying backup state save file. Jobs may be lost!
[2020-04-15T18:17:26.490] No job state file (/opt/flight/opt/slurm/var/spool/slurm.state/job_state.old) found
[2020-04-15T18:19:09.659] error: Nodes node[01-02] not responding, setting DOWN

There’s nothing else to say why these nodes are going away, we can still ssh into these nodes and they seem to be OK? Has anyone else experienced this?

Many Thanks,
Antonio

Hi Antonio,

I’ve launched a CentOS 7 research environment using the ansible playbook. I was, sadly, unable to reproduce the issue you’re seeing.

SLURM have a page on their website with troubleshooting tips for nodes being in state down. It may also be worth checking your network configuration.

Could you share the output of scontrol show node node01? This may provide further information on why the node went unresponsive.

Kind Regards,

Stu

Hi Stu,

Thanks for the fast response as always!

The output from the command you requested is as follows:

[tony@headnode1 (mycluster) ~]$ scontrol show node node01
NodeName=node01 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=0.17
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=node01 NodeHostName=node01 Version=17.11
OS=Linux 3.10.0-1062.el7.x86_64 #1 SMP Wed Aug 7 18:08:02 UTC 2019
RealMemory=1 AllocMem=0 FreeMem=325 Sockets=1 Boards=1
State=DOWN* ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=all
BootTime=2020-04-16T15:41:34 SlurmdStartTime=2020-04-16T15:46:48
CfgTRES=cpu=1,mem=1M,billing=1
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Not responding [nobody@2020-04-16T15:53:33]

I don’t know whether that’s of any use to help diagnose?

We’re also seeing this error when running “flight start” on the headnode:

[tony@headnode1 ~]$ flight start
__ _ _ _ _ ==>
==> / | () | | | | ==>
==> ___ _ __ ___ _ __ | || | __ | |_ | |_ ==>
==> / _ \ | '_ \ / _ | '_ \ | | | |/ ` | ’ | __| ==>
==> | (
) || |) || __/| | | || | | | | (| | | | | |_ ==>
==> _/ | ./ __||| |||| |||_, || ||_| ==>
==> |
| / | ==>
==> Welcome to mycluster |
/ ==>
==> OpenFlight r2019.2
==> Based on CentOS Linux 7.7.1908
TIPS:

‘flight help’ - get help on available commands
‘flight env’ - manage software package environments
‘flight desktop’ - manage interactive GUI desktop sessions

Traceback (most recent call last):
10: from /opt/flight/opt/flight-env/bin/flenv:36:in <main>' 9: from /opt/flight/opt/runway/embedded/lib/ruby/site_ruby/2.6.0/bundler.rb:101:insetup’
8: from /opt/flight/opt/runway/embedded/lib/ruby/site_ruby/2.6.0/bundler.rb:135:in definition' 7: from /opt/flight/opt/runway/embedded/lib/ruby/site_ruby/2.6.0/bundler/definition.rb:34:inbuild’
6: from /opt/flight/opt/runway/embedded/lib/ruby/site_ruby/2.6.0/bundler/dsl.rb:13:in evaluate' 5: from /opt/flight/opt/runway/embedded/lib/ruby/site_ruby/2.6.0/bundler/dsl.rb:234:into_definition’
4: from /opt/flight/opt/runway/embedded/lib/ruby/site_ruby/2.6.0/bundler/dsl.rb:234:in new' 3: from /opt/flight/opt/runway/embedded/lib/ruby/site_ruby/2.6.0/bundler/definition.rb:83:ininitialize’
2: from /opt/flight/opt/runway/embedded/lib/ruby/site_ruby/2.6.0/bundler/definition.rb:83:in new' 1: from /opt/flight/opt/runway/embedded/lib/ruby/site_ruby/2.6.0/bundler/lockfile_parser.rb:95:ininitialize’
/opt/flight/opt/runway/embedded/lib/ruby/site_ruby/2.6.0/bundler/lockfile_parser.rb:108:in `warn_for_outdated_bundler_version’: You must use Bundler 2 or greater with this lockfile. (Bundler::LockfileError)
OpenFlight is now active.

Does anyone have any ideas why this might be happening?

Thanks,
Tony

Hi Tony,

That command output has proven less useful than I thought! It’s worth checking a few things on both the controller and nodes, such as:

  • Date and times align
  • Nodes can SSH/ping one another
  • Firewalls are properly configured to allow SLURM ports (or disabled entirely)
  • Munge service is running correctly

Further to the above, the SLURM troubleshooting tips may provide some extra clarity on the issues you’re facing.

As for your flight start issue, it seems that there is a version misalignment in ruby. I suspect that this is due to running the latest runway with an older version of env/desktop. Could you send me the output of rpm -qa |grep flight please?

Cheers,

Stu