[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Multiple HTCondor workers on a single compute node




   Hello,

I use the -n option with a random number in a bash script like

#!/bin/bash

export VERY_RNUM=$RANDOM
${_CONDOR_SBIN}/condor_master -f -n compute_condor_${VERY_RNUM}

but I imagine there could be other ways.

          Greg 



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Seung-Jin Sul <ssul@xxxxxxx>
Sent: Monday, November 6, 2023 5:21 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Multiple HTCondor workers on a single compute node
 
Hi, 

We use SLURM as a glide-in backend and sometimes need to run multiple HTCondor worker services on the same node. This happens when we request a part of a compute node like 1 CPU and 10GB memory from SLURM. 

When we try to start another instance of HTCondor on the same node, we see below

```
11/06/23 14:49:54 lock_file returning ERROR, errno=11 (Resource temporarily unavailable)
11/06/23 14:49:54 FileLock::obtain(1) failed - errno 11 (Resource temporarily unavailable)
11/06/23 14:49:54 ERROR "Can't get lock on "/clusterfs/jgi/scratch/dsi/aa/jaws/dori-dev/htcondor-log/n0099/log/InstanceLock"" at line 1691 in fil    e /var/lib/condor/execute/slot1/dir_3620933/userdir/.tmpdnieob/BUILD/condor-10.2.2/src/condor_master.V6/master.cpp

```


How can we start multiple HTcondor worker services on a node? Any info on setting the port and on the lock file will be helpful. 

Thank you!

Best, 
Seung