[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Multiple HTCondor workers on a single compute node



Dear Greg

I appreciate your help. The method works perfectly for me.

```
export VERY_RNUM=$RANDOM
export _CONDOR_STARTD_RESOURCE_PREFIX=slot_${VERY_RNUM}_
export _CONDOR_LOCAL_DIR=$SCRATCH/${SLURM_NODELIST%.*}/log/${VERY_RNUM}
mkdir -p ${_CONDOR_LOCAL_DIR}/log
mkdir -p ${_CONDOR_LOCAL_DIR}/execute
condor_master -f -n compute_condor_${VERY_RNUM}
```
Thank you very much!

Best, 
Seung

On Mon, Nov 6, 2023 at 4:11âPM Daues, Gregory Edward <daues@xxxxxxxxxxxx> wrote:

 And I guess I left off some elements; one can also set a LOCAL_DIRÂ
and a prefix for the glide-in using the random number:

export VERY_RNUM=$RANDOM
export _CONDOR_STARTD_RESOURCE_PREFIX=slot_${VERY_RNUM}_
export _CONDOR_LOCAL_DIR=/scratch.local/${USER}/${VERY_RNUM}
${_CONDOR_SBIN}/condor_master -f -n compute_condor_${VERY_RNUM}

Those should be the elements that keep the different condor_masterÂ
from interfering with one another.

    ÂGregÂ



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Daues, Gregory Edward <daues@xxxxxxxxxxxx>
Sent: Monday, November 6, 2023 5:59 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Multiple HTCondor workers on a single compute node
Â

 ÂHello,

I use the -n option with a random number in a bash script like

#!/bin/bash

export VERY_RNUM=$RANDOM
${_CONDOR_SBIN}/condor_master -f -n compute_condor_${VERY_RNUM}

but I imagine there could be other ways.

     GregÂ



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Seung-Jin Sul <ssul@xxxxxxx>
Sent: Monday, November 6, 2023 5:21 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Multiple HTCondor workers on a single compute node
Â
Hi,Â

We use SLURM as a glide-in backend and sometimes need to run multiple HTCondor worker services on the same node. This happens when we request a part of a compute node like 1 CPU and 10GB memory from SLURM.Â

When we try to start another instance of HTCondor on the same node, we see below

```
11/06/23 14:49:54 lock_file returning ERROR, errno=11 (Resource temporarily unavailable)
11/06/23 14:49:54 FileLock::obtain(1) failed - errno 11 (Resource temporarily unavailable)
11/06/23 14:49:54 ERROR "Can't get lock on "/clusterfs/jgi/scratch/dsi/aa/jaws/dori-dev/htcondor-log/n0099/log/InstanceLock"" at line 1691 in fil  Âe /var/lib/condor/execute/slot1/dir_3620933/userdir/.tmpdnieob/BUILD/condor-10.2.2/src/condor_master.V6/master.cpp

```


How can we start multiple HTcondor worker services on a node? Any info on setting the port and on the lock file will be helpful.Â

Thank you!

Best,Â
Seung
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/