[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] htcondor + gpudirect + openmpi



Hello all,

Can you tell me how this works and in which file we have to edit this.

Regards,
Malathi

----- Original Message -----
From: "Harald van Pee" <pee@xxxxxxxxxxxxxxxxx>
To: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, September 12, 2017 8:31:03 PM
Subject: Re: [HTCondor-users] htcondor + gpudirect + openmpi

Hello all,

I think I have now a working version for all cases, 
CONDOR_CHIRP=`condor_config_val libexec`
CONDOR_CHIRP=$CONDOR_CHIRP/condor_chirp
ncpus=`$CONDOR_CHIRP get_job_attr RequestCpus`
ngpus=`$CONDOR_CHIRP get_job_attr RequestGpus`
...
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines
#sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' > machines
for(( i=1 ; i <$ngpus ; i++)) ; do
    echo i= $i
    sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $2}' >> machines
#    sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' >> machines
done;
...
nmpinodes=$(( $ngpus * $_CONDOR_NPROCS))
...
mpirun -v  --prefix $MPDIR --mca $mca_ssh_agent $CONDOR_SSH -n $nmpinodes  -
hostfile machines $EXECUTABLE $@ &

but

I have to use the old condor_ssh version from htcondor 8.4 which uses 
hostnames not proc numbers (indeed I just changed back these parts).
If I do not use hostnames, it could hapen, that if a 
request_cpus=1/request_gpus=1 job lands several times on one machine, there is 
an sshd running and mpirun starts all jobs on that machine and ignores 
completly all others. 
Therfore I think we need hosts in the machine file, because mpirun can not 
handle procnumbers.

Why was it changed? Any other pitfalls?

Best
Harald


On Monday 11 September 2017 23:10:09 Harald van Pee wrote:
> On Monday 11 September 2017 22:10:18 Michael Pelletier wrote:
> > I've been using the job ad file for non-dynamic values. For dynamic stuff
> > you could use condor_chirp get_job_attr.
> > 
> >      condor_q -jobads $_CONDOR_JOB_AD -autoformat RequestCpus
> 
> Thanks!
> 
> > -Michael Pelletier.
> > 
> > > -----Original Message-----
> > > From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On
> > > Behalf Of Harald van Pee
> > > Sent: Monday, September 11, 2017 3:45 PM
> > > To: htcondor-users@xxxxxxxxxxx
> > > Subject: Re: [HTCondor-users] htcondor + gpudirect + openmpi
> > > 
> > > Hi Jason,
> > > 
> > > I think I have done something wrong, or its just working since I have
> > > installed Mellanox OFED 4.1.
> > > 
> > > Up to now I have tested it only with openmpi-2.0.2a1 but at least for
> > > this version its working if I make a loop over the requested gpus to
> > > just get more lines in the machine file and for the -n argument I have
> > > to multiply $_CONDOR_NPROCS with the requested gpus.
> > > 
> > > How I get the number of requested gpus in the script?
> > > At the moment I would parse
> > > _CONDOR_AssignedGPUs and count them.
> > > 
> > > OMP_NUM_THREADS can be used to get the request_cpus value but in
> > > general this is not the number of mpinodes per node.
> > > 
> > > Up to now I just test the cpus but I hope I can start with real gpu
> > > jobs soon.
> > 
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
> > a subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/