[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] htcondor + gpudirect + openmpi



Harald,

> I have to use the old condor_ssh version from htcondor 8.4 which uses
> hostnames not proc numbers (indeed I just changed back these parts).
> If I do not use hostnames, it could hapen, that if a
> request_cpus=1/request_gpus=1 job lands several times on one machine, there is
> an sshd running and mpirun starts all jobs on that machine and ignores
> completly all others.
> Therfore I think we need hosts in the machine file, because mpirun can not
> handle procnumbers.
>
> Why was it changed? Any other pitfalls?

What I found was that if a job landed several times on the same
machine, and if the same hostname was then listed consecutively in the
machine file, mpirun would only generate a single SSH command (with
the combined number of CPUs) for that hostname and, therefore, only
run under one condor slot.

What you're describing actually sounds a lot like the same problem.
The idea with the proc numbers is that they should be unique per slot
so that mpirun is tricked in to thinking that every hostname is unique
and generates an SSH command for every slot. (Then condor_ssh
translates the proc number to the actual hostname and issues the
actual SSH command.) So to be clear, this is not happening correctly
for you? If you comment out the rm of the contact and machine files
and let them transfer back, what do they look like (you can send off
list if you don't want to expose any hostnames).

Jason

> Best
> Harald
>
>
> On Monday 11 September 2017 23:10:09 Harald van Pee wrote:
>> On Monday 11 September 2017 22:10:18 Michael Pelletier wrote:
>> > I've been using the job ad file for non-dynamic values. For dynamic stuff
>> > you could use condor_chirp get_job_attr.
>> >
>> >      condor_q -jobads $_CONDOR_JOB_AD -autoformat RequestCpus
>>
>> Thanks!
>>
>> > -Michael Pelletier.
>> >
>> > > -----Original Message-----
>> > > From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On
>> > > Behalf Of Harald van Pee
>> > > Sent: Monday, September 11, 2017 3:45 PM
>> > > To: htcondor-users@xxxxxxxxxxx
>> > > Subject: Re: [HTCondor-users] htcondor + gpudirect + openmpi
>> > >
>> > > Hi Jason,
>> > >
>> > > I think I have done something wrong, or its just working since I have
>> > > installed Mellanox OFED 4.1.
>> > >
>> > > Up to now I have tested it only with openmpi-2.0.2a1 but at least for
>> > > this version its working if I make a loop over the requested gpus to
>> > > just get more lines in the machine file and for the -n argument I have
>> > > to multiply $_CONDOR_NPROCS with the requested gpus.
>> > >
>> > > How I get the number of requested gpus in the script?
>> > > At the moment I would parse
>> > > _CONDOR_AssignedGPUs and count them.
>> > >
>> > > OMP_NUM_THREADS can be used to get the request_cpus value but in
>> > > general this is not the number of mpinodes per node.
>> > >
>> > > Up to now I just test the cpus but I hope I can start with real gpu
>> > > jobs soon.
>> >
>> > _______________________________________________
>> > HTCondor-users mailing list
>> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>> > a subject: Unsubscribe
>> > You can also unsubscribe by visiting
>> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> >
>> > The archives can be found at:
>> > https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/