[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] parallel universe fails with PREEMPTION_REQUIREMENTS == False



Hi Jason,

sorry for the delay, I had to rule out some other things :( 

Back to my problem, yes everything looks OK, I can list the dedicated nodes:

[chbeyer@bird-htc-sched01]~% condor_status -constraint condor_status âconst âDedicatedScheduler =!= Nullâ
[chbeyer@bird-htc-sched01]~% condor_status -constraint DedicatedScheduler=!=Null  
Name                  OpSys      Arch   State     Activity LoadAv Mem    ActvtyTime

slot1@xxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 11838  1+01:02:10
slot1@xxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 11838  2+21:40:19
<snip>


The scheduler does his job as far as I see: 

03/09/18 11:16:00 Found idle MPI cluster 855240
03/09/18 11:16:00 Inserting new attribute Scheduler into non-active cluster cid=855240 acid=-1
03/09/18 11:16:00 Dedicated job: 855240.0 chbeyer
03/09/18 11:16:00 Trying to find 3 resource(s) for dedicated job 855240.0
03/09/18 11:16:00 satisfyJobs: finding resources for 855240.0
03/09/18 11:16:00 satisfyJobs:     855240.0 satisfied with slot slot1@xxxxxxxxxxxxxxx
03/09/18 11:16:00 satisfyJobs: finding resources for 855240.0
03/09/18 11:16:00 satisfyJobs:     855240.0 satisfied with slot slot1@xxxxxxxxxxxxxxx
03/09/18 11:16:00 satisfyJobs: finding resources for 855240.0
03/09/18 11:16:00 satisfyJobs:     855240.0 satisfied with slot slot1@xxxxxxxxxxxxxxx
03/09/18 11:16:00 Satisfied job 855240 with 3 unclaimed resources
03/09/18 11:16:00 Generating 3 resource requests for job 855240


But the negotiator just rejects the parallel job: 

03/08/18 13:28:54     Request 855240.00000: autocluster -1 (request count 1 of 0)
03/08/18 13:28:54 matchmakingAlgorithm: limit 3.000000 used 0.000000 pieLeft 3.000000
03/08/18 13:28:54       Rejected 855240.0 DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx <131.169.56.32:9618?addrs=131.169.56.32-9618&noUDP&sock=2854700_55af_3>: no match found
03/08/18 13:28:54     Request 855240.00000: autocluster -1 (request count 1 of 0)
03/08/18 13:28:54 matchmakingAlgorithm: limit 3.000000 used 0.000000 pieLeft 3.000000
03/08/18 13:28:54       Rejected 855240.0 DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx <131.169.56.32:9618?addrs=131.169.56.32-9618&noUDP&sock=2854700_55af_3>: no match found
03/08/18 13:28:54     Request 855240.00000: autocluster -1 (request count 1 of 0)
03/08/18 13:28:54 matchmakingAlgorithm: limit 3.000000 used 0.000000 pieLeft 3.000000
03/08/18 13:28:54       Rejected 855240.0 DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx <131.169.56.32:9618?addrs=131.169.56.32-9618&noUDP&sock=2854700_55af_3>: no match found
03/08/18 13:28:54     Sending SEND_RESOURCE_REQUEST_LIST/20/eom
03/08/18 13:28:54     Getting reply from schedd ...
03/08/18 13:28:54     Got NO_MORE_JOBS;  schedd has no more requests
03/08/18 13:28:54   This submitter hit its submitterLimit.
03/08/18 13:28:54  resources used scheddUsed= 0.000000

There is nothing wrong with the jobs though, according to '-better-analyze' : 

[chbeyer@bird-htc-sched01]~% condor_q -better-analyze 855240.0 -reverse -machine bird621.desy.de


-- Schedd: bird-htc-sched01.desy.de : <131.169.56.32:9618?...

-- Slot: slot1@xxxxxxxxxxxxxxx : Analyzing matches for 1 Jobs in 1 autoclusters

The Requirements expression for this slot is

    (START) && (IsValidCheckpointPlatform) &&
            (WithinResourceLimits)

  START is
    (NODE_IS_HEALTHY is true) &&
            (StartJobs is true)

  IsValidCheckpointPlatform is
    (TARGET.JobUniverse isnt 1 ||
            ((MY.CheckpointPlatform isnt undefined) &&
                ((TARGET.LastCheckpointPlatform is MY.CheckpointPlatform) ||
                    (TARGET.NumCkpts == 0))))

  WithinResourceLimits is
    (ifThenElse(TARGET._condor_RequestCpus isnt undefined,MY.Cpus > 0 &&
        TARGET._condor_RequestCpus <= MY.Cpus,ifThenElse(TARGET.RequestCpus isnt undefined,MY.Cpus > 0 &&
          TARGET.RequestCpus <= MY.Cpus,1 <= MY.Cpus)) &&
      ifThenElse(TARGET._condor_RequestMemory isnt undefined,MY.Memory > 0 &&
        TARGET._condor_RequestMemory <= MY.Memory,ifThenElse(TARGET.RequestMemory isnt undefined,MY.Memory > 0 &&
          TARGET.RequestMemory <= MY.Memory,false)) &&
      ifThenElse(TARGET._condor_RequestDisk isnt undefined,MY.Disk > 0 &&
        TARGET._condor_RequestDisk <= MY.Disk,ifThenElse(TARGET.RequestDisk isnt undefined,MY.Disk > 0 &&
          TARGET.RequestDisk <= MY.Disk,false)))

This slot defines the following attributes:

    CheckpointPlatform = "LINUX X86_64 3.10.0-693.17.1.el7.x86_64 normal N/A ssse3 sse4_1 sse4_2"
    Cpus = 8
    Disk = 80198152
    Memory = 11838
    NODE_IS_HEALTHY = true
    StartJobs = true

Job 855240.0 has the following attributes:

    TARGET.JobUniverse = 11
    TARGET.NumCkpts = 0
    TARGET.RequestCpus = 1
    TARGET.RequestDisk = 3072000
    TARGET.RequestMemory = 1536

The Requirements expression for this slot reduces to these conditions:

       Clusters
Step    Matched  Condition
-----  --------  ---------
[3]           1  IsValidCheckpointPlatform
[5]           1  WithinResourceLimits

slot1@xxxxxxxxxxxxxxx: Run analysis summary of 1 jobs.
    1 (100.00 %) match both slot and job requirements.
    1 match the requirements of this slot.
    1 have job requirements that match this slot.

I most likely missed something stupid (?) ;) 

Best
Christoph

-- 
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

----- UrsprÃngliche Mail -----
Von: "Jason Patton" <jpatton@xxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Montag, 5. MÃrz 2018 21:28:28
Betreff: Re: [HTCondor-users] parallel universe fails with PREEMPTION_REQUIREMENTS == False

Christoph,

Is the pool busy at the time? Do I understand your first email correct
that the schedd does claim the resources the job needs?

Jason

On Mon, Mar 5, 2018 at 8:18 AM, Beyer, Christoph
<christoph.beyer@xxxxxxx> wrote:
> Hi,
>
> hmm, no idea, found one on the net, I think it's this one:
>
> https://github.com/htcondor/htcondor/blob/master/build/packaging/srpm/condor_config.local.dedicated.resource
>
> [root@bird621 /etc/condor/config.d]# grep -v \# 100dedicated_ressource_wn.conf
> DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx"
> SUSPEND = False
> CONTINUE        = True
> PREEMPT = False
> KILL            = False
> WANT_SUSPEND    = False
> WANT_VACATE     = False
> RANK            = Scheduler =?= $(DedicatedScheduler)
> MPI_CONDOR_RSH_PATH = $(LIBEXEC)
> CONDOR_SSHD = /usr/sbin/sshd
> CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
> STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler
> START = (NODE_IS_HEALTHY =?= True) && (StartJobs =?= True)
>
> Best
> Christoph
>
>
> --
> Christoph Beyer
> DESY Hamburg
> IT-Department
>
> Notkestr. 85
> Building 02b, Room 009
> 22607 Hamburg
>
> phone:+49-(0)40-8998-2317
> mail: christoph.beyer@xxxxxxx
>
> ----- UrsprÃngliche Mail -----
> Von: "Jason Patton" <jpatton@xxxxxxxxxxx>
> An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
> Gesendet: Montag, 5. MÃrz 2018 15:11:32
> Betreff: Re: [HTCondor-users] parallel universe fails with PREEMPTION_REQUIREMENTS == False
>
> Christoph,
>
> Are you using one of the pre-defined cases from the
> condor_config.local.dedicated.resource example config? If so, which
> one?
>
> Jason Patton
>
> On Fri, Mar 2, 2018 at 8:37 AM, Beyer, Christoph
> <christoph.beyer@xxxxxxx> wrote:
>> Hi everybody,
>>
>> I guess Oi need a hint :(
>>
>> I try to run the parallel environment following the example in the documentation and everything looks quite OK to me, the sched knows about the dedicated ressources and gets the slots together he needs.
>>
>> The negotiator on the other hand though is not happy and rejects the paralle job:
>>
>> Rejected 855232.0 DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx <131.169.56.32:9618?addrs=131.169.56.32-9618&noUDP&sock=2854700_55af_3>: PREEMPTION_REQ
>> UIREMENTS == False
>>
>> I did use the example config file for the parallel universe on the workernodes and do not see any other obvious errors/problems, hence I think the overall setup is OK maybe someone can point me in the right direction what the reject message means ?
>>
>> Best
>> Christoph
>>
>> --
>> Christoph Beyer
>> DESY Hamburg
>> IT-Department
>>
>> Notkestr. 85
>> Building 02b, Room 009
>> 22607 Hamburg
>>
>> phone:+49-(0)40-8998-2317
>> mail: christoph.beyer@xxxxxxx
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/