[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Unable to run a standard universe job.



Collin,

The 'condor_q -better -analyze 183.0 -reverse -machine
bane.hq.ierustech.com' yielded the following:

condor_q -better -analyze 185.0 -reverse -machine bane.hq.ierustech.com
-- Schedd: banzai.hq.ierustech.com : <192.168.6.67:9618?...
-- Slot: slot1@xxxxxxxxxxxxxxxxxxxxx : Analyzing matches for 1 Jobs in 1 autoclusters
The Requirements expression for this slot is

    ( START ) && ( IsValidCheckpointPlatform ) &&
            ( WithinResourceLimits )

  START is
    ( ( ( ( ( LoadAvg - CondorLoadAvg ) <= 1.000000000000000E+00 ) ||
                              ( State != "Unclaimed" && State != "Owner" ) ) &&
                        ( IsDesktop isnt true ||
          ( KeyboardIdle > 15 * 60 ) ) ) ) &&
            ( ( TARGET.RequiresWholeMachine isnt true &&
                        MY.CAN_RUN_WHOLE_MACHINE == false &&
                        eval(strcat("Slot",2,"_State")) isnt "Claimed" ) ||
                  ( TARGET.RequiresWholeMachine is true &&
                MY.CAN_RUN_WHOLE_MACHINE ) )

  IsValidCheckpointPlatform is
    ( TARGET.JobUniverse isnt 1 ||
            ( ( MY.CheckpointPlatform isnt undefined ) &&
                ( ( TARGET.LastCheckpointPlatform is MY.CheckpointPlatform ) ||
                    ( TARGET.NumCkpts == 0 ) ) ) )

  WithinResourceLimits is
    ( ifThenElse(TARGET._condor_RequestCpus isnt undefined,MY.Cpus > 0 &&
        TARGET._condor_RequestCpus <= MY.Cpus,ifThenElse(TARGET.RequestCpus isnt undefined,MY.Cpus > 0 &&
          TARGET.RequestCpus <= MY.Cpus,1 <= MY.Cpus)) &&
      ifThenElse(TARGET._condor_RequestMemory isnt undefined,MY.Memory > 0 &&
        TARGET._condor_RequestMemory <= MY.Memory,ifThenElse(TARGET.RequestMemory isnt undefined,MY.Memory > 0 &&
          TARGET.RequestMemory <= MY.Memory,false)) &&
      ifThenElse(TARGET._condor_RequestDisk isnt undefined,MY.Disk > 0 &&
        TARGET._condor_RequestDisk <= MY.Disk,ifThenElse(TARGET.RequestDisk isnt undefined,MY.Disk > 0 &&
          TARGET.RequestDisk <= MY.Disk,false)) &&
      ( TARGET.RequestGPUs is undefined ||
        MY.GPUs >= ifThenElse(TARGET._condor_RequestGPUs is undefined,TARGET.RequestGPUs,TARGET._condor_RequestGPUs) ) )

This slot defines the following attributes:

    CAN_RUN_WHOLE_MACHINE = SlotID == 2
    CheckpointPlatform = "LINUX X86_64 3.10.0-957.12.1.el7.x86_64 normal 0x2aaaaaaab000 avx ssse3 sse4_1 sse4_2"
    CondorLoadAvg = 0.0
    Cpus = 24
    Disk = 7208238
    GPUs = 0
    IsDesktop = true
    KeyboardIdle = 10095
    LoadAvg = 0.39
    Memory = 128889
    SlotID = 1
    State = "Unclaimed"

Job 185.0 has the following attributes:

    TARGET.JobUniverse = 1
    TARGET.NumCkpts = 0
    TARGET.RequestCpus = 1
    TARGET.RequestDisk = 3750
    TARGET.RequestMemory = 4

The Requirements expression for this slot reduces to these conditions:

       Clusters
Step    Matched  Condition
-----  --------  ---------
[10]          1  TARGET.RequiresWholeMachine isnt true
[20]          1  IsValidCheckpointPlatform
[22]          1  WithinResourceLimits

slot1@xxxxxxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs.
    1 (100.00 %) match both slot and job requirements.
    1 match the requirements of this slot.
    1 have job requirements that match this slot.

-- Slot: slot2@xxxxxxxxxxxxxxxxxxxxx : Analyzing matches for 1 Jobs in 1 autoclusters

The Requirements expression for this slot is

    ( START ) &&
        ( IsValidCheckpointPlatform )

  START is
    ( ( ( ( ( LoadAvg - CondorLoadAvg ) <= 1.000000000000000E+00 ) ||
                    ( State != "Unclaimed" && State != "Owner" ) ) &&
                ( IsDesktop isnt true || ( KeyboardIdle > 15 * 60 ) ) ) ) &&
        ( ( TARGET.RequiresWholeMachine isnt true &&
                MY.CAN_RUN_WHOLE_MACHINE == false &&
                eval(strcat("Slot",2,"_State")) isnt "Claimed" ) ||
            ( TARGET.RequiresWholeMachine is true &&
        MY.CAN_RUN_WHOLE_MACHINE ) )

  IsValidCheckpointPlatform is
    ( TARGET.JobUniverse isnt 1 ||
      ( ( MY.CheckpointPlatform isnt undefined ) &&
        ( ( TARGET.LastCheckpointPlatform is MY.CheckpointPlatform ) ||
          ( TARGET.NumCkpts == 0 ) ) ) )

This slot defines the following attributes:

    CAN_RUN_WHOLE_MACHINE = SlotID == 2
    CheckpointPlatform = "LINUX X86_64 3.10.0-957.12.1.el7.x86_64 normal 0x2aaaaaaab000 avx ssse3 sse4_1 sse4_2"
    CondorLoadAvg = 0.0
    IsDesktop = true
    KeyboardIdle = 10087
    LoadAvg = 1.0
    SlotID = 2
    State = "Owner"

Job 185.0 has the following attributes:

    TARGET.JobUniverse = 1
    TARGET.NumCkpts = 0

The Requirements expression for this slot reduces to these conditions:

       Clusters
Step    Matched  Condition
-----  --------  ---------
[15]          0  TARGET.RequiresWholeMachine is true

slot2@xxxxxxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs.
    0 (0.00 %) match both slot and job requirements.
    0 match the requirements of this slot.
    1 have job requirements that match this slot.


My read on this is that slot1_x should have allowed this job to run.
Slot 1 is partitionalble and Slot 2 is a whole machine type slot. They
are exclusive and cannot run at the same time. I forgot to add that all
of our condor nodes run either Red Hat EL 7.6 or the CentOS equivalent.
HTCondor binaries are pulled from the HTCondor Stable RPM Repository
for Redhat Enterprise Linux 7.

Thanks,


Michael McInerny Murphy
IERUS Technologies, Inc.
2904 Westcorp Blvd., Suite 210
Huntsville, AL  35805
(O): (256) 319-2026 ext 107

-----Original Message-----
From: Collin Mehring <collin.mehring@xxxxxxxxxxxxxx>
Reply-To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Unable to run a standard universe job.
Date: Fri, 14 Jun 2019 10:23:38 -0700

Hi Michael,

>From the analyze output it seems like that machine is rejecting your
job. I would either check the START expression on that machine directly
(1) or do a reverse analyze with condor_q (2) to find out why.

1: condor_config_val -name bane.hq.ierustech.com -v START
2: condor_q 183.0 --better-analyze -reverse
-machine bane.hq.ierustech.com

Best,
Collin

On Fri, Jun 14, 2019 at 6:43 AM Michael Murphy <
Michael.Murphy@xxxxxxxxxxxxx> wrote:
> Greetings,
> 
> I am trying to run a standard job in our condor pool. However, I
> cannot get a test job to execute. The matchmaker is not finding a
> match even though my requirement only specifies a hostname. I have
> never run a standard job in our pool before. I am not sure it's
> configured properly. Here's my submit script:
> 
> universe = standard
> executable = ./Cicero_CC_12750
> should_transfer_files = YES
> Requirements = machine == "bane.hq.ierustech.com"
> when_to_transfer_output = ON_EXIT_OR_EVICT
> log = $(Cluster).log
> 
> input = test_run.inp
> output = test_run.out
> error = test_run.err
> transfer_input_files = test_run.inp
> queue
> 
> The executable is compiled FORTRAN code relinked with
> condor_compile. 
> 
> When I check the status and try to determine why it's not matched to
> the execute host I use 'condor_q -analyze -better <JOB ID>' with the
> following output:
> 
> [michael.murphy@banzai Condor_checkpoint_test]$ condor_q -better
> -analyze 183.0 
> -- Schedd: banzai.hq.ierustech.com : <192.168.6.67:9618?...
> The Requirements expression for job 183.000 is
> 
>     ( machine == "bane.hq.ierustech.com" ) && ( TARGET.Arch ==
> "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( ( CkptArch ==
> TARGET.Arch ) || ( CkptArch is undefined ) ) && ( ( CkptOpSys ==
> TARGET.OpSys ) ||
>       ( CkptOpSys is undefined ) ) && ( TARGET.Disk >= RequestDisk )
> && ( TARGET.Memory >= RequestMemory )
> 
> Job 183.000 defines the following attributes:
> 
>     DiskUsage = 3750
>     ImageSize = 3500
>     RequestDisk = DiskUsage
>     RequestMemory = ifthenelse(MemoryUsage =!=
> undefined,MemoryUsage,( ImageSize + 1023 ) / 1024)
> 
> The Requirements expression for job 183.000 reduces to these
> conditions:
> 
>          Slots
> Step    Matched  Condition
> -----  --------  ---------
> [0]           2  machine == "bane.hq.ierustech.com"
> [6]         560  CkptArch is undefined
> [10]        560  CkptOpSys is undefined
> 
> No successful match recorded.
> Last failed match: Fri Jun 14 08:24:48 2019
> 
> Reason for last match failure: no match found 
> 
> 183.000:  Run analysis summary ignoring user priority.  Of 560
> machines,
>     544 are rejected by your job's requirements 
>       2 reject your job because of their own requirements 
>      14 are exhausted partitionable slots 
>       0 match and are already running your jobs 
>       0 match but are serving other users 
>       0 are available to run your job
> 
> WARNING:  Be advised:
>    Job did not match any machines's constraints
>    To see why, pick a machine that you think should match and add
>      -reverse -machine <name>
>    to your query.
> 
> The submitting machine's name is "banzai.hq.ierustech.com" and the
> execution machine is called "bane.hq.ierustech.com".
> 
> Have I forgotten to specifiy some macros to enable std universe jobs?
> Thanks for your time.
> 
>  -- 
> Michael McInerny Murphy
> IERUS Technologies, Inc.
> 2904 Westcorp Blvd., Suite 210
> Huntsville, AL  35805
> (O): (256) 319-2026 ext 107
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/