[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] dynamic slots with gpus



The first thing I recommend is to run condor_q -analyze with the -machine option, and give the name of the machine with 2 gpus thatâs not running your jobs. That will ensure that when it says 1 machine is able to run your job, it means that machine.

If condor_q -analyze says that the 2-gpu machine is able to run your gpu jobs, the next thing to check is whether HTCondor is trying and failing to run your gpu jobs on the 2-gpu machine. There a few things you can check for this. To start, submit a gpu job with a Requirements expression that limits it to only run on the 2-gpu machine. For example, you can add this to the submit file:

requirements = Machine == âname-of-2gpu-machine"

Then, after a few minutes, run "condor_q <job id> -af NumShadowStarts"
If this prints a number greater than 0, then HTCondor is trying to run the job on the machine.

Look in the job event log (set with âlogâ in the submit file). Any events other than "Job submittedâ will show HTCondorâs attempts to run the job, and may indicate whatâs going wrong.

You can also see if any jobs at all can run on the 2-gpu machine. Try submitting a simple job that just runs /bin/date. Make sure to set Requirements so this test job only runs on only that machine.

 - Jaime

> On Oct 6, 2023, at 12:07 PM, Justin Killebrew via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
> 
> Hello.
> 
> Iâve enabled dynamic slots on 2 machines:
> 
> use feature : GPUs
> GPU_DISCOVERY_EXTRA = -extra
> # dynamic slot config
> cpu = 24
> # 24 * 662
> memory = 15888
> disk = BIG
> NUM_SLOTS = 1
> NUM_SLOTS_TYPE = 1
> SLOT_TYPE_1 = 100%
> SLOT_TYPE_1_PARTITIONABLE = TRUE
> 
> 
> 1 machine has 1 gpu, the other has 2 and condor_status  -long bench5 shows correct gpu info. 
> 
> I submit with:
> 
> [â]
> request_cpus = 1
> request_memory = 800 MB
> request_disk = 1 GB
> request_gpus = 1
> should_transfer_files = yes
> when_to_transfer_output = ON_EXIT
> transfer_input_files =  enable_gpus.py, blender-3.5-splash.blend
> queue 10
> 
> 
> output of condor_q -better-analyze 111.002:
> 
> The Requirements expression for job 111.002 is
> 
>    (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
>    (TARGET.GPUs >= RequestGPUs) && (TARGET.HasFileTransfer)
> 
> Job 111.002 defines the following attributes:
> 
>    RequestDisk = 1048576
>    RequestGPUs = 1
>    RequestMemory = 800
> 
> The Requirements expression for job 111.002 reduces to these conditions:
> 
>         Slots
> Step    Matched  Condition
> -----  --------  ---------
> [0]           6  TARGET.Arch == "X86_64"
> [1]           6  TARGET.OpSys == "LINUX"
> [3]           6  TARGET.Disk >= RequestDisk
> [5]           6  TARGET.Memory >= RequestMemory
> [7]           2  TARGET.GPUs >= RequestGPUs
> 
> 
> 111.002:  Job is running.
> 
> Last successful match: Fri Oct  6 12:19:16 2023
> 
> 111.002:  Run analysis summary ignoring user priority.  Of 6 machines,
>      4 are rejected by your job's requirements
>      0 reject your job because of their own requirements
>      1 match and are already running your jobs
>      0 match but are serving other users
>      1 are able to run your job
> 
> 
> Only 1 machine (with 1 gpu) matches (and runs) all the jobs but I expected the machine with 2 gpus to be split into 2 partitions and run 2 jobs, 1 gpu each.  
> 
> Is there additional configuration for the 2 gpu machine?  Why doesnât it at least run 1 job? 
> 
> I tried request_gpus >= 1 in the submit file but thatâs a syntax error.  
> 
> 
> Thanks,
> JK
> 
> 
> 
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/