[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs aren't running on a specific machine






On Thursday, April 21, 2011 at 12:25 PM, swarna baggani wrote:

Hi Everyone,

I have 6 window machines, that are dual core(Each machine having slot1 and slot2), and I have some R jobs which I want to run on a specific slot of specific machine, for suppose I have three different jobs J1,J2,J3 and I want to run on machines slot1@M1, slot1@M2 and slot1@M3 respectively and when I submit this on condor jobs aren't running (running for few seconds and exiting)

Here is my submit file
-------------------------------------------------------------------
universe = vanilla
Executable = C:\Progra~1\R\R-2.11.1-x64\bin\Rscript.exe
getenv = true

Output = sim_boot_omega_3_1_3.out
Log = sim_boot_omega_3_1_3.log
error = sim_boot_omega_3_1_3.error

Rank = Machine == "slot1@xxxxxxxxxxxxxxxxxxxx"

input = sim_boot_omega_3_1_3.R
arguments = sim_boot_omega_3_1_3.R
queue 

----------------------------------------------------------------------

I have tried using Requirements instead of Rank but, still I jobs don't run. can anyone tell me what might be the problem.
Sounds like you have two problems: the first one is that Rank doesn't restrict a job to a particular match, it just prefers that match if a job has multiple matches to pick from. If you really only want to run in slot 1 on any machine you want:

requirements = SlotID == 1

That'll restrict the job to just slot 1 on *any* machine. If you only want slot 1 on a specific machine or two you want:

requirements = SlotID == 1 && (Machine == "machineA" || Machine == "machine")

Now your job will only run on slot 1 on machineA or machineB.

So that solves your steering problem.

Your second problem appears to be related to executing the job. Maybe that's cleared up by using just slot 1 on a specific machine, but if it isn't you'll need to start log and output reading to figure out why the jobs aren't running properly. First place to look is in the stdout and stderr captures for your jobs. There might be useful information in there. If those are empty try checking the StarterLog.slot1 log on the machine where the job tried to run. And the ShadowLog file on the scheduler.

If you want more help debugging why the job isn't run post some of the stderr/stdout from a failing job or the StarterLog.slot1 section for the failed run attempt.

Regards,
- Ian

-- 
Ian Chesal
ichesal@xxxxxxxxxxxxxxxxxx
http://www.cyclecomputing.com/