[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs aren't running on a specific machine



Hi,

I ran the jobs using the requirements Class Ad but still jobs don't run.There is nothing in Output and error file of job folder.Here are start and Starter logs for the jobs I have submitted to condor.I couldn't understand these logs.

here is My Submit File
-----------------------------------------------------
# Submit file for combining the output

universe = vanilla
Executable = C:\Progra~1\R\R-2.11.1-x64\bin\Rscript.exe
getenv = true

Output = sim_boot_omega_3_1_3.out
Log = sim_boot_omega_3_1_3.log
error = sim_boot_omega_3_1_3.error

Requirements = SlotID == 1 && Machine == "omegaws3.sci.odu.edu"

input = sim_boot_omega_3_1_3.R
arguments = sim_boot_omega_3_1_3.R
queue 

----------------------------------------------------- 
I tried using only SlotId=1, even it didn't work

Start Log
-----------------------------------------------------
04/21 16:20:03 slot1: match_info called
04/21 16:20:03 slot1: Received match <192.168.0.103:49160>#1303284283#32#...
04/21 16:20:03 slot1: State change: match notification protocol successful
04/21 16:20:03 slot1: Changing state: Unclaimed -> Matched
04/21 16:20:03 slot1: Request accepted.
04/21 16:20:03 slot1: Remote owner is OmegaAdmin@xxxxxxxxxxxxxxxx
04/21 16:20:03 slot1: State change: claiming protocol successful
04/21 16:20:03 slot1: Changing state: Matched -> Claimed
04/21 16:20:04 slot1: Got activate_claim request from shadow (<192.168.0.103:50577>)
04/21 16:20:04 slot1: Remote job ID is 471.0
04/21 16:20:04 slot1: Got universe "VANILLA" (5) from request classad
04/21 16:20:04 slot1: State change: claim-activation protocol successful
04/21 16:20:04 slot1: Changing activity: Idle -> Busy
04/21 16:20:06 slot1: Called deactivate_claim_forcibly()
04/21 16:20:07 Starter pid 280 exited with status 0
04/21 16:20:07 slot1: State change: starter exited
04/21 16:20:07 slot1: Changing activity: Busy -> Idle
04/21 16:20:09 slot1: State change: received RELEASE_CLAIM command
04/21 16:20:09 slot1: Changing state and activity: Claimed/Idle -> Preempting/Vacating
04/21 16:20:09 slot1: State change: No preempting claim, returning to owner
04/21 16:20:09 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle
04/21 16:20:09 slot1: State change: IS_OWNER is false
04/21 16:20:09 slot1: Changing state: Owner -> Unclaimed
----------------------------------------------------------------------------------------------------------------

Starter Log for Slot 1
----------------------------------------------------------------------------------------------------------------
04/21 16:20:04 Locale: English_United States.1252
04/21 16:20:04 ******************************************************
04/21 16:20:04 ** condor_starter (CONDOR_STARTER) STARTING UP
04/21 16:20:04 ** C:\condor\bin\condor_starter.exe
04/21 16:20:04 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
04/21 16:20:04 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
04/21 16:20:04 ** $CondorVersion: 7.4.2 Mar 30 2010 BuildID: 227044 $
04/21 16:20:04 ** $CondorPlatform: INTEL-WINNT50 $
04/21 16:20:04 ** PID = 280
04/21 16:20:04 ** Log last touched 4/21 15:15:06
04/21 16:20:04 ******************************************************
04/21 16:20:04 Using config source: C:\condor\condor_config
04/21 16:20:04 Using local config sources: 
04/21 16:20:04    C:\condor\condor_config.local
04/21 16:20:04 DaemonCore: Command Socket at <192.168.0.103:50581>
04/21 16:20:05 GLEXEC_JOB not supported on this platform; ignoring
04/21 16:20:05 Setting resource limits not implemented!
04/21 16:20:05 Communicating with shadow <192.168.0.103:50573>
04/21 16:20:05 Submitting machine is "omegaws3.sci.odu.edu"
04/21 16:20:05 setting the orig job name in starter
04/21 16:20:05 setting the orig job iwd in starter
04/21 16:20:05 File transfer completed successfully.
04/21 16:20:06 Job 471.0 set to execute immediately
04/21 16:20:06 Starting a VANILLA universe job with ID: 471.0
04/21 16:20:06 Tracking process family by login "condor-reuse-slot1"
04/21 16:20:06 IWD: C:\condor\execute\dir_280
04/21 16:20:06 Input file: C:\condor\execute\dir_280\sim_boot_omega_3_1_3.R
04/21 16:20:06 Output file: C:\condor\execute\dir_280\sim_boot_omega_3_1_3.out
04/21 16:20:06 Error file: C:\condor\execute\dir_280\sim_boot_omega_3_1_3.error
04/21 16:20:06 Renice expr "10" evaluated to 10
04/21 16:20:06 About to exec C:\condor\execute\dir_280\condor_exec.exe sim_boot_omega_3_1_3.R
04/21 16:20:06 Create_Process succeeded, pid=2644
04/21 16:20:06 Process exited, pid=2644, status=10
04/21 16:20:06 Got SIGQUIT.  Performing fast shutdown.
04/21 16:20:06 ShutdownFast all jobs.
04/21 16:20:06 **** condor_starter (condor_STARTER) pid 280 EXITING WITH STATUS 0
----------------------------------------------------------------------------------------------------------------

can anyone tell me how to solve this problem.

Thanks,

Shruti

From: Ian Chesal <ichesal@xxxxxxxxxxxxxxxxxx>
To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Sent: Thu, 21 April, 2011 12:36:06 PM
Subject: Re: [Condor-users] Jobs aren't running on a specific machine




On Thursday, April 21, 2011 at 12:25 PM, swarna baggani wrote:

Hi Everyone,

I have 6 window machines, that are dual core(Each machine having slot1 and slot2), and I have some R jobs which I want to run on a specific slot of specific machine, for suppose I have three different jobs J1,J2,J3 and I want to run on machines slot1@M1, slot1@M2 and slot1@M3 respectively and when I submit this on condor jobs aren't running (running for few seconds and exiting)

Here is my submit file
-------------------------------------------------------------------
universe = vanilla
Executable = C:\Progra~1\R\R-2.11.1-x64\bin\Rscript.exe
getenv = true

Output = sim_boot_omega_3_1_3.out
Log = sim_boot_omega_3_1_3.log
error = sim_boot_omega_3_1_3.error

Rank = Machine == "slot1@xxxxxxxxxxxxxxxxxxxx"

input = sim_boot_omega_3_1_3.R
arguments = sim_boot_omega_3_1_3.R
queue 

----------------------------------------------------------------------

I have tried using Requirements instead of Rank but, still I jobs don't run. can anyone tell me what might be the problem.
Sounds like you have two problems: the first one is that Rank doesn't restrict a job to a particular match, it just prefers that match if a job has multiple matches to pick from. If you really only want to run in slot 1 on any machine you want:

requirements = SlotID == 1

That'll restrict the job to just slot 1 on *any* machine. If you only want slot 1 on a specific machine or two you want:

requirements = SlotID == 1 && (Machine == "machineA" || Machine == "machine")

Now your job will only run on slot 1 on machineA or machineB.

So that solves your steering problem.

Your second problem appears to be related to executing the job. Maybe that's cleared up by using just slot 1 on a specific machine, but if it isn't you'll need to start log and output reading to figure out why the jobs aren't running properly. First place to look is in the stdout and stderr captures for your jobs. There might be useful information in there. If those are empty try checking the StarterLog.slot1 log on the machine where the job tried to run. And the ShadowLog file on the scheduler.

If you want more help debugging why the job isn't run post some of the stderr/stdout from a failing job or the StarterLog.slot1 section for the failed run attempt.

Regards,
- Ian

-- 
Ian Chesal
ichesal@xxxxxxxxxxxxxxxxxx