[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Starter problem? Shadow problem? File transferproblem?



Hi,

I am having a difficult time debugging a problem.  I am sending jobs to
my condor pool.  The jobs start to run and then go into an idle state. 
Here are the errors I am seeing:

In the StarterLog on the node the job runs on I see this:

12/27 10:12:26 DaemonCore: Command received via TCP from host
<131.225.207.15:47421>
12/27 10:12:26 DaemonCore: received command 444 (ACTIVATE_CLAIM),
calling handler (command_activate_claim)
12/27 10:12:26 vm1: Got activate_claim request from shadow
(<131.225.207.15:47421>)
12/27 10:12:26 vm1: Remote job ID is 680.0
12/27 10:12:26 vm1: Got universe "VANILLA" (5) from request classad
12/27 10:12:26 vm1: State change: claim-activation protocol successful
12/27 10:12:26 vm1: Changing activity: Idle -> Busy
12/27 10:12:26 Starter pid 24865 exited with status 4
12/27 10:12:26 vm1: State change: starter exited
12/27 10:12:26 vm1: Changing activity: Busy -> Idle
12/27 10:12:30 DaemonCore: Command received via TCP from host
<131.225.207.15:47430>
12/27 10:12:30 DaemonCore: received command 444 (ACTIVATE_CLAIM),
calling handler (command_activate_claim)
12/27 10:12:30 vm1: Got activate_claim request from shadow
(<131.225.207.15:47430>)
12/27 10:12:30 vm1: Remote job ID is 680.0
12/27 10:12:30 vm1: Got universe "VANILLA" (5) from request classad
12/27 10:12:30 vm1: State change: claim-activation protocol successful
12/27 10:12:30 vm1: Changing activity: Idle -> Busy
12/27 10:12:30 Starter pid 24871 exited with status 4
12/27 10:12:31 vm1: State change: starter exited
12/27 10:12:31 vm1: Changing activity: Busy -> Idle
12/27 10:12:31 DaemonCore: Command received via UDP from host
<131.225.207.15:40753>
12/27 10:12:31 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_handler)
12/27 10:12:31 vm1: State change: received RELEASE_CLAIM command
12/27 10:12:31 vm1: Changing state and activity: Claimed/Idle ->
Preempting/Vacating
12/27 10:12:31 vm1: State change: No preempting claim, returning to
owner
12/27 10:12:31 vm1: Changing state and activity: Preempting/Vacating ->
Owner/Idle
12/27 10:12:31 vm1: State change: IS_OWNER is false
12/27 10:12:31 vm1: Changing state: Owner -> Unclaimed
12/27 10:12:31 DaemonCore: Command received via UDP from host
<131.225.207.15:40753>
12/27 10:12:31 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_handler)
12/27 10:12:31 Error: can't find resource with capability
(<131.225.207.232:32776>#6113182116)

In the StarterLog.vm1 on the node the job is running on I see this:
12/27 10:12:30 ******************************************************
12/27 10:12:30 ** condor_starter (CONDOR_STARTER) STARTING UP
12/27 10:12:30 ** /opt/condor/sbin/condor_starter
12/27 10:12:30 ** $CondorVersion: 6.6.6 Jul 26 2004 $
12/27 10:12:30 ** $CondorPlatform: I386-LINUX_RH72 $
12/27 10:12:30 ** PID = 24871
12/27 10:12:30 ******************************************************
12/27 10:12:30 Using config file: /etc/condor/condor_config
12/27 10:12:30 Using local config files:
/opt/condor/local.cmswn103/condor_confi g.local
12/27 10:12:30 DaemonCore: Command Socket at <131.225.207.232:39991>
12/27 10:12:30 Done setting resource limits
12/27 10:12:30 Starter communicating with condor_shadow
<131.225.207.15:47428>
12/27 10:12:30 Submitting machine is "cmssrv10.fnal.gov"
12/27 10:12:30 File transfer completed successfully.
12/27 10:12:30 Starting a VANILLA universe job with ID: 680.0
12/27 10:12:30 IWD: /opt/condor/local.cmswn103/execute/dir_24871
12/27 10:12:30 Output file:
/opt/condor/local.cmswn103/execute/dir_24871/_condor _stdout_680.0
12/27 10:12:30 Error file:
/opt/condor/local.cmswn103/execute/dir_24871/_condor_ stderr_680.0
12/27 10:12:30 About to exec
/opt/condor/local.cmswn103/execute/dir_24871/condor _exec.exe -l
12/27 10:12:30 Create_Process succeeded, pid=24873
12/27 10:12:30 Process exited, pid=24873, status=0
12/27 10:12:30 ReliSock: put_file: Failed to open file
/opt/condor/local.cmswn10 3/execute/dir_24871/output_kludge, errno = 2.
12/27 10:12:30 ERROR "DoUpload: Failed to send file
/opt/condor/local.cmswn103/e xecute/dir_24871/output_kludge, exiting at
1386
" at line 1385 in file file_transfer.C
12/27 10:12:30 ShutdownFast all jobs.


Web searches indicate I need to make sure the permissions on
/opt/condor/local.cmswn103/execute are correct.  And they are, they are
writable by everyone at this point.  

I don't know why this job does not complete.  I have done everything I
can think to do.  Does anyone have any ideas?

Here is the jdf that I am submitting:

universe = vanilla
should_transfer_files = YES
transfer_output_files = output_kludge
when_to_transfer_output = ON_EXIT_OR_EVICT
output = /storage/remote/data1/sam/jim/ot.$(cluster).$(process)
error = /storage/remote/data1/sam/jim/err.$(cluster).$(process)
log = log.$(cluster).$(process)
executable = ./binary/ls
arguments = -l
# JIM__LOCAL_JID needs to be just the cluster value
environment =
SAM_STATION=cms-grid;JIM_LOCAL_JID=$(cluster);JIM_JOB_NUMBER=
+sam_project = "test"
+grid_jid = "test"
+IsCommitted = True
+IsCommittedJob = True
queue 1

Any ideas at all?

Thanks,

Joe