[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] RE: why jobs are always evicted?



hello, 

first i thank you  for your last advise,

>>in order to point out the facts for you, this is the
>>last mail i sent you:
>>I have a 4-node Linux cluster running Condor. I have
>>tried, unsuccessfully, to run jobs on the remotes
>>nodes. but they were evicted on these nodes!!, and 
>>finally, all the executions were held locally on the
submitting machine.
>>i don't understand why the jobs cannot be executed
>>on the remotes machines?

so, I changed some parameters such as the universe
from standard to vanilla, and i think that's more
clearer. 
now my jobs aren't always evicted but they aren't
either executed remotely!! I get from time to time in
my log result files some infomations announcing Shadow
exception!!

i think that the shadow forcibly closed the connection
from the starter! and there seems to be a problem with
creating some parameters to the executing machines!

i get too this message in the StarterLog of the
executing machine: couldn't create dir
/home00/condor/execute/dir_7805: Permission denied

I checked then I gave the same UID, GID to all the
users "condor" and I allotted all the right on their
home repertoires. but always no matter.


in the following you have the more relevant parts of
my log files:
 i submitted my jobs from node3.
 node1 is one of my executing machines in this step.
 
my sub file:
 universe = vanilla
 Executable     = /home/condor/test
 initialdir = /home/condor
 transfer_executable = TRUE
 should_transfer_files = YES
 transfer_files = ALWAYS
 when_to_transfer_output = ON_EXIT
 Output         = out$(process)
 Log            = log$(process)
  
 Queue 8  

here are all the log files of node3 (submitting
machine) and those of node1 (executing machine).



ScheddLog of node1:

10/30 14:36:20
******************************************************
10/30 14:36:20 ** condor_schedd (CONDOR_SCHEDD)
STARTING UP
10/30 14:36:20 ** $CondorVersion: 6.4.7 Jan 26 2003 $
10/30 14:36:20 ** $CondorPlatform: INTEL-LINUX-GLIBC22
$
10/30 14:36:20 ** PID = 14292
10/30 14:36:20
******************************************************
10/30 14:36:20 DaemonCore: Command Socket at
<130.98.172.55:33140>


ScheddLog of node3:

10/30 14:36:28
******************************************************
10/30 14:36:28 ** condor_schedd (CONDOR_SCHEDD)
STARTING UP
10/30 14:36:28 ** $CondorVersion: 6.4.7 Jan 26 2003 $
10/30 14:36:28 ** $CondorPlatform: INTEL-LINUX-GLIBC22
$
10/30 14:36:28 ** PID = 3238
10/30 14:36:28
******************************************************
10/30 14:36:28 DaemonCore: Command Socket at
<130.98.172.57:33381>
10/30 14:36:39 DaemonCore: Command received via UDP
from host 
<130.98.172.57:32785>
10/30 14:36:39 DaemonCore: received command 421
(RESCHEDULE), calling 
handler (reschedule_negotiator)
10/30 14:36:39 Sent ad to central manager for 
condor@xxxxxxxxxxxxxxxxxxxxxxxx/30 14:36:39 Called 
reschedule_negotiator()
10/30 14:36:51 DaemonCore: Command received via TCP
from host 
<130.98.172.56:34157>
10/30 14:36:51 DaemonCore: received command 416
(NEGOTIATE), calling 
handler (negotiate)
10/30 14:36:51 Negotiating for owner:
condor@xxxxxxxxxxxxxxxxxxxxxx
10/30 14:36:51 Checking consistency running and
runnable jobs
10/30 14:36:51 Tables are consistent
10/30 14:36:51 Out of servers - 3 jobs matched, 5 jobs
idle, 1 jobs 
rejected
10/30 14:36:53 Started shadow for job 113.0 on
"<130.98.172.55:33139>", 
(shadow pid = 3245)
10/30 14:36:54 ERROR: Shadow exited with job exception
code!
10/30 14:36:55 Started shadow for job 113.1 on
"<130.98.172.56:34148>", 
(shadow pid = 3246)
10/30 14:36:56 ERROR: Shadow exited with job exception
code!
10/30 14:36:57 Started shadow for job 113.2 on
"<130.98.172.57:33380>", 
(shadow pid = 3247)
10/30 14:36:59 Started shadow for job 113.0 on
"<130.98.172.55:33139>", 
(shadow pid = 3250)
10/30 14:37:00 ERROR: Shadow exited with job exception
code!
10/30 14:37:02 Started shadow for job 113.1 on
"<130.98.172.56:34148>", 
(shadow pid = 3258)
10/30 14:37:03 ERROR: Shadow exited with job exception
code!
10/30 14:37:04 Started shadow for job 113.3 on
"<130.98.172.57:33380>", 
(shadow pid = 3259)
10/30 14:37:07 Started shadow for job 113.0 on
"<130.98.172.55:33139>", 
(shadow pid = 3262)
10/30 14:37:08 ERROR: Shadow exited with job exception
code!
10/30 14:37:09 Started shadow for job 113.1 on
"<130.98.172.56:34148>", 
(shadow pid = 3270)
10/30 14:37:10 ERROR: Shadow exited with job exception
code!
10/30 14:37:11 Started shadow for job 113.4 on
"<130.98.172.57:33380>", 
(shadow pid = 3271)
10/30 14:37:11 Activity on stashed negotiator socket
10/30 14:37:11 Negotiating for owner:
condor@xxxxxxxxxxxxxxxxxxxxxx
10/30 14:37:11 Checking consistency running and
runnable jobs
10/30 14:37:11 Tables are consistent
10/30 14:37:11 Out of servers - 0 jobs matched, 3 jobs
idle, 1 jobs 
rejected
10/30 14:37:13 Started shadow for job 113.0 on
"<130.98.172.55:33139>", 
(shadow pid = 3274)
10/30 14:37:14 ERROR: Shadow exited with job exception
code!
10/30 14:37:15 Started shadow for job 113.1 on
"<130.98.172.56:34148>", 
(shadow pid = 3282)
10/30 14:37:16 ERROR: Shadow exited with job exception
code!
10/30 14:37:17 Started shadow for job 113.5 on
"<130.98.172.57:33380>", 
(shadow pid = 3283)
10/30 14:37:19 Started shadow for job 113.0 on
"<130.98.172.55:33139>", 
(shadow pid = 3286)
10/30 14:37:20 ERROR: Shadow exited with job exception
code!
10/30 14:37:20 Match for cluster 113 has had 5 shadow
exceptions, 
relinquishing.
10/30 14:37:20 Sent RELEASE_CLAIM to startd on
<130.98.172.55:33139>
10/30 14:37:20 Match record (<130.98.172.55:33139>,
113, 0) deleted
10/30 14:37:20 DaemonCore: Command received via TCP
from host 
<130.98.172.55:33147>
10/30 14:37:20 DaemonCore: received command 443
(VACATE_SERVICE), 
calling handler (vacate_service)
10/30 14:37:20 Got VACATE_SERVICE from
<130.98.172.55:33147>
10/30 14:37:21 Started shadow for job 113.1 on
"<130.98.172.56:34148>", 
(shadow pid = 3294)
10/30 14:37:22 ERROR: Shadow exited with job exception
code!
10/30 14:37:22 Match for cluster 113 has had 5 shadow
exceptions, 
relinquishing.
10/30 14:37:22 Sent RELEASE_CLAIM to startd on
<130.98.172.56:34148>
10/30 14:37:22 Match record (<130.98.172.56:34148>,
113, 1) deleted
10/30 14:37:22 DaemonCore: Command received via TCP
from host 
<130.98.172.56:34165>
10/30 14:37:22 DaemonCore: received command 443
(VACATE_SERVICE), 
calling handler (vacate_service)
10/30 14:37:22 Got VACATE_SERVICE from
<130.98.172.56:34165>
10/30 14:37:23 Started shadow for job 113.6 on
"<130.98.172.57:33380>", 
(shadow pid = 3295)
10/30 14:37:23 Sent ad to central manager for 
condor@xxxxxxxxxxxxxxxxxxxxxxxx/30 14:37:25 Started
shadow for job 
113.0 on "<130.98.172.57:33380>", (shadow pid = 3305)
10/30 14:37:28 Sent ad to central manager for 
condor@xxxxxxxxxxxxxxxxxxxxxxxx/30 14:37:28 Started
shadow for job 
113.1 on "<130.98.172.57:33380>", (shadow pid = 3315)
10/30 14:37:30 Started shadow for job 113.7 on
"<130.98.172.57:33380>", 
(shadow pid = 3325)
10/30 14:37:31 Activity on stashed negotiator socket
10/30 14:37:31 Negotiating for owner:
condor@xxxxxxxxxxxxxxxxxxxxxx
10/30 14:37:31 Checking consistency running and
runnable jobs
10/30 14:37:31 Tables are consistent
10/30 14:37:31 Out of jobs - 0 jobs matched, 0 jobs
idle, flock level = 
0
10/30 14:37:33 Sent ad to central manager for 
condor@xxxxxxxxxxxxxxxxxxxxxxxx/30 14:37:33 match 
(<130.98.172.57:33380>#1800841445) out of jobs
(cluster id 113); relinquishing
10/30 14:37:33 Sent RELEASE_CLAIM to startd on
<130.98.172.57:33380>
10/30 14:37:33 Match record (<130.98.172.57:33380>,
113, -1) deleted
10/30 14:37:33 DaemonCore: Command received via TCP
from host 
<130.98.172.57:33496>
10/30 14:37:33 DaemonCore: received command 443
(VACATE_SERVICE), 
calling handler (vacate_service)
10/30 14:37:33 Got VACATE_SERVICE from
<130.98.172.57:33496>
10/30 14:42:33 Sent owner (0 jobs) ad to central
manager
[condor@node3 condor]$ 



StarterLog of node1:

10/30 14:36:54
******************************************************
10/30 14:36:54 ** condor_starter (CONDOR_STARTER)
STARTING UP
10/30 14:36:54 ** $CondorVersion: 6.4.7 Jan 26 2003 $
10/30 14:36:54 ** $CondorPlatform: INTEL-LINUX-GLIBC22
$
10/30 14:36:54 ** PID = 14303
10/30 14:36:54
******************************************************
10/30 14:36:54 DaemonCore: Command Socket at
<130.98.172.55:33142>
10/30 14:36:54 Submitting machine is
"node3.xtrem.der.edf.fr"
10/30 14:36:54 Done setting resource limits
10/30 14:36:54 couldn't create dir
/home00/condor/execute/dir_14303: 
Permission denied
10/30 14:36:54 Unable to start job.
10/30 14:36:54 **** condor_starter (condor_STARTER)
EXITING WITH STATUS 
1
10/30 14:37:00
******************************************************
10/30 14:37:00 ** condor_starter (CONDOR_STARTER)
STARTING UP
10/30 14:37:00 ** $CondorVersion: 6.4.7 Jan 26 2003 $
10/30 14:37:00 ** $CondorPlatform: INTEL-LINUX-GLIBC22
$
10/30 14:37:00 ** PID = 14304
10/30 14:37:00
******************************************************
10/30 14:37:00 DaemonCore: Command Socket at
<130.98.172.55:33143>
10/30 14:37:00 Submitting machine is
"node3.xtrem.der.edf.fr"
10/30 14:37:00 Done setting resource limits
10/30 14:37:00 couldn't create dir
/home00/condor/execute/dir_14304: 
Permission denied
10/30 14:37:00 Unable to start job.
10/30 14:37:00 **** condor_starter (condor_STARTER)
EXITING WITH STATUS 
1
10/30 14:37:08
******************************************************
10/30 14:37:08 ** condor_starter (CONDOR_STARTER)
STARTING UP
10/30 14:37:08 ** $CondorVersion: 6.4.7 Jan 26 2003 $
10/30 14:37:08 ** $CondorPlatform: INTEL-LINUX-GLIBC22
$
10/30 14:37:08 ** PID = 14309
10/30 14:37:08
******************************************************
10/30 14:37:08 DaemonCore: Command Socket at
<130.98.172.55:33144>
10/30 14:37:08 Submitting machine is
"node3.xtrem.der.edf.fr"
10/30 14:37:08 Done setting resource limits
10/30 14:37:08 couldn't create dir
/home00/condor/execute/dir_14309: 
Permission denied
10/30 14:37:08 Unable to start job.
10/30 14:37:08 **** condor_starter (condor_STARTER)
EXITING WITH STATUS 
1
10/30 14:37:14
******************************************************
10/30 14:37:14 ** condor_starter (CONDOR_STARTER)
STARTING UP
10/30 14:37:14 ** $CondorVersion: 6.4.7 Jan 26 2003 $
10/30 14:37:14 ** $CondorPlatform: INTEL-LINUX-GLIBC22
$
10/30 14:37:14 ** PID = 14310
10/30 14:37:14
******************************************************
10/30 14:37:14 DaemonCore: Command Socket at
<130.98.172.55:33145>
10/30 14:37:14 Submitting machine is
"node3.xtrem.der.edf.fr"
10/30 14:37:14 Done setting resource limits
10/30 14:37:14 couldn't create dir
/home00/condor/execute/dir_14310: 
Permission denied
10/30 14:37:14 Unable to start job.
10/30 14:37:14 **** condor_starter (condor_STARTER)
EXITING WITH STATUS 1
10/30 14:37:20
******************************************************
10/30 14:37:20 ** condor_starter (CONDOR_STARTER)
STARTING UP
10/30 14:37:20 ** $CondorVersion: 6.4.7 Jan 26 2003 $
10/30 14:37:20 ** $CondorPlatform: INTEL-LINUX-GLIBC22
$
10/30 14:37:20 ** PID = 14311
10/30 14:37:20
******************************************************
10/30 14:37:20 DaemonCore: Command Socket at
<130.98.172.55:33146>
10/30 14:37:20 Submitting machine is
"node3.xtrem.der.edf.fr"
10/30 14:37:20 Done setting resource limits
10/30 14:37:20 couldn't create dir
/home00/condor/execute/dir_14311: 
Permission denied
10/30 14:37:20 Unable to start job.
10/30 14:37:20 **** condor_starter (condor_STARTER)
EXITING WITH STATUS 
1
[condor@node1 condor]$ 



StartLog of node1:

10/30 14:37:20 DaemonCore: Command received via TCP
from host 
<130.98.172.57:33457>
10/30 14:37:20 DaemonCore: received command 444
(ACTIVATE_CLAIM), 
calling handler (command_activate_claim)
10/30 14:37:20 Got activate_claim request from shadow 
(<130.98.172.57:33457>)
10/30 14:37:20 Remote job ID is 113.0
10/30 14:37:20 Got universe (5) from request classad
10/30 14:37:20 Startd using *_VANILLA control
expressions.
10/30 14:37:20 State change: claim-activation protocol
successful
10/30 14:37:20 Changing activity: Idle -> Busy
10/30 14:37:20 Starter pid 14311 exited with status 1
10/30 14:37:20 State change: starter exited
10/30 14:37:20 Changing activity: Busy -> Idle
10/30 14:37:20 DaemonCore: Command received via UDP
from host 
<130.98.172.57:32786>
10/30 14:37:20 DaemonCore: received command 443
(RELEASE_CLAIM), 
calling handler (command_handler)
10/30 14:37:20 State change: received RELEASE_CLAIM
command
10/30 14:37:20 Changing state and activity:
Claimed/Idle -> Preempting/Vacating10/30 14:37:20
State change: No preempting match, returning to 
owner
10/30 14:37:20 Changing state and activity:
Preempting/Vacating -> Owner/Idle
10/30 14:37:20 State change: IS_OWNER is false
10/30 14:37:20 Changing state: Owner -> Unclaimed
10/30 14:37:20 DaemonCore: Command received via UDP
from host 
<130.98.172.57:32786>
10/30 14:37:20 DaemonCore: received command 443
(RELEASE_CLAIM), 
calling handler (command_handler)
10/30 14:37:20 Error: can't find resource with
capability 
(<130.98.172.55:33139>#2302711529)
[condor@node1 condor]$ 


StarterLog of node3:

[condor@node3 condor]$ cat log/StarterLog
Now in new log file /home00/condor/log/StarterLog
10/30 14:37:24 Submitting machine is
"node3.xtrem.der.edf.fr"
10/30 14:37:24 Done setting resource limits
10/30 14:37:24 File transfer completed successfully.
10/30 14:37:25 Starting a VANILLA universe job.
10/30 14:37:25 Output file:
/home00/condor/execute/dir_3296/out6
10/30 14:37:25 About to exec 
/home00/condor/execute/dir_3296/condor_exec.exe
10/30 14:37:25 Create_Process succeeded, pid=3298
10/30 14:37:25 Job exited, pid=3298, status=44
10/30 14:37:25 Got SIGQUIT.  Performing fast shutdown.
10/30 14:37:25 ShutdownFast all jobs.
10/30 14:37:25 **** condor_starter (condor_STARTER)
EXITING WITH STATUS 
0
10/30 14:37:26
******************************************************
10/30 14:37:26 ** condor_starter (CONDOR_STARTER)
STARTING UP
10/30 14:37:26 ** $CondorVersion: 6.4.7 Jan 26 2003 $
10/30 14:37:26 ** $CondorPlatform: INTEL-LINUX-GLIBC22
$
10/30 14:37:26 ** PID = 3306
10/30 14:37:26
******************************************************
10/30 14:37:26 DaemonCore: Command Socket at
<130.98.172.57:33473>
10/30 14:37:26 Submitting machine is
"node3.xtrem.der.edf.fr"
10/30 14:37:26 Done setting resource limits
10/30 14:37:27 File transfer completed successfully.
10/30 14:37:28 Starting a VANILLA universe job.
10/30 14:37:28 Output file:
/home00/condor/execute/dir_3306/out0
10/30 14:37:28 About to exec 
/home00/condor/execute/dir_3306/condor_exec.exe
10/30 14:37:28 Create_Process succeeded, pid=3308
10/30 14:37:28 Job exited, pid=3308, status=44
10/30 14:37:28 Got SIGQUIT.  Performing fast shutdown.
10/30 14:37:28 ShutdownFast all jobs.
10/30 14:37:28 **** condor_starter (condor_STARTER)
EXITING WITH STATUS 
0
10/30 14:37:29
******************************************************
10/30 14:37:29 ** condor_starter (CONDOR_STARTER)
STARTING UP
10/30 14:37:29 ** $CondorVersion: 6.4.7 Jan 26 2003 $
10/30 14:37:29 ** $CondorPlatform: INTEL-LINUX-GLIBC22
$
10/30 14:37:29 ** PID = 3316
10/30 14:37:29
******************************************************
10/30 14:37:29 DaemonCore: Command Socket at
<130.98.172.57:33482>
10/30 14:37:29 Submitting machine is
"node3.xtrem.der.edf.fr"
10/30 14:37:29 Done setting resource limits
10/30 14:37:29 File transfer completed successfully.
10/30 14:37:30 Starting a VANILLA universe job.
10/30 14:37:30 Output file:
/home00/condor/execute/dir_3316/out1
10/30 14:37:30 About to exec 
/home00/condor/execute/dir_3316/condor_exec.exe
10/30 14:37:30 Create_Process succeeded, pid=3318
10/30 14:37:30 Job exited, pid=3318, status=44
10/30 14:37:30 Got SIGQUIT.  Performing fast shutdown.
10/30 14:37:30 ShutdownFast all jobs.
10/30 14:37:30 **** condor_starter (condor_STARTER)
EXITING WITH STATUS 
0
10/30 14:37:31
******************************************************
10/30 14:37:31 ** condor_starter (CONDOR_STARTER)
STARTING UP
10/30 14:37:31 ** $CondorVersion: 6.4.7 Jan 26 2003 $
10/30 14:37:31 ** $CondorPlatform: INTEL-LINUX-GLIBC22
$
10/30 14:37:31 ** PID = 3326
10/30 14:37:31
******************************************************
10/30 14:37:31 DaemonCore: Command Socket at
<130.98.172.57:33491>
10/30 14:37:31 Submitting machine is
"node3.xtrem.der.edf.fr"
10/30 14:37:31 Done setting resource limits
10/30 14:37:31 File transfer completed successfully.
10/30 14:37:32 Starting a VANILLA universe job.
10/30 14:37:32 Output file:
/home00/condor/execute/dir_3326/out7
10/30 14:37:32 About to exec 
/home00/condor/execute/dir_3326/condor_exec.exe
10/30 14:37:32 Create_Process succeeded, pid=3328
10/30 14:37:33 Job exited, pid=3328, status=44
10/30 14:37:33 Got SIGQUIT.  Performing fast shutdown.
10/30 14:37:33 ShutdownFast all jobs.
10/30 14:37:33 **** condor_starter (condor_STARTER)
EXITING WITH STATUS 
0
[condor@node3 condor]$ 




ShadowLog of node3:


10/30 14:37:28
******************************************************
10/30 14:37:28 ** condor_shadow (CONDOR_SHADOW)
STARTING UP
10/30 14:37:28 ** $CondorVersion: 6.4.7 Jan 26 2003 $
10/30 14:37:28 ** $CondorPlatform: INTEL-LINUX-GLIBC22
$
10/30 14:37:28 ** PID = 3315
10/30 14:37:28
******************************************************
10/30 14:37:28 DaemonCore: Command Socket at
<130.98.172.57:33478>
10/30 14:37:29 Initializing a VANILLA shadow
10/30 14:37:29 (113.1) (3315): Request to run on
<130.98.172.57:33380> was ACCEPTED
10/30 14:37:30 (113.1) (3315): **** condor_shadow
(condor_SHADOW) 
EXITING WITH STATUS 100
10/30 14:37:30
******************************************************
10/30 14:37:30 ** condor_shadow (CONDOR_SHADOW)
STARTING UP
10/30 14:37:30 ** $CondorVersion: 6.4.7 Jan 26 2003 $
10/30 14:37:30 ** $CondorPlatform: INTEL-LINUX-GLIBC22
$
10/30 14:37:30 ** PID = 3325
10/30 14:37:30
******************************************************
10/30 14:37:30 DaemonCore: Command Socket at
<130.98.172.57:33487>
10/30 14:37:31 Initializing a VANILLA shadow
10/30 14:37:31 (113.7) (3325): Request to run on
<130.98.172.57:33380> was ACCEPTED
10/30 14:37:33 (113.7) (3325): **** condor_shadow
(condor_SHADOW) 
EXITING WITH STATUS 100
[condor@node3 condor]$ 




StartLog of node3:


10/30 14:37:28 DaemonCore: Command received via TCP
from host 
<130.98.172.57:33476>
10/30 14:37:28 DaemonCore: received command 404 
(DEACTIVATE_CLAIM_FORCIBLY), calling handler
(command_handler)
10/30 14:37:28 Called deactivate_claim_forcibly()
10/30 14:37:28 Starter pid 3306 exited with status 0
10/30 14:37:28 State change: starter exited
10/30 14:37:28 Changing activity: Busy -> Idle
10/30 14:37:29 DaemonCore: Command received via TCP
from host 
<130.98.172.57:33481>
10/30 14:37:29 DaemonCore: received command 444
(ACTIVATE_CLAIM), 
calling handler (command_activate_claim)
10/30 14:37:29 Got activate_claim request from shadow 
(<130.98.172.57:33481>)
10/30 14:37:29 Remote job ID is 113.1
10/30 14:37:29 Got universe (5) from request classad
10/30 14:37:29 Startd using *_VANILLA control
expressions.
10/30 14:37:29 State change: claim-activation protocol
successful
10/30 14:37:29 Changing activity: Idle -> Busy
10/30 14:37:30 DaemonCore: Command received via TCP
from host 
<130.98.172.57:33485>
10/30 14:37:30 DaemonCore: received command 404 
(DEACTIVATE_CLAIM_FORCIBLY), calling handler
(command_handler)
10/30 14:37:30 Called deactivate_claim_forcibly()
10/30 14:37:30 Starter pid 3316 exited with status 0
10/30 14:37:30 State change: starter exited
10/30 14:37:30 Changing activity: Busy -> Idle
10/30 14:37:31 DaemonCore: Command received via TCP
from host 
<130.98.172.57:33490>
10/30 14:37:31 DaemonCore: received command 444
(ACTIVATE_CLAIM), 
calling handler (command_activate_claim)
10/30 14:37:31 Got activate_claim request from shadow 
(<130.98.172.57:33490>)
10/30 14:37:31 Remote job ID is 113.7
10/30 14:37:31 Got universe (5) from request classad
10/30 14:37:31 Startd using *_VANILLA control
expressions.
10/30 14:37:31 State change: claim-activation protocol
successful
10/30 14:37:31 Changing activity: Idle -> Busy
10/30 14:37:33 DaemonCore: Command received via TCP
from host 
<130.98.172.57:33494>
10/30 14:37:33 DaemonCore: received command 404 
(DEACTIVATE_CLAIM_FORCIBLY), calling handler
(command_handler)
10/30 14:37:33 Called deactivate_claim_forcibly()
10/30 14:37:33 Starter pid 3326 exited with status 0
10/30 14:37:33 State change: starter exited
10/30 14:37:33 Changing activity: Busy -> Idle
10/30 14:37:33 DaemonCore: Command received via UDP
from host 
<130.98.172.57:32786>
10/30 14:37:33 DaemonCore: received command 443
(RELEASE_CLAIM), 
calling handler (command_handler)
10/30 14:37:33 State change: received RELEASE_CLAIM
command
10/30 14:37:33 Changing state and activity:
Claimed/Idle -> Preempting/Vacating
10/30 14:37:33 State change: No preempting match,
returning to owner
10/30 14:37:33 Changing state and activity:
Preempting/Vacating -> Owner/Idle
10/30 14:37:33 State change: IS_OWNER is false
10/30 14:37:33 Changing state: Owner -> Unclaimed
10/30 14:37:33 DaemonCore: Command received via UDP
from host 
<130.98.172.57:32786>
10/30 14:37:33 DaemonCore: received command 443
(RELEASE_CLAIM), 
calling handler (command_handler)
10/30 14:37:33 Error: can't find resource with
capability 
(<130.98.172.57:33380>#1800841445)
[condor@node3 condor]$ 



MatchLog of node2 (pool master):

10/30 14:36:51       Matched 113.0
condor@xxxxxxxxxxxxxxxxxxxxxx <130.98.172.57:33381>
preempting none <130.98.172.55:33139>
10/30 14:36:51       Matched 113.1
condor@xxxxxxxxxxxxxxxxxxxxxx <130.98.172.57:33381>
preempting none <130.98.172.56:34148>
10/30 14:36:51       Matched 113.2
condor@xxxxxxxxxxxxxxxxxxxxxx <130.98.172.57:33381>
preempting none <130.98.172.57:33380>
10/30 14:36:51       Rejected 113.3
condor@xxxxxxxxxxxxxxxxxxxxxx <130.98.172.57:33381>:
no match found
10/30 14:37:11       Rejected 113.5
condor@xxxxxxxxxxxxxxxxxxxxxx <130.98.172.57:33381>:
no match found
[condor@node2 condor]$ 




condor_status:


[root@node3 bin]# ./condor_status

Name          OpSys       Arch   State      Activity  
LoadAv Mem   
ActvtyTime

node1.xtrem.d LINUX       INTEL  Unclaimed  Idle      
0.000   248  
0+00:02:41
node2.xtrem.d LINUX       INTEL  Unclaimed  Idle      
0.000   248  
0+00:02:28
node3.xtrem.d LINUX       INTEL  Unclaimed  Idle      
0.000   248  
0+00:02:35

                     Machines Owner Claimed Unclaimed
Matched 
Preempting

         INTEL/LINUX        3     0       0         3 
     0          
0

               Total        3     0       0         3 
     0          
0



condor_status -l:

[root@node3 bin]# ./condor_status -l
MyType = "Machine"
TargetType = "Job"
Name = "node1.xtrem.der.edf.fr"
Machine = "node1.xtrem.der.edf.fr"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
CondorVersion = "$CondorVersion: 6.4.7 Jan 26 2003 $"
CondorPlatform = "$CondorPlatform: INTEL-LINUX-GLIBC22
$"
VirtualMachineID = 1
ImageSize = 3511
ExecutableSize = 3511
JobUniverse = 5
NiceUser = FALSE
VirtualMemory = 521600
Disk = 764184
CondorLoadAvg = 0.000000
LoadAvg = 0.000000
KeyboardIdle = 293
ConsoleIdle = 32350
Memory = 248
Cpus = 1
StartdIpAddr = "<130.98.172.55:33139>"
Arch = "INTEL"
OpSys = "LINUX"
UidDomain = "node1.xtrem.der.edf.fr"
FileSystemDomain = "node1.xtrem.der.edf.fr"
Subnet = "130.98.172"
HasIOProxy = TRUE
TotalVirtualMemory = 521600
TotalDisk = 764184
KFlops = 274667
Mips = 825
LastBenchmark = 1067520982
TotalLoadAvg = 0.000000
TotalCondorLoadAvg = 0.000000
ClockMin = 977
ClockDay = 4
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasMPI = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList = 
"HasFileTransfer,HasMPI,HasRemoteSyscalls,HasCheckpointing"CpuBusyTime
= 0
CpuIsBusy = FALSE
State = "Claimed"
EnteredCurrentState = 1067527038
Activity = "Idle"
EnteredCurrentActivity = 1067527041
Start = TRUE
Requirements = START
CurrentRank = 0.000000
RemoteUser = "condor@xxxxxxxxxxxxxxxxxxxxxx"
RemoteOwner = "condor@xxxxxxxxxxxxxxxxxxxxxx"
ClientMachine = "node3.xtrem.der.edf.fr"
JobId = "115.0"
JobStart = 1067527041
LastPeriodicCheckpoint = 1067527041
AvailTime = 0.960000
LastAvailInterval = 32496
AvailSince = 1067520982
AvailTimeEstimate = 7270
UpdateSequenceNumber = 31
DaemonStartTime = 1067520980
LastHeardFrom = 1067527042

MyType = "Machine"
TargetType = "Job"
Name = "node2.xtrem.der.edf.fr"
Machine = "node2.xtrem.der.edf.fr"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
CondorVersion = "$CondorVersion: 6.4.7 Jan 26 2003 $"
CondorPlatform = "$CondorPlatform: INTEL-LINUX-GLIBC22
$"
VirtualMachineID = 1
ImageSize = 3511
ExecutableSize = 3511
JobUniverse = 5
NiceUser = FALSE
VirtualMemory = 521600
Disk = 859148
CondorLoadAvg = 0.000000
LoadAvg = 0.000000
KeyboardIdle = 212
ConsoleIdle = 84753
Memory = 248
Cpus = 1
StartdIpAddr = "<130.98.172.56:34148>"
Arch = "INTEL"
OpSys = "LINUX"
UidDomain = "node2.xtrem.der.edf.fr"
FileSystemDomain = "node2.xtrem.der.edf.fr"
Subnet = "130.98.172"
HasIOProxy = TRUE
TotalVirtualMemory = 521600
TotalDisk = 859148
KFlops = 254321
Mips = 837
LastBenchmark = 1067520971
TotalLoadAvg = 0.000000
TotalCondorLoadAvg = 0.000000
ClockMin = 976
ClockDay = 4
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasMPI = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList = 
"HasFileTransfer,HasMPI,HasRemoteSyscalls,HasCheckpointing"CpuBusyTime
= 0
CpuIsBusy = FALSE
State = "Claimed"
EnteredCurrentState = 1067527038
Activity = "Idle"
EnteredCurrentActivity = 1067527038
Start = TRUE
Requirements = START
CurrentRank = 0.000000
RemoteUser = "condor@xxxxxxxxxxxxxxxxxxxxxx"
RemoteOwner = "condor@xxxxxxxxxxxxxxxxxxxxxx"
ClientMachine = "node3.xtrem.der.edf.fr"
AvailTime = 1.000000
LastAvailInterval = 3450
AvailSince = 1067520971
AvailTimeEstimate = 138209024
UpdateSequenceNumber = 33
DaemonStartTime = 1067520969
LastHeardFrom = 1067527042

MyType = "Machine"
TargetType = "Job"
Name = "node3.xtrem.der.edf.fr"
Machine = "node3.xtrem.der.edf.fr"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
CondorVersion = "$CondorVersion: 6.4.7 Jan 26 2003 $"
CondorPlatform = "$CondorPlatform: INTEL-LINUX-GLIBC22
$"
VirtualMachineID = 1
ExecutableSize = 3511
JobUniverse = 5
NiceUser = FALSE
ImageSize = 4352
VirtualMemory = 521600
Disk = 915440
CondorLoadAvg = 0.000000
LoadAvg = 0.000000
KeyboardIdle = 92
ConsoleIdle = 84775
Memory = 248
Cpus = 1
StartdIpAddr = "<130.98.172.57:33380>"
Arch = "INTEL"
OpSys = "LINUX"
UidDomain = "node3.xtrem.der.edf.fr"
FileSystemDomain = "node3.xtrem.der.edf.fr"
Subnet = "130.98.172"
HasIOProxy = TRUE
TotalVirtualMemory = 521600
TotalDisk = 915440
KFlops = 274667
Mips = 825
LastBenchmark = 1067520989
TotalLoadAvg = 0.000000
TotalCondorLoadAvg = 0.000000
ClockMin = 976
ClockDay = 4
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasMPI = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList = 
"HasFileTransfer,HasMPI,HasRemoteSyscalls,HasCheckpointing"CpuBusyTime
= 0
CpuIsBusy = FALSE
State = "Claimed"
EnteredCurrentState = 1067527038
Activity = "Idle"
EnteredCurrentActivity = 1067527038
Start = TRUE
Requirements = START
CurrentRank = 0.000000
RemoteUser = "condor@xxxxxxxxxxxxxxxxxxxxxx"
RemoteOwner = "condor@xxxxxxxxxxxxxxxxxxxxxx"
ClientMachine = "node3.xtrem.der.edf.fr"
AvailTime = 0.960000
LastAvailInterval = 32445
AvailSince = 1067520989
AvailTimeEstimate = 138224504
UpdateSequenceNumber = 37
DaemonStartTime = 1067520988
LastHeardFrom = 1067527042



[root@node3 bin]# ./condor_status

Name          OpSys       Arch   State      Activity  
LoadAv Mem   
ActvtyTime

node1.xtrem.d LINUX       INTEL  Claimed    Idle      
0.000   248  
0+00:00:04
node2.xtrem.d LINUX       INTEL  Claimed    Idle      
0.000   248  
0+00:00:04
node3.xtrem.d LINUX       INTEL  Claimed    Idle      
0.110   248  
0+00:00:03

                     Machines Owner Claimed Unclaimed
Matched 
Preempting

         INTEL/LINUX        3     0       3         0 
     0          
0

               Total        3     0       3         0 
     0          
0



any idea would be appreciated, thanks in advance.

habib.







___________________________________________________________
Do You Yahoo!? -- Une adresse @yahoo.fr gratuite et en français !
Yahoo! Mail : http://fr.mail.yahoo.com
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>