[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] why jobs are always evicted on remotes machines?



hello,
I already sent three messages but unfortunately I
didn't receive any answer.

well, I will summarize my problem once again:

I have a 4-node Linux cluster running Condor. I have
tried, unsuccessfully, to run jobs on the remotes
nodes. but they were evicted on these nodes!!, and 
finally, all the executions were held locally on the
submitting machine.
i don't understand why the jobs cannot be executed on
the remotes machines?



********************************************************
my sub file have a simple structure, like:
********************************************************
universe       = standard                         
Executable     = /home/condor/test                
initialdir     = /home/condor                         
                                   
transfer_executable = TRUE                        

ould_transfer_files = YES                       
when_to_transfer_output = ON_EXIT                     

Output        = out.$(process)   
Log            = log.$(process)                       
Queue 15             


*******************************************************
here is the relevant part of the log file Log.1:
*******************************************************

[condor@node1 condor]$ cat log.1
000 (133.001.000) 10/28 14:17:17 Job submitted from
host: <130.98.172.55:58106>...
001 (133.001.000) 10/28 14:17:49 Job executing on
host: <130.98.172.56:37429>
...
004 (133.001.000) 10/28 14:17:50 Job was evicted.
	(0) Job was not checkpointed.
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
	224  -  Run Bytes Sent By Job
	3587556  -  Run Bytes Received By Job
...
001 (133.001.000) 10/28 14:18:07 Job executing on
host: <130.98.172.56:37429>
...
004 (133.001.000) 10/28 14:18:07 Job was evicted.
	(0) Job was not checkpointed.
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
	224  -  Run Bytes Sent By Job
	3587556  -  Run Bytes Received By Job
...
001 (133.001.000) 10/28 14:18:11 Job executing on
host: <130.98.172.55:58105>
...
005 (133.001.000) 10/28 14:18:42 Job terminated.
	(1) Normal termination (return value 13)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote
Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	893  -  Run Bytes Sent By Job
	3587999  -  Run Bytes Received By Job
	1341  -  Total Bytes Sent By Job
	10763111  -  Total Bytes Received By Job
...

*******************************************************
concerning the different the job queue, i obtain the
following when i execute condor_q -analyze: 
*******************************************************
[root@node1 bin]# ./condor_q -analyze


-- Submitter: node1.xtrem.der.edf.fr :
<130.98.172.55:58106> : node1.xtrem.der.edf.fr
 ID      OWNER            SUBMITTED     RUN_TIME ST
PRI SIZE CMD               
---
133.002:  Request is being serviced

---
133.003:  Run analysis summary.  Of 3 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own
requirements
      1 match, but are serving users with a better
priority in the pool
      2 match, but prefer another specific job despite
its worse user-priority
      0 match, but cannot currently preempt their
existing job
      0 are available to run your job
	Last successful match: Tue Oct 28 14:18:57 2003
---
133.005:  Run analysis summary.  Of 3 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own
requirements
      1 match, but are serving users with a better
priority in the pool
      2 match, but prefer another specific job despite
its worse user-priority
      0 match, but cannot currently preempt their
existing job
      0 are available to run your job
	No successful match recorded.
	Last failed match: Tue Oct 28 14:18:57 2003
	Reason for last match failure: no match found
---
133.006:  Run analysis summary.  Of 3 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own
requirements
      1 match, but are serving users with a better
priority in the pool
      2 match, but prefer another specific job despite
its worse user-priority
      0 match, but cannot currently preempt their
existing job
      0 are available to run your job
---
133.007:  Run analysis summary.  Of 3 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own
requirements
      1 match, but are serving users with a better
priority in the pool
      2 match, but prefer another specific job despite
its worse user-priority
      0 match, but cannot currently preempt their
existing job
      0 are available to run your job

5 jobs; 4 idle, 1 running, 0 held
[root@node1 bin]# ./condor_q -analyze


-- Submitter: node1.xtrem.der.edf.fr :
<130.98.172.55:58106> : node1.xtrem.der.edf.fr
 ID      OWNER            SUBMITTED     RUN_TIME ST
PRI SIZE CMD               
---
133.002:  Request is being serviced

---
133.003:  Request is being serviced

---
133.005:  Request is being serviced

---
133.006:  Run analysis summary.  Of 3 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own
requirements
      3 match, but are serving users with a better
priority in the pool
      0 match, but prefer another specific job despite
its worse user-priority
      0 match, but cannot currently preempt their
existing job
      0 are available to run your job
	No successful match recorded.
	Last failed match: Tue Oct 28 14:19:17 2003
	Reason for last match failure: no match found
---
133.007:  Run analysis summary.  Of 3 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own
requirements
      3 match, but are serving users with a better
priority in the pool
      0 match, but prefer another specific job despite
its worse user-priority
      0 match, but cannot currently preempt their
existing job
      0 are available to run your job

5 jobs; 2 idle, 3 running, 0 held
[root@node1 bin]# ./condor_q -analyze


-- Submitter: node1.xtrem.der.edf.fr :
<130.98.172.55:58106> : node1.xtrem.der.edf.fr
 ID      OWNER            SUBMITTED     RUN_TIME ST
PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held


********************************************************
the relevant part of my SchedLog file :
********************************************************
10/28 14:17:17 DaemonCore: Command received via UDP
from host <130.98.172.55:33426>
10/28 14:17:17 DaemonCore: received command 421
(RESCHEDULE), calling handler (reschedule_negotiator)
10/28 14:17:17 Sent ad to central manager for
condor@xxxxxxxxxxxxxxxxxxxxxx
10/28 14:17:17 Called reschedule_negotiator()
10/28 14:17:35 Activity on stashed negotiator socket
10/28 14:17:35 Negotiating for owner:
condor@xxxxxxxxxxxxxxxxxxxxxx
10/28 14:17:35 Checking consistency running and
runnable jobs
10/28 14:17:35 Tables are consistent
10/28 14:17:35 Out of servers - 3 jobs matched, 5 jobs
idle, 1 jobs rejected
10/28 14:17:38 Started shadow for job 133.0 on
"<130.98.172.55:58105>", (shadow pid = 25865)
10/28 14:17:40 Started shadow for job 133.1 on
"<130.98.172.56:37429>", (shadow pid = 25869)
10/28 14:17:43 Started shadow for job 133.2 on
"<130.98.172.57:45074>", (shadow pid = 25871)
10/28 14:17:43 Sent ad to central manager for
condor@xxxxxxxxxxxxxxxxxxxxxx
10/28 14:17:50 Sent RELEASE_CLAIM to startd on
<130.98.172.56:37429>
10/28 14:17:50 Match record (<130.98.172.56:37429>,
133, 1) deleted
10/28 14:17:51 DaemonCore: Command received via TCP
from host <130.98.172.56:46324>
10/28 14:17:51 DaemonCore: received command 443
(VACATE_SERVICE), calling handler (vacate_service)
10/28 14:17:51 Got VACATE_SERVICE from
<130.98.172.56:46324>
10/28 14:17:51 Sent RELEASE_CLAIM to startd on
<130.98.172.57:45074>
10/28 14:17:51 Match record (<130.98.172.57:45074>,
133, 2) deleted
10/28 14:17:52 DaemonCore: Command received via TCP
from host <130.98.172.57:48867>
10/28 14:17:52 DaemonCore: received command 443
(VACATE_SERVICE), calling handler (vacate_service)
10/28 14:17:52 Got VACATE_SERVICE from
<130.98.172.57:48867>
10/28 14:17:56 Activity on stashed negotiator socket
10/28 14:17:56 Negotiating for owner:
condor@xxxxxxxxxxxxxxxxxxxxxx
10/28 14:17:56 Checking consistency running and
runnable jobs
10/28 14:17:56 Tables are consistent
10/28 14:17:56 Out of servers - 2 jobs matched, 5 jobs
idle, 1 jobs rejected


*******************************************************
the StarterLog file of node3 on witch the jobs were
evicted 
*******************************************************
[condor@node3 log]$ cat StarterLog
Now in new log file /home/condor/log/StarterLog
GET_NEW_PROC SUSPEND VACATE ALARM DIE CHILD_EXIT
PERIODIC_CKPT  ]
10/28 14:19:31 	*FSM* Got asynchronous event "DIE"
10/28 14:19:31 	*FSM* Executing transition function
"req_die"
10/28 14:19:31 	*FSM* Transitioning to state
"TERMINATE"
10/28 14:19:31 	*FSM* Executing state func
"terminate_all()" [  ]
10/28 14:19:31 	*FSM* Transitioning to state
"SEND_STATUS_ALL"
10/28 14:19:31 	*FSM* Executing state func
"dispose_all()" [  ]
10/28 14:19:31 	*FSM* Reached state "END"
10/28 14:19:31 ********* STARTER terminating normally
**********
10/28 14:19:43 ********** STARTER starting up
***********
10/28 14:19:43 ** $CondorVersion: 6.4.7 Jan 26 2003 $
10/28 14:19:43 ** $CondorPlatform: INTEL-LINUX-GLIBC22
$
10/28 14:19:43
******************************************
10/28 14:19:43 Submitting machine is
"node1.xtrem.der.edf.fr"
10/28 14:19:43 EventHandler {
10/28 14:19:43 	func = 0x80706d0
10/28 14:19:43 	mask = SIGALRM SIGHUP SIGINT SIGUSR1
SIGUSR2 SIGCHLD SIGTSTP 
10/28 14:19:43 }
10/28 14:19:43 Done setting resource limits
10/28 14:19:43 	*FSM* Transitioning to state
"GET_PROC"
10/28 14:19:43 	*FSM* Executing state func
"get_proc()" [  ]
10/28 14:19:43 Entering get_proc()
10/28 14:19:43 Entering get_job_info()
10/28 14:19:43 Startup Info:
10/28 14:19:43 	Version Number: 1
10/28 14:19:43 	Id: 133.5
10/28 14:19:43 	JobClass: STANDARD
10/28 14:19:43 	Uid: 504
10/28 14:19:43 	Gid: 505
10/28 14:19:43 	VirtPid: -1
10/28 14:19:43 	SoftKillSignal: 20
10/28 14:19:43 	Cmd: "/home/condor/test"
10/28 14:19:43 	Args: ""
10/28 14:19:43 	Env: ""
10/28 14:19:43 	Iwd: "/home/condor"
10/28 14:19:43 	Ckpt Wanted: TRUE
10/28 14:19:43 	Is Restart: FALSE
10/28 14:19:43 	Core Limit Valid: TRUE
10/28 14:19:43 	Coredump Limit 0
10/28 14:19:43 User uid set to 99
10/28 14:19:43 User uid set to 99

10/28 14:19:43 User Process 133.5 {
10/28 14:19:43   cmd = /home/condor/test
10/28 14:19:43   args = 
10/28 14:19:43   env = 
10/28 14:19:43   local_dir = dir_13235
10/28 14:19:43   cur_ckpt =
dir_13235/condor_exec.133.5
10/28 14:19:43   core_name = dir_13235/core
10/28 14:19:43   uid = 99, gid = 99
10/28 14:19:43   v_pid = -1
10/28 14:19:43   pid = (NOT CURRENTLY EXECUTING)
10/28 14:19:43   exit_status_valid = FALSE
10/28 14:19:43   exit_status = (NEVER BEEN EXECUTED)
10/28 14:19:43   ckpt_wanted = TRUE
10/28 14:19:43   coredump_limit_exists = TRUE
10/28 14:19:43   coredump_limit = 0
10/28 14:19:43   soft_kill_sig = 20
10/28 14:19:43   job_class = STANDARD
10/28 14:19:43   state = NEW
10/28 14:19:43   new_ckpt_created = FALSE
10/28 14:19:43   ckpt_transferred = FALSE
10/28 14:19:43   core_created = FALSE
10/28 14:19:43   core_transferred = FALSE
10/28 14:19:43   exit_requested = FALSE
10/28 14:19:43   image_size = -1 blocks
10/28 14:19:43   user_time = 0
10/28 14:19:43   sys_time = 0
10/28 14:19:43   guaranteed_user_time = 0
10/28 14:19:43   guaranteed_sys_time = 0
10/28 14:19:43 }
10/28 14:19:43 	*FSM* Transitioning to state
"GET_EXEC"
10/28 14:19:43 	*FSM* Executing state func
"get_exec()" [ SUSPEND VACATE DIE  ]
10/28 14:19:43 Entering get_exec()
10/28 14:19:43 Executable is located on submitting
host
10/28 14:19:43 Expanded executable name is
"/home/condor/spool/cluster133.ickpt.subproc0"
10/28 14:19:43 Going to try 3 attempts at getting the
inital executable
10/28 14:19:43 Entering get_file(
/home/condor/spool/cluster133.ickpt.subproc0,
dir_13235/condor_exec.133.5, 0755 )
10/28 14:19:44 Opened
"/home/condor/spool/cluster133.ickpt.subproc0" via
file stream
10/28 14:19:49 Get_file() transferred 3587233 bytes,
587500 bytes/second
10/28 14:19:49 Fetched orig ckpt file
"/home/condor/spool/cluster133.ickpt.subproc0" into
"dir_13235/condor_exec.133.5" with 1 attempt
10/28 14:19:50 Executable
'dir_13235/condor_exec.133.5' is linked with
"$CondorVersion: 6.4.7 Jan 26 2003 $" on a
"$CondorPlatform: INTEL-LINUX-GLIBC22 $"
10/28 14:19:50 	*FSM* Executing transition function
"spawn_all"
10/28 14:19:50 Pipe built
10/28 14:19:50 New pipe_fds[14,1]
10/28 14:19:50 cmd_fd = 14
10/28 14:19:50 Calling execve(
"/home/condor/execute/dir_13235/condor_exec.133.5",
"condor_exec.133.5", "-_condor_cmd_fd", "14", 0,
"CONDOR_VM=vm1",
"CONDOR_SCRATCH_DIR=/home/condor/execute/dir_13235", 0
)
10/28 14:19:50 Started user job - PID = 13236
10/28 14:19:50 cmd_fp = 0x82b2d30
10/28 14:19:50 end
10/28 14:19:50 	*FSM* Transitioning to state
"SUPERVISE"
10/28 14:19:50 	*FSM* Executing state func
"supervise_all()" [ GET_NEW_PROC SUSPEND VACATE ALARM
DIE CHILD_EXIT PERIODIC_CKPT  ]
10/28 14:19:50 	*FSM* Got asynchronous event
"CHILD_EXIT"
10/28 14:19:50 	*FSM* Executing transition function
"reaper"
10/28 14:19:50 Process 13236 exited with status 129
10/28 14:19:50 EXEC of user process failed, probably
insufficient swap
10/28 14:19:50 	*FSM* Transitioning to state
"PROC_EXIT"
10/28 14:19:50 	*FSM* Executing state func
"proc_exit()" [ DIE  ]
10/28 14:19:50 	*FSM* Executing transition function
"dispose_one"
10/28 14:19:50 Sending final status for process 133.5
10/28 14:19:50 STATUS encoded as CKPT, *NOT*
TRANSFERRED
10/28 14:19:50 User time = 0.000000 seconds
10/28 14:19:50 System time = 0.000000 seconds
10/28 14:19:50 Unlinked "dir_13235/condor_exec.133.5"
10/28 14:19:50 Can't unlink "dir_13235/core" - errno =
2
10/28 14:19:50 Removed directory "dir_13235"
10/28 14:19:50 	*FSM* Transitioning to state
"SUPERVISE"
10/28 14:19:50 	*FSM* Executing state func
"supervise_all()" [ GET_NEW_PROC SUSPEND VACATE ALARM
DIE CHILD_EXIT PERIODIC_CKPT  ]
10/28 14:19:50 	*FSM* Got asynchronous event "DIE"
10/28 14:19:50 	*FSM* Executing transition function
"req_die"
10/28 14:19:50 	*FSM* Transitioning to state
"TERMINATE"
10/28 14:19:50 	*FSM* Executing state func
"terminate_all()" [  ]
10/28 14:19:50 	*FSM* Transitioning to state
"SEND_STATUS_ALL"
10/28 14:19:50 	*FSM* Executing state func
"dispose_all()" [  ]
10/28 14:19:50 	*FSM* Reached state "END"
10/28 14:19:50 ********* STARTER terminating normally
**********
...

...


*******************************************************
when I checked the log file of the collector, I
noticed that the period of negiciation is rather short
. how can I make to change it?
*******************************************************

10/28 15:01:38 ---------- Started Negotiation Cycle
----------
10/28 15:01:38 Phase 1:  Obtaining ads from collector
...
10/28 15:01:38   Getting all public ads ...
10/28 15:01:38   Sorting 9 ads ...
10/28 15:01:38   Getting startd private ads ...
10/28 15:01:38 Got ads: 9 public and 3 private
10/28 15:01:38 Public ads include 0 submitter, 3
startd
10/28 15:01:38 Phase 2:  Performing accounting ...
10/28 15:01:38 Phase 3:  Sorting submitter ads by
priority ...
10/28 15:01:38 Phase 4.1:  Negotiating with schedds
...
10/28 15:01:38 ---------- Finished Negotiation Cycle
----------
10/28 15:01:58 ---------- Started Negotiation Cycle
----------


any help would be appreciated!
MAZOUNI habib.


___________________________________________________________
Do You Yahoo!? -- Une adresse @yahoo.fr gratuite et en français !
Yahoo! Mail : http://fr.mail.yahoo.com
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>