[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] what's a COMMAND 404?



hello,

I have a 4-node Linux cluster running Condor. I have
tried, unsuccessfully, to run jobs on the remotes
nodes. but they were evicted on these nodes!!, and 
finally, all the executions were held locally on the
submitting machine.

so, What is a command 404 (DEACTIVATE_CLAIM_FORCIBLY)?


here is the relevants parts of a running machine's log
files.  

cat log1:
*********

001 (109.001.000) 10/16 15:53:39 Job executing on
host: <130.98.172.56:51420>
...
004 (109.001.000) 10/16 15:53:39 Job was evicted.
	(0) Job was not checkpointed.
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
	224  -  Run Bytes Sent By Job
	3587556  -  Run Bytes Received By Job
...
001 (109.001.000) 10/16 15:53:43 Job executing on
host: <130.98.172.55:43121>
...
005 (109.001.000) 10/16 15:53:43 Job terminated.
	(1) Normal termination (return value 13)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote
Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	893  -  Run Bytes Sent By Job
	3587999  -  Run Bytes Received By Job
	1117  -  Total Bytes Sent By Job
	7175555  -  Total Bytes Received By Job
...

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

The StartLog shows these messages at the time of
eviction:

Now in new log file /home/condor/log/StartLog
10/16 15:53:40 Changing state and activity:
Preempting/Vacating -> Owner/Idle
10/16 15:53:40 State change: IS_OWNER is false
10/16 15:53:40 Changing state: Owner -> Unclaimed
10/16 15:53:40 DaemonCore: Command received via UDP
from host <130.98.172.55:33364>
10/16 15:53:40 DaemonCore: received command 443
(RELEASE_CLAIM), calling handler (command_handler)
10/16 15:53:40 Error: can't find resource with        
           capability
(<130.98.172.56:51420>#5582989707)

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

The StarterLog shows these messages:

10/16 15:50:58 ********* STARTER terminating normally
**********
10/16 15:53:32 ********** STARTER starting up
***********
10/16 15:53:32 ** $CondorVersion: 6.4.7 Jan 26 2003 $
10/16 15:53:32 ** $CondorPlatform: INTEL-LINUX-GLIBC22
$
10/16 15:53:32
******************************************
10/16 15:53:32 Submitting machine is
"node1.xtrem.der.edf.fr"
10/16 15:53:32 EventHandler {
10/16 15:53:32 	func = 0x80706d0
10/16 15:53:32 	mask = SIGALRM SIGHUP SIGINT SIGUSR1
SIGUSR2 SIGCHLD SIGTSTP 
10/16 15:53:32 }
10/16 15:53:32 Done setting resource limits
10/16 15:53:32 	*FSM* Transitioning to state
"GET_PROC"
10/16 15:53:32 	*FSM* Executing state func
"get_proc()" [  ]
10/16 15:53:32 Entering get_proc()
10/16 15:53:32 Entering get_job_info()
10/16 15:53:32 Startup Info:
10/16 15:53:32 	Version Number: 1
10/16 15:53:32 	Id: 109.1
10/16 15:53:32 	JobClass: STANDARD
10/16 15:53:32 	Uid: 504
10/16 15:53:32 	Gid: 505
10/16 15:53:32 	VirtPid: -1
10/16 15:53:32 	SoftKillSignal: 20
10/16 15:53:32 	Cmd: "/home/condor/test"
10/16 15:53:32 	Args: ""
10/16 15:53:32 	Env: ""
10/16 15:53:32 	Iwd: "/home/condor"
10/16 15:53:32 	Ckpt Wanted: TRUE
10/16 15:53:32 	Is Restart: FALSE
10/16 15:53:32 	Core Limit Valid: TRUE
10/16 15:53:32 	Coredump Limit 0
10/16 15:53:32 User uid set to 99
10/16 15:53:32 User uid set to 99
10/16 15:53:32 User Process 109.1 {
10/16 15:53:32   cmd = /home/condor/test
10/16 15:53:32   args = 
10/16 15:53:32   env = 
10/16 15:53:32   local_dir = dir_29257
10/16 15:53:32   cur_ckpt =
dir_29257/condor_exec.109.1
10/16 15:53:32   core_name = dir_29257/core
10/16 15:53:32   uid = 99, gid = 99
10/16 15:53:32   v_pid = -1
10/16 15:53:32   pid = (NOT CURRENTLY EXECUTING)
10/16 15:53:32   exit_status_valid = FALSE
10/16 15:53:32   exit_status = (NEVER BEEN EXECUTED)
10/16 15:53:32   ckpt_wanted = TRUE
10/16 15:53:32   coredump_limit_exists = TRUE
10/16 15:53:32   coredump_limit = 0
10/16 15:53:32   soft_kill_sig = 20
10/16 15:53:32   job_class = STANDARD
10/16 15:53:32   state = NEW
10/16 15:53:32   new_ckpt_created = FALSE
10/16 15:53:32   ckpt_transferred = FALSE
10/16 15:53:32   core_created = FALSE
10/16 15:53:32   core_transferred = FALSE
10/16 15:53:32   exit_requested = FALSE
10/16 15:53:32   image_size = -1 blocks
10/16 15:53:32   user_time = 0
10/16 15:53:32   sys_time = 0
10/16 15:53:32   guaranteed_user_time = 0
10/16 15:53:32   guaranteed_sys_time = 0
10/16 15:53:32 }
10/16 15:53:32 	*FSM* Transitioning to state
"GET_EXEC"
10/16 15:53:32 	*FSM* Executing state func
"get_exec()" [ SUSPEND VACATE DIE  ]
10/16 15:53:32 Entering get_exec()
10/16 15:53:32 Executable is located on submitting
host
10/16 15:53:32 Expanded executable name is
"/home/condor/spool/cluster109.ickpt.subproc0"
10/16 15:53:32 Going to try 3 attempts at getting the
inital executable
10/16 15:53:32 Entering get_file(
/home/condor/spool/cluster109.ickpt.subproc0,
dir_29257/condor_exec.109.1, 0755 )
10/16 15:53:33 Opened
"/home/condor/spool/cluster109.ickpt.subproc0" via
file stream
10/16 15:53:39 Get_file() transferred 3587233 bytes,
552788 bytes/second
10/16 15:53:39 Fetched orig ckpt file
"/home/condor/spool/cluster109.ickpt.subproc0" into
"dir_29257/condor_exec.109.1" with 1 attempt
10/16 15:53:39 Executable
'dir_29257/condor_exec.109.1' is linked with
"$CondorVersion: 6.4.7 Jan 26 2003 $" on a
"$CondorPlatform: INTEL-LINUX-GLIBC22 $"
10/16 15:53:39 	*FSM* Executing transition function
"spawn_all"
10/16 15:53:39 Pipe built
10/16 15:53:39 New pipe_fds[14,1]
10/16 15:53:39 cmd_fd = 14
10/16 15:53:39 Calling execve(
"/home/condor/execute/dir_29257/condor_exec.109.1",
"condor_exec.109.1", "-_condor_cmd_fd", "14", 0,
"CONDOR_VM=vm1",
"CONDOR_SCRATCH_DIR=/home/condor/execute/dir_29257", 0
)
10/16 15:53:39 Started user job - PID = 29258
10/16 15:53:39 cmd_fp = 0x82b3018
10/16 15:53:39 end
10/16 15:53:39 	*FSM* Transitioning to state
"SUPERVISE"
10/16 15:53:39 	*FSM* Executing state func
"supervise_all()" [ GET_NEW_PROC SUSPEND VACATE ALARM
DIE CHILD_EXIT PERIODIC_CKPT  ]
10/16 15:53:39 	*FSM* Got asynchronous event
"CHILD_EXIT"
10/16 15:53:39 	*FSM* Executing transition function
"reaper"
10/16 15:53:39 Process 29258 exited with status 129
10/16 15:53:39 EXEC of user process failed, probably
insufficient swap
10/16 15:53:39 	*FSM* Transitioning to state
"PROC_EXIT"
10/16 15:53:39 	*FSM* Executing state func
"proc_exit()" [ DIE  ]
10/16 15:53:39 	*FSM* Executing transition function
"dispose_one"
10/16 15:53:39 Sending final status for process 109.1
10/16 15:53:39 STATUS encoded as CKPT, *NOT*
TRANSFERRED
10/16 15:53:39 User time = 0.000000 seconds
10/16 15:53:39 System time = 0.000000 seconds
10/16 15:53:39 Unlinked "dir_29257/condor_exec.109.1"
10/16 15:53:39 Can't unlink "dir_29257/core" - errno =
2
10/16 15:53:39 Removed directory "dir_29257"
10/16 15:53:39 	*FSM* Transitioning to state
"SUPERVISE"
10/16 15:53:39 	*FSM* Got asynchronous event "DIE"
10/16 15:53:39 	*FSM* Executing transition function
"req_die"
10/16 15:53:39 	*FSM* Transitioning to state
"TERMINATE"
10/16 15:53:39 	*FSM* Executing state func
"terminate_all()" [  ]
10/16 15:53:39 	*FSM* Transitioning to state
"SEND_STATUS_ALL"
10/16 15:53:39 	*FSM* Executing state func
"dispose_all()" [  ]
10/16 15:53:39 	*FSM* Reached state "END"
10/16 15:53:39 ********* STARTER terminating normally
**********

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
 
any help would be appreciated!
MAZOUNI habib
EDF. 



























___________________________________________________________
Do You Yahoo!? -- Une adresse @yahoo.fr gratuite et en français !
Yahoo! Mail : http://fr.mail.yahoo.com
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>