[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] why jobs are always evicted on remotes machines?



On Tuesday 28 October 2003 8:19 am, habib mazouni wrote:
> hello,
> I already sent three messages but unfortunately I
> didn't receive any answer.
>
> well, I will summarize my problem once again:
>
> I have a 4-node Linux cluster running Condor. I have
> tried, unsuccessfully, to run jobs on the remotes
> nodes. but they were evicted on these nodes!!, and
> finally, all the executions were held locally on the
> submitting machine.
> i don't understand why the jobs cannot be executed on
> the remotes machines?

Ok, look in the log snippet before, and you'll see at least some information 
as to what's going wrong.

<snip>

> 10/28 14:19:49 Get_file() transferred 3587233 bytes,
> 587500 bytes/second
> 10/28 14:19:49 Fetched orig ckpt file
> "/home/condor/spool/cluster133.ickpt.subproc0" into
> "dir_13235/condor_exec.133.5" with 1 attempt
> 10/28 14:19:50 Executable
> 'dir_13235/condor_exec.133.5' is linked with
> "$CondorVersion: 6.4.7 Jan 26 2003 $" on a
> "$CondorPlatform: INTEL-LINUX-GLIBC22 $"
> 10/28 14:19:50 	*FSM* Executing transition function
> "spawn_all"
> 10/28 14:19:50 Pipe built
> 10/28 14:19:50 New pipe_fds[14,1]
> 10/28 14:19:50 cmd_fd = 14
> 10/28 14:19:50 Calling execve(
> "/home/condor/execute/dir_13235/condor_exec.133.5",
> "condor_exec.133.5", "-_condor_cmd_fd", "14", 0,
> "CONDOR_VM=vm1",
> "CONDOR_SCRATCH_DIR=/home/condor/execute/dir_13235", 0
> )
> 10/28 14:19:50 Started user job - PID = 13236
> 10/28 14:19:50 cmd_fp = 0x82b2d30
> 10/28 14:19:50 end
> 10/28 14:19:50 	*FSM* Transitioning to state
> "SUPERVISE"
> 10/28 14:19:50 	*FSM* Executing state func
> "supervise_all()" [ GET_NEW_PROC SUSPEND VACATE ALARM
> DIE CHILD_EXIT PERIODIC_CKPT  ]
> 10/28 14:19:50 	*FSM* Got asynchronous event
> "CHILD_EXIT"
> 10/28 14:19:50 	*FSM* Executing transition function
> "reaper"
> 10/28 14:19:50 Process 13236 exited with status 129
> 10/28 14:19:50 EXEC of user process failed, probably
> insufficient swap
> 10/28 14:19:50 	*FSM* Transitioning to state
> "PROC_EXIT"
> 10/28 14:19:50 	*FSM* Executing state func
> "proc_exit()" [ DIE  ]

Notice the "Process 13236 exited with status 129" and "EXEC of user process 
failed, probably insufficient swap" messages.

Exit status 129 would indicate that the process was killed by signal 1 
(SIGHUP) immediately after it started execution.  This is odd.  Can you 
directly run the executable on the target machine (node3)?  It's a long shot, 
but something strange is going on.

Could you send the output of 'condor_status' and 'condor_status -l node3'?

Thanks

-Nick

-- 
           <<< Why, oh, why, didn't I take the blue pill? >>>
 /`-_    Nicholas R. LeRoy               The Condor Project
{     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
 \    /  nleroy@xxxxxxxxxxx              The University of Wisconsin
 |_*_|   608-265-5761                    Department of Computer Sciences

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>