[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] Jobs submit, immediate eviction



I have a new install of Condor 6.6.3.  Both the central manager as well as the
execute hosts are all using v6.6.3.  The CM is running Fedora Core 1 and using
the static tarball for rh9.  The execute hosts are running LTSP-4.0.1 and using
the dynamic tarball for rh8.

All execute nodes share a common root filesystem mounted ro via NFS.  They mount
/home rw via NFS.  Their local directory is /home/condor/hosts/$(HOSTNAME). 
They share a common NIS domain of beowulfnis and a common Internet domain of
just beowulf.  Each hostname is of the form: nodeXXX.beowulf where XXX is a
number ranging from 000 to 021.

I can use most of the condor functions such as condor -version, condor_status to
view available hosts, and condor_q to see the job status.  However, I cannot
successfully run a job.  Here is what I have done:

C source:
---------------------------------
#include <stdio.h>
 
int main(void)
{
  int i;
  for (i = 0; i < 20; ++i)
    printf("hello, Condor\n");
  return 0;
}
---------------------------------


Output of "condor_compile gcc -O -o hello hello.c":

---------------------------------
LINKING FOR CONDOR : /usr/bin/ld -L/opt/condor/lib -Bstatic --eh-frame-hdr -m
elf_i386 -dynamic-linker /lib/ld-linux.so.2 -o hello
/opt/condor/lib/condor_rt0.o
/usr/lib/gcc-lib/i386-redhat-linux/3.3.2/../../../crti.o
/usr/lib/gcc-lib/i386-redhat-linux/3.3.2/crtbeginT.o -L/opt/condor/lib
-L/usr/lib/gcc-lib/i386-redhat-linux/3.3.2
-L/usr/lib/gcc-lib/i386-redhat-linux/3.3.2/../../.. /tmp/ccndlyTE.o
/opt/condor/lib/libcondorzsyscall.a /opt/condor/lib/libz.a
/opt/condor/lib/libcomp_libstdc++.a /opt/condor/lib/libcomp_libgcc.a
/opt/condor/lib/libcomp_libgcc_eh.a /opt/condor/lib/libcomp_libgcc_eh.a -lc
-lnss_files -lnss_dns -lresolv -lc -lnss_files -lnss_dns -lresolv -lc
/opt/condor/lib/libcomp_libgcc.a /opt/condor/lib/libcomp_libgcc_eh.a
/opt/condor/lib/libcomp_libgcc_eh.a
/usr/lib/gcc-lib/i386-redhat-linux/3.3.2/crtend.o
/usr/lib/gcc-lib/i386-redhat-linux/3.3.2/../../../crtn.o
/opt/condor/lib/libcondorzsyscall.a(condor_file_agent.o)(.text+0x250): In
function `CondorFileAgent::open(char const*, int, int)':
/home/condor/execute/dir_22897/src/condor_ckpt/condor_file_agent.C:99: warning:
the use of `tmpnam' is dangerous, better use `mkstemp'
-----------------------------------

Contents of submit.hello:
---------------------------------
Executable      = hello
Universe        = Standard
InitialDir      = /home/condor/hello
UidDomain       = beowulfnis
FileSystemDomain= beowulfnis
Output          = /home/condor/hello/hello.out
Log             = /home/condor/hello/hello.log
Should_transfer_files = YES
When_to_transfer_output= ON_EXIT
Queue
-------------------------------------



I then verify that the ownership of all files is condor:condor, and I run
"condor_submit submit.hello" from the /home/condor/hello directory as user condor:

---------------------------------
[condor@server hello]$ condor_submit submit.hello
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 111.
------------------------------------------

hello.log and hello.out are created.  After the job leaves idle state and
executes on a host, hello.out is empty and hello.log contains 422 bytes:

---------------------------------
000 (111.000.000) 04/15 21:17:20 Job submitted from host: <172.16.0.1:57081>
...
001 (111.000.000) 04/15 21:18:56 Job executing on host: <172.16.1.3:1026>
...
004 (111.000.000) 04/15 21:19:03 Job was evicted.
        (0) Job was not checkpointed.
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        344  -  Run Bytes Sent By Job
        12352778  -  Run Bytes Received By Job
...
---------------------------------


>From the output, we see that it was submitted to machine node003 (all IPs are
statically assigned).  Node003's log files show confusing information.  Here is
the complete StarterLog.  The StartLog is rather long, so I will not include it
unless it is requested.


----------------------------------
4/15 20:57:14 ********** STARTER starting up ***********
4/15 20:57:14 ** $CondorVersion: 6.6.3 Mar 29 2004 $
4/15 20:57:14 ** $CondorPlatform: I386-LINUX-RH80 $
4/15 20:57:14 ******************************************
4/15 20:57:14 Submitting machine is "server"
4/15 20:57:15 EventHandler {
4/15 20:57:15 	func = 0x80733ea
4/15 20:57:15 	mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2 SIGCHLD SIGTSTP 
4/15 20:57:17 }
4/15 20:57:18 Done setting resource limits
4/15 20:57:18 	*FSM* Transitioning to state "GET_PROC"
4/15 20:57:18 	*FSM* Executing state func "get_proc()" [  ]
4/15 20:57:19 Entering get_proc()
4/15 20:57:19 Entering get_job_info()
4/15 20:57:19 Startup Info:
4/15 20:57:19 	Version Number: 1
4/15 20:57:19 	Id: 111.0
4/15 20:57:20 	JobClass: STANDARD
4/15 20:57:20 	Uid: 500
4/15 20:57:20 	Gid: 500
4/15 20:57:20 	VirtPid: -1
4/15 20:57:21 	SoftKillSignal: 20
4/15 20:57:21 	Cmd: "/home/condor/hello/hello"
4/15 20:57:21 	Args: ""
4/15 20:57:21 	Env: ""
4/15 20:57:21 	Iwd: "/home/condor/hello"
4/15 20:57:21 	Ckpt Wanted: TRUE
4/15 20:57:21 	Is Restart: FALSE
4/15 20:57:22 	Core Limit Valid: TRUE
4/15 20:57:22 	Coredump Limit 0
4/15 20:57:22 User uid set to 99
4/15 20:57:22 User uid set to 99
4/15 20:57:23 User Process 111.0 {
4/15 20:57:23   cmd = /home/condor/hello/hello
4/15 20:57:23   args = 
4/15 20:57:24   env = 
4/15 20:57:24   local_dir = dir_9222
4/15 20:57:24   cur_ckpt = dir_9222/condor_exec.111.0
4/15 20:57:24   core_name = (either 'core' or 'core.<pid>')
4/15 20:57:24   uid = 99, gid = 99
4/15 20:57:25   v_pid = -1
4/15 20:57:25   pid = (NOT CURRENTLY EXECUTING)
4/15 20:57:25   exit_status_valid = FALSE
4/15 20:57:25   exit_status = (NEVER BEEN EXECUTED)
4/15 20:57:25   ckpt_wanted = TRUE
4/15 20:57:26   coredump_limit_exists = TRUE
4/15 20:57:26   coredump_limit = 0
4/15 20:57:26   soft_kill_sig = 20
4/15 20:57:27   job_class = STANDARD
4/15 20:57:27   state = NEW
4/15 20:57:27   new_ckpt_created = FALSE
4/15 20:57:27   ckpt_transferred = FALSE
4/15 20:57:28   core_created = FALSE
4/15 20:57:28   core_transferred = FALSE
4/15 20:57:28   exit_requested = FALSE
4/15 20:57:28   image_size = -1 blocks
4/15 20:57:28   user_time = 0
4/15 20:57:29   sys_time = 0
4/15 20:57:29   guaranteed_user_time = 0
4/15 20:57:29   guaranteed_sys_time = 0
4/15 20:57:29 }
4/15 20:57:30 	*FSM* Transitioning to state "GET_EXEC"
4/15 20:57:30 	*FSM* Executing state func "get_exec()" [ SUSPEND VACATE DIE  ]
4/15 20:57:31 Entering get_exec()
4/15 20:57:31 Executable is located on submitting host
4/15 20:57:31 Expanded executable name is
"/home/condor/hosts/server/spool/cluster111.ickpt.subproc0"
4/15 20:57:31 Going to try 3 attempts at getting the inital executable
4/15 20:57:32 Entering get_file(
/home/condor/hosts/server/spool/cluster111.ickpt.subproc0,
dir_9222/condor_exec.111.0, 0755 )
4/15 20:57:32 Opened "/home/condor/hosts/server/spool/cluster111.ickpt.subproc0"
via file stream
4/15 20:57:57 Get_file() transferred 12352126 bytes, 501260 bytes/second
4/15 20:57:57 Fetched orig ckpt file
"/home/condor/hosts/server/spool/cluster111.ickpt.subproc0" into
"dir_9222/condor_exec.111.0" with 1 attempt
4/15 20:57:58 Executable 'dir_9222/condor_exec.111.0' is linked with
"$CondorVersion: 6.6.3 Mar 29 2004 $" on a "$CondorPlatform: I386-LINUX-RH9 $"
4/15 20:57:59 	*FSM* Executing transition function "spawn_all"
4/15 20:57:59 Pipe built
4/15 20:57:59 New pipe_fds[14,1]
4/15 20:57:59 cmd_fd = 14
4/15 20:57:59 Calling execve(
"/home/condor/hosts/node003/execute/dir_9222/condor_exec.111.0",
"condor_exec.111.0", "-_condor_cmd_fd", "14", 0, "CONDOR_VM=vm1",
"CONDOR_SCRATCH_DIR=/home/condor/hosts/node003/execute/dir_9222", 0 )
4/15 20:58:01 Started user job - PID = 9413
4/15 20:58:02 cmd_fp = 0x8292d90
4/15 20:58:02 end
4/15 20:58:02 	*FSM* Transitioning to state "SUPERVISE"
4/15 20:58:03 	*FSM* Executing state func "supervise_all()" [ GET_NEW_PROC
SUSPEND VACATE ALARM DIE CHILD_EXIT 4/15 20:58:05 	*FSM* Got asynchronous event
"CHILD_EXIT"
4/15 20:58:05 	*FSM* Executing transition function "reaper"
4/15 20:58:05 Process 9413 killed by signal 9
4/15 20:58:05 Process exited by request
4/15 20:58:06 	*FSM* Transitioning to state "PROC_EXIT"
4/15 20:58:06 	*FSM* Executing state func "proc_exit()" [ DIE  ]
4/15 20:58:06 	*FSM* Executing transition function "dispose_one"
4/15 20:58:07 Sending final status for process 111.0
4/15 20:58:07 STATUS encoded as CKPT, *NOT* TRANSFERRED
4/15 20:58:07 User time = 0.000000 seconds
4/15 20:58:07 System time = 0.000000 seconds
4/15 20:58:08 Unlinked "dir_9222/condor_exec.111.0"
4/15 20:58:08 Removed directory "dir_9222"
4/15 20:58:09 	*FSM* Transitioning to state "SUPERVISE"
4/15 20:58:09 	*FSM* Got asynchronous event "DIE"
4/15 20:58:09 	*FSM* Executing transition function "req_die"
4/15 20:58:09 	*FSM* Transitioning to state "TERMINATE"
4/15 20:58:10 	*FSM* Executing state func "terminate_all()" [  ]
4/15 20:58:10 	*FSM* Transitioning to state "SEND_STATUS_ALL"
4/15 20:58:10 	*FSM* Executing state func "dispose_all()" [  ]
4/15 20:58:11 	*FSM* Reached state "END"
4/15 20:58:11 ********* STARTER terminating normally **********
----------------------------




Now then...   Does anyone have any idea whatsoever as to the nature of this
problem?  Is there any more information I could include that would shed light on
the cause and, ideally, the solution?  Is there a way to verify my setup to
ensure that everything is configured as it should be?

hmm...  that would be a good addition to the condor package.. condor_diagnostic
-- a tool that tests all of the condor functions to verify if I was able to
install it correctly.  At any rate, thoughts on the matter are more than welcome.
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>