[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job starts then imediatly idles





On 5/19/06, Nicolas GUIOT <nicolas.guiot@xxxxxxx> wrote:
Hi guys

So, I've a standard job that runs for 3 seconds, then goes Idle, and I don't understand why.

>From all the logs I went through, the surprising point is this :
Error: Couldn't open standard file '/dev/null'
in the ShadowLog (coyp below).
I also copied the StareLog on the executing host, that says me it kills the job because it receives a signal 9....

FYI, This is the first "standard" job I'm trying to run : could this be a problem with the checkpoint server ?

Have you any idea where else I could check ?

Thanks in advance
Nicoals


****************************************
ShadowLog :

5/19 11:38:15 (?.?) (13373):******* Standard Shadow starting up *******
5/19 11:38:15 (?.?) (13373):** $CondorVersion: 6.7.18 Mar 22 2006 $
5/19 11:38:15 (?.?) (13373):** $CondorPlatform: I386-LINUX_RH9 $
5/19 11:38:15 (?.?) (13373):*******************************************
5/19 11:38:15 (?.?) (13373):uid=0, euid=5556, gid=0, egid=100
5/19 11:38:15 (?.?) (13373):Hostname = "<193.49.27.58:33060>", Job = 1.2
5/19 11:38:15 (1.2) (13373):Requesting Primary Starter
5/19 11:38:15 (1.2) (13373):Shadow: Request to run a job was ACCEPTED
5/19 11:38:15 (1.2) (13373):Shadow: RSC_SOCK connected, fd = 17
5/19 11:38:15 (1.2) (13373):Shadow: CLIENT_LOG connected, fd = 18
5/19 11:38:15 (1.2) (13373):My_Filesystem_Domain = " galaxy.ibpc.fr"
5/19 11:38:15 (1.2) (13373):My_UID_Domain = "galaxy.ibpc.fr"
5/19 11:38:15 (1.2) (13373):    Entering pseudo_get_file_stream
5/19 11:38:15 (1.2 ) (13373):    file ="/scratch/condor/spool/cluster1.ickpt.subproc0"
5/19 11:38:16 (1.2) (13373):Reaped child status - pid 13375 exited with status 0
5/19 11:38:17 (1.2) (13373):Read: User Job - $CondorPlatform: I386-LINUX_RH9 $
5/19 11:38:17 (1.2) (13373):Read: User Job - $CondorVersion: 6.7.18 Mar 22 2006 $
5/19 11:38:17 (1.2) (13373):Read: Checkpoint file name is "/scratch/condor/spool/cluster1.proc2.subproc0"
5/19 11:38:17 ( 1.2) (13373):error: Error: Couldn't open standard file '/dev/null'
5/19 11:38:17 (1.2) (13373):Shadow: Job 1.2 exited, termsig = 9, coredump = 0, retcode = 0
5/19 11:38:17 (1.2) (13373):Shadow: Job was kicked off without a checkpoint
5/19 11:38:17 (1.2) (13373):Shadow: DoCleanup: unlinking TmpCkpt '/scratch/condor/spool/cluster1.proc2.subproc0.tmp'
5/19 11:38:17 (1.2) (13373):Trying to unlink /scratch/condor/spool/cluster1.proc2.subproc0.tmp
5/19 11:38:17 (1.2) (13373):user_time = 1 ticks
5/19 11:38:17 (1.2) (13373):sys_time  = 3 ticks
5/19 11:38:17 (1.2) (13373):********** Shadow Exiting(107)**********

****************************************


StarterLog@vm1 on executing host
5/19 11:18:36 ********** STARTER starting up ***********
5/19 11:18:36 ** $CondorVersion: 6.7.18 Mar 22 2006 $
5/19 11:18:36 ** $CondorPlatform: I386-LINUX_RH9 $
5/19 11:18:36 ******************************************
5/19 11:18:36 Submitting machine is "chagall.galaxy.ibpc.fr"
5/19 11:18:36 EventHandler {
5/19 11:18:36   func = 0x80d80b6
5/19 11:18:36   mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2 SIGCHLD SIGTSTP
5/19 11:18:36 }
5/19 11:18:36 Done setting resource limits
5/19 11:18:36   *FSM* Transitioning to state "GET_PROC"
5/19 11:18:36   *FSM* Executing state func "get_proc()" [  ]
5/19 11:18:36 Entering get_proc()
5/19 11:18:36 Entering get_job_info()
5/19 11:18:36 Startup Info:
5/19 11:18:36   Version Number: 1
5/19 11:18:36   Id: 1.0
5/19 11:18:36   JobClass: STANDARD
5/19 11:18:36   Uid: 1105
5/19 11:18:36   Gid: 100
5/19 11:18:36   VirtPid: -1
5/19 11:18:36   SoftKillSignal: 20
5/19 11:18:36   Cmd: "/ibpc/rhea/saladin/Attract/Condor/attract"
5/19 11:18:36   Args: "receptor.red ligand.red 0"
5/19 11:18:36   Env: ""
5/19 11:18:36   Iwd: "/ibpc/jaune/RecA/RecAlc/Rigid/AvecHelice/AvecCharge2"
5/19 11:18:36   Ckpt Wanted: TRUE
5/19 11:18:36   Is Restart: FALSE
5/19 11:18:36   Core Limit Valid: TRUE
5/19 11:18:36   Coredump Limit 0
5/19 11:18:36 User uid set to 1105
5/19 11:18:36 User uid set to 100
5/19 11:18:36 User Process 1.0 {
5/19 11:18:36   cmd = /ibpc/rhea/saladin/Attract/Condor/attract
5/19 11:18:36   args = receptor.red ligand.red 0
5/19 11:18:36   env = CONDOR_VM=vm1 _condor_BIND_ALL_INTERFACES=FALSE CONDOR_SCRATCH_DIR=/scratch/condor/e
xecute/dir_2425
5/19 11:18:36   local_dir = dir_2425
5/19 11:18:36   cur_ckpt = dir_2425/condor_exec.1.0
5/19 11:18:36   core_name = (either 'core' or 'core.<pid>')
5/19 11:18:36   uid = 1105, gid = 100
5/19 11:18:36   v_pid = -1
5/19 11:18:36   pid = (NOT CURRENTLY EXECUTING)
5/19 11:18:36   exit_status_valid = FALSE
5/19 11:18:36   exit_status = (NEVER BEEN EXECUTED)
5/19 11:18:36   ckpt_wanted = TRUE
5/19 11:18:36   coredump_limit_exists = TRUE
5/19 11:18:36   coredump_limit = 0
5/19 11:18:36   soft_kill_sig = 20
5/19 11:18:36   job_class = STANDARD
5/19 11:18:36   state = NEW
5/19 11:18:36   new_ckpt_created = FALSE
5/19 11:18:36   ckpt_transferred = FALSE
5/19 11:18:36   core_created = FALSE
5/19 11:18:36   core_transferred = FALSE
5/19 11:18:36   exit_requested = FALSE
5/19 11:18:36   image_size = -1 blocks
5/19 11:18:36   user_time = 0
5/19 11:18:36   sys_time = 0
5/19 11:18:36   guaranteed_user_time = 0
5/19 11:18:36   guaranteed_sys_time = 0
5/19 11:18:36 }
5/19 11:18:36   *FSM* Transitioning to state "GET_EXEC"
5/19 11:18:36   *FSM* Executing state func "get_exec()" [ SUSPEND VACATE DIE  ]
5/19 11:18:36 Entering get_exec()
5/19 11:18:36 Executable is located on submitting host
5/19 11:18:36 Expanded executable name is "/scratch/condor/spool/cluster1.ickpt.subproc0"
5/19 11:18:36 Going to try 3 attempts at getting the initial executable
5/19 11:18:36 Entering get_file( /scratch/condor/spool/cluster1.ickpt.subproc0, dir_2425/condor_exec.1.0,
0755 )
5/19 11:18:36 Opened "/scratch/condor/spool/cluster1.ickpt.subproc0" via file stream
5/19 11:18:41 Get_file() transferred 13731741 bytes, 2988499 bytes/second
5/19 11:18:41 Fetched orig ckpt file "/scratch/condor/spool/cluster1.ickpt.subproc0" into "dir_2425/condor
_exec.1.0" with 1 attempt
5/19 11:18:41 Executable 'dir_2425/condor_exec.1.0' is linked with "$CondorVersion: 6.7.18 Mar 22 2006 $"
on a "$CondorPlatform: I386-LINUX_RH9 $"
5/19 11:18:41   *FSM* Executing transition function "spawn_all"
5/19 11:18:41 Pipe built
5/19 11:18:41 New pipe_fds[14,1]
5/19 11:18:41 cmd_fd = 14
5/19 11:18:41 Calling execve( "/scratch/condor/execute/dir_2425/condor_exec.1.0", "condor_exec.1.0", "-_co
ndor_cmd_fd", "14", "receptor.red", "ligand.red", "0", 0, "CONDOR_VM=vm1", "_condor_BIND_ALL_INTERFACES=FA
LSE", "CONDOR_SCRATCH_DIR=/scratch/condor/execute/dir_2425", 0 )
5/19 11:18:41 Started user job - PID = 2431
5/19 11:18:41 cmd_fp = 0x836c5d8
5/19 11:18:41 end
5/19 11:18:41   *FSM* Transitioning to state "SUPERVISE"
5/19 11:18:41   *FSM* Executing state func "supervise_all()" [ GET_NEW_PROC SUSPEND VACATE ALARM DIE CHILD
_EXIT PERIODIC_CKPT  ]
5/19 11:18:41   *FSM* Got asynchronous event "CHILD_EXIT"
5/19 11:18:41   *FSM* Executing transition function "reaper"
5/19 11:18:41 Process 2431 killed by signal 9
5/19 11:18:41 Process exited by request
5/19 11:18:41   *FSM* Transitioning to state "PROC_EXIT"
5/19 11:18:41   *FSM* Executing state func "proc_exit()" [ DIE  ]
5/19 11:18:41   *FSM* Executing transition function "dispose_one"
5/19 11:18:41 Sending final status for process 1.0
5/19 11:18:41 STATUS encoded as CKPT, *NOT* TRANSFERRED
5/19 11:18:41 User time = 0.000000 seconds
5/19 11:18:41 System time = 0.000000 seconds
5/19 11:18:41 Unlinked "dir_2425/condor_exec.1.0"
5/19 11:18:41 Removed directory "dir_2425"
5/19 11:18:41   *FSM* Transitioning to state "SUPERVISE"
5/19 11:18:41   *FSM* Executing state func "supervise_all()" [ GET_NEW_PROC SUSPEND VACATE ALARM DIE CHILD
_EXIT PERIODIC_CKPT  ]
5/19 11:18:41   *FSM* Got asynchronous event "DIE"
5/19 11:18:41   *FSM* Executing transition function "req_die"
5/19 11:18:41   *FSM* Transitioning to state "TERMINATE"
5/19 11:18:41   *FSM* Executing state func "terminate_all()" [  ]
5/19 11:18:41   *FSM* Transitioning to state "SEND_STATUS_ALL"
5/19 11:18:41   *FSM* Executing state func "dispose_all()" [  ]
5/19 11:18:41   *FSM* Reached state "END"
5/19 11:18:41 ********* STARTER terminating normally **********


I guess you specified the /dev/null in the submit script. What if you change /dev/null to some real file? If that works, the /dev/null has some problem.

--
Diego Bello Carreño
Estudiante Memorista de Ingeniería Civil Informática
UTFSM, Valparaíso, Chile
Usuario #294897 counter.li.org