[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problem Shadow Exiting(107) Warning: can't find resource with ClaimId (< 192.168.7.221:57320>





hello,
i am submitting job through globus to condor but the job stays in idle state, even same thing is happening when submitting job through condor_submit. please help me out.

Shadow.log
=================================================================================
11/22 20:47:29 ( 4.0) (4473):My_UID_Domain = "niting-w2p.corp.cdac.in"
11/22 20:47:35 (4.0) (4473):Shadow: Job 4.0 exited, termsig = 9, coredump = 0, retcode = 0
11/22 20:47:35 ( 4.0) (4473):Shadow: Job was kicked off without a checkpoint
11/22 20:47:35 (4.0) (4473):Shadow: DoCleanup: unlinking TmpCkpt '/home/condor/hosts/niting-w2p/spool/cluster4.proc0.subproc0.tmp'
11/22 20:47:36 ( 4.0) (4473):Trying to unlink /home/condor/hosts/niting-w2p/spool/cluster4.proc0.subproc0.tmp
11/22 20:47:36 (4.0) (4473):user_time = 1 ticks
11/22 20:47:36 (4.0) (4473):sys_time = 0 ticks
11/22 20:47:36 (4.0) (4473):Asked to write event of number 1.
11/22 20:47:36 (4.0) (4473):Asked to write event of number 4.
11/22 20:47:36 (4.0) (4473):********** Shadow Exiting(107) **********
11/22 20:49:02 (?.?) (4574):******* Standard Shadow starting up *******
11/22 20:49:02 (?.?) (4574):** $CondorVersion: 6.8.6 Sep 13 2007 $
11/22 20:49:02 (?.?) (4574):** $CondorPlatform: I386-LINUX_RH9 $
11/22 20:49:02 (?.?) (4574):*******************************************
11/22 20:49:02 (?.?) (4574):uid=0, euid=900, gid=0, egid=900
11/22 20:49:02 (?.?) (4574):Hostname = "<192.168.7.221:57320>", Job = 5.0
11/22 20:49:02 (5.0) (4574):Requesting Primary Starter
11/22 20:49:02 ( 5.0) (4574):Shadow: Request to run a job was ACCEPTED
11/22 20:49:02 (5.0) (4574):Shadow: RSC_SOCK connected, fd = 17
11/22 20:49:03 (5.0) (4574):Shadow: CLIENT_LOG connected, fd = 18
11/22 20:49:03 (5.0) (4574):My_Filesystem_Domain = " niting-w2p.corp.cdac.in"
11/22 20:49:03 (5.0) (4574):My_UID_Domain = "niting-w2p.corp.cdac.in"
11/22 20:49:10 (5.0) (4574):Shadow: Job 5.0 exited, termsig = 9, coredump = 0, retcode = 0
11/22 20:49:10 (5.0) (4574):Shadow: Job was kicked off without a checkpoint
11/22 20:49:10 (5.0) (4574):Shadow: DoCleanup: unlinking TmpCkpt '/home/condor/hosts/niting-w2p/spool/cluster5.proc0.subproc0.tmp'
11/22 20:49:10 (5.0) (4574):Trying to unlink /home/condor/hosts/niting-w2p/spool/cluster5.proc0.subproc0.tmp
11/22 20:49:10 (5.0) (4574):user_time = 0 ticks
11/22 20:49:10 (5.0) (4574):sys_time = 0 ticks
11/22 20:49:10 ( 5.0) (4574):Asked to write event of number 1.
11/22 20:49:10 (5.0) (4574):Asked to write event of number 4.
11/22 20:49:10 (5.0) (4574):********** Shadow Exiting(107) **********
11/22 20:59:02 (?.?) (4621):******* Standard Shadow starting up *******
11/22 20:59:02 (?.?) (4621):** $CondorVersion: 6.8.6 Sep 13 2007 $
11/22 20:59:02 (?.?) (4621):** $CondorPlatform: I386-LINUX_RH9 $
11/22 20:59:02 (?.?) (4621):*******************************************
11/22 20:59:03 (?.?) (4621):uid=0, euid=900, gid=0, egid=900
11/22 20:59:03 (?.?) (4621):Hostname = "<192.168.7.221:57320>", Job = 5.0
11/22 20:59:03 (5.0) (4621):Requesting Primary Starter
11/22 20:59:03 ( 5.0) (4621):Shadow: Request to run a job was ACCEPTED
11/22 20:59:03 (5.0) (4621):Shadow: RSC_SOCK connected, fd = 17
11/22 20:59:03 (5.0) (4621):Shadow: CLIENT_LOG connected, fd = 18
11/22 20:59:03 (5.0) (4621):My_Filesystem_Domain = " niting-w2p.corp.cdac.in"
11/22 20:59:03 (5.0) (4621):My_UID_Domain = "niting-w2p.corp.cdac.in"
11/22 20:59:10 (5.0) (4621):Shadow: Job 5.0 exited, termsig = 9, coredump = 0, retcode = 0
11/22 20:59:10 (5.0) (4621):Shadow: Job was kicked off without a checkpoint
11/22 20:59:10 (5.0) (4621):Shadow: DoCleanup: unlinking TmpCkpt '/home/condor/hosts/niting-w2p/spool/cluster5.proc0.subproc0.tmp'
11/22 20:59:10 (5.0) (4621):Trying to unlink /home/condor/hosts/niting-w2p/spool/cluster5.proc0.subproc0.tmp
11/22 20:59:10 (5.0) (4621):user_time = 1 ticks
11/22 20:59:11 (5.0) (4621):sys_time = 0 ticks
11/22 20:59:11 ( 5.0) (4621):Asked to write event of number 1.
11/22 20:59:11 (5.0) (4621):Asked to write event of number 4.
11/22 20:59:11 (5.0) (4621):********** Shadow Exiting(107) **********
=====================================================================================
startd.log
==================================
11/22 20:47:35 vm1: Got KILL_FRGN_JOB while in Preempting state, ignoring.
11/22 20:47:36 Starter pid 4474 exited with status 0
11/22 20:47:36 vm1: State change: starter exited
11/22 20:47:36 vm1: State change: No preempting claim, returning to owner
11/22 20:47:36 vm1: Changing state and activity: Preempting/Killing -> Owner/Idle
11/22 20:47:36 vm1: State change: IS_OWNER is false
11/22 20:47:36 vm1: Changing state: Owner -> Unclaimed
11/22 20:47:37 DaemonCore: Command received via UDP from host < 192.168.7.221:32863 >
11/22 20:47:37 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_release_claim)
11/22 20:47:37 Warning: can't find resource with ClaimId (< 192.168.7.221:57320>#1195742198#13#...)
11/22 20:48:56 DaemonCore: Command received via UDP from host <192.168.7.127:32845>
11/22 20:48:56 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
11/22 20:48:56 vm1: match_info called
11/22 20:48:57 vm1: Received match <192.168.7.221:57320>#1195742198#16#...
11/22 20:48:57 vm1: State change: match notification protocol successful
11/22 20:48:57 vm1: Changing state: Unclaimed -> Matched
11/22 20:48:57 DaemonCore: Command received via TCP from host <192.168.7.221:38154>
11/22 20:48:57 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
11/22 20:48:57 vm1: Request accepted.
11/22 20:48:57 vm1: Remote owner is psegrid@xxxxxxxxxxxxxxxxxxxxxxx
11/22 20:48:57 vm1: State change: claiming protocol successful
11/22 20:48:57 vm1: Changing state: Matched -> Claimed
11/22 20:49:02 DaemonCore: Command received via TCP from host <192.168.7.221:38436>
11/22 20:49:02 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)
11/22 20:49:02 vm1: Got activate_claim request from shadow (<192.168.7.221:38436>)
11/22 20:49:02 vm1: Remote job ID is 5.0
11/22 20:49:02 vm1: exec_starter( niting-w2p.corp.cdac.in, 10, 11 ) : pid 4575
11/22 20:49:03 vm1: execl(/usr/local/condor/sbin/condor_starter.std, "condor_starter", niting-w2p.corp.cdac.in , 0)
11/22 20:49:03 vm1: Got universe "STANDARD" (1) from request classad
11/22 20:49:03 vm1: State change: claim-activation protocol successful
11/22 20:49:03 vm1: Changing activity: Idle -> Busy
11/22 20:49:08 vm1: State change: PREEMPT is TRUE
11/22 20:49:08 vm1: Changing activity: Busy -> Retiring
11/22 20:49:08 vm1: State change: retirement ended/expired
11/22 20:49:08 vm1: State change: WANT_VACATE is FALSE
11/22 20:49:08 vm1: Changing state and activity: Claimed/Retiring -> Preempting/Killing
11/22 20:49:10 DaemonCore: Command received via TCP from host <192.168.7.221:37844>
11/22 20:49:10 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
11/22 20:49:10 vm1: Got KILL_FRGN_JOB while in Preempting state, ignoring.
11/22 20:49:10 Starter pid 4575 exited with status 0
11/22 20:49:10 vm1: State change: starter exited
11/22 20:49:10 vm1: State change: No preempting claim, returning to owner
11/22 20:49:10 vm1: Changing state and activity: Preempting/Killing -> Owner/Idle
11/22 20:49:11 vm1: State change: IS_OWNER is false
11/22 20:49:11 vm1: Changing state: Owner -> Unclaimed
11/22 20:49:11 DaemonCore: Command received via UDP from host < 192.168.7.221:32878>
11/22 20:49:12 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_release_claim)
11/22 20:49:12 Warning: can't find resource with ClaimId (< 192.168.7.221:57320>#1195742198#16#...)
11/22 20:58:57 DaemonCore: Command received via UDP from host < 192.168.7.127:32861>
11/22 20:58:57 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
11/22 20:58:57 vm1: match_info called
11/22 20:58:57 vm1: Received match < 192.168.7.221:57320 >#1195742198#18#...
11/22 20:58:57 vm1: State change: match notification protocol successful
11/22 20:58:57 vm1: Changing state: Unclaimed -> Matched
11/22 20:58:57 DaemonCore: Command received via TCP from host < 192.168.7.221:40060>
11/22 20:58:58 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
11/22 20:58:58 vm1: Request accepted.
11/22 20:58:58 vm1: Remote owner is psegrid@xxxxxxxxxxxxxxxxxxxxxxx
11/22 20:58:58 vm1: State change: claiming protocol successful
11/22 20:58:58 vm1: Changing state: Matched -> Claimed
11/22 20:59:03 DaemonCore: Command received via TCP from host < 192.168.7.221:56177>
11/22 20:59:03 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)
11/22 20:59:03 vm1: Got activate_claim request from shadow (< 192.168.7.221:56177>)
11/22 20:59:03 vm1: Remote job ID is 5.0
11/22 20:59:03 vm1: exec_starter( niting-w2p.corp.cdac.in, 10, 11 ) : pid 4622
11/22 20:59:03 vm1: execl(/usr/local/condor/sbin/condor_starter.std, "condor_starter", niting-w2p.corp.cdac.in, 0)
11/22 20:59:03 vm1: Got universe "STANDARD" (1) from request classad
11/22 20:59:03 vm1: State change: claim-activation protocol successful
11/22 20:59:03 vm1: Changing activity: Idle -> Busy
11/22 20:59:09 vm1: State change: PREEMPT is TRUE
11/22 20:59:09 vm1: Changing activity: Busy -> Retiring
11/22 20:59:09 vm1: State change: retirement ended/expired
11/22 20:59:09 vm1: State change: WANT_VACATE is FALSE
11/22 20:59:09 vm1: Changing state and activity: Claimed/Retiring -> Preempting/Killing
11/22 20:59:10 DaemonCore: Command received via TCP from host < 192.168.7.221:39386>
11/22 20:59:10 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
11/22 20:59:10 vm1: Got KILL_FRGN_JOB while in Preempting state, ignoring.
11/22 20:59:11 Starter pid 4622 exited with status 0
11/22 20:59:11 vm1: State change: starter exited
11/22 20:59:11 vm1: State change: No preempting claim, returning to owner
11/22 20:59:11 vm1: Changing state and activity: Preempting/Killing -> Owner/Idle
11/22 20:59:11 vm1: State change: IS_OWNER is false
11/22 20:59:11 vm1: Changing state: Owner -> Unclaimed
11/22 20:59:12 DaemonCore: Command received via UDP from host < 192.168.7.221:32895>
11/22 20:59:12 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_release_claim)
11/22 20:59:12 Warning: can't find resource with ClaimId (< 192.168.7.221:57320>#1195742198#18#...)
===================================================================================
starter.vm1
======================================================
11/22 20:47:35     *FSM* Transitioning to state "SEND_STATUS_ALL"
11/22 20:47:35     *FSM* Executing state func "dispose_all()" [  ]
11/22 20:47:35 Sending final status for process 4.0
11/22 20:47:35 STATUS encoded as CKPT, *NOT* TRANSFERRED
11/22 20:47:35 User time = 0.000000 seconds
11/22 20:47:35 System time = 0.000000 seconds
11/22 20:47:35 Can't unlink "dir_4474/condor_exec.4.0" - errno = 2
11/22 20:47:35 Removed directory "dir_4474"
11/22 20:47:36     *FSM* Reached state "END"
11/22 20:47:36 ********* STARTER terminating normally **********
11/22 20:49:03 ********** STARTER starting up ***********
11/22 20:49:03 ** $CondorVersion: 6.8.6 Sep 13 2007 $
11/22 20:49:03 ** $CondorPlatform: I386-LINUX_RH9 $
11/22 20:49:03 ******************************************
11/22 20:49:03 Submitting machine is " niting-w2p.corp.cdac.in"
11/22 20:49:03 EventHandler {
11/22 20:49:03     func = 0x80e3bde
11/22 20:49:03     mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2 SIGCHLD SIGTSTP
11/22 20:49:04 }
11/22 20:49:04 Done setting resource limits
11/22 20:49:04     *FSM* Transitioning to state "GET_PROC"
11/22 20:49:04     *FSM* Executing state func "get_proc()" [  ]
11/22 20:49:04 Entering get_proc()
11/22 20:49:04 Entering get_job_info()
11/22 20:49:04 Startup Info:
11/22 20:49:04     Version Number: 1
11/22 20:49:05     Id: 5.0
11/22 20:49:05     JobClass: STANDARD
11/22 20:49:05     Uid: 503
11/22 20:49:05     Gid: 503
11/22 20:49:05     VirtPid: -1
11/22 20:49:05     SoftKillSignal: 20
11/22 20:49:05     Cmd: "/home/psegrid/NIP/nip"
11/22 20:49:05     Args: ""
11/22 20:49:05     Env: "GLOBUS_LOCATION=/usr/local/globus-4.0.5/;X509_CERT_DIR=/etc/grid-security/certificates;X509_USER_PROXY=;X509_USER_CERT=;X509_USER_KEY=;HOME=/home/psegrid;LOGNAME=psegrid;SCRATCH_DIRECTORY=/home/psegrid/.globus/scratch;JAVA_HOME=/usr/java/jdk1.6.0_03/jre;GLOBUS_GRAM_JOB_HANDLE= https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?3880a8a0-990e-11dc-814c-f74218502878;LD_LIBRARY_PATH= "
11/22 20:49:05     Iwd: "/home/psegrid"
11/22 20:49:05     Ckpt Wanted: TRUE
11/22 20:49:05     Is Restart: FALSE
11/22 20:49:05     Core Limit Valid: TRUE
11/22 20:49:05     Coredump Limit 0
11/22 20:49:06 User uid set to 503
11/22 20:49:06 User uid set to 503
11/22 20:49:06 User Process 5.0 {
11/22 20:49:06   cmd = /home/psegrid/NIP/nip
11/22 20:49:06   args =
11/22 20:49:06   env = GLOBUS_LOCATION=/usr/local/globus- 4.0.5/ X509_CERT_DIR=/etc/grid-security/certificates X509_USER_PROXY= X509_USER_CERT= X509_USER_KEY= HOME=/home/psegrid LOGNAME=psegrid SCRATCH_DIRECTORY=/home/psegrid/.globus/scratch JAVA_HOME=/usr/java/jdk1.6.0_03/jre GLOBUS_GRAM_JOB_HANDLE= https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?3880a8a0-990e-11dc-814c-f74218502878 LD_LIBRARY_PATH= CONDOR_VM=vm1 _condor_BIND_ALL_INTERFACES=FALSE CONDOR_SCRATCH_DIR=/home/condor/hosts/niting-w2p/execute/dir_4575
11/22 20:49:06   local_dir = dir_4575
11/22 20:49:06   cur_ckpt = dir_4575/condor_exec.5.0
11/22 20:49:06   core_name = (either 'core' or 'core.<pid>')
11/22 20:49:06   uid = 503, gid = 503
11/22 20:49:06   v_pid = -1
11/22 20:49:06   pid = (NOT CURRENTLY EXECUTING)
11/22 20:49:06   exit_status_valid = FALSE
11/22 20:49:07   exit_status = (NEVER BEEN EXECUTED)
11/22 20:49:07   ckpt_wanted = TRUE
11/22 20:49:07   coredump_limit_exists = TRUE
11/22 20:49:07   coredump_limit = 0
11/22 20:49:07   soft_kill_sig = 20
11/22 20:49:07   job_class = STANDARD
11/22 20:49:07   state = NEW
11/22 20:49:07   new_ckpt_created = FALSE
11/22 20:49:07   ckpt_transferred = FALSE
11/22 20:49:07   core_created = FALSE
11/22 20:49:07   core_transferred = FALSE
11/22 20:49:07   exit_requested = FALSE
11/22 20:49:07   image_size = -1 blocks
11/22 20:49:08   user_time = 0
11/22 20:49:08   sys_time = 0
11/22 20:49:08   guaranteed_user_time = 0
11/22 20:49:08   guaranteed_sys_time = 0
11/22 20:49:08 }
11/22 20:49:08     *FSM* Transitioning to state "GET_EXEC"
11/22 20:49:08     *FSM* Executing state func "get_exec()" [ SUSPEND VACATE DIE  ]
11/22 20:49:08 Entering get_exec()
11/22 20:49:08 Executable is located on submitting host
11/22 20:49:08     *FSM* Got asynchronous event "DIE"
11/22 20:49:09     *FSM* Executing transition function "req_die"
11/22 20:49:09 req_exit_all: Proc -1 in state NEW
11/22 20:49:09     *FSM* Transitioning to state "TERMINATE"
11/22 20:49:09     *FSM* Executing state func "terminate_all()" [  ]
11/22 20:49:09     *FSM* Transitioning to state "SEND_STATUS_ALL"
11/22 20:49:09     *FSM* Executing state func "dispose_all()" [  ]
11/22 20:49:09 Sending final status for process 5.0
11/22 20:49:09 STATUS encoded as CKPT, *NOT* TRANSFERRED
11/22 20:49:09 User time = 0.000000 seconds
11/22 20:49:09 System time = 0.000000 seconds
11/22 20:49:10 Can't unlink "dir_4575/condor_exec.5.0" - errno = 2
11/22 20:49:10 Removed directory "dir_4575"
11/22 20:49:10     *FSM* Reached state "END"
11/22 20:49:10 ********* STARTER terminating normally **********
11/22 20:59:03 ********** STARTER starting up ***********
11/22 20:59:03 ** $CondorVersion: 6.8.6 Sep 13 2007 $
11/22 20:59:03 ** $CondorPlatform: I386-LINUX_RH9 $
11/22 20:59:03 ******************************************
11/22 20:59:03 Submitting machine is "niting-w2p.corp.cdac.in"
11/22 20:59:04 EventHandler {
11/22 20:59:04     func = 0x80e3bde
11/22 20:59:04     mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2 SIGCHLD SIGTSTP
11/22 20:59:04 }
11/22 20:59:04 Done setting resource limits
11/22 20:59:05     *FSM* Transitioning to state "GET_PROC"
11/22 20:59:05     *FSM* Executing state func "get_proc()" [  ]
11/22 20:59:05 Entering get_proc()
11/22 20:59:05 Entering get_job_info()
11/22 20:59:05 Startup Info:
11/22 20:59:05     Version Number: 1
11/22 20:59:05     Id: 5.0
11/22 20:59:05     JobClass: STANDARD
11/22 20:59:05     Uid: 503
11/22 20:59:05     Gid: 503
11/22 20:59:05     VirtPid: -1
11/22 20:59:05     SoftKillSignal: 20
11/22 20:59:06     Cmd: "/home/psegrid/NIP/nip"
11/22 20:59:06     Args: ""
11/22 20:59:06     Env: "GLOBUS_LOCATION=/usr/local/globus-4.0.5/;X509_CERT_DIR=/etc/grid-security/certificates;X509_USER_PROXY=;X509_USER_CERT=;X509_USER_KEY=;HOME=/home/psegrid;LOGNAME=psegrid;SCRATCH_DIRECTORY=/home/psegrid/.globus/scratch;JAVA_HOME=/usr/java/jdk1.6.0_03/jre;GLOBUS_GRAM_JOB_HANDLE= https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?3880a8a0-990e-11dc-814c-f74218502878;LD_LIBRARY_PATH= "
11/22 20:59:06     Iwd: "/home/psegrid"
11/22 20:59:06     Ckpt Wanted: TRUE
11/22 20:59:06     Is Restart: FALSE
11/22 20:59:06     Core Limit Valid: TRUE
11/22 20:59:06     Coredump Limit 0
11/22 20:59:06 User uid set to 503
11/22 20:59:06 User uid set to 503
11/22 20:59:06 User Process 5.0 {
11/22 20:59:06   cmd = /home/psegrid/NIP/nip
11/22 20:59:06   args =
11/22 20:59:06   env = GLOBUS_LOCATION=/usr/local/globus- 4.0.5/ X509_CERT_DIR=/etc/grid-security/certificates X509_USER_PROXY= X509_USER_CERT= X509_USER_KEY= HOME=/home/psegrid LOGNAME=psegrid SCRATCH_DIRECTORY=/home/psegrid/.globus/scratch JAVA_HOME=/usr/java/jdk1.6.0_03/jre GLOBUS_GRAM_JOB_HANDLE= https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?3880a8a0-990e-11dc-814c-f74218502878 LD_LIBRARY_PATH= CONDOR_VM=vm1 _condor_BIND_ALL_INTERFACES=FALSE CONDOR_SCRATCH_DIR=/home/condor/hosts/niting-w2p/execute/dir_4622
11/22 20:59:07   local_dir = dir_4622
11/22 20:59:07   cur_ckpt = dir_4622/condor_exec.5.0
11/22 20:59:07   core_name = (either 'core' or 'core.<pid>')
11/22 20:59:07   uid = 503, gid = 503
11/22 20:59:07   v_pid = -1
11/22 20:59:07   pid = (NOT CURRENTLY EXECUTING)
11/22 20:59:07   exit_status_valid = FALSE
11/22 20:59:07   exit_status = (NEVER BEEN EXECUTED)
11/22 20:59:07   ckpt_wanted = TRUE
11/22 20:59:07   coredump_limit_exists = TRUE
11/22 20:59:07   coredump_limit = 0
11/22 20:59:07   soft_kill_sig = 20
11/22 20:59:07   job_class = STANDARD
11/22 20:59:08   state = NEW
11/22 20:59:08   new_ckpt_created = FALSE
11/22 20:59:08   ckpt_transferred = FALSE
11/22 20:59:08   core_created = FALSE
11/22 20:59:08   core_transferred = FALSE
11/22 20:59:08   exit_requested = FALSE
11/22 20:59:08   image_size = -1 blocks
11/22 20:59:08   user_time = 0
11/22 20:59:08   sys_time = 0
11/22 20:59:08   guaranteed_user_time = 0
11/22 20:59:08   guaranteed_sys_time = 0
11/22 20:59:08 }
11/22 20:59:08     *FSM* Transitioning to state "GET_EXEC"
11/22 20:59:09     *FSM* Executing state func "get_exec()" [ SUSPEND VACATE DIE  ]
11/22 20:59:09 Entering get_exec()
11/22 20:59:09     *FSM* Got asynchronous event "DIE"
11/22 20:59:09     *FSM* Executing transition function "req_die"
11/22 20:59:09 req_exit_all: Proc -1 in state NEW
11/22 20:59:09     *FSM* Transitioning to state "TERMINATE"
11/22 20:59:09     *FSM* Executing state func "terminate_all()" [  ]
11/22 20:59:09     *FSM* Transitioning to state "SEND_STATUS_ALL"
11/22 20:59:10     *FSM* Executing state func "dispose_all()" [  ]
11/22 20:59:10 Sending final status for process 5.0
11/22 20:59:10 STATUS encoded as CKPT, *NOT* TRANSFERRED
11/22 20:59:10 User time = 0.000000 seconds
11/22 20:59:10 System time = 0.000000 seconds
11/22 20:59:10 Can't unlink "dir_4622/condor_exec.5.0" - errno = 2
11/22 20:59:10 Removed directory "dir_4622"
11/22 20:59:10     *FSM* Reached state "END"
11/22 20:59:10 ********* STARTER terminating normally **********
=====================================================================
globus-condor.log
==============================================================
<c>
    <a n="MyType"><s>JobAbortedEvent</s></a>
    <a n="EventTypeNumber"><i>9</i></a>
    <a n="MyType"><s>JobAbortedEvent</s></a>
    <a n="EventTime"><s>2007-11-22T20:48:10</s></a>
    <a n="Cluster"><i>4</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Reason"><s>via condor_rm (by user psegrid)</s></a>
</c>
<c>
    <a n="MyType"><s>SubmitEvent</s></a>
    <a n="EventTypeNumber"><i>0</i></a>
    <a n="MyType"><s>SubmitEvent</s></a>
    <a n="EventTime"><s>2007-11-22T20:48:55</s></a>
    <a n="Cluster"><i>5</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="SubmitHost"><s>&lt;192.168.7.221:42898&gt;</s></a>
</c>
<c>
    <a n="MyType"><s>ExecuteEvent</s></a>
    <a n="EventTypeNumber"><i>1</i></a>
    <a n="MyType"><s>ExecuteEvent</s></a>
    <a n="EventTime"><s>2007-11-22T20:49:10</s></a>
    <a n="Cluster"><i>5</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="ExecuteHost"><s>&lt;192.168.7.221:57320&gt;</s></a>
</c>
<c>
    <a n="MyType"><s>JobEvictedEvent</s></a>
    <a n="EventTypeNumber"><i>4</i></a>
    <a n="MyType"><s>JobEvictedEvent</s></a>
    <a n="EventTime"><s>2007-11-22T20:49:10</s></a>
    <a n="Cluster"><i>5</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Checkpointed"><b v="f"/></a>
    <a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
    <a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
    <a n="SentBytes"><r>2.570000000000000E+02</r></a>
    <a n="ReceivedBytes"><r> 6.650000000000000E+02</r></a>
    <a n="TerminatedAndRequeued"><b v="f"/></a>
    <a n="TerminatedNormally"><b v="f"/></a>
</c>
<c>
    <a n="MyType"><s>ExecuteEvent</s></a>
    <a n="EventTypeNumber"><i>1</i></a>
    <a n="MyType"><s>ExecuteEvent</s></a>
    <a n="EventTime"><s>2007-11-22T20:59:11</s></a>
    <a n="Cluster"><i>5</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="ExecuteHost"><s>&lt;192.168.7.221:57320&gt;</s></a>
</c>
<c>
    <a n="MyType"><s>JobEvictedEvent</s></a>
    <a n="EventTypeNumber"><i>4</i></a>
    <a n="MyType"><s>JobEvictedEvent</s></a>
    <a n="EventTime"><s>2007-11-22T20:59:11</s></a>
    <a n="Cluster"><i>5</i></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Checkpointed"><b v="f"/></a>
    <a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
    <a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
    <a n="SentBytes"><r> 2.490000000000000E+02</r></a>
    <a n="ReceivedBytes"><r>5.970000000000000E+02</r></a>
    <a n="TerminatedAndRequeued"><b v="f"/></a>
    <a n="TerminatedNormally"><b v="f"/></a>
</c>
====================================================================

Nitin


On Nov 20, 2007 9:24 PM, Dan Bradley <dan@xxxxxxxxxxxx> wrote:

>        Last successful match: Tue Nov 20 22:36:21 2007


This indicates that the job is successfully getting matched to a
machine.  Something must be going wrong when the Condor tries to run the
job on that machine.  Look for clues about what is going wrong here:

The "user log": /usr/local/globus-4.0.5//var/globus-condor.log
The ShadowLog (condor_config_val SHADOW_LOG)
The StartLog (condor_config_val STARTD_LOG)
The StarterLog (condor_config_val STARTER_LOG)

I hope that helps!

--Dan

Nitin Gavhane wrote:

> hello all,
> i am submitting job through globus to condor but the job stays in idle
> state. the job details are as follows.
> ================================================
> *The Job Description Generated by GRAM is as follows *
>
> [condor@niting-w2p etc]$ cat /tmp/condor_job_description
> #
> # description file for condor submission
> #
> Universe = standard
> Notification = Never
> Executable = /home/psegrid/NIP/nip
> Requirements = OpSys == "LINUX"  && Arch == "INTEL"
> Environment =
> GLOBUS_LOCATION=/usr/local/globus-4.0.5/;X509_CERT_DIR=/etc/grid-security/certificates;X509_USER_PROXY=;X509_USER_CERT=;X509_USER_KEY=;HOME=/home/psegrid;LOGNAME=psegrid;SCRATCH_DIRECTORY=/home/psegrid/.globus/scratch;JAVA_HOME=/usr/java/jdk1.6.0_03/jre;GLOBUS_GRAM_JOB_HANDLE=
> https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?7f408200-9789-11dc-9f1a-b41f06e1e2ea;LD_LIBRARY_PATH=
> <https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?7f408200-9789-11dc-9f1a-b41f06e1e2ea;LD_LIBRARY_PATH= >
> Arguments =
> InitialDir = /home/psegrid
> Input = /dev/null
> Log = /usr/local/globus-4.0.5//var/globus-condor.log
> log_xml = True
> #Extra attributes specified by client
>
> Output = /home/psegrid/stdout
> Error = /home/psegrid/stderr
> queue 1
>
> =======================================================================
> *[psegrid@niting-w2p NIP]$ condor_q -better-analyze*
>
>
> -- Submitter: niting-w2p.corp.cdac.in <http://niting-w2p.corp.cdac.in >
> : <192.168.7.221:42993 <http://192.168.7.221:42993>> :
> niting-w2p.corp.cdac.in <http://niting-w2p.corp.cdac.in>
> ---
> 005.000:  Run analysis summary.  Of 7 machines,
>      4 are rejected by your job's requirements
>      0 reject your job because of their own requirements
>      0 match but are serving users with a better priority in the pool
>      3 match but reject the job for unknown reasons
>      0 match but will not currently preempt their existing job
>      0 are available to run your job
>        Last successful match: Tue Nov 20 22:36:21 2007
>
> The Requirements _expression_ for your job is:
>
> ( target.OpSys == "LINUX" && target.Arch == "INTEL" ) &&
> ( ( target.CkptArch == target.Arch ) || ( target.CkptArch is undefined
> ) ) &&
> ( ( target.CkptOpSys == target.OpSys ) || ( target.CkptOpSys is
> undefined ) ) &&
> ( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize )
>
>    Condition                         Machines Matched    Suggestion
>    ---------                         ----------------    ----------
> 1   target.Arch == "INTEL"            3
> 2   target.OpSys == "LINUX"           7
> 3   ( ( target.CkptArch == target.Arch ) || ( target.CkptArch is
> undefined ) )
>                                      7
> 4   ( ( target.CkptOpSys == target.OpSys ) || ( target.CkptOpSys is
> undefined ) )
>                                      7
> 5   ( target.Disk >= 20000 )          7
> 6   ( ( 1024 * target.Memory ) >= 20000 )7
>
>
>
>
> ==========================================================
> *[psegrid@niting-w2p NIP]$ condor_status*
>
> Name          OpSys       Arch   State      Activity   LoadAv Mem
> ActvtyTime
>
> vm1@niting-w2 LINUX       INTEL  Unclaimed  Idle       0.000   469
>  0+00:05:26
> vm2@niting-w2 LINUX       INTEL  Unclaimed  Idle       0.140   469
>  0+00:26:42
> sskadam-w2p.c LINUX       INTEL  Unclaimed  Idle       0.000   248
>  0+00:44:38
> vm1@psewebs-w LINUX       X86_64 Unclaimed  Idle       0.400   753
>  0+00:30:04
> vm2@psewebs-w LINUX       X86_64 Unclaimed  Idle       0.000   753
>  0+00:30:05
> vm3@psewebs-w LINUX       X86_64 Unclaimed  Idle       0.000   753
>  0+00:30:06
> vm4@psewebs-w LINUX       X86_64 Unclaimed  Idle       0.000   753
>  0+00:30:27
>
>                     Total Owner Claimed Unclaimed Matched Preempting
> Backfill
>
>         INTEL/LINUX     3     0       0         3       0          0
>      0
>        X86_64/LINUX     4     0       0         4       0          0
>      0
>
>               Total     7     0       0         7       0          0
>      0
> ==============================================================
> *The DAEMON details for all three machines are as follows *
>
> [condor@niting-w2p etc]$ ./test.sh
> current file: condor_config
> ##  checkpoint server isn't available or USE_CKPT_SERVER is set to
> USE_CKPT_SERVER = True
> CKPT_SERVER_HOST        = psewebs-w2p.corp.cdac.in
> < http://psewebs-w2p.corp.cdac.in>
> ##  checkpoint server?  If False, the CKPT_SERVER_HOST set on
> ##  the submit machine is used.  Otherwise, the CKPT_SERVER_HOST set
> STARTER_CHOOSES_CKPT_SERVER = True
> #WALL_CLOCK_CKPT_INTERVAL = 3600
> ##  setting is only used if USE_CKPT_SERVER (from above) is True.
> #COMPRESS_PERIODIC_CKPT = False
> #COMPRESS_VACATE_CKPT = False
> #SLOW_CKPT_SPEED = 0
> DAEMON_LIST                     = MASTER, STARTD, SCHEDD
> #DC_DAEMON_LIST = \
> =============
> current file: psewebs-w2p.local
> USE_CKPT_SERVER = True
> CKPT_SERVER_HOST        = psewebs-w2p.corp.cdac.in
> <http://psewebs-w2p.corp.cdac.in>
> DAEMON_LIST = MASTER, STARTD, SCHEDD
> DAEMON_LIST   = MASTER, COLLECTOR, NEGOTIATOR, STARTD, SCHEDD
> =============
> current file: niting-w2p.local
> USE_CKPT_SERVER = True
> CKPT_SERVER_HOST        = psewebs-w2p.corp.cdac.in
> <http://psewebs-w2p.corp.cdac.in>
> DAEMON_LIST = MASTER, STARTD, SCHEDD
> =============
> current file: sskadam-w2p.local
> USE_CKPT_SERVER = True
> CKPT_SERVER_HOST        = psewebs-w2p.corp.cdac.in
> <http://psewebs-w2p.corp.cdac.in>
> DAEMON_LIST = MASTER, STARTD, SCHEDD
> ===============================
>
> Please Tell what is wrong with job submission.
> Thank you.
> --
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Nitin M. Gavhane
> MS in Adavanced Software Technologies
> International Institute of Information Technology
> P-14,Hinjewadi,Pune, India.
> ---------------------------------------------------------------------------------------------------------------------------
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Condor-users mailing list
>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>subject: Unsubscribe
>You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>The archives can be found at:
>https://lists.cs.wisc.edu/archive/condor-users/
>
>
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/



--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Nitin M. Gavhane
MS in Adavanced Software Technologies
International Institute of Information Technology
P-14,Hinjewadi,Pune, India.
---------------------------------------------------------------------------------------------------------------------------