[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problem Condor Job Stays Idle Because of target.CkptArch



The PREEMPT _expression_ is
[condor@niting-w2p ~]$ condor_config_val PREEMPT
( ((Activity == "Suspended") && ((CurrentTime - EnteredCurrentActivity) > 10 * 60)) || (SUSPEND && (WANT_SUSPEND == False)) )

nitin

On Nov 28, 2007 5:28 AM, Dan Bradley <dan@xxxxxxxxxxxx> wrote:
Here is the clue about why the job is not running for long:

11/22 20:49:03 vm1: Changing activity: Idle -> Busy
11/22 20:49:08 vm1: State change: PREEMPT is TRUE

What is your PREEMPT _expression_?

condor_config_val PREEMPT

--Dan

Nitin Gavhane wrote:
> Hello Dan,
> the following are the snapshot of log files, please look at them.
>
> *Shadow.log*
> =================================================================================
> 11/22 20:47:29 ( 4.0) (4473):My_UID_Domain = "niting-w2p.corp.cdac.in
> 11/22 20:47:35 (4.0) (4473):Shadow: Job 4.0 exited, termsig = 9,
> coredump = 0, retcode = 0
> 11/22 20:47:35 ( 4.0) (4473):Shadow: Job was kicked off without a
> checkpoint
> 11/22 20:47:35 (4.0) (4473):Shadow: DoCleanup: unlinking TmpCkpt
> '/home/condor/hosts/niting-w2p/spool/cluster4.proc0.subproc0.tmp'
> 11/22 20:47:36 ( 4.0) (4473):Trying to unlink
> /home/condor/hosts/niting-w2p/spool/cluster4.proc0.subproc0.tmp
> 11/22 20:47:36 (4.0) (4473):user_time = 1 ticks
> 11/22 20:47:36 (4.0) (4473):sys_time = 0 ticks
> 11/22 20:47:36 (4.0) (4473):Asked to write event of number 1.
> 11/22 20:47:36 (4.0) (4473):Asked to write event of number 4.
> 11/22 20:47:36 (4.0) (4473):********** Shadow Exiting(107) **********
> 11/22 20:49:02 (?.?) (4574):******* Standard Shadow starting up *******
> 11/22 20:49:02 (?.?) (4574):** $CondorVersion: 6.8.6 Sep 13 2007 $
> 11/22 20:49:02 (?.?) (4574):** $CondorPlatform: I386-LINUX_RH9 $
> 11/22 20:49:02 (?.?) (4574):*******************************************
> 11/22 20:49:02 (?.?) (4574):uid=0, euid=900, gid=0, egid=900
> 11/22 20:49:02 (?.?) (4574):Hostname = "<192.168.7.221:57320
> < http://192.168.7.221:57320>>", Job = 5.0
> 11/22 20:49:02 (5.0) (4574):Requesting Primary Starter
> 11/22 20:49:02 (5.0) (4574):Shadow: Request to run a job was ACCEPTED
> 11/22 20:49:02 (5.0) (4574):Shadow: RSC_SOCK connected, fd = 17
> 11/22 20:49:03 (5.0) (4574):Shadow: CLIENT_LOG connected, fd = 18
> 11/22 20:49:03 (5.0) (4574):My_Filesystem_Domain = "
> 11/22 20:49:03 (5.0) (4574):My_UID_Domain = "niting-w2p.corp.cdac.in
> 11/22 20:49:10 (5.0) (4574):Shadow: Job 5.0 exited, termsig = 9,
> coredump = 0, retcode = 0
> 11/22 20:49:10 (5.0) (4574):Shadow: Job was kicked off without a
> checkpoint
> 11/22 20:49:10 (5.0) (4574):Shadow: DoCleanup: unlinking TmpCkpt
> '/home/condor/hosts/niting-w2p/spool/cluster5.proc0.subproc0.tmp'
> 11/22 20:49:10 (5.0) (4574):Trying to unlink
> /home/condor/hosts/niting-w2p/spool/cluster5.proc0.subproc0.tmp
> 11/22 20:49:10 (5.0) (4574):user_time = 0 ticks
> 11/22 20:49:10 (5.0) (4574):sys_time = 0 ticks
> 11/22 20:49:10 ( 5.0) (4574):Asked to write event of number 1.
> 11/22 20:49:10 (5.0) (4574):Asked to write event of number 4.
> 11/22 20:49:10 (5.0) (4574):********** Shadow Exiting(107) **********
> 11/22 20:59:02 (?.?) (4621):******* Standard Shadow starting up *******
> 11/22 20:59:02 (?.?) (4621):** $CondorVersion: 6.8.6 Sep 13 2007 $
> 11/22 20:59:02 (?.?) (4621):** $CondorPlatform: I386-LINUX_RH9 $
> 11/22 20:59:02 (?.?) (4621):*******************************************
> 11/22 20:59:03 (?.?) (4621):uid=0, euid=900, gid=0, egid=900
> 11/22 20:59:03 (?.?) (4621):Hostname = "<192.168.7.221:57320
> < http://192.168.7.221:57320>>", Job = 5.0
> 11/22 20:59:03 (5.0) (4621):Requesting Primary Starter
> 11/22 20:59:03 (5.0) (4621):Shadow: Request to run a job was ACCEPTED
> 11/22 20:59:03 (5.0) (4621):Shadow: RSC_SOCK connected, fd = 17
> 11/22 20:59:03 (5.0) (4621):Shadow: CLIENT_LOG connected, fd = 18
> 11/22 20:59:03 (5.0) (4621):My_Filesystem_Domain = "
> 11/22 20:59:03 (5.0) (4621):My_UID_Domain = "niting-w2p.corp.cdac.in
> 11/22 20:59:10 (5.0) (4621):Shadow: Job 5.0 exited, termsig = 9,
> coredump = 0, retcode = 0
> 11/22 20:59:10 (5.0) (4621):Shadow: Job was kicked off without a
> checkpoint
> 11/22 20:59:10 (5.0) (4621):Shadow: DoCleanup: unlinking TmpCkpt
> '/home/condor/hosts/niting-w2p/spool/cluster5.proc0.subproc0.tmp'
> 11/22 20:59:10 (5.0) (4621):Trying to unlink
> /home/condor/hosts/niting-w2p/spool/cluster5.proc0.subproc0.tmp
> 11/22 20:59:10 (5.0) (4621):user_time = 1 ticks
> 11/22 20:59:11 (5.0) (4621):sys_time = 0 ticks
> 11/22 20:59:11 ( 5.0) (4621):Asked to write event of number 1.
> 11/22 20:59:11 (5.0) (4621):Asked to write event of number 4.
> 11/22 20:59:11 (5.0) (4621):********** Shadow Exiting(107) **********
> =====================================================================================
>
> *startd.log
> *==================================
> 11/22 20:47:35 vm1: Got KILL_FRGN_JOB while in Preempting state, ignoring.
> 11/22 20:47:36 Starter pid 4474 exited with status 0
> 11/22 20:47:36 vm1: State change: starter exited
> 11/22 20:47:36 vm1: State change: No preempting claim, returning to owner
> 11/22 20:47:36 vm1: Changing state and activity: Preempting/Killing ->
> Owner/Idle
> 11/22 20:47:36 vm1: State change: IS_OWNER is false
> 11/22 20:47:36 vm1: Changing state: Owner -> Unclaimed
> 11/22 20:47:37 DaemonCore: Command received via UDP from host
> <192.168.7.221:32863 <http://192.168.7.221:32863>>
> 11/22 20:47:37 DaemonCore: received command 443 (RELEASE_CLAIM),
> calling handler (command_release_claim)
> 11/22 20:47:37 Warning: can't find resource with ClaimId (<
> 192.168.7.221:57320 < http://192.168.7.221:57320>>#1195742198#13#...)
> 11/22 20:48:56 DaemonCore: Command received via UDP from host
> <192.168.7.127:32845 <http://192.168.7.127:32845>>
> 11/22 20:48:56 DaemonCore: received command 440 (MATCH_INFO), calling
> handler (command_match_info)
> 11/22 20:48:56 vm1: match_info called
> 11/22 20:48:57 vm1: Received match <192.168.7.221:57320
> < http://192.168.7.221:57320>>#1195742198#16#...
> 11/22 20:48:57 vm1: State change: match notification protocol successful
> 11/22 20:48:57 vm1: Changing state: Unclaimed -> Matched
> 11/22 20:48:57 DaemonCore: Command received via TCP from host
> <192.168.7.221:38154 < http://192.168.7.221:38154>>
> 11/22 20:48:57 DaemonCore: received command 442 (REQUEST_CLAIM),
> calling handler (command_request_claim)
> 11/22 20:48:57 vm1: Request accepted.
> 11/22 20:48:57 vm1: Remote owner is psegrid@xxxxxxxxxxxxxxxxxxxxxxx
> <mailto:psegrid@xxxxxxxxxxxxxxxxxxxxxxx >
> 11/22 20:48:57 vm1: State change: claiming protocol successful
> 11/22 20:48:57 vm1: Changing state: Matched -> Claimed
> 11/22 20:49:02 DaemonCore: Command received via TCP from host
> <192.168.7.221:38436 <http://192.168.7.221:38436>>
> 11/22 20:49:02 DaemonCore: received command 444 (ACTIVATE_CLAIM),
> calling handler (command_activate_claim)
> 11/22 20:49:02 vm1: Got activate_claim request from shadow
> (<192.168.7.221:38436 < http://192.168.7.221:38436>>)
> 11/22 20:49:02 vm1: Remote job ID is 5.0
> 11/22 20:49:02 vm1: exec_starter( niting-w2p.corp.cdac.in
> <http://niting-w2p.corp.cdac.in>, 10, 11 ) : pid 4575
> 11/22 20:49:03 vm1: execl(/usr/local/condor/sbin/condor_starter.std,
> "condor_starter", niting-w2p.corp.cdac.in
> <http://niting-w2p.corp.cdac.in >, 0)
> 11/22 20:49:03 vm1: Got universe "STANDARD" (1) from request classad
> 11/22 20:49:03 vm1: State change: claim-activation protocol successful
> 11/22 20:49:03 vm1: Changing activity: Idle -> Busy
> 11/22 20:49:08 vm1: State change: PREEMPT is TRUE
> 11/22 20:49:08 vm1: Changing activity: Busy -> Retiring
> 11/22 20:49:08 vm1: State change: retirement ended/expired
> 11/22 20:49:08 vm1: State change: WANT_VACATE is FALSE
> 11/22 20:49:08 vm1: Changing state and activity: Claimed/Retiring ->
> Preempting/Killing
> 11/22 20:49:10 DaemonCore: Command received via TCP from host
> < 192.168.7.221:37844 <http://192.168.7.221:37844>>
> 11/22 20:49:10 DaemonCore: received command 404
> (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
> 11/22 20:49:10 vm1: Got KILL_FRGN_JOB while in Preempting state, ignoring.
> 11/22 20:49:10 Starter pid 4575 exited with status 0
> 11/22 20:49:10 vm1: State change: starter exited
> 11/22 20:49:10 vm1: State change: No preempting claim, returning to owner
> 11/22 20:49:10 vm1: Changing state and activity: Preempting/Killing ->
> Owner/Idle
> 11/22 20:49:11 vm1: State change: IS_OWNER is false
> 11/22 20:49:11 vm1: Changing state: Owner -> Unclaimed
> 11/22 20:49:11 DaemonCore: Command received via UDP from host <
> 192.168.7.221:32878 < http://192.168.7.221:32878>>
> 11/22 20:49:12 DaemonCore: received command 443 (RELEASE_CLAIM),
> calling handler (command_release_claim)
> 11/22 20:49:12 Warning: can't find resource with ClaimId (<
> 192.168.7.221:57320 <http://192.168.7.221:57320>>#1195742198#16#...)
> 11/22 20:58:57 DaemonCore: Command received via UDP from host
> <192.168.7.127:32861 <http://192.168.7.127:32861 >>
> 11/22 20:58:57 DaemonCore: received command 440 (MATCH_INFO), calling
> handler (command_match_info)
> 11/22 20:58:57 vm1: match_info called
> 11/22 20:58:57 vm1: Received match < 192.168.7.221:57320
> <http://192.168.7.221:57320>>#1195742198#18#...
> 11/22 20:58:57 vm1: State change: match notification protocol successful
> 11/22 20:58:57 vm1: Changing state: Unclaimed -> Matched
> 11/22 20:58:57 DaemonCore: Command received via TCP from host <
> 192.168.7.221:40060 <http://192.168.7.221:40060>>
> 11/22 20:58:58 DaemonCore: received command 442 (REQUEST_CLAIM),
> calling handler (command_request_claim)
> 11/22 20:58:58 vm1: Request accepted.
> 11/22 20:58:58 vm1: Remote owner is psegrid@xxxxxxxxxxxxxxxxxxxxxxx
> <mailto:psegrid@xxxxxxxxxxxxxxxxxxxxxxx>
> 11/22 20:58:58 vm1: State change: claiming protocol successful
> 11/22 20:58:58 vm1: Changing state: Matched -> Claimed
> 11/22 20:59:03 DaemonCore: Command received via TCP from host <
> 192.168.7.221:56177 < http://192.168.7.221:56177>>
> 11/22 20:59:03 DaemonCore: received command 444 (ACTIVATE_CLAIM),
> calling handler (command_activate_claim)
> 11/22 20:59:03 vm1: Got activate_claim request from shadow (<
> 192.168.7.221:56177 <http://192.168.7.221:56177>>)
> 11/22 20:59:03 vm1: Remote job ID is 5.0
> 11/22 20:59:03 vm1: exec_starter( niting-w2p.corp.cdac.in
> <http://niting-w2p.corp.cdac.in >, 10, 11 ) : pid 4622
> 11/22 20:59:03 vm1: execl(/usr/local/condor/sbin/condor_starter.std,
> "condor_starter", niting-w2p.corp.cdac.in
> <http://niting-w2p.corp.cdac.in>, 0)
> 11/22 20:59:03 vm1: Got universe "STANDARD" (1) from request classad
> 11/22 20:59:03 vm1: State change: claim-activation protocol successful
> 11/22 20:59:03 vm1: Changing activity: Idle -> Busy
> 11/22 20:59:09 vm1: State change: PREEMPT is TRUE
> 11/22 20:59:09 vm1: Changing activity: Busy -> Retiring
> 11/22 20:59:09 vm1: State change: retirement ended/expired
> 11/22 20:59:09 vm1: State change: WANT_VACATE is FALSE
> 11/22 20:59:09 vm1: Changing state and activity: Claimed/Retiring ->
> Preempting/Killing
> 11/22 20:59:10 DaemonCore: Command received via TCP from host <
> 192.168.7.221:39386 < http://192.168.7.221:39386>>
> 11/22 20:59:10 DaemonCore: received command 404
> (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
> 11/22 20:59:10 vm1: Got KILL_FRGN_JOB while in Preempting state,
> ignoring.
> 11/22 20:59:11 Starter pid 4622 exited with status 0
> 11/22 20:59:11 vm1: State change: starter exited
> 11/22 20:59:11 vm1: State change: No preempting claim, returning to owner
> 11/22 20:59:11 vm1: Changing state and activity: Preempting/Killing ->
> Owner/Idle
> 11/22 20:59:11 vm1: State change: IS_OWNER is false
> 11/22 20:59:11 vm1: Changing state: Owner -> Unclaimed
> 11/22 20:59:12 DaemonCore: Command received via UDP from host <
> 192.168.7.221:32895 < http://192.168.7.221:32895>>
> 11/22 20:59:12 DaemonCore: received command 443 (RELEASE_CLAIM),
> calling handler (command_release_claim)
> 11/22 20:59:12 Warning: can't find resource with ClaimId (<
> 192.168.7.221:57320 <http://192.168.7.221:57320>>#1195742198#18#...)
> ===================================================================================
> starter.vm1
> ======================================================
> 11/22 20:47:35     *FSM* Transitioning to state "SEND_STATUS_ALL"
> 11/22 20:47:35     *FSM* Executing state func "dispose_all()" [  ]
> 11/22 20:47:35 Sending final status for process 4.0
> 11/22 20:47:35 STATUS encoded as CKPT, *NOT* TRANSFERRED
> 11/22 20:47:35 User time = 0.000000 seconds
> 11/22 20:47:35 System time = 0.000000 seconds
> 11/22 20:47:35 Can't unlink "dir_4474/condor_exec.4.0" - errno = 2
> 11/22 20:47:35 Removed directory "dir_4474"
> 11/22 20:47:36     *FSM* Reached state "END"
> 11/22 20:47:36 ********* STARTER terminating normally **********
> 11/22 20:49:03 ********** STARTER starting up ***********
> 11/22 20:49:03 ** $CondorVersion: 6.8.6 Sep 13 2007 $
> 11/22 20:49:03 ** $CondorPlatform: I386-LINUX_RH9 $
> 11/22 20:49:03 ******************************************
> 11/22 20:49:03 Submitting machine is " niting-w2p.corp.cdac.in
> 11/22 20:49:03 EventHandler {
> 11/22 20:49:03     func = 0x80e3bde
> 11/22 20:49:03     mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2
> SIGCHLD SIGTSTP
> 11/22 20:49:04 }
> 11/22 20:49:04 Done setting resource limits
> 11/22 20:49:04     *FSM* Transitioning to state "GET_PROC"
> 11/22 20:49:04     *FSM* Executing state func "get_proc()" [  ]
> 11/22 20:49:04 Entering get_proc()
> 11/22 20:49:04 Entering get_job_info()
> 11/22 20:49:04 Startup Info:
> 11/22 20:49:04     Version Number: 1
> 11/22 20:49:05     Id: 5.0
> 11/22 20:49:05     JobClass: STANDARD
> 11/22 20:49:05     Uid: 503
> 11/22 20:49:05     Gid: 503
> 11/22 20:49:05     VirtPid: -1
> 11/22 20:49:05     SoftKillSignal: 20
> 11/22 20:49:05     Cmd: "/home/psegrid/NIP/nip"
> 11/22 20:49:05     Args: ""
> 11/22 20:49:05     Env:
> "GLOBUS_LOCATION=/usr/local/globus-4.0.5/;X509_CERT_DIR=/etc/grid-security/certificates;X509_USER_PROXY=;X509_USER_CERT=;X509_USER_KEY=;HOME=/home/psegrid;LOGNAME=psegrid;SCRATCH_DIRECTORY=/home/psegrid/.globus/scratch;JAVA_HOME=/usr/java/jdk1.6.0_03/jre;GLOBUS_GRAM_JOB_HANDLE=
> https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?3880a8a0-990e-11dc-814c-f74218502878;LD_LIBRARY_PATH=
> <https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?3880a8a0-990e-11dc-814c-f74218502878;LD_LIBRARY_PATH= >"
> 11/22 20:49:05     Iwd: "/home/psegrid"
> 11/22 20:49:05     Ckpt Wanted: TRUE
> 11/22 20:49:05     Is Restart: FALSE
> 11/22 20:49:05     Core Limit Valid: TRUE
> 11/22 20:49:05     Coredump Limit 0
> 11/22 20:49:06 User uid set to 503
> 11/22 20:49:06 User uid set to 503
> 11/22 20:49:06 User Process 5.0 {
> 11/22 20:49:06   cmd = /home/psegrid/NIP/nip
> 11/22 20:49:06   args =
> 11/22 20:49:06   env = GLOBUS_LOCATION=/usr/local/globus- 4.0.5/
> X509_CERT_DIR=/etc/grid-security/certificates X509_USER_PROXY=
> X509_USER_CERT= X509_USER_KEY= HOME=/home/psegrid LOGNAME=psegrid
> SCRATCH_DIRECTORY=/home/psegrid/.globus/scratch
> JAVA_HOME=/usr/java/jdk1.6.0_03/jre GLOBUS_GRAM_JOB_HANDLE=
> https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?3880a8a0-990e-11dc-814c-f74218502878
> <https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?3880a8a0-990e-11dc-814c-f74218502878 >
> LD_LIBRARY_PATH= CONDOR_VM=vm1 _condor_BIND_ALL_INTERFACES=FALSE
> CONDOR_SCRATCH_DIR=/home/condor/hosts/niting-w2p/execute/dir_4575
> 11/22 20:49:06   local_dir = dir_4575
> 11/22 20:49:06   cur_ckpt = dir_4575/condor_exec.5.0
> 11/22 20:49:06   core_name = (either 'core' or 'core.<pid>')
> 11/22 20:49:06   uid = 503, gid = 503
> 11/22 20:49:06   v_pid = -1
> 11/22 20:49:06   pid = (NOT CURRENTLY EXECUTING)
> 11/22 20:49:06   exit_status_valid = FALSE
> 11/22 20:49:07   exit_status = (NEVER BEEN EXECUTED)
> 11/22 20:49:07   ckpt_wanted = TRUE
> 11/22 20:49:07   coredump_limit_exists = TRUE
> 11/22 20:49:07   coredump_limit = 0
> 11/22 20:49:07   soft_kill_sig = 20
> 11/22 20:49:07   job_class = STANDARD
> 11/22 20:49:07   state = NEW
> 11/22 20:49:07   new_ckpt_created = FALSE
> 11/22 20:49:07   ckpt_transferred = FALSE
> 11/22 20:49:07   core_created = FALSE
> 11/22 20:49:07   core_transferred = FALSE
> 11/22 20:49:07   exit_requested = FALSE
> 11/22 20:49:07   image_size = -1 blocks
> 11/22 20:49:08   user_time = 0
> 11/22 20:49:08   sys_time = 0
> 11/22 20:49:08   guaranteed_user_time = 0
> 11/22 20:49:08   guaranteed_sys_time = 0
> 11/22 20:49:08 }
> 11/22 20:49:08     *FSM* Transitioning to state "GET_EXEC"
> 11/22 20:49:08     *FSM* Executing state func "get_exec()" [ SUSPEND
> VACATE DIE  ]
> 11/22 20:49:08 Entering get_exec()
> 11/22 20:49:08 Executable is located on submitting host
> 11/22 20:49:08     *FSM* Got asynchronous event "DIE"
> 11/22 20:49:09     *FSM* Executing transition function "req_die"
> 11/22 20:49:09 req_exit_all: Proc -1 in state NEW
> 11/22 20:49:09     *FSM* Transitioning to state "TERMINATE"
> 11/22 20:49:09     *FSM* Executing state func "terminate_all()" [  ]
> 11/22 20:49:09     *FSM* Transitioning to state "SEND_STATUS_ALL"
> 11/22 20:49:09     *FSM* Executing state func "dispose_all()" [  ]
> 11/22 20:49:09 Sending final status for process 5.0
> 11/22 20:49:09 STATUS encoded as CKPT, *NOT* TRANSFERRED
> 11/22 20:49:09 User time = 0.000000 seconds
> 11/22 20:49:09 System time = 0.000000 seconds
> 11/22 20:49:10 Can't unlink "dir_4575/condor_exec.5.0" - errno = 2
> 11/22 20:49:10 Removed directory "dir_4575"
> 11/22 20:49:10     *FSM* Reached state "END"
> 11/22 20:49:10 ********* STARTER terminating normally **********
> 11/22 20:59:03 ********** STARTER starting up ***********
> 11/22 20:59:03 ** $CondorVersion: 6.8.6 Sep 13 2007 $
> 11/22 20:59:03 ** $CondorPlatform: I386-LINUX_RH9 $
> 11/22 20:59:03 ******************************************
> 11/22 20:59:03 Submitting machine is "niting-w2p.corp.cdac.in
> 11/22 20:59:04 EventHandler {
> 11/22 20:59:04     func = 0x80e3bde
> 11/22 20:59:04     mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2
> SIGCHLD SIGTSTP
> 11/22 20:59:04 }
> 11/22 20:59:04 Done setting resource limits
> 11/22 20:59:05     *FSM* Transitioning to state "GET_PROC"
> 11/22 20:59:05     *FSM* Executing state func "get_proc()" [  ]
> 11/22 20:59:05 Entering get_proc()
> 11/22 20:59:05 Entering get_job_info()
> 11/22 20:59:05 Startup Info:
> 11/22 20:59:05     Version Number: 1
> 11/22 20:59:05     Id: 5.0
> 11/22 20:59:05     JobClass: STANDARD
> 11/22 20:59:05     Uid: 503
> 11/22 20:59:05     Gid: 503
> 11/22 20:59:05     VirtPid: -1
> 11/22 20:59:05     SoftKillSignal: 20
> 11/22 20:59:06     Cmd: "/home/psegrid/NIP/nip"
> 11/22 20:59:06     Args: ""
> 11/22 20:59:06     Env:
> "GLOBUS_LOCATION=/usr/local/globus-4.0.5/;X509_CERT_DIR=/etc/grid-security/certificates;X509_USER_PROXY=;X509_USER_CERT=;X509_USER_KEY=;HOME=/home/psegrid;LOGNAME=psegrid;SCRATCH_DIRECTORY=/home/psegrid/.globus/scratch;JAVA_HOME=/usr/java/jdk1.6.0_03/jre;GLOBUS_GRAM_JOB_HANDLE=
> https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?3880a8a0-990e-11dc-814c-f74218502878;LD_LIBRARY_PATH=
> <https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?3880a8a0-990e-11dc-814c-f74218502878;LD_LIBRARY_PATH= >"
> 11/22 20:59:06     Iwd: "/home/psegrid"
> 11/22 20:59:06     Ckpt Wanted: TRUE
> 11/22 20:59:06     Is Restart: FALSE
> 11/22 20:59:06     Core Limit Valid: TRUE
> 11/22 20:59:06     Coredump Limit 0
> 11/22 20:59:06 User uid set to 503
> 11/22 20:59:06 User uid set to 503
> 11/22 20:59:06 User Process 5.0 {
> 11/22 20:59:06   cmd = /home/psegrid/NIP/nip
> 11/22 20:59:06   args =
> 11/22 20:59:06   env = GLOBUS_LOCATION=/usr/local/globus- 4.0.5/
> X509_CERT_DIR=/etc/grid-security/certificates X509_USER_PROXY=
> X509_USER_CERT= X509_USER_KEY= HOME=/home/psegrid LOGNAME=psegrid
> SCRATCH_DIRECTORY=/home/psegrid/.globus/scratch
> JAVA_HOME=/usr/java/jdk1.6.0_03/jre GLOBUS_GRAM_JOB_HANDLE=
> https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?3880a8a0-990e-11dc-814c-f74218502878
> <https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?3880a8a0-990e-11dc-814c-f74218502878 >
> LD_LIBRARY_PATH= CONDOR_VM=vm1 _condor_BIND_ALL_INTERFACES=FALSE
> CONDOR_SCRATCH_DIR=/home/condor/hosts/niting-w2p/execute/dir_4622
> 11/22 20:59:07   local_dir = dir_4622
> 11/22 20:59:07   cur_ckpt = dir_4622/condor_exec.5.0
> 11/22 20:59:07   core_name = (either 'core' or 'core.<pid>')
> 11/22 20:59:07   uid = 503, gid = 503
> 11/22 20:59:07   v_pid = -1
> 11/22 20:59:07   pid = (NOT CURRENTLY EXECUTING)
> 11/22 20:59:07   exit_status_valid = FALSE
> 11/22 20:59:07   exit_status = (NEVER BEEN EXECUTED)
> 11/22 20:59:07   ckpt_wanted = TRUE
> 11/22 20:59:07   coredump_limit_exists = TRUE
> 11/22 20:59:07   coredump_limit = 0
> 11/22 20:59:07   soft_kill_sig = 20
> 11/22 20:59:07   job_class = STANDARD
> 11/22 20:59:08   state = NEW
> 11/22 20:59:08   new_ckpt_created = FALSE
> 11/22 20:59:08   ckpt_transferred = FALSE
> 11/22 20:59:08   core_created = FALSE
> 11/22 20:59:08   core_transferred = FALSE
> 11/22 20:59:08   exit_requested = FALSE
> 11/22 20:59:08   image_size = -1 blocks
> 11/22 20:59:08   user_time = 0
> 11/22 20:59:08   sys_time = 0
> 11/22 20:59:08   guaranteed_user_time = 0
> 11/22 20:59:08   guaranteed_sys_time = 0
> 11/22 20:59:08 }
> 11/22 20:59:08     *FSM* Transitioning to state "GET_EXEC"
> 11/22 20:59:09     *FSM* Executing state func "get_exec()" [ SUSPEND
> VACATE DIE  ]
> 11/22 20:59:09 Entering get_exec()
> 11/22 20:59:09     *FSM* Got asynchronous event "DIE"
> 11/22 20:59:09     *FSM* Executing transition function "req_die"
> 11/22 20:59:09 req_exit_all: Proc -1 in state NEW
> 11/22 20:59:09     *FSM* Transitioning to state "TERMINATE"
> 11/22 20:59:09     *FSM* Executing state func "terminate_all()" [  ]
> 11/22 20:59:09     *FSM* Transitioning to state "SEND_STATUS_ALL"
> 11/22 20:59:10     *FSM* Executing state func "dispose_all()" [  ]
> 11/22 20:59:10 Sending final status for process 5.0
> 11/22 20:59:10 STATUS encoded as CKPT, *NOT* TRANSFERRED
> 11/22 20:59:10 User time = 0.000000 seconds
> 11/22 20:59:10 System time = 0.000000 seconds
> 11/22 20:59:10 Can't unlink "dir_4622/condor_exec.5.0" - errno = 2
> 11/22 20:59:10 Removed directory "dir_4622"
> 11/22 20:59:10     *FSM* Reached state "END"
> 11/22 20:59:10 ********* STARTER terminating normally **********
> =====================================================================
> *globus-condor.log*
> ==============================================================
> <c>
>     <a n="MyType"><s>JobAbortedEvent</s></a>
>     <a n="EventTypeNumber"><i>9</i></a>
>     <a n="MyType"><s>JobAbortedEvent</s></a>
>     <a n="EventTime"><s>2007-11-22T20:48:10</s></a>
>     <a n="Cluster"><i>4</i></a>
>     <a n="Proc"><i>0</i></a>
>     <a n="Subproc"><i>0</i></a>
>     <a n="Reason"><s>via condor_rm (by user psegrid)</s></a>
> </c>
> <c>
>     <a n="MyType"><s>SubmitEvent</s></a>
>     <a n="EventTypeNumber"><i>0</i></a>
>     <a n="MyType"><s>SubmitEvent</s></a>
>     <a n="EventTime"><s>2007-11-22T20:48:55</s></a>
>     <a n="Cluster"><i>5</i></a>
>     <a n="Proc"><i>0</i></a>
>     <a n="Subproc"><i>0</i></a>
>     <a n="SubmitHost"><s>&lt;192.168.7.221:42898&gt;</s></a>
> </c>
> <c>
>     <a n="MyType"><s>ExecuteEvent</s></a>
>     <a n="EventTypeNumber"><i>1</i></a>
>     <a n="MyType"><s>ExecuteEvent</s></a>
>     <a n="EventTime"><s>2007-11-22T20:49:10</s></a>
>     <a n="Cluster"><i>5</i></a>
>     <a n="Proc"><i>0</i></a>
>     <a n="Subproc"><i>0</i></a>
>     <a n="ExecuteHost"><s>&lt;192.168.7.221:57320&gt;</s></a>
> </c>
> <c>
>     <a n="MyType"><s>JobEvictedEvent</s></a>
>     <a n="EventTypeNumber"><i>4</i></a>
>     <a n="MyType"><s>JobEvictedEvent</s></a>
>     <a n="EventTime"><s>2007-11-22T20:49:10</s></a>
>     <a n="Cluster"><i>5</i></a>
>     <a n="Proc"><i>0</i></a>
>     <a n="Subproc"><i>0</i></a>
>     <a n="Checkpointed"><b v="f"/></a>
>     <a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
>     <a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
>     <a n="SentBytes"><r>2.570000000000000E+02</r></a>
>     <a n="ReceivedBytes"><r> 6.650000000000000E+02</r></a>
>     <a n="TerminatedAndRequeued"><b v="f"/></a>
>     <a n="TerminatedNormally"><b v="f"/></a>
> </c>
> <c>
>     <a n="MyType"><s>ExecuteEvent</s></a>
>     <a n="EventTypeNumber"><i>1</i></a>
>     <a n="MyType"><s>ExecuteEvent</s></a>
>     <a n="EventTime"><s>2007-11-22T20:59:11</s></a>
>     <a n="Cluster"><i>5</i></a>
>     <a n="Proc"><i>0</i></a>
>     <a n="Subproc"><i>0</i></a>
>     <a n="ExecuteHost"><s>&lt;192.168.7.221:57320&gt;</s></a>
> </c>
> <c>
>     <a n="MyType"><s>JobEvictedEvent</s></a>
>     <a n="EventTypeNumber"><i>4</i></a>
>     <a n="MyType"><s>JobEvictedEvent</s></a>
>     <a n="EventTime"><s>2007-11-22T20:59:11</s></a>
>     <a n="Cluster"><i>5</i></a>
>     <a n="Proc"><i>0</i></a>
>     <a n="Subproc"><i>0</i></a>
>     <a n="Checkpointed"><b v="f"/></a>
>     <a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
>     <a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
>     <a n="SentBytes"><r> 2.490000000000000E+02</r></a>
>     <a n="ReceivedBytes"><r> 5.970000000000000E+02</r></a>
>     <a n="TerminatedAndRequeued"><b v="f"/></a>
>     <a n="TerminatedNormally"><b v="f"/></a>
> </c>
> ====================================================================
>
> Nitin
>
> On Nov 20, 2007 9:24 PM, Dan Bradley <dan@xxxxxxxxxxxx
> <mailto:dan@xxxxxxxxxxxx>> wrote:
>
>
>     >        Last successful match: Tue Nov 20 22:36:21 2007
>
>
>     This indicates that the job is successfully getting matched to a
>     machine.  Something must be going wrong when the Condor tries to
>     run the
>     job on that machine.  Look for clues about what is going wrong here:
>
>     The "user log": /usr/local/globus-4.0.5//var/globus-condor.log
>     The ShadowLog (condor_config_val SHADOW_LOG)
>     The StartLog (condor_config_val STARTD_LOG)
>     The StarterLog (condor_config_val STARTER_LOG)
>
>     I hope that helps!
>
>     --Dan
>
>     Nitin Gavhane wrote:
>
>     > hello all,
>     > i am submitting job through globus to condor but the job stays
>     in idle
>     > state. the job details are as follows.
>     > ================================================
>     > *The Job Description Generated by GRAM is as follows *
>     >
>     > [condor@niting-w2p etc]$ cat /tmp/condor_job_description
>     > #
>     > # description file for condor submission
>     > #
>     > Universe = standard
>     > Notification = Never
>     > Executable = /home/psegrid/NIP/nip
>     > Requirements = OpSys == "LINUX"  && Arch == "INTEL"
>     > Environment =
>     >
>     GLOBUS_LOCATION=/usr/local/globus- 4.0.5/;X509_CERT_DIR=/etc/grid-security/certificates;X509_USER_PROXY=;X509_USER_CERT=;X509_USER_KEY=;HOME=/home/psegrid;LOGNAME=psegrid;SCRATCH_DIRECTORY=/home/psegrid/.globus/scratch;JAVA_HOME=/usr/java/jdk1.6.0_03/jre;GLOBUS_GRAM_JOB_HANDLE=
>
>     >
>     https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?7f408200-9789-11dc-9f1a-b41f06e1e2ea;LD_LIBRARY_PATH=
>     <https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?7f408200-9789-11dc-9f1a-b41f06e1e2ea;LD_LIBRARY_PATH= >
>     >
>     <https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?7f408200-9789-11dc-9f1a-b41f06e1e2ea;LD_LIBRARY_PATH=
>     <https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?7f408200-9789-11dc-9f1a-b41f06e1e2ea;LD_LIBRARY_PATH= >>
>     > Arguments =
>     > InitialDir = /home/psegrid
>     > Input = /dev/null
>     > Log = /usr/local/globus-4.0.5//var/globus-condor.log
>     > log_xml = True
>     > #Extra attributes specified by client
>     >
>     > Output = /home/psegrid/stdout
>     > Error = /home/psegrid/stderr
>     > queue 1
>     >
>     >
>     =======================================================================
>     > *[psegrid@niting-w2p NIP]$ condor_q -better-analyze*
>     >
>     >
>     > -- Submitter: niting-w2p.corp.cdac.in
>     <http://niting-w2p.corp.cdac.in> <http://niting-w2p.corp.cdac.in
>     <http://niting-w2p.corp.cdac.in>>
>     > : < 192.168.7.221:42993 <http://192.168.7.221:42993>
>     <http://192.168.7.221:42993>> :
>     > niting-w2p.corp.cdac.in <http://niting-w2p.corp.cdac.in>
>     < http://niting-w2p.corp.cdac.in>
>     > ---
>     > 005.000:  Run analysis summary.  Of 7 machines,
>     >      4 are rejected by your job's requirements
>     >      0 reject your job because of their own requirements
>     >      0 match but are serving users with a better priority in the
>     pool
>     >      3 match but reject the job for unknown reasons
>     >      0 match but will not currently preempt their existing job
>     >      0 are available to run your job
>     >        Last successful match: Tue Nov 20 22:36:21 2007
>     >
>     > The Requirements _expression_ for your job is:
>     >
>     > ( target.OpSys == "LINUX" && target.Arch == "INTEL" ) &&
>     > ( ( target.CkptArch == target.Arch ) || ( target.CkptArch is
>     undefined
>     > ) ) &&
>     > ( ( target.CkptOpSys == target.OpSys ) || ( target.CkptOpSys is
>     > undefined ) ) &&
>     > ( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >=
>     ImageSize )
>     >
>     >    Condition                         Machines Matched    Suggestion
>     >    ---------                         ----------------    ----------
>     > 1   target.Arch == "INTEL"            3
>     > 2   target.OpSys == "LINUX"           7
>     > 3   ( ( target.CkptArch == target.Arch ) || ( target.CkptArch is
>     > undefined ) )
>     >                                      7
>     > 4   ( ( target.CkptOpSys == target.OpSys ) || ( target.CkptOpSys is
>     > undefined ) )
>     >                                      7
>     > 5   ( target.Disk >= 20000 )          7
>     > 6   ( ( 1024 * target.Memory ) >= 20000 )7
>     >
>     >
>     >
>     >
>     > ==========================================================
>     > *[ psegrid@niting-w2p NIP]$ condor_status*
>     >
>     > Name          OpSys       Arch   State      Activity   LoadAv Mem
>     > ActvtyTime
>     >
>     > vm1@niting-w2 LINUX       INTEL  Unclaimed  Idle       0.000   469
>     >  0+00:05:26
>     > vm2@niting-w2 LINUX       INTEL  Unclaimed  Idle       0.140   469
>     >  0+00:26:42
>     > sskadam-w2p.c LINUX       INTEL  Unclaimed  Idle       0.000   248
>     >  0+00:44:38
>     > vm1@psewebs-w LINUX       X86_64 Unclaimed  Idle       0.400   753
>     >  0+00:30:04
>     > vm2@psewebs-w LINUX       X86_64 Unclaimed  Idle       0.000   753
>     >  0+00:30:05
>     > vm3@psewebs-w LINUX       X86_64 Unclaimed  Idle       0.000   753
>     >  0+00:30:06
>     > vm4@psewebs-w LINUX       X86_64 Unclaimed  Idle       0.000   753
>     >  0+00:30:27
>     >
>     >                     Total Owner Claimed Unclaimed Matched Preempting
>     > Backfill
>     >
>     >         INTEL/LINUX     3     0       0         3       0
>        0
>     >      0
>     >        X86_64/LINUX     4     0       0         4       0          0
>     >      0
>     >
>     >               Total     7     0       0         7       0          0
>     >      0
>     > ==============================================================
>     > *The DAEMON details for all three machines are as follows *
>     >
>     > [condor@niting-w2p etc]$ ./test.sh
>     > current file: condor_config
>     > ##  checkpoint server isn't available or USE_CKPT_SERVER is set to
>     > USE_CKPT_SERVER = True
>     > CKPT_SERVER_HOST        = psewebs-w2p.corp.cdac.in
>     <http://psewebs-w2p.corp.cdac.in>
>     > < http://psewebs-w2p.corp.cdac.in>
>     > ##  checkpoint server?  If False, the CKPT_SERVER_HOST set on
>     > ##  the submit machine is used.  Otherwise, the CKPT_SERVER_HOST set
>     > STARTER_CHOOSES_CKPT_SERVER = True
>     > #WALL_CLOCK_CKPT_INTERVAL = 3600
>     > ##  setting is only used if USE_CKPT_SERVER (from above) is True.
>     > #COMPRESS_PERIODIC_CKPT = False
>     > #COMPRESS_VACATE_CKPT = False
>     > #SLOW_CKPT_SPEED = 0
>     > DAEMON_LIST                     = MASTER, STARTD, SCHEDD
>     > #DC_DAEMON_LIST = \
>     > =============
>     > current file: psewebs-w2p.local
>     > USE_CKPT_SERVER = True
>     > CKPT_SERVER_HOST        = psewebs-w2p.corp.cdac.in
>     <http://psewebs-w2p.corp.cdac.in>
>     > <http://psewebs-w2p.corp.cdac.in >
>     > DAEMON_LIST = MASTER, STARTD, SCHEDD
>     > DAEMON_LIST   = MASTER, COLLECTOR, NEGOTIATOR, STARTD, SCHEDD
>     > =============
>     > current file: niting-w2p.local
>     > USE_CKPT_SERVER = True
>     > CKPT_SERVER_HOST        = psewebs-w2p.corp.cdac.in
>     < http://psewebs-w2p.corp.cdac.in>
>     > <http://psewebs-w2p.corp.cdac.in>
>     > DAEMON_LIST = MASTER, STARTD, SCHEDD
>     > =============
>     > current file: sskadam-w2p.local
>     > USE_CKPT_SERVER = True
>     > CKPT_SERVER_HOST        = psewebs-w2p.corp.cdac.in
>     <http://psewebs-w2p.corp.cdac.in>
>     > <http://psewebs-w2p.corp.cdac.in >
>     > DAEMON_LIST = MASTER, STARTD, SCHEDD
>     > ===============================
>     >
>     > Please Tell what is wrong with job submission.
>     > Thank you.
>     > --
>     > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>     > Nitin M. Gavhane
>     > MS in Adavanced Software Technologies
>     > International Institute of Information Technology
>     > P-14,Hinjewadi,Pune, India.
>     >
>     ---------------------------------------------------------------------------------------------------------------------------
>
>     >
>     >
>     >------------------------------------------------------------------------
>     >
>     >_______________________________________________
>     >Condor-users mailing list
>     >To unsubscribe, send a message to
>     condor-users-request@xxxxxxxxxxx
>     <mailto: condor-users-request@xxxxxxxxxxx> with a
>     >subject: Unsubscribe
>     >You can also unsubscribe by visiting
>     > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>     >
>     >The archives can be found at:
>     >https://lists.cs.wisc.edu/archive/condor-users/
>     <https://lists.cs.wisc.edu/archive/condor-users/>
>     >
>     >
>     _______________________________________________
>     Condor-users mailing list
>     To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
>     <mailto: condor-users-request@xxxxxxxxxxx> with a
>     subject: Unsubscribe
>     You can also unsubscribe by visiting
>     https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>     The archives can be found at:
>     https://lists.cs.wisc.edu/archive/condor-users/
>
>
>
>
> --
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Nitin M. Gavhane
> MS in Adavanced Software Technologies
> International Institute of Information Technology
> P-14,Hinjewadi,Pune, India.
> ---------------------------------------------------------------------------------------------------------------------------




--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Nitin M. Gavhane
MS in Adavanced Software Technologies
International Institute of Information Technology
P-14,Hinjewadi,Pune, India.
---------------------------------------------------------------------------------------------------------------------------