[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor-G jobs remains idle



PENDING status just means that the job is queued on the remote resource but there are no free slots to run it with.

Eventually it should run OK.   But it’s perfectly normal for most gt2 jobs to show a PENDING state first.

 

Steve Timm

 

 

From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Hameed Alzahrani
Sent: Monday, April 30, 2012 5:23 AM
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] Condor-G jobs remains idle

 

When I submit a condor-G job its status keeps "idle" when I type "condor_q" and "PENDING" when I type "condor_q -globus". Is there a missing configuration that I need to add to be able to submit condor-G jobs successfully?
I use Condor 7.6.6 and VDT 2

Submission file and process:

[zhrani@CM Grid]$ cat hostname_submit.jcl

grid_resource = gt2 head.beng02.com/jobmanager-pbs
Universe = grid
when_to_transfer_output = ON_EXIT
Executable = /bin/hostname
Arguments = -f
Output = cout.$(Cluster).$(Process)
Log =clog.$(Cluster).$(Process)
Queue

[zhrani@CM Grid]$ condor_submit hostname_submit.jcl

Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 1111.


[zhrani@CM Grid]$ condor_q

-- Submitter: CM.CHPC.hud.ac.uk : <192.168.0.10:21871> : CM.CHPC.hud.ac.uk
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
1111.0   zhrani          4/30 11:07   0+00:00:00 I  0   0.0  hostname -f

1 jobs; 1 idle, 0 running, 0 held

[zhrani@CM Grid]$ condor_q -globus

-- Submitter: CM.CHPC.hud.ac.uk : <192.168.0.10:21871> : CM.CHPC.hud.ac.uk
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE
1111.0   zhrani        PENDING pbs      head.beng02.com     /bin/hostname


[zhrani@CM Grid]$ cat clog.1111.0
000 (1111.000.000) 04/30 11:07:24 Job submitted from host: <192.168.0.10:21871>
...
017 (1111.000.000) 04/30 11:07:34 Job submitted to Globus
    RM-Contact: head.beng02.com/jobmanager-pbs
    JM-Contact: https://head.beng02.com:53994/13404/1335780447/
    Can-Restart-JM: 1
...
027 (1111.000.000) 04/30 11:07:34 Job submitted to grid resource
    GridResource: gt2 head.beng02.com/jobmanager-pbs
    GridJobId: gt2 head.beng02.com/jobmanager-pbs https://head.beng02.com:53994/13404/1335780447/
...


Gridmanager LOG:

04/30/12 11:07:34 [31322] GAHP[31326] <- 'RESULTS'
04/30/12 11:07:34 [31322] GAHP[31326] -> 'R'
04/30/12 11:07:34 [31322] GAHP[31326] -> 'S' '1'
04/30/12 11:07:34 [31322] GAHP[31326] -> '2' 'https://head.beng02.com:53994/13404/1335780447/' '64' '0'
04/30/12 11:07:34 [31322] (1111.0) gram callback: state 64, errorcode 0
04/30/12 11:07:34 [31322] (1111.0) doEvaluateState called: gmState GM_SUBMITTED, globusState 32
04/30/12 11:07:34 [31322] (1111.0) globus state change: UNSUBMITTED -> STAGE_IN
04/30/12 11:07:34 [31322] directory_util::rec_touch_file: Creating directory /tmp
04/30/12 11:07:34 [31322] directory_util::rec_touch_file: Creating directory /tmp/condorLocks
04/30/12 11:07:34 [31322] directory_util::rec_touch_file: Creating directory /tmp/condorLocks/13
04/30/12 11:07:34 [31322] directory_util::rec_touch_file: Creating directory /tmp/condorLocks/13/73
04/30/12 11:07:34 [31322] FileLock object is updating timestamp on: /tmp/condorLocks/13/73/8624055152012540.lockc
04/30/12 11:07:34 [31322] (1111.0) Writing globus submit record to user logfile
04/30/12 11:07:34 [31322] FileLock::obtain(1) - @1335780454.150935 lock on /tmp/condorLocks/13/73/8624055152012540.lockc now WRITE
04/30/12 11:07:34 [31322] FileLock::obtain(2) - @1335780454.154117 lock on /tmp/condorLocks/13/73/8624055152012540.lockc now UNLOCKED
04/30/12 11:07:34 [31322] FileLock::obtain(1) - @1335780454.154250 lock on /tmp/condorLocks/13/73/8624055152012540.lockc now WRITE
04/30/12 11:07:34 [31322] directory_util::rec_clean_up: file /tmp/condorLocks/13/73/8624055152012540.lockc has been deleted.
04/30/12 11:07:34 [31322] Lock file /tmp/condorLocks/13/73/8624055152012540.lockc has been deleted.
04/30/12 11:07:34 [31322] FileLock::obtain(2) - @1335780454.154591 lock on /tmp/condorLocks/13/73/8624055152012540.lockc now UNLOCKED
04/30/12 11:07:34 [31322] directory_util::rec_touch_file: Creating directory /tmp
04/30/12 11:07:34 [31322] directory_util::rec_touch_file: Creating directory /tmp/condorLocks
04/30/12 11:07:34 [31322] directory_util::rec_touch_file: Creating directory /tmp/condorLocks/13
04/30/12 11:07:34 [31322] directory_util::rec_touch_file: Creating directory /tmp/condorLocks/13/73
04/30/12 11:07:34 [31322] FileLock object is updating timestamp on: /tmp/condorLocks/13/73/8624055152012540.lockc
04/30/12 11:07:34 [31322] (1111.0) Writing grid submit record to user logfile
04/30/12 11:07:34 [31322] FileLock::obtain(1) - @1335780454.155638 lock on /tmp/condorLocks/13/73/8624055152012540.lockc now WRITE
04/30/12 11:07:34 [31322] FileLock::obtain(2) - @1335780454.157136 lock on /tmp/condorLocks/13/73/8624055152012540.lockc now UNLOCKED
04/30/12 11:07:34 [31322] FileLock::obtain(1) - @1335780454.157265 lock on /tmp/condorLocks/13/73/8624055152012540.lockc now WRITE
04/30/12 11:07:34 [31322] directory_util::rec_clean_up: file /tmp/condorLocks/13/73/8624055152012540.lockc has been deleted.
04/30/12 11:07:34 [31322] Lock file /tmp/condorLocks/13/73/8624055152012540.lockc has been deleted.
04/30/12 11:07:34 [31322] FileLock::obtain(2) - @1335780454.157598 lock on /tmp/condorLocks/13/73/8624055152012540.lockc now UNLOCKED
04/30/12 11:07:34 [31322] GAHP[31326] <- 'RESULTS'
04/30/12 11:07:34 [31322] GAHP[31326] -> 'R'
04/30/12 11:07:34 [31322] GAHP[31326] -> 'S' '1'
04/30/12 11:07:34 [31322] GAHP[31326] -> '2' 'https://head.beng02.com:53994/13404/1335780447/' '1' '0'
04/30/12 11:07:34 [31322] (1111.0) gram callback: state 1, errorcode 0
04/30/12 11:07:34 [31322] (1111.0) doEvaluateState called: gmState GM_SUBMITTED, globusState 64
04/30/12 11:07:34 [31322] (1111.0) globus state change: STAGE_IN -> PENDING
04/30/12 11:07:38 [31322] grid_monitor for head.beng02.com:2119 entering CheckMonitor
04/30/12 11:07:38 [31322] GAHP[31326] <- 'GRAM_JOB_REQUEST 7 head.beng02.com:2119/jobmanager-fork https://cm.chpc.hud.ac.uk:24383/ 1 &(executable=https://cm.chpc.hud.ac.uk:20886/usr/sbin/grid_monitor.sh)(stdout=https://cm.chpc.hud.ac.uk:20886/tmp/condor_g_scratch.0x19390fd0.25029/grid-monitor.head.beng02.com:2119.1/grid-monitor-log)(arguments='--dest-url=""> 04/30/12 11:07:38 [31322] GAHP[31326] -> 'S'
04/30/12 11:07:39 [31322] in doContactSchedd()
04/30/12 11:07:39 [31322] querying for removed/held jobs
04/30/12 11:07:39 [31322] Using constraint ((Owner=?="zhrani"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
04/30/12 11:07:39 [31322] Fetched 0 job ads from schedd
04/30/12 11:07:39 [31322] Updating classad values for 1111.0:
04/30/12 11:07:39 [31322]    GlobusStatus = 1
04/30/12 11:07:39 [31322]    GridJobStatus = "PENDING"
04/30/12 11:07:39 [31322]    LastRemoteStatusUpdate = 1335780454
04/30/12 11:07:39 [31322]    NumGlobusSubmits = 1
04/30/12 11:07:39 [31322] leaving doContactSchedd()
04/30/12 11:07:42 [31322] GAHP[31326] <- 'RESULTS'
04/30/12 11:07:42 [31322] GAHP[31326] -> 'R'
04/30/12 11:07:42 [31322] GAHP[31326] -> 'S' '1'
04/30/12 11:07:42 [31322] GAHP[31326] -> '7' '0' 'https://head.beng02.com:60336/13434/1335780456/'
04/30/12 11:07:42 [31322] grid_monitor for head.beng02.com:2119 entering CheckMonitor
04/30/12 11:07:42 [31322] GAHP[31326] <- 'RESULTS'
04/30/12 11:07:42 [31322] GAHP[31326] -> 'R'
04/30/12 11:07:42 [31322] GAHP[31326] -> 'S' '1'
04/30/12 11:07:42 [31322] GAHP[31326] -> '2' 'https://head.beng02.com:60336/13434/1335780456/' '64' '0'
04/30/12 11:07:42 [31322] grid_monitor for head.beng02.com:2119: gram callback status=64 errorcode=0
04/30/12 11:07:43 [31322] GAHP[31326] <- 'RESULTS'
04/30/12 11:07:43 [31322] GAHP[31326] -> 'R'
04/30/12 11:07:43 [31322] GAHP[31326] -> 'S' '1'
04/30/12 11:07:43 [31322] GAHP[31326] -> '2' 'https://head.beng02.com:60336/13434/1335780456/' '2' '0'
04/30/12 11:07:43 [31322] grid_monitor for head.beng02.com:2119: gram callback status=2 errorcode=0
04/30/12 11:08:12 [31322] grid_monitor for head.beng02.com:2119 entering CheckMonitor
04/30/12 11:08:12 [31322] grid_monitor job status for head.beng02.com:2119 file has been refreshed.
04/30/12 11:08:12 [31322] Read full grid_monitor status file for head.beng02.com:2119: scan start=1335780406, scan finish=1335780406, job count=0
04/30/12 11:08:12 [31322] Read grid_monitor status file for head.beng02.com:2119 successfully
04/30/12 11:08:12 [31322] grid_monitor log file for head.beng02.com:2119 updated.
04/30/12 11:08:12 [31322] grid_monitor log file for head.beng02.com:2119 looks normal
04/30/12 11:08:12 [31322] Successfully started grid_monitor for head.beng02.com:2119
04/30/12 11:08:12 [31322] (1111.0) doEvaluateState called: gmState GM_SUBMITTED, globusState 1
04/30/12 11:08:12 [31322] (1111.0) gm state change: GM_SUBMITTED -> GM_PUT_TO_SLEEP
04/30/12 11:08:12 [31322] GAHP[31326] <- 'GRAM_JOB_SIGNAL 8 https://head.beng02.com:53994/13404/1335780447/ 9 NULL'
04/30/12 11:08:12 [31322] GAHP[31326] -> 'S'
04/30/12 11:08:12 [31322] GAHP[31326] <- 'RESULTS'
04/30/12 11:08:12 [31322] GAHP[31326] -> 'R'
04/30/12 11:08:12 [31322] GAHP[31326] -> 'S' '1'
04/30/12 11:08:12 [31322] GAHP[31326] -> '8' '0' '0' '1'
04/30/12 11:08:12 [31322] (1111.0) doEvaluateState called: gmState GM_PUT_TO_SLEEP, globusState 1
04/30/12 11:08:12 [31322] (1111.0) gm state change: GM_PUT_TO_SLEEP -> GM_JOBMANAGER_ASLEEP
04/30/12 11:08:12 [31322] GAHP[31326] <- 'RESULTS'
04/30/12 11:08:12 [31322] GAHP[31326] -> 'R'
04/30/12 11:08:12 [31322] GAHP[31326] -> 'S' '1'
04/30/12 11:08:12 [31322] GAHP[31326] -> '2' 'https://head.beng02.com:53994/13404/1335780447/' '4' '130'
04/30/12 11:08:12 [31322] (1111.0) gram callback: state 4, errorcode 130
04/30/12 11:08:12 [31322] (1111.0) doEvaluateState called: gmState GM_JOBMANAGER_ASLEEP, globusState 1
04/30/12 11:08:25 [31322] Received CHECK_LEASES signal
04/30/12 11:08:25 [31322] in doContactSchedd()
04/30/12 11:08:25 [31322] querying for renewed leases
04/30/12 11:08:25 [31322] querying for removed/held jobs
04/30/12 11:08:25 [31322] Using constraint ((Owner=?="zhrani"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
04/30/12 11:08:25 [31322] Fetched 0 job ads from schedd
04/30/12 11:08:25 [31322] leaving doContactSchedd()
04/30/12 11:08:28 [31322] GAHP[31326] <- 'RESULTS'
04/30/12 11:08:28 [31322] GAHP[31326] -> 'S' '0'
04/30/12 11:08:30 [31322] Evaluating staleness of remote job statuses.
04/30/12 11:08:42 [31322] grid_monitor for head.beng02.com:2119 entering CheckMonitor
04/30/12 11:09:12 [31322] grid_monitor for head.beng02.com:2119 entering CheckMonitor
04/30/12 11:09:12 [31322] grid_monitor job status for head.beng02.com:2119 file has been refreshed.
04/30/12 11:09:12 [31322] Read full grid_monitor status file for head.beng02.com:2119: scan start=1335780466, scan finish=1335780466, job count=1
04/30/12 11:09:12 [31322] Read grid_monitor status file for head.beng02.com:2119 successfully
04/30/12 11:09:12 [31322] grid_monitor log file for head.beng02.com:2119 updated.
04/30/12 11:09:12 [31322] grid_monitor log file for head.beng02.com:2119 looks normal
04/30/12 11:09:12 [31322] in doContactSchedd()
04/30/12 11:09:12 [31322] querying for removed/held jobs
04/30/12 11:09:12 [31322] Using constraint ((Owner=?="zhrani"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
04/30/12 11:09:12 [31322] Fetched 0 job ads from schedd
04/30/12 11:09:12 [31322] Updating classad values for 1111.0:
04/30/12 11:09:12 [31322]    LastRemoteStatusUpdate = 1335780552
04/30/12 11:09:12 [31322] leaving doContactSchedd()
04/30/12 11:09:25 [31322] Received CHECK_LEASES signal
04/30/12 11:09:25 [31322] in doContactSchedd()
04/30/12 11:09:25 [31322] querying for renewed leases
04/30/12 11:09:25 [31322] querying for removed/held jobs
04/30/12 11:09:25 [31322] Using constraint ((Owner=?="zhrani"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
04/30/12 11:09:25 [31322] Fetched 0 job ads from schedd
04/30/12 11:09:25 [31322] leaving doContactSchedd()
04/30/12 11:09:28 [31322] GAHP[31326] <- 'RESULTS'
04/30/12 11:09:28 [31322] GAHP[31326] -> 'S' '0'
04/30/12 11:09:30 [31322] Evaluating staleness of remote job statuses.
04/30/12 11:09:42 [31322] grid_monitor for head.beng02.com:2119 entering CheckMonitor
04/30/12 11:10:12 [31322] grid_monitor for head.beng02.com:2119 entering CheckMonitor
04/30/12 11:10:12 [31322] grid_monitor job status for head.beng02.com:2119 file has been refreshed.
04/30/12 11:10:12 [31322] Read full grid_monitor status file for head.beng02.com:2119: scan start=1335780526, scan finish=1335780526, job count=1
04/30/12 11:10:12 [31322] Read grid_monitor status file for head.beng02.com:2119 successfully
04/30/12 11:10:12 [31322] grid_monitor log file for head.beng02.com:2119 updated.
04/30/12 11:10:12 [31322] grid_monitor log file for head.beng02.com:2119 looks normal
04/30/12 11:10:12 [31322] in doContactSchedd()
04/30/12 11:10:12 [31322] querying for removed/held jobs
04/30/12 11:10:12 [31322] Using constraint ((Owner=?="zhrani"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
04/30/12 11:10:12 [31322] Fetched 0 job ads from schedd
04/30/12 11:10:12 [31322] Updating classad values for 1111.0:
04/30/12 11:10:12 [31322]    LastRemoteStatusUpdate = 1335780612
04/30/12 11:10:12 [31322] leaving doContactSchedd()
04/30/12 11:10:25 [31322] Received CHECK_LEASES signal
04/30/12 11:10:25 [31322] in doContactSchedd()
04/30/12 11:10:25 [31322] querying for renewed leases
04/30/12 11:10:25 [31322] querying for removed/held jobs
04/30/12 11:10:25 [31322] Using constraint ((Owner=?="zhrani"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
04/30/12 11:10:25 [31322] Fetched 0 job ads from schedd
04/30/12 11:10:25 [31322] leaving doContactSchedd()




Remote Host Log:


TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 6: globus-gatekeeper pid=13401 starting at Mon Apr 30 11:07:27 2012

TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 6: Got connection 10.71.88.93 at Mon Apr 30 11:07:27 2012

TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 5: Authenticated globus user: /O=Grid/OU=GlobusTest/OU=simpleCA-head.beng02.com/OU=beng02.com/CN=zahrani
TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 0: GRID_SECURITY_HTTP_BODY_FD=6
TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 5: Requested service: jobmanager
TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 5: Authorized as local user: zhrani
TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 5: Authorized as local uid: 516
TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 5:           and local gid: 516
TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 0: executing /usr/local/globus-4.2.0/libexec/globus-job-manager
TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 0: GRID_SECURITY_CONTEXT_FD=9
TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 0: Child 13402 started
TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 6: globus-gatekeeper pid=13403 starting at Mon Apr 30 11:07:27 2012

TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 6: Got connection 10.71.88.93 at Mon Apr 30 11:07:27 2012

TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 5: Authenticated globus user: /O=Grid/OU=GlobusTest/OU=simpleCA-head.beng02.com/OU=beng02.com/CN=zahrani
TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 0: GRID_SECURITY_HTTP_BODY_FD=6
TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 5: Requested service: jobmanager-pbs
TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 5: Authorized as local user: zhrani
TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 5: Authorized as local uid: 516
TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 5:           and local gid: 516
TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 0: executing /usr/local/globus-4.2.0/libexec/globus-job-manager
TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 0: GRID_SECURITY_CONTEXT_FD=9
TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 0: Child 13404 started
TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 6: globus-gatekeeper pid=13433 starting at Mon Apr 30 11:07:36 2012

TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 6: Got connection 10.71.88.93 at Mon Apr 30 11:07:36 2012

TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 5: Authenticated globus user: /O=Grid/OU=GlobusTest/OU=simpleCA-head.beng02.com/OU=beng02.com/CN=zahrani
TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 0: GRID_SECURITY_HTTP_BODY_FD=6
TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 5: Requested service: jobmanager-fork
TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 5: Authorized as local user: zhrani
TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 5: Authorized as local uid: 516
TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 5:           and local gid: 516
TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 0: executing /usr/local/globus-4.2.0/libexec/globus-job-manager
TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 0: GRID_SECURITY_CONTEXT_FD=9
TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 0: Child 13434 started




Regards,