[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor-G jobs remains idle



When I submit a condor-G job its status keeps "idle" when I type "condor_q" and "PENDING" when I type "condor_q -globus". Is there a missing configuration that I need to add to be able to submit condor-G jobs successfully?
I use Condor 7.6.6 and VDT 2

Submission file and process:

[zhrani@CM Grid]$ cat hostname_submit.jcl

grid_resource = gt2 head.beng02.com/jobmanager-pbs
Universe = grid
when_to_transfer_output = ON_EXIT
Executable = /bin/hostname
Arguments = -f
Output = cout.$(Cluster).$(Process)
Log =clog.$(Cluster).$(Process)
Queue

[zhrani@CM Grid]$ condor_submit hostname_submit.jcl

Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 1111.

[zhrani@CM Grid]$ condor_q

-- Submitter: CM.CHPC.hud.ac.uk : <192.168.0.10:21871> : CM.CHPC.hud.ac.uk
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
1111.0   zhrani          4/30 11:07   0+00:00:00 I  0   0.0  hostname -f

1 jobs; 1 idle, 0 running, 0 held

[zhrani@CM Grid]$ condor_q -globus

-- Submitter: CM.CHPC.hud.ac.uk : <192.168.0.10:21871> : CM.CHPC.hud.ac.uk
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE
1111.0   zhrani        PENDING pbs      head.beng02.com     /bin/hostname

[zhrani@CM Grid]$ cat clog.1111.0
000 (1111.000.000) 04/30 11:07:24 Job submitted from host: <192.168.0.10:21871>
...
017 (1111.000.000) 04/30 11:07:34 Job submitted to Globus
    RM-Contact: head.beng02.com/jobmanager-pbs
    JM-Contact: https://head.beng02.com:53994/13404/1335780447/
    Can-Restart-JM: 1
...
027 (1111.000.000) 04/30 11:07:34 Job submitted to grid resource
    GridResource: gt2 head.beng02.com/jobmanager-pbs
    GridJobId: gt2 head.beng02.com/jobmanager-pbs https://head.beng02.com:53994/13404/1335780447/
...


Gridmanager LOG:

04/30/12 11:07:34 [31322] GAHP[31326] <- 'RESULTS'
04/30/12 11:07:34 [31322] GAHP[31326] -> 'R'
04/30/12 11:07:34 [31322] GAHP[31326] -> 'S' '1'
04/30/12 11:07:34 [31322] GAHP[31326] -> '2' 'https://head.beng02.com:53994/13404/1335780447/' '64' '0'
04/30/12 11:07:34 [31322] (1111.0) gram callback: state 64, errorcode 0
04/30/12 11:07:34 [31322] (1111.0) doEvaluateState called: gmState GM_SUBMITTED, globusState 32
04/30/12 11:07:34 [31322] (1111.0) globus state change: UNSUBMITTED -> STAGE_IN
04/30/12 11:07:34 [31322] directory_util::rec_touch_file: Creating directory /tmp
04/30/12 11:07:34 [31322] directory_util::rec_touch_file: Creating directory /tmp/condorLocks
04/30/12 11:07:34 [31322] directory_util::rec_touch_file: Creating directory /tmp/condorLocks/13
04/30/12 11:07:34 [31322] directory_util::rec_touch_file: Creating directory /tmp/condorLocks/13/73
04/30/12 11:07:34 [31322] FileLock object is updating timestamp on: /tmp/condorLocks/13/73/8624055152012540.lockc
04/30/12 11:07:34 [31322] (1111.0) Writing globus submit record to user logfile
04/30/12 11:07:34 [31322] FileLock::obtain(1) - @1335780454.150935 lock on /tmp/condorLocks/13/73/8624055152012540.lockc now WRITE
04/30/12 11:07:34 [31322] FileLock::obtain(2) - @1335780454.154117 lock on /tmp/condorLocks/13/73/8624055152012540.lockc now UNLOCKED
04/30/12 11:07:34 [31322] FileLock::obtain(1) - @1335780454.154250 lock on /tmp/condorLocks/13/73/8624055152012540.lockc now WRITE
04/30/12 11:07:34 [31322] directory_util::rec_clean_up: file /tmp/condorLocks/13/73/8624055152012540.lockc has been deleted.
04/30/12 11:07:34 [31322] Lock file /tmp/condorLocks/13/73/8624055152012540.lockc has been deleted.
04/30/12 11:07:34 [31322] FileLock::obtain(2) - @1335780454.154591 lock on /tmp/condorLocks/13/73/8624055152012540.lockc now UNLOCKED
04/30/12 11:07:34 [31322] directory_util::rec_touch_file: Creating directory /tmp
04/30/12 11:07:34 [31322] directory_util::rec_touch_file: Creating directory /tmp/condorLocks
04/30/12 11:07:34 [31322] directory_util::rec_touch_file: Creating directory /tmp/condorLocks/13
04/30/12 11:07:34 [31322] directory_util::rec_touch_file: Creating directory /tmp/condorLocks/13/73
04/30/12 11:07:34 [31322] FileLock object is updating timestamp on: /tmp/condorLocks/13/73/8624055152012540.lockc
04/30/12 11:07:34 [31322] (1111.0) Writing grid submit record to user logfile
04/30/12 11:07:34 [31322] FileLock::obtain(1) - @1335780454.155638 lock on /tmp/condorLocks/13/73/8624055152012540.lockc now WRITE
04/30/12 11:07:34 [31322] FileLock::obtain(2) - @1335780454.157136 lock on /tmp/condorLocks/13/73/8624055152012540.lockc now UNLOCKED
04/30/12 11:07:34 [31322] FileLock::obtain(1) - @1335780454.157265 lock on /tmp/condorLocks/13/73/8624055152012540.lockc now WRITE
04/30/12 11:07:34 [31322] directory_util::rec_clean_up: file /tmp/condorLocks/13/73/8624055152012540.lockc has been deleted.
04/30/12 11:07:34 [31322] Lock file /tmp/condorLocks/13/73/8624055152012540.lockc has been deleted.
04/30/12 11:07:34 [31322] FileLock::obtain(2) - @1335780454.157598 lock on /tmp/condorLocks/13/73/8624055152012540.lockc now UNLOCKED
04/30/12 11:07:34 [31322] GAHP[31326] <- 'RESULTS'
04/30/12 11:07:34 [31322] GAHP[31326] -> 'R'
04/30/12 11:07:34 [31322] GAHP[31326] -> 'S' '1'
04/30/12 11:07:34 [31322] GAHP[31326] -> '2' 'https://head.beng02.com:53994/13404/1335780447/' '1' '0'
04/30/12 11:07:34 [31322] (1111.0) gram callback: state 1, errorcode 0
04/30/12 11:07:34 [31322] (1111.0) doEvaluateState called: gmState GM_SUBMITTED, globusState 64
04/30/12 11:07:34 [31322] (1111.0) globus state change: STAGE_IN -> PENDING
04/30/12 11:07:38 [31322] grid_monitor for head.beng02.com:2119 entering CheckMonitor
04/30/12 11:07:38 [31322] GAHP[31326] <- 'GRAM_JOB_REQUEST 7 head.beng02.com:2119/jobmanager-fork https://cm.chpc.hud.ac.uk:24383/ 1 &(executable=https://cm.chpc.hud.ac.uk:20886/usr/sbin/grid_monitor.sh)(stdout=https://cm.chpc.hud.ac.uk:20886/tmp/condor_g_scratch.0x19390fd0.25029/grid-monitor.head.beng02.com:2119.1/grid-monitor-log)(arguments='--dest-url="">
04/30/12 11:07:38 [31322] GAHP[31326] -> 'S'
04/30/12 11:07:39 [31322] in doContactSchedd()
04/30/12 11:07:39 [31322] querying for removed/held jobs
04/30/12 11:07:39 [31322] Using constraint ((Owner=?="zhrani"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
04/30/12 11:07:39 [31322] Fetched 0 job ads from schedd
04/30/12 11:07:39 [31322] Updating classad values for 1111.0:
04/30/12 11:07:39 [31322]    GlobusStatus = 1
04/30/12 11:07:39 [31322]    GridJobStatus = "PENDING"
04/30/12 11:07:39 [31322]    LastRemoteStatusUpdate = 1335780454
04/30/12 11:07:39 [31322]    NumGlobusSubmits = 1
04/30/12 11:07:39 [31322] leaving doContactSchedd()
04/30/12 11:07:42 [31322] GAHP[31326] <- 'RESULTS'
04/30/12 11:07:42 [31322] GAHP[31326] -> 'R'
04/30/12 11:07:42 [31322] GAHP[31326] -> 'S' '1'
04/30/12 11:07:42 [31322] GAHP[31326] -> '7' '0' 'https://head.beng02.com:60336/13434/1335780456/'
04/30/12 11:07:42 [31322] grid_monitor for head.beng02.com:2119 entering CheckMonitor
04/30/12 11:07:42 [31322] GAHP[31326] <- 'RESULTS'
04/30/12 11:07:42 [31322] GAHP[31326] -> 'R'
04/30/12 11:07:42 [31322] GAHP[31326] -> 'S' '1'
04/30/12 11:07:42 [31322] GAHP[31326] -> '2' 'https://head.beng02.com:60336/13434/1335780456/' '64' '0'
04/30/12 11:07:42 [31322] grid_monitor for head.beng02.com:2119: gram callback status=64 errorcode=0
04/30/12 11:07:43 [31322] GAHP[31326] <- 'RESULTS'
04/30/12 11:07:43 [31322] GAHP[31326] -> 'R'
04/30/12 11:07:43 [31322] GAHP[31326] -> 'S' '1'
04/30/12 11:07:43 [31322] GAHP[31326] -> '2' 'https://head.beng02.com:60336/13434/1335780456/' '2' '0'
04/30/12 11:07:43 [31322] grid_monitor for head.beng02.com:2119: gram callback status=2 errorcode=0
04/30/12 11:08:12 [31322] grid_monitor for head.beng02.com:2119 entering CheckMonitor
04/30/12 11:08:12 [31322] grid_monitor job status for head.beng02.com:2119 file has been refreshed.
04/30/12 11:08:12 [31322] Read full grid_monitor status file for head.beng02.com:2119: scan start=1335780406, scan finish=1335780406, job count=0
04/30/12 11:08:12 [31322] Read grid_monitor status file for head.beng02.com:2119 successfully
04/30/12 11:08:12 [31322] grid_monitor log file for head.beng02.com:2119 updated.
04/30/12 11:08:12 [31322] grid_monitor log file for head.beng02.com:2119 looks normal
04/30/12 11:08:12 [31322] Successfully started grid_monitor for head.beng02.com:2119
04/30/12 11:08:12 [31322] (1111.0) doEvaluateState called: gmState GM_SUBMITTED, globusState 1
04/30/12 11:08:12 [31322] (1111.0) gm state change: GM_SUBMITTED -> GM_PUT_TO_SLEEP
04/30/12 11:08:12 [31322] GAHP[31326] <- 'GRAM_JOB_SIGNAL 8 https://head.beng02.com:53994/13404/1335780447/ 9 NULL'
04/30/12 11:08:12 [31322] GAHP[31326] -> 'S'
04/30/12 11:08:12 [31322] GAHP[31326] <- 'RESULTS'
04/30/12 11:08:12 [31322] GAHP[31326] -> 'R'
04/30/12 11:08:12 [31322] GAHP[31326] -> 'S' '1'
04/30/12 11:08:12 [31322] GAHP[31326] -> '8' '0' '0' '1'
04/30/12 11:08:12 [31322] (1111.0) doEvaluateState called: gmState GM_PUT_TO_SLEEP, globusState 1
04/30/12 11:08:12 [31322] (1111.0) gm state change: GM_PUT_TO_SLEEP -> GM_JOBMANAGER_ASLEEP
04/30/12 11:08:12 [31322] GAHP[31326] <- 'RESULTS'
04/30/12 11:08:12 [31322] GAHP[31326] -> 'R'
04/30/12 11:08:12 [31322] GAHP[31326] -> 'S' '1'
04/30/12 11:08:12 [31322] GAHP[31326] -> '2' 'https://head.beng02.com:53994/13404/1335780447/' '4' '130'
04/30/12 11:08:12 [31322] (1111.0) gram callback: state 4, errorcode 130
04/30/12 11:08:12 [31322] (1111.0) doEvaluateState called: gmState GM_JOBMANAGER_ASLEEP, globusState 1
04/30/12 11:08:25 [31322] Received CHECK_LEASES signal
04/30/12 11:08:25 [31322] in doContactSchedd()
04/30/12 11:08:25 [31322] querying for renewed leases
04/30/12 11:08:25 [31322] querying for removed/held jobs
04/30/12 11:08:25 [31322] Using constraint ((Owner=?="zhrani"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
04/30/12 11:08:25 [31322] Fetched 0 job ads from schedd
04/30/12 11:08:25 [31322] leaving doContactSchedd()
04/30/12 11:08:28 [31322] GAHP[31326] <- 'RESULTS'
04/30/12 11:08:28 [31322] GAHP[31326] -> 'S' '0'
04/30/12 11:08:30 [31322] Evaluating staleness of remote job statuses.
04/30/12 11:08:42 [31322] grid_monitor for head.beng02.com:2119 entering CheckMonitor
04/30/12 11:09:12 [31322] grid_monitor for head.beng02.com:2119 entering CheckMonitor
04/30/12 11:09:12 [31322] grid_monitor job status for head.beng02.com:2119 file has been refreshed.
04/30/12 11:09:12 [31322] Read full grid_monitor status file for head.beng02.com:2119: scan start=1335780466, scan finish=1335780466, job count=1
04/30/12 11:09:12 [31322] Read grid_monitor status file for head.beng02.com:2119 successfully
04/30/12 11:09:12 [31322] grid_monitor log file for head.beng02.com:2119 updated.
04/30/12 11:09:12 [31322] grid_monitor log file for head.beng02.com:2119 looks normal
04/30/12 11:09:12 [31322] in doContactSchedd()
04/30/12 11:09:12 [31322] querying for removed/held jobs
04/30/12 11:09:12 [31322] Using constraint ((Owner=?="zhrani"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
04/30/12 11:09:12 [31322] Fetched 0 job ads from schedd
04/30/12 11:09:12 [31322] Updating classad values for 1111.0:
04/30/12 11:09:12 [31322]    LastRemoteStatusUpdate = 1335780552
04/30/12 11:09:12 [31322] leaving doContactSchedd()
04/30/12 11:09:25 [31322] Received CHECK_LEASES signal
04/30/12 11:09:25 [31322] in doContactSchedd()
04/30/12 11:09:25 [31322] querying for renewed leases
04/30/12 11:09:25 [31322] querying for removed/held jobs
04/30/12 11:09:25 [31322] Using constraint ((Owner=?="zhrani"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
04/30/12 11:09:25 [31322] Fetched 0 job ads from schedd
04/30/12 11:09:25 [31322] leaving doContactSchedd()
04/30/12 11:09:28 [31322] GAHP[31326] <- 'RESULTS'
04/30/12 11:09:28 [31322] GAHP[31326] -> 'S' '0'
04/30/12 11:09:30 [31322] Evaluating staleness of remote job statuses.
04/30/12 11:09:42 [31322] grid_monitor for head.beng02.com:2119 entering CheckMonitor
04/30/12 11:10:12 [31322] grid_monitor for head.beng02.com:2119 entering CheckMonitor
04/30/12 11:10:12 [31322] grid_monitor job status for head.beng02.com:2119 file has been refreshed.
04/30/12 11:10:12 [31322] Read full grid_monitor status file for head.beng02.com:2119: scan start=1335780526, scan finish=1335780526, job count=1
04/30/12 11:10:12 [31322] Read grid_monitor status file for head.beng02.com:2119 successfully
04/30/12 11:10:12 [31322] grid_monitor log file for head.beng02.com:2119 updated.
04/30/12 11:10:12 [31322] grid_monitor log file for head.beng02.com:2119 looks normal
04/30/12 11:10:12 [31322] in doContactSchedd()
04/30/12 11:10:12 [31322] querying for removed/held jobs
04/30/12 11:10:12 [31322] Using constraint ((Owner=?="zhrani"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
04/30/12 11:10:12 [31322] Fetched 0 job ads from schedd
04/30/12 11:10:12 [31322] Updating classad values for 1111.0:
04/30/12 11:10:12 [31322]    LastRemoteStatusUpdate = 1335780612
04/30/12 11:10:12 [31322] leaving doContactSchedd()
04/30/12 11:10:25 [31322] Received CHECK_LEASES signal
04/30/12 11:10:25 [31322] in doContactSchedd()
04/30/12 11:10:25 [31322] querying for renewed leases
04/30/12 11:10:25 [31322] querying for removed/held jobs
04/30/12 11:10:25 [31322] Using constraint ((Owner=?="zhrani"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
04/30/12 11:10:25 [31322] Fetched 0 job ads from schedd
04/30/12 11:10:25 [31322] leaving doContactSchedd()



Remote Host Log:

TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 6: globus-gatekeeper pid=13401 starting at Mon Apr 30 11:07:27 2012

TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 6: Got connection 10.71.88.93 at Mon Apr 30 11:07:27 2012

TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 5: Authenticated globus user: /O=Grid/OU=GlobusTest/OU=simpleCA-head.beng02.com/OU=beng02.com/CN=zahrani
TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 0: GRID_SECURITY_HTTP_BODY_FD=6
TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 5: Requested service: jobmanager
TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 5: Authorized as local user: zhrani
TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 5: Authorized as local uid: 516
TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 5:           and local gid: 516
TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 0: executing /usr/local/globus-4.2.0/libexec/globus-job-manager
TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 0: GRID_SECURITY_CONTEXT_FD=9
TIME: Mon Apr 30 11:07:27 2012
 PID: 13401 -- Notice: 0: Child 13402 started
TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 6: globus-gatekeeper pid=13403 starting at Mon Apr 30 11:07:27 2012

TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 6: Got connection 10.71.88.93 at Mon Apr 30 11:07:27 2012

TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 5: Authenticated globus user: /O=Grid/OU=GlobusTest/OU=simpleCA-head.beng02.com/OU=beng02.com/CN=zahrani
TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 0: GRID_SECURITY_HTTP_BODY_FD=6
TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 5: Requested service: jobmanager-pbs
TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 5: Authorized as local user: zhrani
TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 5: Authorized as local uid: 516
TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 5:           and local gid: 516
TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 0: executing /usr/local/globus-4.2.0/libexec/globus-job-manager
TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 0: GRID_SECURITY_CONTEXT_FD=9
TIME: Mon Apr 30 11:07:27 2012
 PID: 13403 -- Notice: 0: Child 13404 started
TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 6: globus-gatekeeper pid=13433 starting at Mon Apr 30 11:07:36 2012

TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 6: Got connection 10.71.88.93 at Mon Apr 30 11:07:36 2012

TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 5: Authenticated globus user: /O=Grid/OU=GlobusTest/OU=simpleCA-head.beng02.com/OU=beng02.com/CN=zahrani
TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 0: GRID_SECURITY_HTTP_BODY_FD=6
TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 5: Requested service: jobmanager-fork
TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 5: Authorized as local user: zhrani
TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 5: Authorized as local uid: 516
TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 5:           and local gid: 516
TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 0: executing /usr/local/globus-4.2.0/libexec/globus-job-manager
TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 0: GRID_SECURITY_CONTEXT_FD=9
TIME: Mon Apr 30 11:07:36 2012
 PID: 13433 -- Notice: 0: Child 13434 started



Regards,