[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor-G sumission does not work while globus submit works



Hi,

When I submit the following submission file through condor it does not work and the job remains idle while submitting the same job using globus-job-submit works without any errors. The log on the remote host shows authentication failure in the condor-G case but it does not shows any failure when submitting the job by globus. Does any one come across this problem or know how to solve it? any help will be appreciated.

I use condor 7.6.6 and VDT 2

Submission file and process:

[zhrani@CM Grid]$ cat hostname_submit.jcl
grid_resource = gt4 https://head.beng02.com:2119/wsrf/services/ManagedJobFactoryService PBS
Universe = grid
when_to_transfer_output = ON_EXIT
Executable = /bin/hostname
Arguments = -f
Output = cout.$(Cluster).$(Process)
Log =clog.$(Cluster).$(Process)
Queue

[zhrani@CM Grid]$ condor_submit hostname_submit.jcl
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 1106.

[zhrani@CM Grid]$ condor_q -globus


-- Submitter: CM.CHPC.hud.ac.uk : <192.168.0.10:21871> : CM.CHPC.hud.ac.uk
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE
1106.0   zhrani        UNSUBMITTED PBS      head.beng02.com     /bin/hostname

[zhrani@CM Grid]$ condor_rm zhrani
User zhrani's job(s) have been marked for removal.

[zhrani@CM Grid]$ globus-job-submit head.beng02.com /bin/hostname -f
https://head.beng02.com:37308/6261/1335746926/
[zhrani@CM Grid]$ globus-job-status https://head.beng02.com:37308/6261/1335746926/
DONE
[zhrani@CM Grid]$ globus-job-get-output https://head.beng02.com:37308/6261/1335746926/
head.beng02.com


Gridmanager LOG:

04/30/12 01:46:29 [25065] resource https://head.beng02.com:2119/wsrf/services/ManagedJobFactoryService is now up
04/30/12 01:46:29 [25065] *** checkDelegation()
04/30/12 01:46:29 [25065] (1106.0) doEvaluateState called: gmState GM_UNSUBMITTED, globusState
04/30/12 01:47:19 [25065] Received CHECK_LEASES signal
04/30/12 01:47:19 [25065] in doContactSchedd()
04/30/12 01:47:19 [25065] querying for renewed leases
04/30/12 01:47:19 [25065] querying for removed/held jobs
04/30/12 01:47:19 [25065] Using constraint ((Owner=?="zhrani"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
04/30/12 01:47:19 [25065] Fetched 0 job ads from schedd
04/30/12 01:47:19 [25065] leaving doContactSchedd()
04/30/12 01:47:22 [25065] GridftpServer: Submitting job for proxy '/O=Grid/OU=GlobusTest/OU=simpleCA-head.beng02.com/OU=beng02.com/CN=zahrani'
04/30/12 01:47:22 [25065] entering FileTransfer::SimpleInit
04/30/12 01:47:22 [25065] Input files: /tmp/condor_g_scratch.0x19360fd0.25029/grid-mapfile
04/30/12 01:47:22 [25065] entering FileTransfer::UploadFiles (final_transfer=0)
04/30/12 01:47:22 [25065] entering FileTransfer::Upload
04/30/12 01:47:22 [25065] entering FileTransfer::DoUpload
04/30/12 01:47:22 [25065] DoUpload: sending file /tmp/condor_g_scratch.0x19360fd0.25029/master_proxy.2
04/30/12 01:47:22 [25065] FILETRANSFER: outgoing file_command is 4 for /tmp/condor_g_scratch.0x19360fd0.25029/master_proxy.2
04/30/12 01:47:22 [25065] Received GoAhead from peer to send /tmp/condor_g_scratch.0x19360fd0.25029/master_proxy.2 and all further files.
04/30/12 01:47:22 [25065] Sending GoAhead for 192.168.0.10 to receive /tmp/condor_g_scratch.0x19360fd0.25029/master_proxy.2 and all further files.
04/30/12 01:47:22 [25065] DoUpload: put_x509_delegation() returned 0
04/30/12 01:47:22 [25065] DoUpload: sending file /tmp/condor_g_scratch.0x19360fd0.25029/grid-mapfile
04/30/12 01:47:22 [25065] FILETRANSFER: outgoing file_command is 1 for /tmp/condor_g_scratch.0x19360fd0.25029/grid-mapfile
04/30/12 01:47:22 [25065] ReliSock::put_file_with_permissions(): going to send permissions 100644
04/30/12 01:47:22 [25065] put_file: going to send from filename /tmp/condor_g_scratch.0x19360fd0.25029/grid-mapfile
04/30/12 01:47:22 [25065] put_file: Found file size 84
04/30/12 01:47:22 [25065] put_file: sending 84 bytes
04/30/12 01:47:22 [25065] ReliSock: put_file: sent 84 bytes
04/30/12 01:47:22 [25065] DoUpload: sending file /usr/libexec/condor/gridftp_wrapper.sh
04/30/12 01:47:22 [25065] FILETRANSFER: outgoing file_command is 1 for /usr/libexec/condor/gridftp_wrapper.sh
04/30/12 01:47:22 [25065] ReliSock::put_file_with_permissions(): going to send permissions 100755
04/30/12 01:47:22 [25065] put_file: going to send from filename /usr/libexec/condor/gridftp_wrapper.sh
04/30/12 01:47:22 [25065] put_file: Found file size 1057
04/30/12 01:47:22 [25065] put_file: sending 1057 bytes
04/30/12 01:47:22 [25065] ReliSock: put_file: sent 1057 bytes
04/30/12 01:47:22 [25065] DoUpload: exiting at 3003
04/30/12 01:47:25 [25065] GAHP[25071] <- 'RESULTS'
04/30/12 01:47:25 [25065] GAHP[25071] -> 'S' '0'
04/30/12 01:47:25 [25065] in doContactSchedd()
04/30/12 01:47:25 [25065] querying for removed/held jobs
04/30/12 01:47:25 [25065] Using constraint ((Owner=?="zhrani"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
04/30/12 01:47:25 [25065] Fetched 0 job ads from schedd
04/30/12 01:47:25 [25065] 1108.0 job status: 4
04/30/12 01:47:25 [25065] leaving doContactSchedd()
04/30/12 01:47:26 [25065] Evaluating staleness of remote job statuses.
04/30/12 01:47:42 [25065] Received REMOVE_JOBS signal
04/30/12 01:47:42 [25065] in doContactSchedd()
04/30/12 01:47:42 [25065] querying for new jobs
04/30/12 01:47:42 [25065] Using constraint ((Owner=?="zhrani"&&JobUniverse==9)) && (Managed =!= "ScheddDone") && (Matched =!= FALSE) && (JobStatus != 5) && (Managed =!= "External")
04/30/12 01:47:42 [25065] Fetched 0 new job ads from schedd
04/30/12 01:47:42 [25065] querying for removed/held jobs
04/30/12 01:47:42 [25065] Using constraint ((Owner=?="zhrani"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
04/30/12 01:47:42 [25065] Fetched 1 job ads from schedd
04/30/12 01:47:42 [25065] leaving doContactSchedd()
04/30/12 01:47:42 [25065] (1106.0) doEvaluateState called: gmState GM_UNSUBMITTED, globusState
04/30/12 01:47:42 [25065] (1106.0) gm state change: GM_UNSUBMITTED -> GM_DELETE
04/30/12 01:47:42 [25065] directory_util::rec_touch_file: Creating directory /tmp
04/30/12 01:47:42 [25065] directory_util::rec_touch_file: Creating directory /tmp/condorLocks
04/30/12 01:47:42 [25065] directory_util::rec_touch_file: Creating directory /tmp/condorLocks/13
04/30/12 01:47:42 [25065] directory_util::rec_touch_file: Creating directory /tmp/condorLocks/13/73
04/30/12 01:47:42 [25065] FileLock object is updating timestamp on: /tmp/condorLocks/13/73/8341789162039746.lockc
04/30/12 01:47:42 [25065] (1106.0) Writing abort record to user logfile
04/30/12 01:47:42 [25065] FileLock::obtain(1) - @1335746862.880224 lock on /tmp/condorLocks/13/73/8341789162039746.lockc now WRITE
04/30/12 01:47:42 [25065] FileLock::obtain(2) - @1335746862.882102 lock on /tmp/condorLocks/13/73/8341789162039746.lockc now UNLOCKED
04/30/12 01:47:42 [25065] FileLock::obtain(1) - @1335746862.882247 lock on /tmp/condorLocks/13/73/8341789162039746.lockc now WRITE
04/30/12 01:47:42 [25065] directory_util::rec_clean_up: file /tmp/condorLocks/13/73/8341789162039746.lockc has been deleted.
04/30/12 01:47:42 [25065] Lock file /tmp/condorLocks/13/73/8341789162039746.lockc has been deleted.
04/30/12 01:47:42 [25065] FileLock::obtain(2) - @1335746862.882583 lock on /tmp/condorLocks/13/73/8341789162039746.lockc now UNLOCKED
04/30/12 01:47:47 [25065] in doContactSchedd()
04/30/12 01:47:47 [25065] querying for removed/held jobs
04/30/12 01:47:47 [25065] Using constraint ((Owner=?="zhrani"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
04/30/12 01:47:47 [25065] Fetched 1 job ads from schedd
04/30/12 01:47:47 [25065] Updating classad values for 1106.0:
04/30/12 01:47:47 [25065]    Managed = "ScheddDone"
04/30/12 01:47:47 [25065] Deleting job 1106.0 from schedd
04/30/12 01:47:47 [25065] GAHP[25071] <- 'UNCACHE_PROXY 1'
04/30/12 01:47:47 [25065] GAHP[25071] -> 'S'
04/30/12 01:47:47 [25065] No jobs left, shutting down
04/30/12 01:47:47 [25065] leaving doContactSchedd()
04/30/12 01:47:47 [25065] Got SIGTERM. Performing graceful shutdown.
04/30/12 01:47:47 [25065] Started timer to call main_shutdown_fast in 1800 seconds
04/30/12 01:47:47 [25065] **** condor_gridmanager (condor_GRIDMANAGER) pid 25065 EXITING WITH STATUS 0


Remote Host Log including condor-G submit and globus submit:

TIME: Mon Apr 30 01:46:26 2012
 PID: 6255 -- Notice: 6: globus-gatekeeper pid=6255 starting at Mon Apr 30 01:46:26 2012

TIME: Mon Apr 30 01:46:26 2012
 PID: 6255 -- Notice: 6: Got connection 10.71.88.93 at Mon Apr 30 01:46:26 2012

GSS authentication failure
GSS Major Status: General failure
GSS Minor Status Error Chain:
globus_gsi_gssapi: Error during delegation: Delegation protocol violation
Failure: GSS failed Major:000d0000 Minor:00000002 Token:00000000

TIME: Mon Apr 30 01:46:26 2012
 PID: 6255 -- Failure: GSS failed Major:000d0000 Minor:00000002 Token:00000000

TIME: Mon Apr 30 01:48:46 2012
 PID: 6260 -- Notice: 6: globus-gatekeeper pid=6260 starting at Mon Apr 30 01:48:46 2012

TIME: Mon Apr 30 01:48:46 2012
 PID: 6260 -- Notice: 6: Got connection 10.71.88.93 at Mon Apr 30 01:48:46 2012

TIME: Mon Apr 30 01:48:46 2012
 PID: 6260 -- Notice: 5: Authenticated globus user: /O=Grid/OU=GlobusTest/OU=simpleCA-head.beng02.com/OU=beng02.com/CN=zahrani
TIME: Mon Apr 30 01:48:46 2012
 PID: 6260 -- Notice: 0: GRID_SECURITY_HTTP_BODY_FD=6
TIME: Mon Apr 30 01:48:46 2012
 PID: 6260 -- Notice: 5: Requested service: jobmanager
TIME: Mon Apr 30 01:48:46 2012
 PID: 6260 -- Notice: 5: Authorized as local user: zhrani
TIME: Mon Apr 30 01:48:46 2012
 PID: 6260 -- Notice: 5: Authorized as local uid: 516
TIME: Mon Apr 30 01:48:46 2012
 PID: 6260 -- Notice: 5:           and local gid: 516
TIME: Mon Apr 30 01:48:46 2012
 PID: 6260 -- Notice: 0: executing /usr/local/globus-4.2.0/libexec/globus-job-manager
TIME: Mon Apr 30 01:48:46 2012
 PID: 6260 -- Notice: 0: GRID_SECURITY_CONTEXT_FD=9
TIME: Mon Apr 30 01:48:46 2012
 PID: 6260 -- Notice: 0: Child 6261 started
TIME: Mon Apr 30 01:49:21 2012
 PID: 6275 -- Notice: 6: globus-gatekeeper pid=6275 starting at Mon Apr 30 01:49:21 2012

TIME: Mon Apr 30 01:49:21 2012
 PID: 6275 -- Notice: 6: Got connection 10.71.88.93 at Mon Apr 30 01:49:21 2012

TIME: Mon Apr 30 01:49:21 2012
 PID: 6275 -- Notice: 5: Authenticated globus user: /O=Grid/OU=GlobusTest/OU=simpleCA-head.beng02.com/OU=beng02.com/CN=zahrani
TIME: Mon Apr 30 01:49:21 2012
 PID: 6275 -- Notice: 0: GRID_SECURITY_HTTP_BODY_FD=6
TIME: Mon Apr 30 01:49:21 2012
 PID: 6275 -- Notice: 5: Requested service: jobmanager
TIME: Mon Apr 30 01:49:21 2012
 PID: 6275 -- Notice: 5: Authorized as local user: zhrani
TIME: Mon Apr 30 01:49:21 2012
 PID: 6275 -- Notice: 5: Authorized as local uid: 516
TIME: Mon Apr 30 01:49:21 2012
 PID: 6275 -- Notice: 5:           and local gid: 516
TIME: Mon Apr 30 01:49:21 2012
 PID: 6275 -- Notice: 0: executing /usr/local/globus-4.2.0/libexec/globus-job-manager
TIME: Mon Apr 30 01:49:21 2012
 PID: 6275 -- Notice: 0: GRID_SECURITY_CONTEXT_FD=9
TIME: Mon Apr 30 01:49:21 2012
 PID: 6275 -- Notice: 0: Child 6276 started


Regards,