[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Failed to set Qdate...





I am seeing the following error when submitting jobs from
Condor-G into a condor cluster.  The following error occurred
on 7 out of 300 jobs.  It should be noted that this was a
stress test and I submitted three batches of 100 jobs each,
each batch about 10 minutes apart.

Down the gram_job_mgr log a ways, the following errors
happen when the globus-job-manager attempts to submit
the job to condor.  It fails, and condor_g on the remote submit host
shows the job as in status H.  From the Condor log of the job:
Unfortunately the SchedLog has already rotated out and I don't have
the info.

...
012 (040.072.000) 10/13 09:54:31 Job was held.
Globus error 17: the job failed when the job manager attempted to run it
Code 2 Subcode 17



<clip>
Thu Oct 13 09:52:25 2005 JM_SCRIPT: Using jm supplied job dir: /home/fnalgrid/.globus/job/fngp-osg.fnal.gov/19277.1129212071
Thu Oct 13 09:52:25 2005 JM_SCRIPT: Using jm supplied job dir: /home/fnalgrid/.globus/job/fngp-osg.fnal.gov/19277.1129212071
Thu Oct 13 09:52:25 2005 JM_SCRIPT: About to submit condor job
Thu Oct 13 09:52:25 2005 JM_SCRIPT: I am the parent
Thu Oct 13 09:53:08 2005 JM_SCRIPT: submission failed!!!
Thu Oct 13 09:53:18 2005 JM_SCRIPT: Sent NFS sync for /home/fnalgrid/.globus/job/fngp-osg.fnal.gov/19277.1129212071/scheduler_condor_submit_stderr
Thu Oct 13 09:53:29 2005 JM_SCRIPT: Error file is not empty, and submission failed


Thu Oct 13 09:53:29 2005 JM_SCRIPT: Error text is
ERROR: Failed to set QDate=1129215145 for job 9137.0

ERROR: Failed to set CompletionDate=0 for job 9137.0

ERROR: Failed to set Owner="fnalgrid" for job 9137.0

ERROR: Failed to set RemoteWallClockTime=0.000000 for job 9137.0

ERROR: Failed to set LocalUserCpu=0.000000 for job 9137.0

ERROR: Failed to set LocalSysCpu=0.000000 for job 9137.0

ERROR: Failed to set RemoteUserCpu=0.000000 for job 9137.0

ERROR: Failed to set RemoteSysCpu=0.000000 for job 9137.0

ERROR: Failed to set ExitStatus=0 for job 9137.0

ERROR: Failed to set NumCkpts=0 for job 9137.0

ERROR: Failed to set NumRestarts=0 for job 9137.0

ERROR: Failed to set NumSystemHolds=0 for job 9137.0

ERROR: Failed to set CommittedTime=0 for job 9137.0

ERROR: Failed to set TotalSuspensions=0 for job 9137.0

ERROR: Failed to set LastSuspensionTime=0 for job 9137.0

ERROR: Failed to set CumulativeSuspensionTime=0 for job 9137.0

ERROR: Failed to set ExitBySignal=FALSE for job 9137.0

ERROR: Failed to set CondorVersion="$CondorVersion: 6.7.7 Apr 27 2005 $" for job 9137.0

ERROR: Failed to set CondorPlatform="$CondorPlatform: I386-LINUX_RH9 $" for job 9137.0

ERROR: Failed to set RootDir="/" for job 9137.0

ERROR: Failed to set Iwd="/home/fnalgrid/gram_scratch_hpL0apPGSd" for job 9137.0

ERROR: Failed to set JobUniverse=5 for job 9137.0

ERROR: Failed to set Cmd="/home/fnalgrid/.globus/.gass_cache/local/md5/45/2bda2377346e122f0c70404222c1de/md5/b0/c678798d7b3f3e7d70fda95f0a0367/data" for job 9137.0

ERROR: Failed to set MinHosts=1 for job 9137.0

ERROR: Failed to set MaxHosts=1 for job 9137.0

ERROR: Failed to set CurrentHosts=0 for job 9137.0

ERROR: Failed to set WantRemoteSyscalls=FALSE for job 9137.0

ERROR: Failed to set WantCheckpoint=FALSE for job 9137.0

ERROR: Failed to set JobStatus=1 for job 9137.0

ERROR: Failed to set EnteredCurrentStatus=1129215151 for job 9137.0

ERROR: Failed to set JobPrio=0 for job 9137.0

ERROR: Failed to set NiceUser=FALSE for job 9137.0

ERROR: Failed to set Env="X509_USER_PROXY=/home/fnalgrid/.globus/job/fngp-osg.fnal.gov/19277.1129212071/x509_up;GLOBUS_REMOTE_IO_URL=/home/fnalgrid/.globus/job/fngp-osg.fnal.gov/19277.1129212071/remote_io_url;GLOBUS_LOCATION=/export/osg/grid/globus;GLOBUS_GRAM_JOB_CONTACT=https://fngp-osg.fnal.gov:51523/19277/1129212071/;GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://fngp-osg.fnal.gov:51525/;SCRATCH_DIRECTORY=/home/fnalgrid//gram_scratch_hpL0apPGSd;LD_LIBRARY_PATH=/export/osg/grid/prima/lib:/export/osg/grid/voms/lib:/export/osg/grid/globus/lib:;HOME=/home/fnalgrid;LOGNAME=fnalgrid"; for job 9137.0

ERROR: Failed to set JobNotification=0 for job 9137.0

ERROR: Failed to set UserLog="/export/osg/grid/globus/tmp/gram_job_state/gram_condor_log.19277.1129212071" for job 9137.0

ERROR: Failed to set CoreSize=0 for job 9137.0

ERROR: Failed to set KillSig="SIGTERM" for job 9137.0

ERROR: Failed to set Rank=0.000000 for job 9137.0

ERROR: Failed to set In="/dev/null" for job 9137.0

ERROR: Failed to set StreamIn=FALSE for job 9137.0

ERROR: Failed to set Out="/home/fnalgrid/.globus/job/fngp-osg.fnal.gov/19277.1129212071/stdout" for job 9137.0

ERROR: Failed to set StreamOut=FALSE for job 9137.0

ERROR: Failed to set Err="/home/fnalgrid/.globus/job/fngp-osg.fnal.gov/19277.1129212071/stderr" for job 9137.0

ERROR: Failed to set StreamErr=FALSE for job 9137.0

ERROR: Failed to set BufferSize=524288 for job 9137.0

ERROR: Failed to set BufferBlockSize=32768 for job 9137.0

ERROR: Failed to set ShouldTransferFiles="NO" for job 9137.0

ERROR: Failed to set TransferFiles="NEVER" for job 9137.0

ERROR: Failed to set ImageSize=60 for job 9137.0

ERROR: Failed to set ExecutableSize=60 for job 9137.0

ERROR: Failed to set DiskUsage=60 for job 9137.0

ERROR: Failed to set Requirements=(OpSys == "LINUX" && Arch == "INTEL") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (TARGET.FileSystemDomain == MY.FileSystemDomain) for job 9137.0

ERROR: Failed to set FileSystemDomain="fnal.gov" for job 9137.0

ERROR: Failed to set AccountingGroup="group_fnalgrid.fnalgrid" for job 9137.0

ERROR: Failed to set PeriodicHold=FALSE for job 9137.0

ERROR: Failed to set PeriodicRelease=FALSE for job 9137.0

ERROR: Failed to set PeriodicRemove=FALSE for job 9137.0

ERROR: Failed to set OnExitHold=FALSE for job 9137.0

ERROR: Failed to set OnExitRemove=TRUE for job 9137.0

ERROR: Failed to set LeaveJobInQueue=FALSE for job 9137.0

ERROR: Failed to set Args="" for job 9137.0

ERROR: Failed to queue job.

Thu Oct 13 09:53:29 2005 JM_SCRIPT: Writing extended error information to stderr
10/13 09:53:29 JM: GT3 extended error message: GRAM_SCRIPT_GT3_FAILURE_MESSAGE: ERROR: Failed to set QDate=1129215145 for job 9137.0 ERROR: Failed to set CompletionDate=0 for job 9137.0 ERROR: Failed to set Owner="fnalgrid" for job 9137.0 ERROR: Failed to set RemoteWallClockTime=0.000000 for job 9137.0 ERROR: Failed to set LocalUserCpu=0.000000 for job 9137.0 ERROR: Failed to set LocalSysCpu=0.000000 for job 9137.0 ERROR: Failed to set RemoteUserCpu=0.000000 for job 9137.0 ERROR: Failed to set RemoteSysCpu=0.000000 for job 9137.0 ERROR: Failed to set ExitStatus=0 for job 9137.0 ERROR: Failed to set NumCkpts=0 for job 9137.0 ERROR: Failed to set NumRestarts=0 for job 9137.0 ERROR: Failed to set NumSystemHolds=0 for job 9137.0 ERROR: Failed to set CommittedTime=0 for job 9137.0 ERROR: Failed to set TotalSuspensions=0 for job 9137.0 ERROR: Failed to set LastSuspensionTime=0 for job 9137.0 ERROR: Failed to set CumulativeSuspensionTime=0 for job 9137.0 ERROR: Failed to set
.0 ERRO
R: Failed to set Env="X509_USER_PROXY=/home/fnalgrid/.globus/job/fngp-osg.fnal.gov/19277.1129212071/x509_up;GLOBUS_REMOTE_IO_URL=/home/fnalgrid/.globus/job/fngp-osg.fnal.gov/19277.1129212071/remote_io_url;GLOBUS_LOCATION=/export/osg/grid/globus;GLOBUS_GRAM_JOB_CONTACT=https://fngp-osg.fnal.gov:51523/19277/1129212071/;GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://fngp-osg.fnal.gov:51525/;SCRATCH_DIRECTORY=/home/fnalgrid//gram_scratch_hpL0apPGSd;LD_LIBRARY_PATH=/export/osg/grid/prima/lib:/export/osg/grid/voms/lib:/export/osg/grid/globus/lib:;HOME=/home/fnalgrid;LOGNAME=fnalgrid"; for job 9137.0 ERROR: Failed to set JobNotification=0 for job 9137.0 ERROR: Failed to set UserLog="/export/osg/grid/globus/tmp/gram_job_state/gram_condor_log.19277.1129212071" for job 9137.0 ERROR: Failed to set CoreSize=0 for job 9137.0 ERROR: Failed to set KillSig="SIGTERM" for job 9137.0 ERROR: Failed to set Rank=0.000000 for job 9137.0 ERROR: Failed to set In="/dev/null" for job 9137.0 ERROR: Failed to set
ingGroup
="group_fnalgrid.fnalgrid" for job 9137.0 ERROR: Failed to set PeriodicHold=FALSE for job 9137.0 ERROR: Failed to set PeriodicRelease=FALSE for job 9137.0 ERROR: Failed to set PeriodicRemove=FALSE for job 9137.0 ERROR: Failed to set OnExitHold=FALSE for job 9137.0 ERROR: Failed to set OnExitRemove=TRUE for job 9137.0 ERROR: Failed to set LeaveJobInQueue=FALSE for job 9137.0 ERROR: Failed to set Args="" for job 9137.0 ERROR: Failed to queue job.
10/13 09:53:29 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_MESSAGE = ERROR: Failed to set QDate=1129215145 for job 9137.0 ERROR: Failed to set CompletionDate=0 for job 9137.0 ERROR: Failed to set Owner="fnalgrid" for job 9137.0 ERROR: Failed to set RemoteWallClockTime=0.000000 for job 9137.0 ERROR: Failed to set LocalUserCpu=0.000000 for job 9137.0 ERROR: Failed to set LocalSysCpu=0.000000 for job 9137.0 ERROR: Failed to set RemoteUserCpu=0.000000 for job 9137.0 ERROR: Failed to set RemoteSysCpu=0.000000 for job 9137.0 ERROR: Failed to set ExitStatus=0 for job 9137.0 ERROR: Failed to set NumCkpts=0 for job 9137.0 ERROR: Failed to set NumRestarts=0 for job 9137.0 ERROR: Failed to set NumSystemHolds=0 for job 9137.0 ERROR: Failed to set CommittedTime=0 for job 9137.0 ERROR: Failed to set TotalSuspensions=0 for job 9137.0 ERROR: Failed to set LastSuspensionTime=0 for job 9137.0 ERROR: Failed to set CumulativeSuspensionTime=0 for job 9137.0 ERROR: Failed to set ExitB
ROR: Fai
led to set Env="X509_USER_PROXY=/home/fnalgrid/.globus/job/fngp-osg.fnal.gov/19277.1129212071/x509_up;GLOBUS_REMOTE_IO_URL=/home/fnalgrid/.globus/job/fngp-osg.fnal.gov/19277.1129212071/remote_io_url;GLOBUS_LOCATION=/export/osg/grid/globus;GLOBUS_GRAM_JOB_CONTACT=https://fngp-osg.fnal.gov:51523/19277/1129212071/;GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://fngp-osg.fnal.gov:51525/;SCRATCH_DIRECTORY=/home/fnalgrid//gram_scratch_hpL0apPGSd;LD_LIBRARY_PATH=/export/osg/grid/prima/lib:/export/osg/grid/voms/lib:/export/osg/grid/globus/lib:;HOME=/home/fnalgrid;LOGNAME=fnalgrid"; for job 9137.0 ERROR: Failed to set JobNotification=0 for job 9137.0 ERROR: Failed to set UserLog="/export/osg/grid/globus/tmp/gram_job_state/gram_condor_log.19277.1129212071" for job 9137.0 ERROR: Failed to set CoreSize=0 for job 9137.0 ERROR: Failed to set KillSig="SIGTERM" for job 9137.0 ERROR: Failed to set Rank=0.000000 for job 9137.0 ERROR: Failed to set In="/dev/null" for job 9137.0 ERROR: Failed to set Stream
up="grou
p_fnalgrid.fnalgrid" for job 9137.0 ERROR: Failed to set PeriodicHold=FALSE for job 9137.0 ERROR: Failed to set PeriodicRelease=FALSE for job 9137.0 ERROR: Failed to set PeriodicRemove=FALSE for job 9137.0 ERROR: Failed to set OnExitHold=FALSE for job 9137.0 ERROR: Failed to set OnExitRemove=TRUE for job 9137.0 ERROR: Failed to set LeaveJobInQueue=FALSE for job 9137.0 ERROR: Failed to set Args="" for job 9137.0 ERROR: Failed to queue job.
10/13 09:53:29 JMI: while return_buf = GRAM_SCRIPT_ERROR = 17
10/13 09:53:29 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_SUBMIT
10/13 09:53:29 JM: in globus_gram_job_manager_reporting_file_create()
10/13 09:53:29 JM: not reporting job information
10/13 09:53:29 JM: in globus_gram_job_manager_history_file_create()
10/13 09:53:29 JM: Writing state file
10/13 09:53:29 JM: Writing state file
10/13 09:53:29 JM: in globus_gram_job_manager_reporting_file_create()
10/13 09:53:29 JM: not reporting job information
10/13 09:53:29 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED
10/13 09:53:29 JM: Writing state file
10/13 09:53:29 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_CLOSE_OUTPUT
10/13 09:53:29 JM: NOT empty client callback list.
10/13 09:53:29 JM: sending callback of status 4 (failure code 17) to https://snowball.fnal.gov:33169/.
10/13 09:53:29 globus_gram_job_manager_query_callback() not a literal URI match
10/13 09:53:29 JM : in globus_l_gram_job_manager_query_callback, query=signal 10
10/13 09:53:29 JM : reply: (status=4 failure code=0 (Success))
10/13 09:53:29 JM : sending reply:
protocol-version: 2
status: 4
failure-code: 0
job-failure-code: 17


10/13 09:53:29 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_TWO_PHASE_COMMITTED
10/13 09:53:29 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_FILE_CLEAN_UP
10/13 09:53:29 JMI: testing job manager scripts for type condor exist and permissions are ok.
10/13 09:53:29 JMI: completed script validation: job manager type is condor.
10/13 09:53:29 JMI: in globus_gram_job_manager_rm_scratchdir()
10/13 09:53:29 JMI: cmd = remove_scratchdir
10/13 09:53:31 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_SCRATCH_CLEAN_UP
10/13 09:53:31 JMI: testing job manager scripts for type condor exist and permissions are ok.
10/13 09:53:31 JMI: completed script validation: job manager type is condor.
10/13 09:53:31 JMI: cmd = cache_cleanup
Thu Oct 13 09:53:32 2005 JM_SCRIPT: New Perl JobManager created.
Thu Oct 13 09:53:32 2005 JM_SCRIPT: Using jm supplied job dir: /home/fnalgrid/.globus/job/fngp-osg.fnal.gov/19277.1129212071
Thu Oct 13 09:53:32 2005 JM_SCRIPT: Using jm supplied job dir: /home/fnalgrid/.globus/job/fngp-osg.fnal.gov/19277.1129212071
Thu Oct 13 09:53:32 2005 JM_SCRIPT: cache_cleanup(enter)
Thu Oct 13 09:53:34 2005 JM_SCRIPT: Cleaning files in job dir /home/fnalgrid/.globus/job/fngp-osg.fnal.gov/19277.1129212071
Thu Oct 13 09:54:29 2005 JM_SCRIPT: Removed 6 files from /home/fnalgrid/.globus/job/fngp-osg.fnal.gov/19277.1129212071
Thu Oct 13 09:54:29 2005 JM_SCRIPT: cache_cleanup(exit)
10/13 09:54:29 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_CACHE_CLEAN_UP
10/13 09:54:29 JM: in globus_gram_job_manager_reporting_file_remove()
10/13 09:54:29 JM: exiting globus_gram_job_manager.