[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor: Error parsing classad or job not found



Hi Brian,
    I enabled verbose logging and will inline it, I am not familiar with condor so nothing jumps out to me. One thing that may be important is that "qstat" alone does not work on this host, a special argument is required that essentially specifies the queue for qstat to use.

Thanks again,
Adam

-----------------------
GridmanagerLog:

02/13/15 15:10:46 Using processor count: 1 processors, 1 CPUs, 0 HTs
02/13/15 15:10:46 Enumerating interfaces: lo 127.0.0.1 up
02/13/15 15:10:46 Enumerating interfaces: eth0 160.91.202.132 up
02/13/15 15:10:46 Initializing Directory: curr_dir = /etc/condor/config.d
02/13/15 15:10:46 ******************************************************
02/13/15 15:10:46 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
02/13/15 15:10:46 ** /usr/sbin/condor_gridmanager
02/13/15 15:10:46 ** SubsystemInfo: name=GRIDMANAGER type=DAEMON(12) class=DAEMON(1)
02/13/15 15:10:46 ** Configuration: subsystem:GRIDMANAGER local:<NONE> class:DAEMON
02/13/15 15:10:46 ** $CondorVersion: 8.2.6 Dec 10 2014 BuildID: 287355 $
02/13/15 15:10:46 ** $CondorPlatform: x86_64_RedHat6 $
02/13/15 15:10:46 ** PID = 19922
02/13/15 15:10:46 ** Log last touched 2/13 15:05:22
02/13/15 15:10:46 ******************************************************
02/13/15 15:10:46 Using config source: /etc/condor/condor_config
02/13/15 15:10:46 Using local config sources: 
02/13/15 15:10:46    /etc/condor/condor_config.local
02/13/15 15:10:46 config Macros = 60, Sorted = 60, StringBytes = 1701, TablesBytes = 2208
02/13/15 15:10:46 CLASSAD_CACHING is ENABLED
02/13/15 15:10:46 Running as root.  Enabling specialized core dump routines
02/13/15 15:10:46 Daemon Log is logging: D_FULLDEBUG D_ALWAYS D_ERROR
02/13/15 15:10:46 Not using shared port because USE_SHARED_PORT=false
02/13/15 15:10:46 DaemonCore: command socket at <160.91.202.132:53679>
02/13/15 15:10:46 DaemonCore: private command socket at <160.91.202.132:53679>
02/13/15 15:10:46 Setting maximum accepts per cycle 8.
02/13/15 15:10:46 Setting maximum reaps per cycle 8.
02/13/15 15:10:46 Will use UDP to update collector workflow1.ccs.ornl.gov <160.91.202.132:9618>
02/13/15 15:10:46 Not using shared port because USE_SHARED_PORT=false
02/13/15 15:10:46 [19922] Welcome to the all-singing, all dancing, "amazing" GridManager!
02/13/15 15:10:46 [19922] DaemonCore: in SendAliveToParent()
02/13/15 15:10:46 [19922] Completed DC_CHILDALIVE to daemon at <160.91.202.132:41221>
02/13/15 15:10:46 [19922] DaemonCore: Leaving SendAliveToParent() - success
02/13/15 15:10:46 [19922] Checking proxies
02/13/15 15:10:49 [19922] Received ADD_JOBS signal
02/13/15 15:10:49 [19922] in doContactSchedd()
02/13/15 15:10:49 [19922] querying for new jobs
02/13/15 15:10:49 [19922] Using constraint ((Owner=?="atj"&&JobUniverse==9)) && (Managed =!= "ScheddDone") && (((Matched =!= FALSE) && (JobStatus != 5)) || (Managed =?= "External"))
02/13/15 15:10:49 [19922] Using job type INFNBatch for job 95.0
02/13/15 15:10:49 [19922] (95.0) SetJobLeaseTimers()
02/13/15 15:10:49 [19922] Found job 95.0 --- inserting
02/13/15 15:10:49 [19922] Fetched 1 new job ads from schedd
02/13/15 15:10:49 [19922] querying for removed/held jobs
02/13/15 15:10:49 [19922] Using constraint ((Owner=?="atj"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
02/13/15 15:10:49 [19922] Fetched 0 job ads from schedd
02/13/15 15:10:49 [19922] leaving doContactSchedd()
02/13/15 15:10:49 [19922] gahp server not up yet, delaying ping
02/13/15 15:10:49 [19922] *** UpdateLeases called
02/13/15 15:10:49 [19922]     Leases not supported, cancelling timer
02/13/15 15:10:49 [19922] BaseResource::UpdateResource: 
NumJobs = 1
HashName = "batch PBS"
Machine = "workflow1.ccs.ornl.gov"
SubmitsAllowed = 0
Name = "batch "
CondorPlatform = "$CondorPlatform: x86_64_RedHat6 $"
RunningJobs = 0
Owner = "atj"
MyType = "Grid"
ScheddName = "workflow1.ccs.ornl.gov"
MyAddress = "<160.91.202.132:53679>"
CondorVersion = "$CondorVersion: 8.2.6 Dec 10 2014 BuildID: 287355 $"
SubmitsWanted = 0
ScheddIpAddr = "<160.91.202.132:41221>"
CurrentTime = time()
MyCurrentTime = 1423858249
IdleJobs = 1
JobLimit = 1000

02/13/15 15:10:49 [19922] Trying to update collector <160.91.202.132:9618>
02/13/15 15:10:49 [19922] Attempting to send update via UDP to collector workflow1.ccs.ornl.gov <160.91.202.132:9618>
02/13/15 15:10:49 [19922] File descriptor limits: max 4096, safe 3277
02/13/15 15:10:49 [19922] (95.0) doEvaluateState called: gmState GM_INIT, remoteState 0
02/13/15 15:10:49 [19922] GAHP server pid = 19927
02/13/15 15:10:49 [19922] GAHP server version: $GahpVersion: 1.16.5 Mar 31 2008 INFN blahpd (poly,new_esc_format) $
02/13/15 15:10:49 [19922] GAHP[19927] <- 'COMMANDS'
02/13/15 15:10:49 [19922] GAHP[19927] -> 'S' 'ASYNC_MODE_OFF' 'ASYNC_MODE_ON' 'BLAH_GET_HOSTPORT' 'BLAH_JOB_CANCEL' 'BLAH_JOB_HOLD' 'BLAH_JOB_REFRESH_PROXY' 'BLAH_JOB_RESUME' 'BLAH_JOB_SEND_PROXY_TO_WORKER_NODE' 'BLAH_JOB_STATUS' 'BLAH_JOB_SUBMIT' 'BLAH_SET_GLEXEC_DN' 'BLAH_SET_GLEXEC_OFF' 'BLAH_SET_SUDO_ID' 'BLAH_SET_SUDO_OFF' 'COMMANDS' 'QUIT' 'RESULTS' 'VERSION'
02/13/15 15:10:49 [19922] GAHP[19927] <- 'ASYNC_MODE_ON'
02/13/15 15:10:49 [19922] GAHP[19927] -> 'S' 'Async mode on'
02/13/15 15:10:49 [19922] (95.0) gm state change: GM_INIT -> GM_START
02/13/15 15:10:49 [19922] (95.0) gm state change: GM_START -> GM_CLEAR_REQUEST
02/13/15 15:10:49 [19922] (95.0) gm state change: GM_CLEAR_REQUEST -> GM_UNSUBMITTED
02/13/15 15:10:49 [19922] (95.0) gm state change: GM_UNSUBMITTED -> GM_SAVE_SANDBOX_ID
02/13/15 15:10:51 [19922] Evaluating staleness of remote job statuses.
02/13/15 15:10:54 [19922] resource  is now up
02/13/15 15:10:54 [19922] in doContactSchedd()
02/13/15 15:10:54 [19922] querying for removed/held jobs
02/13/15 15:10:54 [19922] Using constraint ((Owner=?="atj"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
02/13/15 15:10:54 [19922] Fetched 0 job ads from schedd
02/13/15 15:10:54 [19922] Updating classad values for 95.0:
02/13/15 15:10:54 [19922]    GridJobId = "batch pbs workflow1.ccs.ornl.gov_workflow1.ccs.ornl.gov#95.0#1423858246"
02/13/15 15:10:54 [19922]    LastRemoteStatusUpdate = 1423858249
02/13/15 15:10:54 [19922] leaving doContactSchedd()
02/13/15 15:10:54 [19922] (95.0) doEvaluateState called: gmState GM_SAVE_SANDBOX_ID, remoteState 0
02/13/15 15:10:54 [19922] (95.0) gm state change: GM_SAVE_SANDBOX_ID -> GM_TRANSFER_INPUT
02/13/15 15:10:54 [19922] (95.0) gm state change: GM_TRANSFER_INPUT -> GM_SUBMIT
02/13/15 15:10:54 [19922] GAHP[19927] <- 'BLAH_JOB_SUBMIT 2 [\ RequestMemory\ =\ ifthenelse(MemoryUsage\ isnt\ undefined,MemoryUsage,(\ ImageSize\ +\ 1023\ )\ /\ 1024);\ queue\ =\ "titan";\ Out\ =\ "/ccs/home/atj/PEGASUS/condor_tests/out.95.0";\ cerequirements\ =\ NODES\ ==\ 1\ &&\ PROJECT\ ==\ "STF007"\ &&\ WALLTIME\ ==\ "00:03:00";\ gridtype\ =\ "pbs";\ Environment\ =\ "";\ GridResource\ =\ "pbs";\ Iwd\ =\ "/autofs/na4_home2/atj/PEGASUS/condor_tests";\ Err\ =\ "/ccs/home/atj/PEGASUS/condor_tests/err.95.0";\ In\ =\ "/dev/null";\ Cmd\ =\ "/bin/hostname";\ JobDirectory\ =\ "home_bl_workflow1.ccs.ornl.gov_workflow1.ccs.ornl.gov#95.0#1423858246";\ CurrentTime\ =\ time();\ Arguments\ =\ ""\ ]'
02/13/15 15:10:54 [19922] GAHP[19927] -> 'S'
02/13/15 15:10:56 [19922] GAHP[19927] <- 'RESULTS'
02/13/15 15:10:56 [19922] GAHP[19927] -> 'R'
02/13/15 15:10:56 [19922] GAHP[19927] -> 'S' '1'
02/13/15 15:10:56 [19922] GAHP[19927] -> '2' '0' 'No error' 'pbs/20150213/2249536'
02/13/15 15:10:56 [19922] (95.0) doEvaluateState called: gmState GM_SUBMIT, remoteState 0
02/13/15 15:10:56 [19922] dirscat: dirpath = /tmp
02/13/15 15:10:56 [19922] dirscat: subdir = condorLocks
02/13/15 15:10:56 [19922] directory_util::rec_touch_file: Creating directory /tmp 
02/13/15 15:10:56 [19922] directory_util::rec_touch_file: Creating directory /tmp/condorLocks 
02/13/15 15:10:56 [19922] directory_util::rec_touch_file: Creating directory /tmp/condorLocks/10 
02/13/15 15:10:56 [19922] directory_util::rec_touch_file: Creating directory /tmp/condorLocks/10/77 
02/13/15 15:10:56 [19922] FileLock object is updating timestamp on: /tmp/condorLocks/10/77/231338051342387.lockc
02/13/15 15:10:56 [19922] WriteUserLog::initialize: opened /ccs/home/atj/PEGASUS/condor_tests/con.log successfully
02/13/15 15:10:56 [19922] (95.0) Writing grid submit record to user logfile
02/13/15 15:10:56 [19922] FileLock::obtain(1) - @1423858256.105281 lock on /tmp/condorLocks/10/77/231338051342387.lockc now WRITE
02/13/15 15:10:56 [19922] FileLock::obtain(2) - @1423858256.105892 lock on /tmp/condorLocks/10/77/231338051342387.lockc now UNLOCKED
02/13/15 15:10:56 [19922] FileLock::obtain(1) - @1423858256.105987 lock on /tmp/condorLocks/10/77/231338051342387.lockc now WRITE
02/13/15 15:10:56 [19922] directory_util::rec_clean_up: file /tmp/condorLocks/10/77/231338051342387.lockc has been deleted. 
02/13/15 15:10:56 [19922] Lock file /tmp/condorLocks/10/77/231338051342387.lockc has been deleted. 
02/13/15 15:10:56 [19922] FileLock::obtain(2) - @1423858256.106166 lock on /tmp/condorLocks/10/77/231338051342387.lockc now UNLOCKED
02/13/15 15:10:56 [19922] (95.0) gm state change: GM_SUBMIT -> GM_SUBMIT_SAVE
02/13/15 15:10:59 [19922] in doContactSchedd()
02/13/15 15:10:59 [19922] querying for removed/held jobs
02/13/15 15:10:59 [19922] Using constraint ((Owner=?="atj"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
02/13/15 15:10:59 [19922] Fetched 0 job ads from schedd
02/13/15 15:10:59 [19922] Updating classad values for 95.0:
02/13/15 15:10:59 [19922]    GridJobId = "batch pbs workflow1.ccs.ornl.gov_workflow1.ccs.ornl.gov#95.0#1423858246 pbs/20150213/2249536"
02/13/15 15:10:59 [19922] leaving doContactSchedd()
02/13/15 15:10:59 [19922] (95.0) doEvaluateState called: gmState GM_SUBMIT_SAVE, remoteState 0
02/13/15 15:10:59 [19922] (95.0) gm state change: GM_SUBMIT_SAVE -> GM_SUBMITTED
02/13/15 15:11:46 [19922] Received CHECK_LEASES signal
02/13/15 15:11:46 [19922] in doContactSchedd()
02/13/15 15:11:46 [19922] querying for renewed leases
02/13/15 15:11:46 [19922] querying for removed/held jobs
02/13/15 15:11:46 [19922] Using constraint ((Owner=?="atj"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
02/13/15 15:11:46 [19922] Fetched 0 job ads from schedd
02/13/15 15:11:46 [19922] leaving doContactSchedd()
02/13/15 15:11:49 [19922] GAHP[19927] <- 'RESULTS'
02/13/15 15:11:49 [19922] GAHP[19927] -> 'S' '0'
02/13/15 15:11:51 [19922] Evaluating staleness of remote job statuses.
02/13/15 15:11:59 [19922] (95.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
02/13/15 15:11:59 [19922] (95.0) gm state change: GM_SUBMITTED -> GM_POLL_ACTIVE
02/13/15 15:11:59 [19922] GAHP[19927] <- 'BLAH_JOB_STATUS 3 pbs/20150213/2249536'
02/13/15 15:11:59 [19922] GAHP[19927] -> 'S'
02/13/15 15:11:59 [19922] GAHP[19927] <- 'RESULTS'
02/13/15 15:11:59 [19922] GAHP[19927] -> 'R'
02/13/15 15:11:59 [19922] GAHP[19927] -> 'S' '1'
02/13/15 15:11:59 [19922] GAHP[19927] -> '3' '1' 'Error parsing classad or job not found' '0' 'N/A'
02/13/15 15:11:59 [19922] (95.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
02/13/15 15:11:59 [19922] (95.0) gm state change: GM_POLL_ACTIVE -> GM_SUBMITTED
02/13/15 15:12:46 [19922] Received CHECK_LEASES signal
02/13/15 15:12:46 [19922] in doContactSchedd()
02/13/15 15:12:46 [19922] querying for renewed leases
02/13/15 15:12:46 [19922] querying for removed/held jobs
02/13/15 15:12:46 [19922] Using constraint ((Owner=?="atj"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
02/13/15 15:12:46 [19922] Fetched 0 job ads from schedd
02/13/15 15:12:46 [19922] leaving doContactSchedd()
02/13/15 15:12:49 [19922] GAHP[19927] <- 'RESULTS'
02/13/15 15:12:49 [19922] GAHP[19927] -> 'S' '0'
02/13/15 15:12:51 [19922] Evaluating staleness of remote job statuses.
02/13/15 15:12:59 [19922] (95.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
02/13/15 15:12:59 [19922] (95.0) gm state change: GM_SUBMITTED -> GM_POLL_ACTIVE
02/13/15 15:12:59 [19922] GAHP[19927] <- 'BLAH_JOB_STATUS 4 pbs/20150213/2249536'
02/13/15 15:12:59 [19922] GAHP[19927] -> 'S'
02/13/15 15:13:00 [19922] GAHP[19927] <- 'RESULTS'
02/13/15 15:13:00 [19922] GAHP[19927] -> 'R'
02/13/15 15:13:00 [19922] GAHP[19927] -> 'S' '1'
02/13/15 15:13:00 [19922] GAHP[19927] -> '4' '1' 'Error parsing classad or job not found' '0' 'N/A'
02/13/15 15:13:00 [19922] (95.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
02/13/15 15:13:00 [19922] (95.0) gm state change: GM_POLL_ACTIVE -> GM_SUBMITTED
02/13/15 15:13:46 [19922] Received CHECK_LEASES signal
02/13/15 15:13:46 [19922] in doContactSchedd()
02/13/15 15:13:46 [19922] querying for renewed leases
02/13/15 15:13:46 [19922] querying for removed/held jobs
02/13/15 15:13:46 [19922] Using constraint ((Owner=?="atj"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
02/13/15 15:13:46 [19922] Fetched 0 job ads from schedd
02/13/15 15:13:46 [19922] leaving doContactSchedd()
02/13/15 15:13:49 [19922] GAHP[19927] <- 'RESULTS'
02/13/15 15:13:49 [19922] GAHP[19927] -> 'S' '0'
02/13/15 15:13:51 [19922] Evaluating staleness of remote job statuses.
02/13/15 15:14:00 [19922] (95.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
02/13/15 15:14:00 [19922] (95.0) gm state change: GM_SUBMITTED -> GM_POLL_ACTIVE
02/13/15 15:14:00 [19922] GAHP[19927] <- 'BLAH_JOB_STATUS 5 pbs/20150213/2249536'
02/13/15 15:14:00 [19922] GAHP[19927] -> 'S'
02/13/15 15:14:01 [19922] GAHP[19927] <- 'RESULTS'
02/13/15 15:14:01 [19922] GAHP[19927] -> 'R'
02/13/15 15:14:01 [19922] GAHP[19927] -> 'S' '1'
02/13/15 15:14:01 [19922] GAHP[19927] -> '5' '1' 'Error parsing classad or job not found' '0' 'N/A'
02/13/15 15:14:01 [19922] (95.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
02/13/15 15:14:01 [19922] (95.0) gm state change: GM_POLL_ACTIVE -> GM_SUBMITTED
02/13/15 15:14:46 [19922] Received CHECK_LEASES signal
02/13/15 15:14:46 [19922] in doContactSchedd()
02/13/15 15:14:46 [19922] querying for renewed leases
02/13/15 15:14:46 [19922] querying for removed/held jobs
02/13/15 15:14:46 [19922] Using constraint ((Owner=?="atj"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
02/13/15 15:14:46 [19922] Fetched 0 job ads from schedd
02/13/15 15:14:46 [19922] leaving doContactSchedd()
02/13/15 15:14:49 [19922] GAHP[19927] <- 'RESULTS'
02/13/15 15:14:49 [19922] GAHP[19927] -> 'S' '0'
02/13/15 15:14:51 [19922] Evaluating staleness of remote job statuses.
02/13/15 15:15:01 [19922] (95.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
02/13/15 15:15:01 [19922] (95.0) gm state change: GM_SUBMITTED -> GM_POLL_ACTIVE
02/13/15 15:15:01 [19922] GAHP[19927] <- 'BLAH_JOB_STATUS 6 pbs/20150213/2249536'
02/13/15 15:15:01 [19922] GAHP[19927] -> 'S'
02/13/15 15:15:01 [19922] GAHP[19927] <- 'RESULTS'
02/13/15 15:15:01 [19922] GAHP[19927] -> 'R'
02/13/15 15:15:01 [19922] GAHP[19927] -> 'S' '1'
02/13/15 15:15:01 [19922] GAHP[19927] -> '6' '1' 'Error parsing classad or job not found' '0' 'N/A'
02/13/15 15:15:01 [19922] (95.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
02/13/15 15:15:01 [19922] (95.0) gm state change: GM_POLL_ACTIVE -> GM_SUBMITTED
02/13/15 15:15:46 [19922] Evaluating periodic job policy expressions.
02/13/15 15:15:46 [19922] Received CHECK_LEASES signal
02/13/15 15:15:46 [19922] in doContactSchedd()
02/13/15 15:15:46 [19922] querying for renewed leases
02/13/15 15:15:46 [19922] querying for removed/held jobs
02/13/15 15:15:46 [19922] Using constraint ((Owner=?="atj"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
02/13/15 15:15:46 [19922] Fetched 0 job ads from schedd
02/13/15 15:15:46 [19922] leaving doContactSchedd()
02/13/15 15:15:49 [19922] BaseResource::UpdateResource: 
NumJobs = 1
HashName = "batch PBS"
Machine = "workflow1.ccs.ornl.gov"
SubmitsAllowed = 1
Name = "batch "
CondorPlatform = "$CondorPlatform: x86_64_RedHat6 $"
RunningJobs = 0
Owner = "atj"
MyType = "Grid"
ScheddName = "workflow1.ccs.ornl.gov"
MyAddress = "<160.91.202.132:53679>"
CondorVersion = "$CondorVersion: 8.2.6 Dec 10 2014 BuildID: 287355 $"
SubmitsWanted = 0
ScheddIpAddr = "<160.91.202.132:41221>"
CurrentTime = time()
MyCurrentTime = 1423858549
IdleJobs = 1
JobLimit = 1000

02/13/15 15:15:49 [19922] Trying to update collector <160.91.202.132:9618>
02/13/15 15:15:49 [19922] Attempting to send update via UDP to collector workflow1.ccs.ornl.gov <160.91.202.132:9618>
02/13/15 15:15:49 [19922] GAHP[19927] <- 'RESULTS'
02/13/15 15:15:49 [19922] GAHP[19927] -> 'S' '0'
02/13/15 15:15:51 [19922] Evaluating staleness of remote job statuses.
02/13/15 15:16:01 [19922] (95.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
02/13/15 15:16:01 [19922] (95.0) gm state change: GM_SUBMITTED -> GM_POLL_ACTIVE
02/13/15 15:16:01 [19922] GAHP[19927] <- 'BLAH_JOB_STATUS 7 pbs/20150213/2249536'
02/13/15 15:16:01 [19922] GAHP[19927] -> 'S'
02/13/15 15:16:02 [19922] GAHP[19927] <- 'RESULTS'
02/13/15 15:16:02 [19922] GAHP[19927] -> 'R'
02/13/15 15:16:02 [19922] GAHP[19927] -> 'S' '1'
02/13/15 15:16:02 [19922] GAHP[19927] -> '7' '1' 'Error parsing classad or job not found' '0' 'N/A'
02/13/15 15:16:02 [19922] (95.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
02/13/15 15:16:02 [19922] (95.0) blah_job_status() failed: Error parsing classad or job not found
02/13/15 15:16:02 [19922] (95.0) gm state change: GM_POLL_ACTIVE -> GM_HOLD
02/13/15 15:16:02 [19922] dirscat: dirpath = /tmp
02/13/15 15:16:02 [19922] dirscat: subdir = condorLocks
02/13/15 15:16:02 [19922] directory_util::rec_touch_file: Creating directory /tmp 
02/13/15 15:16:02 [19922] directory_util::rec_touch_file: Creating directory /tmp/condorLocks 
02/13/15 15:16:02 [19922] directory_util::rec_touch_file: Creating directory /tmp/condorLocks/10 
02/13/15 15:16:02 [19922] directory_util::rec_touch_file: Creating directory /tmp/condorLocks/10/77 
02/13/15 15:16:02 [19922] FileLock object is updating timestamp on: /tmp/condorLocks/10/77/231338051342387.lockc
02/13/15 15:16:02 [19922] WriteUserLog::initialize: opened /ccs/home/atj/PEGASUS/condor_tests/con.log successfully
02/13/15 15:16:02 [19922] (95.0) Writing hold record to user logfile
02/13/15 15:16:02 [19922] FileLock::obtain(1) - @1423858562.481823 lock on /tmp/condorLocks/10/77/231338051342387.lockc now WRITE
02/13/15 15:16:02 [19922] FileLock::obtain(2) - @1423858562.482502 lock on /tmp/condorLocks/10/77/231338051342387.lockc now UNLOCKED
02/13/15 15:16:02 [19922] FileLock::obtain(1) - @1423858562.482604 lock on /tmp/condorLocks/10/77/231338051342387.lockc now WRITE
02/13/15 15:16:02 [19922] directory_util::rec_clean_up: file /tmp/condorLocks/10/77/231338051342387.lockc has been deleted. 
02/13/15 15:16:02 [19922] Lock file /tmp/condorLocks/10/77/231338051342387.lockc has been deleted. 
02/13/15 15:16:02 [19922] FileLock::obtain(2) - @1423858562.482808 lock on /tmp/condorLocks/10/77/231338051342387.lockc now UNLOCKED
02/13/15 15:16:02 [19922] (95.0) gm state change: GM_HOLD -> GM_DELETE
02/13/15 15:16:02 [19922] in doContactSchedd()
02/13/15 15:16:02 [19922] querying for removed/held jobs
02/13/15 15:16:02 [19922] Using constraint ((Owner=?="atj"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
02/13/15 15:16:02 [19922] Fetched 0 job ads from schedd
02/13/15 15:16:02 [19922] Updating classad values for 95.0:
02/13/15 15:16:02 [19922]    EnteredCurrentStatus = 1423858562
02/13/15 15:16:02 [19922]    HoldReason = "Error parsing classad or job not found"
02/13/15 15:16:02 [19922]    HoldReasonCode = 0
02/13/15 15:16:02 [19922]    HoldReasonSubCode = 0
02/13/15 15:16:02 [19922]    JobStatus = 5
02/13/15 15:16:02 [19922]    Managed = "Schedd"
02/13/15 15:16:02 [19922]    NumSystemHolds = 1
02/13/15 15:16:02 [19922]    ReleaseReason = undefined
02/13/15 15:16:02 [19922] No jobs left, shutting down
02/13/15 15:16:02 [19922] leaving doContactSchedd()
02/13/15 15:16:02 [19922] Got SIGTERM. Performing graceful shutdown.
02/13/15 15:16:02 [19922] Started timer to call main_shutdown_fast in 1800 seconds
02/13/15 15:16:02 [19922] **** condor_gridmanager (condor_GRIDMANAGER) pid 19922 EXITING WITH STATUS 0


On Feb 12, 2015, at 9:48 PM, Brian Bockelman <bbockelm@xxxxxxxxxxx> wrote:

> Hi Adam,
> 
> Jaime and Derek really are the experts here, but this indicates that the query to PBS from HTCondor is failing.
> 
> A few thoughts:
> 
> 1) Do you have the PBS binaries somewhere besides /usr/bin?  I.e., maybe HTCondor can't find them?
> 2) Does PBS leave your job in "C" state for a few minutes after they finish?
> 3) Can you increase the gridmanager log level so we can see more information about what it's doing?
> 
> I see you're running 8.2.6. I can't think of any bugs that you'd be hitting.  I used to know how to invoke the "pbs_status.sh" script directly (typically this error is caused by some errant output) but I can't recall right now.
> 
> Brian
> 
>> On Feb 12, 2015, at 4:33 PM, Simpson, Adam B. <simpsonab@xxxxxxxx> wrote:
>> 
>> Hi,
>>  I am new to condor and am having trouble getting it connected correctly to our PBS(Torque) system. I am able to submit jobs to PBS from condor, and they appear to run correctly, but condor_q never shows the jobs as running or finished and after several minutes places the jobs in the held state. Does anyone have any ideas what may be going wrong? It might also help if I understand better how condor works with PBS to determine the state of the job when I run condor_q, does it use qstat for instance?
>> 
>> Many Thanks,
>> Adam
>> 
>> My condor submission file:
>> 
>> universe=grid
>> grid_resource=pbs
>> skip_filechecks=true
>> transfer_executable=false
>> +remote_queue="thequeue"
>> +remote_cerequirements=NODES==1 && PROJECT=="ABC123" && WALLTIME=="00:03:00"
>> executable=/bin/hostname
>> output=/home/condor_tests/out.$(cluster).$(process)
>> error=/home/condor_tests/err.$(cluster).$(process)
>> log=/home/condor_tests/con.log
>> 
>> queue
>> 
>> -----------------
>> 
>> I submit it and it runs,  out.$(cluster).$(process) contains the correct hostname as expected but condor_q shows the job as idle:
>> 
>> $ condor_q
>> 
>> -- Submitter: myhost.org<http://myhost.org> : <123.45.678.911:12345> : myhost.org<http://myhost.org>
>> ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
>> 94.0   me             2/12 15:40   0+00:00:00 I  0   0.0  hostname
>> 
>> 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
>> 
>> After several minutes the state changes to held:
>> 
>> $ condor_q
>> 
>> -- Submitter: myhost.org<http://myhost.org> : <123.45.678.911:12345> : myhost.org<http://myhost.org>
>> ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
>> 94.0   me             2/12 15:41   0+00:00:00 H  0   0.0  hostname
>> 
>> 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended
>> 
>> Looking at my log file, specified in the condor submission file I see "Error parsing classad or job not found" appears when the job changes from idle to held:
>> 
>> $ cat con.log
>> 000 (094.000.000) 02/12 15:41:10 Job submitted from host: <160.91.202.132:58725>
>> ...
>> 027 (094.000.000) 02/12 15:41:19 Job submitted to grid resource
>>  GridResource: pbs
>>  GridJobId: batch pbs workflow1.ccs.ornl.gov<http://workflow1.ccs.ornl.gov>_workflow1.ccs.ornl.gov<http://ccs.ornl.gov>#94.0#1423773670 pbs/20150212/2248486
>> ...
>> 012 (094.000.000) 02/12 15:46:27 Job was held.
>> Error parsing classad or job not found
>> Code 0 Subcode 0
>> 
>> And looking at the GridmanagerLog file:
>> 02/12/15 15:41:10 ******************************************************
>> 02/12/15 15:41:10 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
>> 02/12/15 15:41:10 ** /usr/sbin/condor_gridmanager
>> 02/12/15 15:41:10 ** SubsystemInfo: name=GRIDMANAGER type=DAEMON(12) class=DAEMON(1)
>> 02/12/15 15:41:10 ** Configuration: subsystem:GRIDMANAGER local:<NONE> class:DAEMON
>> 02/12/15 15:41:10 ** $CondorVersion: 8.2.6 Dec 10 2014 BuildID: 287355 $
>> 02/12/15 15:41:10 ** $CondorPlatform: x86_64_RedHat6 $
>> 02/12/15 15:41:10 ** PID = 23165
>> 02/12/15 15:41:10 ** Log last touched 2/12 15:39:31
>> 02/12/15 15:41:10 ******************************************************
>> 02/12/15 15:41:10 Using config source: /etc/condor/condor_config
>> 02/12/15 15:41:10 Using local config sources:
>> 02/12/15 15:41:10    /etc/condor/condor_config.local
>> 02/12/15 15:41:10 config Macros = 59, Sorted = 59, StringBytes = 1689, TablesBytes = 2172
>> 02/12/15 15:41:10 CLASSAD_CACHING is ENABLED
>> 02/12/15 15:41:10 Daemon Log is logging: D_ALWAYS D_ERROR
>> 02/12/15 15:41:10 DaemonCore: command socket at <160.91.202.132:45460>
>> 02/12/15 15:41:10 DaemonCore: private command socket at <160.91.202.132:45460>
>> 02/12/15 15:41:13 [23165] Found job 94.0 --- inserting
>> 02/12/15 15:41:13 [23165] gahp server not up yet, delaying ping
>> 02/12/15 15:41:13 [23165] (94.0) doEvaluateState called: gmState GM_INIT, remoteState 0
>> 02/12/15 15:41:13 [23165] GAHP server pid = 23171
>> 02/12/15 15:41:18 [23165] resource  is now up
>> 02/12/15 15:41:18 [23165] (94.0) doEvaluateState called: gmState GM_SAVE_SANDBOX_ID, remoteState 0
>> 02/12/15 15:41:19 [23165] (94.0) doEvaluateState called: gmState GM_SUBMIT, remoteState 0
>> 02/12/15 15:41:23 [23165] (94.0) doEvaluateState called: gmState GM_SUBMIT_SAVE, remoteState 0
>> 02/12/15 15:42:23 [23165] (94.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
>> 02/12/15 15:42:24 [23165] (94.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
>> 02/12/15 15:43:24 [23165] (94.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
>> 02/12/15 15:43:25 [23165] (94.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
>> 02/12/15 15:44:25 [23165] (94.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
>> 02/12/15 15:44:26 [23165] (94.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
>> 02/12/15 15:45:26 [23165] (94.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
>> 02/12/15 15:45:26 [23165] (94.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
>> 02/12/15 15:46:26 [23165] (94.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
>> 02/12/15 15:46:27 [23165] (94.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
>> 02/12/15 15:46:27 [23165] (94.0) blah_job_status() failed: Error parsing classad or job not found
>> 02/12/15 15:46:27 [23165] No jobs left, shutting down
>> 02/12/15 15:46:27 [23165] Got SIGTERM. Performing graceful shutdown.
>> 02/12/15 15:46:27 [23165] **** condor_gridmanager (condor_GRIDMANAGER) pid 23165 EXITING WITH STATUS 0
>> 
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> 
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/