[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor: Error parsing classad or job not found



Hi,
   I am new to condor and am having trouble getting it connected correctly to our PBS(Torque) system. I am able to submit jobs to PBS from condor, and they appear to run correctly, but condor_q never shows the jobs as running or finished and after several minutes places the jobs in the held state. Does anyone have any ideas what may be going wrong? It might also help if I understand better how condor works with PBS to determine the state of the job when I run condor_q, does it use qstat for instance?

Many Thanks,
Adam

My condor submission file:

universe=grid
grid_resource=pbs
skip_filechecks=true
transfer_executable=false
+remote_queue="thequeue"
+remote_cerequirements=NODES==1 && PROJECT=="ABC123" && WALLTIME=="00:03:00"
executable=/bin/hostname
output=/home/condor_tests/out.$(cluster).$(process)
error=/home/condor_tests/err.$(cluster).$(process)
log=/home/condor_tests/con.log

queue

-----------------

I submit it and it runs,  out.$(cluster).$(process) contains the correct hostname as expected but condor_q shows the job as idle:

$ condor_q

-- Submitter: myhost.org<http://myhost.org> : <123.45.678.911:12345> : myhost.org<http://myhost.org>
ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
 94.0   me             2/12 15:40   0+00:00:00 I  0   0.0  hostname

1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended

After several minutes the state changes to held:

$ condor_q

-- Submitter: myhost.org<http://myhost.org> : <123.45.678.911:12345> : myhost.org<http://myhost.org>
ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
 94.0   me             2/12 15:41   0+00:00:00 H  0   0.0  hostname

1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended

Looking at my log file, specified in the condor submission file I see "Error parsing classad or job not found" appears when the job changes from idle to held:

$ cat con.log
000 (094.000.000) 02/12 15:41:10 Job submitted from host: <160.91.202.132:58725>
...
027 (094.000.000) 02/12 15:41:19 Job submitted to grid resource
   GridResource: pbs
   GridJobId: batch pbs workflow1.ccs.ornl.gov<http://workflow1.ccs.ornl.gov>_workflow1.ccs.ornl.gov<http://ccs.ornl.gov>#94.0#1423773670 pbs/20150212/2248486
...
012 (094.000.000) 02/12 15:46:27 Job was held.
Error parsing classad or job not found
Code 0 Subcode 0

And looking at the GridmanagerLog file:
02/12/15 15:41:10 ******************************************************
02/12/15 15:41:10 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
02/12/15 15:41:10 ** /usr/sbin/condor_gridmanager
02/12/15 15:41:10 ** SubsystemInfo: name=GRIDMANAGER type=DAEMON(12) class=DAEMON(1)
02/12/15 15:41:10 ** Configuration: subsystem:GRIDMANAGER local:<NONE> class:DAEMON
02/12/15 15:41:10 ** $CondorVersion: 8.2.6 Dec 10 2014 BuildID: 287355 $
02/12/15 15:41:10 ** $CondorPlatform: x86_64_RedHat6 $
02/12/15 15:41:10 ** PID = 23165
02/12/15 15:41:10 ** Log last touched 2/12 15:39:31
02/12/15 15:41:10 ******************************************************
02/12/15 15:41:10 Using config source: /etc/condor/condor_config
02/12/15 15:41:10 Using local config sources:
02/12/15 15:41:10    /etc/condor/condor_config.local
02/12/15 15:41:10 config Macros = 59, Sorted = 59, StringBytes = 1689, TablesBytes = 2172
02/12/15 15:41:10 CLASSAD_CACHING is ENABLED
02/12/15 15:41:10 Daemon Log is logging: D_ALWAYS D_ERROR
02/12/15 15:41:10 DaemonCore: command socket at <160.91.202.132:45460>
02/12/15 15:41:10 DaemonCore: private command socket at <160.91.202.132:45460>
02/12/15 15:41:13 [23165] Found job 94.0 --- inserting
02/12/15 15:41:13 [23165] gahp server not up yet, delaying ping
02/12/15 15:41:13 [23165] (94.0) doEvaluateState called: gmState GM_INIT, remoteState 0
02/12/15 15:41:13 [23165] GAHP server pid = 23171
02/12/15 15:41:18 [23165] resource  is now up
02/12/15 15:41:18 [23165] (94.0) doEvaluateState called: gmState GM_SAVE_SANDBOX_ID, remoteState 0
02/12/15 15:41:19 [23165] (94.0) doEvaluateState called: gmState GM_SUBMIT, remoteState 0
02/12/15 15:41:23 [23165] (94.0) doEvaluateState called: gmState GM_SUBMIT_SAVE, remoteState 0
02/12/15 15:42:23 [23165] (94.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
02/12/15 15:42:24 [23165] (94.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
02/12/15 15:43:24 [23165] (94.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
02/12/15 15:43:25 [23165] (94.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
02/12/15 15:44:25 [23165] (94.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
02/12/15 15:44:26 [23165] (94.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
02/12/15 15:45:26 [23165] (94.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
02/12/15 15:45:26 [23165] (94.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
02/12/15 15:46:26 [23165] (94.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
02/12/15 15:46:27 [23165] (94.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
02/12/15 15:46:27 [23165] (94.0) blah_job_status() failed: Error parsing classad or job not found
02/12/15 15:46:27 [23165] No jobs left, shutting down
02/12/15 15:46:27 [23165] Got SIGTERM. Performing graceful shutdown.
02/12/15 15:46:27 [23165] **** condor_gridmanager (condor_GRIDMANAGER) pid 23165 EXITING WITH STATUS 0