[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor: Error parsing classad or job not found



Hi Adam,

Jaime and Derek really are the experts here, but this indicates that the query to PBS from HTCondor is failing.

A few thoughts:

1) Do you have the PBS binaries somewhere besides /usr/bin?  I.e., maybe HTCondor can't find them?
2) Does PBS leave your job in "C" state for a few minutes after they finish?
3) Can you increase the gridmanager log level so we can see more information about what it's doing?

I see you're running 8.2.6. I can't think of any bugs that you'd be hitting.  I used to know how to invoke the "pbs_status.sh" script directly (typically this error is caused by some errant output) but I can't recall right now.

Brian

> On Feb 12, 2015, at 4:33 PM, Simpson, Adam B. <simpsonab@xxxxxxxx> wrote:
> 
> Hi,
>   I am new to condor and am having trouble getting it connected correctly to our PBS(Torque) system. I am able to submit jobs to PBS from condor, and they appear to run correctly, but condor_q never shows the jobs as running or finished and after several minutes places the jobs in the held state. Does anyone have any ideas what may be going wrong? It might also help if I understand better how condor works with PBS to determine the state of the job when I run condor_q, does it use qstat for instance?
> 
> Many Thanks,
> Adam
> 
> My condor submission file:
> 
> universe=grid
> grid_resource=pbs
> skip_filechecks=true
> transfer_executable=false
> +remote_queue="thequeue"
> +remote_cerequirements=NODES==1 && PROJECT=="ABC123" && WALLTIME=="00:03:00"
> executable=/bin/hostname
> output=/home/condor_tests/out.$(cluster).$(process)
> error=/home/condor_tests/err.$(cluster).$(process)
> log=/home/condor_tests/con.log
> 
> queue
> 
> -----------------
> 
> I submit it and it runs,  out.$(cluster).$(process) contains the correct hostname as expected but condor_q shows the job as idle:
> 
> $ condor_q
> 
> -- Submitter: myhost.org<http://myhost.org> : <123.45.678.911:12345> : myhost.org<http://myhost.org>
> ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
> 94.0   me             2/12 15:40   0+00:00:00 I  0   0.0  hostname
> 
> 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
> 
> After several minutes the state changes to held:
> 
> $ condor_q
> 
> -- Submitter: myhost.org<http://myhost.org> : <123.45.678.911:12345> : myhost.org<http://myhost.org>
> ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
> 94.0   me             2/12 15:41   0+00:00:00 H  0   0.0  hostname
> 
> 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended
> 
> Looking at my log file, specified in the condor submission file I see "Error parsing classad or job not found" appears when the job changes from idle to held:
> 
> $ cat con.log
> 000 (094.000.000) 02/12 15:41:10 Job submitted from host: <160.91.202.132:58725>
> ...
> 027 (094.000.000) 02/12 15:41:19 Job submitted to grid resource
>   GridResource: pbs
>   GridJobId: batch pbs workflow1.ccs.ornl.gov<http://workflow1.ccs.ornl.gov>_workflow1.ccs.ornl.gov<http://ccs.ornl.gov>#94.0#1423773670 pbs/20150212/2248486
> ...
> 012 (094.000.000) 02/12 15:46:27 Job was held.
> Error parsing classad or job not found
> Code 0 Subcode 0
> 
> And looking at the GridmanagerLog file:
> 02/12/15 15:41:10 ******************************************************
> 02/12/15 15:41:10 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
> 02/12/15 15:41:10 ** /usr/sbin/condor_gridmanager
> 02/12/15 15:41:10 ** SubsystemInfo: name=GRIDMANAGER type=DAEMON(12) class=DAEMON(1)
> 02/12/15 15:41:10 ** Configuration: subsystem:GRIDMANAGER local:<NONE> class:DAEMON
> 02/12/15 15:41:10 ** $CondorVersion: 8.2.6 Dec 10 2014 BuildID: 287355 $
> 02/12/15 15:41:10 ** $CondorPlatform: x86_64_RedHat6 $
> 02/12/15 15:41:10 ** PID = 23165
> 02/12/15 15:41:10 ** Log last touched 2/12 15:39:31
> 02/12/15 15:41:10 ******************************************************
> 02/12/15 15:41:10 Using config source: /etc/condor/condor_config
> 02/12/15 15:41:10 Using local config sources:
> 02/12/15 15:41:10    /etc/condor/condor_config.local
> 02/12/15 15:41:10 config Macros = 59, Sorted = 59, StringBytes = 1689, TablesBytes = 2172
> 02/12/15 15:41:10 CLASSAD_CACHING is ENABLED
> 02/12/15 15:41:10 Daemon Log is logging: D_ALWAYS D_ERROR
> 02/12/15 15:41:10 DaemonCore: command socket at <160.91.202.132:45460>
> 02/12/15 15:41:10 DaemonCore: private command socket at <160.91.202.132:45460>
> 02/12/15 15:41:13 [23165] Found job 94.0 --- inserting
> 02/12/15 15:41:13 [23165] gahp server not up yet, delaying ping
> 02/12/15 15:41:13 [23165] (94.0) doEvaluateState called: gmState GM_INIT, remoteState 0
> 02/12/15 15:41:13 [23165] GAHP server pid = 23171
> 02/12/15 15:41:18 [23165] resource  is now up
> 02/12/15 15:41:18 [23165] (94.0) doEvaluateState called: gmState GM_SAVE_SANDBOX_ID, remoteState 0
> 02/12/15 15:41:19 [23165] (94.0) doEvaluateState called: gmState GM_SUBMIT, remoteState 0
> 02/12/15 15:41:23 [23165] (94.0) doEvaluateState called: gmState GM_SUBMIT_SAVE, remoteState 0
> 02/12/15 15:42:23 [23165] (94.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
> 02/12/15 15:42:24 [23165] (94.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
> 02/12/15 15:43:24 [23165] (94.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
> 02/12/15 15:43:25 [23165] (94.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
> 02/12/15 15:44:25 [23165] (94.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
> 02/12/15 15:44:26 [23165] (94.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
> 02/12/15 15:45:26 [23165] (94.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
> 02/12/15 15:45:26 [23165] (94.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
> 02/12/15 15:46:26 [23165] (94.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
> 02/12/15 15:46:27 [23165] (94.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
> 02/12/15 15:46:27 [23165] (94.0) blah_job_status() failed: Error parsing classad or job not found
> 02/12/15 15:46:27 [23165] No jobs left, shutting down
> 02/12/15 15:46:27 [23165] Got SIGTERM. Performing graceful shutdown.
> 02/12/15 15:46:27 [23165] **** condor_gridmanager (condor_GRIDMANAGER) pid 23165 EXITING WITH STATUS 0
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/