[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Unable to find/track submitted PBS batch jobs



Hi Lukas,

What's the value of PBS_GAHP (condor_config_val -v PBS_GAHP)? I would unset it so that your setup uses the generic BATCH_GAHP (aka Bosco or BLAH), which should be set to "$(GLITE_LOCATION)/bin/batch_gahp".

What's the value of pbs_binpath (grep pbs_binpath `condor_config_val GLITE_LOCATION`/etc/batch_gahp.config)? Is it the directory that contains "qstat"?

Under the hood, the BLAH is running a qstat wrapper so you should

1) Verify that running qstat as a non-privileged user works from the host in question
2) Run the qstat wrapper for a job that's currently in the PBS queue:

    $ python `condor_ce_config_val GLITE_LOCATION`/bin/pbs_status.py <BLAH JOB ID>

   Where <BLAH JOB ID> has the following format: pbs/<YYYYMMDD/<PBS JOB ID>

- Brian

On 3/2/20 1:01 PM, Koschmieder, Lukas Michael wrote:
Hi,

I'm trying to set up Condor as an alternative interface to our PBS cluster.

This is my setup so far:

    - I've installed Condor (BoSCO) on our PBS login/submit node.
    - I've enabled MASTER, COLLECTOR, NEGOTIATOR, and SCHEDD.
    - I've set GLITE_LOCATION and PBS_GAHP in condor_config.
    - I've set pbs_binpath and pbs_spoolpath in GLITE_LOCATION/etc/batch_gahp.config.

With this setup, I can submit jobs to our PBS cluster using `condor_submit`. But for some reason, Condor won't be able to find/track the submitted jobs. While the actual PBS jobs will keep running (and eventually terminate), the corresponding Condor "meta jobs" will remain IDLE for a few minutes and finally change their status to HELD. 

Do you have an idea what might cause this behavior or how to debug it?

Cheers,
Lukas


User LOG:

    027 (001.000.000) 03/02 18:47:52 Job submitted to grid resource
        GridResource: batch pbs
        GridJobId: batch pbs acsrvcl02.gi.rwth-aachen.de_9618_acsrvcl02.gi.rwth-aachen.de#1.0#1583171263 pbs/20200302/10044
    ...
    012 (001.000.000) 03/02 18:53:01 Job was held.
            Error parsing classad or job not found
            Code 0 Subcode 0


GrindmanagerLog.lukask (D_FULLDEBUG):

    03/02/20 18:50:43 [2578688] Received CHECK_LEASES signal
    03/02/20 18:50:43 [2578688] in doContactSchedd()
    03/02/20 18:50:43 [2578688] querying for renewed leases
    03/02/20 18:50:43 [2578688] querying for removed/held jobs
    03/02/20 18:50:43 [2578688] Using constraint ((Owner=?="lukask"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
    03/02/20 18:50:43 [2578688] Fetched 0 job ads from schedd
    03/02/20 18:50:43 [2578688] leaving doContactSchedd()
    03/02/20 18:50:45 [2578688] Evaluating periodic job policy expressions.
    03/02/20 18:50:46 [2578688] GAHP[2578692] <- 'RESULTS'
    03/02/20 18:50:46 [2578688] GAHP[2578692] -> 'S' '0'
    03/02/20 18:50:48 [2578688] Evaluating staleness of remote job statuses.
    03/02/20 18:50:58 [2578688] (1.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 0
    03/02/20 18:50:58 [2578688] (1.0) gm state change: GM_SUBMITTED -> GM_POLL_ACTIVE
    03/02/20 18:50:58 [2578688] GAHP[2578692] <- 'BLAH_JOB_STATUS 5 pbs/20200302/10044'
    03/02/20 18:50:58 [2578688] GAHP[2578692] -> 'S'
    03/02/20 18:50:59 [2578688] GAHP[2578692] <- 'RESULTS'
    03/02/20 18:50:59 [2578688] GAHP[2578692] -> 'R'
    03/02/20 18:50:59 [2578688] GAHP[2578692] -> 'S' '1'
    03/02/20 18:50:59 [2578688] GAHP[2578692] -> '5' '1' 'Error parsing classad or job not found' '0' 'N/A'
    03/02/20 18:50:59 [2578688] (1.0) doEvaluateState called: gmState GM_POLL_ACTIVE, remoteState 0
    03/02/20 18:50:59 [2578688] (1.0) gm state change: GM_POLL_ACTIVE -> GM_SUBMITTED


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/