[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] (Yes & No!) Try to set DedicatedScheduler = NO JOBS EVER onto WN :[



Hi Winnie,

On 1/16/20 5:02 AM, Winnie Lacesso wrote:
> So this looks like "job from lcgce01":
>
> NordugridQueue = "gridAMD"
>
> On a WN, .job.ad file from a grid job vs .job.ad file from a local
> user was compared. Local user .job.ad file not have that!
>
> Hope! Put in /etc/condor/config.d/20_workernode.config on experimental WN:
>
> START = NordugridQueue == "gridAMD"
>
> (Is that syntax correct?)
> & ran condor_reconfig

That syntax looks right. You could always run `condor_config_val 
-verbose START` on the worker node to verify the configuration. A 
service restart also wouldn't hurt.

> Progress - grid jobs land+run on WN! Big Improvement over "no jobs, EVER"!
>
> But then I submitted jobs as myself (= a local user) not via lcgce01,
> & they *also* land+run on this experimental WN. DARN!
>
> So, it doesn't seem to work either (to prevent local user jobs)! :[
> I can't parse the htc00 /var/log/condor/*Log logs to understand where
> the WN is supposed to be telling it "I only want jobs with blahblah set"
>
> root@htc00> cd /var/log/condor
> root@htc00> nice -n 19 grep -li NordugridQueue *Log
> root@htc00> # nothing
> Where might one find out what the WN is "advertising" to the MatchMaker?
> (Assumption: it is logged somewhere! Could be wrong about that!)
>
> In MatchLog is a definite "Matched 2005139.0 phpwl@xxxxxxxxxxxxxx
> snip snip ... slot1@xxxxxxxxxxxxxxxxxxxxxxxx"
>
> gridftp02 (yeah, was gridftp server, repurposed as WN) is the experimental
> WN who's supposed to advertise
> START = NordugridQueue == "gridAMD"
> and my local-user job (this is confirmed) does NOT have that in its
> condor_q -l output, or in the .job.ad file when it's running on the
> experimental WN.

To help with matching issues, `condor_q -better` will be your best 
friend. For instance, if your locally submitted job ID is 1234.0, you 
can run the following to check why a particular slot does or does not match:

$ condor_q -better -reverse -machine slot1@xxxxxxxxxxxxxxxxxxxxxxxx 1234.0

Alternatively, you can check how many resources match your job with some 
hints for why or why not:

$ condor_q -better 1234.0

>
> Note - my local user account doesn't even have an /etc/passwd entry on the
> test WN (that's one somewhat sort of drastic way supposedly to try to
> prevent local user jobs landing on WN where they're not supposed to).
>
> It's a bit astonishing the local-user job was allowed to start+run with no
> entry in /etc/passwd!...
>
> Any further tips/clues/advice most gratefully welcomed!
>
It's possible that you're using slot users: 
https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#STARTER_ALLOW_RUNAS_OWNER. 
Check for `condor_config_val -verbose STARTER_ALLOW_RUNAS_OWNER` on the 
worker node.