[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] (Yes & No!) Try to set DedicatedScheduler = NO JOBS EVER onto WN :[



Good morning!

Thank you again a zillion times Brian Lin! BLESS for your wise 
helpfulness! 

To refresh context:

On Mon, 13 Jan 2020, Brian Lin wrote:
> If you can identify jobs that are being passed through your CE, you can
> inspect their ClassAds [1] with `condor_q -l <JOB ID>` and compare those
> ClassAds with the ClassAds of your non-grid jobs to find an 
> attribute/value pair that's unique to your grid jobs. For example, if 
> Arc CE sets 'SourceCE = "lcgce01"' for all grid jobs, you could set the 
> following on your worker nodes:
> 
> START = SourceCE == "lcgce01"

So this looks like "job from lcgce01":

NordugridQueue = "gridAMD"

On a WN, .job.ad file from a grid job vs .job.ad file from a local 
user was compared. Local user .job.ad file not have that!

Hope! Put in /etc/condor/config.d/20_workernode.config on experimental WN:

START = NordugridQueue == "gridAMD"

(Is that syntax correct?)
& ran condor_reconfig

Progress - grid jobs land+run on WN! Big Improvement over "no jobs, EVER"!

But then I submitted jobs as myself (= a local user) not via lcgce01,
& they *also* land+run on this experimental WN. DARN!

So, it doesn't seem to work either (to prevent local user jobs)! :[
I can't parse the htc00 /var/log/condor/*Log logs to understand where 
the WN is supposed to be telling it "I only want jobs with blahblah set"

root@htc00> cd /var/log/condor
root@htc00> nice -n 19 grep -li NordugridQueue *Log
root@htc00> # nothing
Where might one find out what the WN is "advertising" to the MatchMaker?
(Assumption: it is logged somewhere! Could be wrong about that!)

In MatchLog is a definite "Matched 2005139.0 phpwl@xxxxxxxxxxxxxx 
snip snip ... slot1@xxxxxxxxxxxxxxxxxxxxxxxx"

gridftp02 (yeah, was gridftp server, repurposed as WN) is the experimental 
WN who's supposed to advertise
START = NordugridQueue == "gridAMD"
and my local-user job (this is confirmed) does NOT have that in its 
condor_q -l output, or in the .job.ad file when it's running on the 
experimental WN.  

Note - my local user account doesn't even have an /etc/passwd entry on the
test WN (that's one somewhat sort of drastic way supposedly to try to 
prevent local user jobs landing on WN where they're not supposed to).

It's a bit astonishing the local-user job was allowed to start+run with no 
entry in /etc/passwd!... 

Any further tips/clues/advice most gratefully welcomed!