[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [SOLVED] token jobs not being routed by HTCondor-CE



Hi all,

The problem has been solved.

I evaluated against a non-routed job the "umbrella constraint" of the JobRouter (a quite large and intimidating boolean _expression_ built by OR-ing
the REQUIREMENTS of each JOB_ROUTER_ROUTE_<name> together with a bunch of other checks).
This evaluated to error instead of true/false. Then I inspected individually the REQUIREMENTS of each route and found the one evaluating to error:

JOB_ROUTER_ROUTE_group2 @=jrt
 REQUIREMENTS StringListMember(x509UserProxyVoName, "muoncoll.infn.it:eic:ams02.cern.ch:[SNIP]",":")

The non routed job was scitokens only, thus x509UserProxyVoName is undefined and StringListMember evaluates to error in that case. The fix was simple:
Â

REQUIREMENTS StringListMember(x509UserProxyVoName ?: "", "muoncoll.infn.it:eic:ams02.cern.ch:[SNIP]",":")

Notes/comments:
- the problem was not in the particular "non working route"
- the error was in a route that should be evaluated *after* the matching one, according to the order defined by
JOB_ROUTER_ROUTE_NAMES
- the routing success/failure was not sistematic, i.e. scitokens only jobs were flowing seamlessly most of the times, apart for some
times when they remained stuck. My tests however were systematically remaining stuck :)
- I remember from some HTCondor WS talk that condor does honour the "short-circuit" evaluation of the boolean expressions; that means that my test route
 should have been evaluated first (being the first in
JOB_ROUTER_ROUTE_NAMES) and succed without considering the part with StringListMember.
However i noticed, when looking at the "Umbrella constraint", that the order of the "requirements" rather appears to be randomic.
- I would suggest that the configuration manual could emphasize / warn the reader about the fact that the
REQUIREMENTS _expression_ in a route MUST evaluate to boolean

Hope this can prevent some headaches to someone :)

Regards,
Stefano


On 30/03/23 17:40, Stefano Dal Pra wrote:
Hello all, an update on this:

I replicated the nonworking rules to a condor-ce with little load (it serves only one VO) and these work as expected.
This ensures that the rule syntax is correct.

Then i noticed that in the other CEs there were several nonrouted jobs from a VO who recently started using token credentials, and
whose jobrouter rule was not yet token aware. After fixing that rule, the pending jobs were routed and my rule also started working.
For a while, only. This morning i found several nonroutedjobs (Qdate --> midnight, routing rule correct, i.e. those jobs SHOULD have been routed).
I manually removed those stuck jobs and next fresh ones were being routed flawlessly. The route i'm adding, however, still does not work.

Questions:
- is there a maximum lenght for the active routes listed in JOB_ROUTER_ROUTE_NAMES ?
- is there a "cache effect" so that fixing an error in a JOB_ROUTER_ROUTE_<name> entry does not take effect until <some_cache> expiration?
- is there a (short) timeout for a scitoken job to be routed, after that no more chances exist of being routed?
- if I rename an existing route does that help with "caching" problems? (spoiler: no, i just verified that).

Stefano





On 28/03/23 23:55, Stefano Dal Pra wrote:
Hi Todd, thanks for the advices;
yes, I issued condor_ce_reconfig. The suggested command says

[root@ce07-htc ~]# condor_ce_history -l 3250138.0 |Â condor_ce_job_router_info -match-jobs -ignore-prior-routing -jobads -

Matching jobs against routes to find candidate jobs.

And the same for other test jobs in the queue:

[root@ce03-htc ~]# condor_ce_q 6384655.0 -l | condor_ce_job_router_info -match-jobs -ignore-prior-routing -jobads -

Matching jobs against routes to find candidate jobs.

Since the REQUIREMENTS _expression_ evaluates to True, my guess is that a routing is attempted but fails, possibly because
of some residual problem with that specific token issuer. In fact, there are token only jobs flowing regularly; for example these ones from atlas:

[root@ce06-htc ~]# cccv JOB_ROUTER_ROUTE_atlas_sam
REQUIREMENTS (x509UserProxyVoName =?= "atlas" && x509UserProxyFirstFQAN =?= "/atlas/Role=lcgadmin/Capability=NULL") || (AuthTokenIssuer =?= "https://atlas-auth.web.cern.ch/" && AuthTokenSubject =?= "5c5d2a4d-9177-3efa-912f-1b4e5c9fb660")
UNIVERSE VANILLA
SET Requirements (TARGET.t1_allow_sam =?= true) && (!StringListMember("gpfs_atlas",t1_GPFS_CHECK ?: "",":"))

[root@ce06-htc ~]# condor_ce_q -cons 'x509userproxyvoname =?= undefined && AuthTokenSubject == "5c5d2a4d-9177-3efa-912f-1b4e5c9fb660"' -af:j owner jobstatus routedtojobid qdate 'formattime(qdate)'
5394875.0 atlassgm006 1 8837395.0 1680039294 Tue Mar 28 23:34:54 2023

The above is a "token only job". However this other one remains idle:

[sdalpra@ui-htc CE5]$ export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=SCITOKENS ; condor_submit -pool ce06-htc.cr.cnaf.infn.it:9619 -remote ce06-htc.cr.cnaf.infn.it -append '+WantRoute = "herd_cloud"' ce_scitok308.sub
Submitting job(s).
1 job(s) submitted to cluster 5394871.

[root@ce06-htc ~]# cccv JOB_ROUTER_ROUTE_herdcloud
REQUIREMENTS (AuthTokenIssuer =?= "https://iam-herd.cloud.cnaf.infn.it/" && AuthTokenSubject =?= "6f925657-f9aa-4cb6-b264-a3b1ee78df57")
UNIVERSE VANILLA
SET Requirements (TARGET.t1_group =?= "herd_cloud")
SET RequestMemory 400
SET MaxJobs 35
SET MaxIdleJobs 12

[root@ce06-htc ~]# condor_ce_q 5394871.0 -af:j owner routedtojobid '(AuthTokenIssuer =?= "https://iam-herd.cloud.cnaf.infn.it/" && AuthTokenSubject =?= "6f925657-f9aa-4cb6-b264-a3b1ee78df57")'
5394871.0 herd006 undefined true

[root@ce06-htc ~]# condor_ce_q 5394871.0 -l | condor_ce_job_router_info -match-jobs -ignore-prior-routing -jobads -

Matching jobs against routes to find candidate jobs.


Stefano

Â



On 28/03/23 21:36, Todd Tannenbaum wrote:
On 3/28/2023 5:42 AM, Stefano Dal Pra wrote:

When using (only) x509 and no token, the job is mapped (by argus) to dteam026.
StringListMember should work the same with dteam007 or dteam026
however it only matches with dteam026 (i.e. GSI). and not with dteam007.
I normally check for issuer and subject in the jobrouter; i tried with StringListMember to
restrict the check to Owner only.


Hi Stefano -

After changing the route to try StringListMember, did you remember to issue a "condor_ce_reconfig" command?Â

For job 3250138.0 below, it sure looks like the owner mapping from the token worked fine... perhaps this command will give a clue:
root@host # condor_ce_history -l 3250138.0 | condor_ce_job_router_info -match-jobs -ignore-prior-routing -jobads -
Also see the CE Manual for troubleshooting tips when a job does not route at URL:
 https://htcondor.com/htcondor-ce/v4/troubleshooting/troubleshooting/#jobs-stay-idle-on-the-ce

Hope the above helps, let us know how it goes, feel free to ask for more help if you continue to be stuck.

regards,
Todd




Adding a detail on the submit file used for GSI and SCITOKENS
#submit file for GSI
[sdalpra@ui-htc CE5]$ cat ce_gsi308.sub
universe = vanilla
use_x509userproxy = true
+Owner = undefined
[...]

[sdalpra@ui-htc CE5]$ cat ce_scitok308.sub Â
universe = vanilla
use_scitokens = true
+Owner = undefined


Stefano



On 28/03/23 11:56, Thomas Hartmann wrote:
Hi Stefano,

how does your token mapping look like? ð

Just a suspicion, but maybe the token subject is mapped to another user than the X509 mapped user and the requirement
 REQUIREMENTS StringListMember(Owner, "dteam007|dteam026|cmssgm017","|")
does not get triggered?

Cheers,
 Thomas

On 27/03/2023 22.50, Stefano Dal Pra wrote:
Hello to all,

htcondor-ce-5.1.6 + condor-9.0.17 Here.

I'm having problems with HTCondor-CE not routing jobs submitted with iam token [1]. The same routing rule [2] or [3] working with GSI does not work with tokens.
More notes in [4].

USING GSI
#This works
[sdalpra@ui-htc CE5]$ export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=GSI ; condor_submit -pool ce07-htc.cr.cnaf.infn.it:9619 -remote ce07-htc.cr.cnaf.infn.it ce_gsi308.sub
Submitting job(s).
1 job(s) submitted to cluster 3250129.

#the job is routed and submitted to condor; note the local user (dteam026), that is mapped by argus
[root@ce07-htc ~]# condor_ce_q 3250129. -af:j owner routedtojobid
3250129.0 dteam026 4991835.0

USING SCITOKENS
#This does not work
[sdalpra@ui-htc CE5]$ export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=SCITOKENS ; condor_submit -pool ce07-htc.cr.cnaf.infn.it:9619 -remote ce07-htc.cr.cnaf.infn.it ce_scitok308.sub
Submitting job(s).
1 job(s) submitted to cluster 3250138.

#the job is never routed. Note that the REQUIREMENTS _expression_ evaluates to true.
[root@ce07-htc ~]# condor_ce_q 3250138. -af:j owner routedtojobid 'StringListMember(Owner, "dteam007|dteam026|cmssgm017","|")'
3250138.0 dteam007 undefined true


[1] The token being used
[sdalpra@ui-htc CE5]$ cat Â$BEARER_TOKEN_FILE|jwt.py -v
{
ÂÂ"alg": "RS256",
ÂÂ"kid": "rsa1"
}
{
ÂÂ"sub": "9662c0b5-31a1-4478-963e-bdf3783232ed",
ÂÂ"iss": "https://wlcg.cloud.cnaf.infn.it/",
ÂÂ"wlcg.groups": [
ÂÂÂÂ"/wlcg",
ÂÂÂÂ"/wlcg/pilots",
ÂÂÂÂ"/wlcg/xfers"
ÂÂ],
ÂÂ"wlcg.ver": "1.0",
ÂÂ"jti": "4270f069-81d9-48fb-88ef-817a83b98c6a",
ÂÂ"exp": 1679943559,
ÂÂ"iat": 1679939959,
ÂÂ"client_id": "ad852b22-e517-44a4-99e8-7c0660f878a1",
ÂÂ"scope": "openid compute.create profile compute.read storage.read:/ compute.modify eduperson_entitlement wlcg storage.create:/ offline_access compute.cancel eduperson
_scoped_affiliation storage.modify:/ email wlcg.groups",
ÂÂ"nbf": 1679939959,
ÂÂ"aud": "https://wlcg.cern.ch/jwt/v1/any"
}
exp: Mon Mar 27 20:59:19 2023

[2],[3] Jobrouter rules

JOB_ROUTER_ROUTE_routestsci @=jrt
REQUIREMENTS StringListMember(Owner, "dteam007|dteam026|cmssgm017","|")
ÂÂ UNIVERSE VANILLA
SET Requirements (TARGET.t1_group=?= "myfancygroup")
ÂÂÂSET RequestMemory 400
ÂÂÂSET MaxJobs 5
ÂÂÂSET MaxIdleJobs 10
@jrt

JOB_ROUTER_ROUTE_routestgsi @=jrt
REQUIREMENTS (x509UserProxyVOName== "dteam") || (AuthTokenIssuer =?= "https://wlcg.cloud.cnaf.infn.it/"&& AuthTokenSubject =?= "9662c0b5-31a1-4478-963e-bdf3783232ed")
ÂÂUNIVERSE VANILLA
SET Requirements (TARGET.t1_group=?= "testgroup")
@jrt

JOB_ROUTER_ROUTE_NAMES= routestsci routestgsi $(JOB_ROUTER_ROUTE_NAMES)

[4] Notes

- scitoken is "partially" valid as the mapping to the local user succeeds.
- the REQUIREMENTS _expression_ matches with the condor-ce job, i.e.
ÂÂÂÂ condor_ce_q <jobid> -af StringListMember(Owner, "dteam007|dteam026|cmssgm017","|")
ÂÂ returns True.
- These rules used to work as far as i know. More complex REQUIREMENTS expressions where successfully used with tokens.
- I checked rule [2] against a condor-ce at another site where a colleague accepted to test it; the result is the same: using GSI the job is routed, using SCITOKENS it is not.
- I find nothing useful in the condor-ce logs:

[root@ce07-htc ~]# grep 3250492. /var/log/condor-ce/*Log
/var/log/condor-ce/AuditLog:03/27/23 21:54:54 (cid:18395186) (D_AUDIT) Submitting new job 3250492.0
/var/log/condor-ce/AuditLog:03/27/23 21:54:54 (cid:18395188) (D_AUDIT) Transferring files for jobs 3250492.0
/var/log/condor-ce/SchedLog:03/27/23 21:54:55 (D_ALWAYS) Job 3250492.0 released from hold: Data files spooled

Also at maximum verbosity nothing is found in the JobRouterLog.
I'm out of ideas now. Any hint to find out what's wrong?
Thanks
Stefano



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                   Madison, WI 53706-1685 


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/