[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] job factory universe changed?



There was a change.  A bug fix actually.

Transforms and submit requirements are now applied to both the factory at submit time, and to the jobs as they materialize.  You can see that happening in the log

06/16/23 15:00:56 (pid:5534) job_transforms for 19803671.-1: 2 
considered, 2 applied (TagJob,RemoveAcctGroup)
...
06/16/23 15:00:56 (pid:5534) Trying to Materializing new job 19803671.0 
step=0 row=0
06/16/23 15:00:56 (pid:5534) Trying to Materializing new job 19803671.1 
step=0 row=1
06/16/23 15:00:56 (pid:5534) job_transforms for 19803671.0: 2 
considered, 2 applied (TagJob,RemoveAcctGroup)
06/16/23 15:00:56 (pid:5534) CommitTransaction() failed for cluster 
19803671 rval=-1 (Invalid value for search tag: None)

The first line is applying the transform to the factory.  When that finishes, the factory has no value for AccountingGroup, AcctGroupUser, and AcctGroup.

So when job 19803671.0 is materialized, it *also* has no value for these attributes, which it inherits from the factory.  So the transform does a COPY on these missing attributes and ends up replacing the LigoSearchTag which this job also inherited with undefined. 

Then the submit requirement rejects the job because LogoSearchTag is missing. 

What you need to do change the TagJob transform so it does not overwrite a LigoSearchTag value if the job already has one.  

-tj

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Michael Thomas
Sent: Friday, June 16, 2023 3:40 PM
To: condor-users@xxxxxxxxxxx
Subject: [HTCondor-users] job factory universe changed?

I'm trying to submit a set of jobs using the schedd late materialization 
job factory in condor 10.0.4.  I note that the same submit file and 
schedd configuration worked fine in condor v9, so I'm guessing there was 
some behavior change that I overlooked.

My submit file contains an accounting_group, which a job transform turns 
into a LigoSearchTag and validates that it has an acceptable value.

To start, here is my submit file:

executable = validate_files.sh
log = /home/michael.thomas/condor/rawtrend/job.log.$(Process)
universe = vanilla
accounting_group=llo.test
request_disk = 2048MB
notification = Always
notify_user = michael.thomas@xxxxxxxx
should_transfer_files = YES
stream_output = True
request_HeavyNetwork = 1
max_materialize = 5
arguments = input/condor_input_$(Process)
error = /home/michael.thomas/condor/rawtrend/validation/job.err.$(Process)
output = /home/michael.thomas/condor/rawtrend/validation/job.out.$(Process)
transfer_input_files = input/condor_input_$(Process),validaterawtrend
transfer_output_files = validation
preserve_relative_paths = True
queue 10

...and here are the job transforms:

JOB_TRANSFORM_NAMES = TagJob,RemoveAcctGroup

JOB_TRANSFORM_TagJob @=end
[
   COPY_AcctGroup = "LigoSearchTag";
   COPY_AcctGroupUser = "LigoSearchUser";
   EVAL_SET_LigoSearchTag = LigoSearchTag ?: "None";
   EVAL_SET_LigoSearchUser = LigoSearchUser ?: Owner;
]
@end

# do not strip accounting classads from scheduler universe
# because their presence is necessary to propagate to child
# jobs and sub-DAGs
JOB_TRANSFORM_RemoveAcctGroup @=end
[
Requirements = JobUniverse != 7;
delete_AccountingGroup = True;
delete_AcctGroup = True;
delete_AcctGroupUser = True;
]
@end

SCHEDD_CLASSAD_USER_MAP_NAMES = $(SCHEDD_CLASSAD_USER_MAP_NAMES) 
ValidSearchTags ValidSearchUsers
CLASSAD_USER_MAPFILE_ValidSearchTags = /etc/condor/accounting/valid_tags
CLASSAD_USER_MAPFILE_ValidSearchUsers = /etc/condor/accounting/valid_users

SUBMIT_REQUIREMENT_NAMES = $(SUBMIT_REQUIREMENT_NAMES) ValidateSearchTag 
ValidateSearchUser

SUBMIT_REQUIREMENT_ValidateSearchTag = JobUniverse == 7 || \
   userMap("ValidSearchTags",LigoSearchTag) isnt undefined
SUBMIT_REQUIREMENT_ValidateSearchTag_REASON = \
   strcat("Invalid value for search tag: ",LigoSearchTag ?: "<undefined>")

SUBMIT_REQUIREMENT_ValidateSearchUser = \
   JobUniverse == 7 || \
   userMap("ValidSearchUsers",Owner,LigoSearchUser) is LigoSearchUser || \
   userMap("ValidSearchUsers",Owner) is undefined && Owner =?= 
LigoSearchUser
SUBMIT_REQUIREMENT_ValidateSearchUser_REASON = \
   strcat("Invalid value for search user: ", LigoSearchUser ?: 
"<undefined>", "\n", \
          "       Valid values are: ",userMap("ValidSearchUsers",Owner))


Now when I submit, I'm geting an error that my search tag isn't found:

06/16/23 15:00:56 (pid:5534) Calling HandleReq <handle_q> (0) for 
command 1112 (QMGMT_WRITE_CMD) from 
michael.thomas@xxxxxxxxxxxxxxxxxxxxxxxx <10.13.5.32:27419>
06/16/23 15:00:56 (pid:5534) job_transforms for 19803671.-1: 2 
considered, 2 applied (TagJob,RemoveAcctGroup)
06/16/23 15:00:56 (pid:5534) Return from HandleReq <handle_q> (handler: 
0.045252s, sec: 0.002s, payload: 0.001s)
06/16/23 15:00:56 (pid:5534) Return from Handler 
<DaemonCore::HandleReqPayloadReady> 0.045702s
06/16/23 15:00:56 (pid:5534) Trying to Materializing new job 19803671.0 
step=0 row=0
06/16/23 15:00:56 (pid:5534) Trying to Materializing new job 19803671.1 
step=0 row=1
06/16/23 15:00:56 (pid:5534) job_transforms for 19803671.0: 2 
considered, 2 applied (TagJob,RemoveAcctGroup)
06/16/23 15:00:56 (pid:5534) CommitTransaction() failed for cluster 
19803671 rval=-1 (Invalid value for search tag: None)

Which I presume means that either the transform failed to copy 
AccountingGroup to LigoSearchTag, or that it didn't execute in the 
scheduler universe and deleted the AccountingGroup tag.  Any tips on how 
to debug this or what might have changed between v9 and v10 are appreciated.

--Mike
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/