[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] job factory universe changed?



I'm trying to submit a set of jobs using the schedd late materialization job factory in condor 10.0.4. I note that the same submit file and schedd configuration worked fine in condor v9, so I'm guessing there was some behavior change that I overlooked.

My submit file contains an accounting_group, which a job transform turns into a LigoSearchTag and validates that it has an acceptable value.

To start, here is my submit file:

executable = validate_files.sh
log = /home/michael.thomas/condor/rawtrend/job.log.$(Process)
universe = vanilla
accounting_group=llo.test
request_disk = 2048MB
notification = Always
notify_user = michael.thomas@xxxxxxxx
should_transfer_files = YES
stream_output = True
request_HeavyNetwork = 1
max_materialize = 5
arguments = input/condor_input_$(Process)
error = /home/michael.thomas/condor/rawtrend/validation/job.err.$(Process)
output = /home/michael.thomas/condor/rawtrend/validation/job.out.$(Process)
transfer_input_files = input/condor_input_$(Process),validaterawtrend
transfer_output_files = validation
preserve_relative_paths = True
queue 10

...and here are the job transforms:

JOB_TRANSFORM_NAMES = TagJob,RemoveAcctGroup

JOB_TRANSFORM_TagJob @=end
[
  COPY_AcctGroup = "LigoSearchTag";
  COPY_AcctGroupUser = "LigoSearchUser";
  EVAL_SET_LigoSearchTag = LigoSearchTag ?: "None";
  EVAL_SET_LigoSearchUser = LigoSearchUser ?: Owner;
]
@end

# do not strip accounting classads from scheduler universe
# because their presence is necessary to propagate to child
# jobs and sub-DAGs
JOB_TRANSFORM_RemoveAcctGroup @=end
[
Requirements = JobUniverse != 7;
delete_AccountingGroup = True;
delete_AcctGroup = True;
delete_AcctGroupUser = True;
]
@end

SCHEDD_CLASSAD_USER_MAP_NAMES = $(SCHEDD_CLASSAD_USER_MAP_NAMES) ValidSearchTags ValidSearchUsers
CLASSAD_USER_MAPFILE_ValidSearchTags = /etc/condor/accounting/valid_tags
CLASSAD_USER_MAPFILE_ValidSearchUsers = /etc/condor/accounting/valid_users

SUBMIT_REQUIREMENT_NAMES = $(SUBMIT_REQUIREMENT_NAMES) ValidateSearchTag ValidateSearchUser

SUBMIT_REQUIREMENT_ValidateSearchTag = JobUniverse == 7 || \
  userMap("ValidSearchTags",LigoSearchTag) isnt undefined
SUBMIT_REQUIREMENT_ValidateSearchTag_REASON = \
  strcat("Invalid value for search tag: ",LigoSearchTag ?: "<undefined>")

SUBMIT_REQUIREMENT_ValidateSearchUser = \
  JobUniverse == 7 || \
  userMap("ValidSearchUsers",Owner,LigoSearchUser) is LigoSearchUser || \
userMap("ValidSearchUsers",Owner) is undefined && Owner =?= LigoSearchUser
SUBMIT_REQUIREMENT_ValidateSearchUser_REASON = \
strcat("Invalid value for search user: ", LigoSearchUser ?: "<undefined>", "\n", \
         "       Valid values are: ",userMap("ValidSearchUsers",Owner))


Now when I submit, I'm geting an error that my search tag isn't found:

06/16/23 15:00:56 (pid:5534) Calling HandleReq <handle_q> (0) for command 1112 (QMGMT_WRITE_CMD) from michael.thomas@xxxxxxxxxxxxxxxxxxxxxxxx <10.13.5.32:27419> 06/16/23 15:00:56 (pid:5534) job_transforms for 19803671.-1: 2 considered, 2 applied (TagJob,RemoveAcctGroup) 06/16/23 15:00:56 (pid:5534) Return from HandleReq <handle_q> (handler: 0.045252s, sec: 0.002s, payload: 0.001s) 06/16/23 15:00:56 (pid:5534) Return from Handler <DaemonCore::HandleReqPayloadReady> 0.045702s 06/16/23 15:00:56 (pid:5534) Trying to Materializing new job 19803671.0 step=0 row=0 06/16/23 15:00:56 (pid:5534) Trying to Materializing new job 19803671.1 step=0 row=1 06/16/23 15:00:56 (pid:5534) job_transforms for 19803671.0: 2 considered, 2 applied (TagJob,RemoveAcctGroup) 06/16/23 15:00:56 (pid:5534) CommitTransaction() failed for cluster 19803671 rval=-1 (Invalid value for search tag: None)

Which I presume means that either the transform failed to copy AccountingGroup to LigoSearchTag, or that it didn't execute in the scheduler universe and deleted the AccountingGroup tag. Any tips on how to debug this or what might have changed between v9 and v10 are appreciated.

--Mike