[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Adding custom job classads on condor_starter nodes

My mistake.   You are correct STARTD_JOB_ATTRS doesnât do what you want.  

Iâm not sure that the is a knob that does.


From: ervikrant06@xxxxxxxxx <ervikrant06@xxxxxxxxx>
Sent: Tuesday, June 2, 2020 10:46 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; John M Knoeller <johnkn@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Adding custom job classads on condor_starter nodes


Hi John,


Thanks for your response. 


As per the documentation it seems like that  STARTD_JOB_ATTRS has different purpose, its advertising job classAD into machine classAD. In my case I want to advertise Machine classAD into Job classAD.



When the machine is claimed by a remote user, the condor_startd can also advertise arbitrary attributes from the job ClassAd in the machine ClassAd. List the attribute names to be advertised. NOTE: Since these are already ClassAd expressions, do not do anything unusual with strings. By default, the job ClassAd attributes JobUniverse, NiceUser, ExecutableSize and ImageSize are advertised into the machine ClassAd. This setting was formerly called STARTD_JOB_EXPRS. The older name is still supported, but support for the older name may be removed in a future version of HTCondor.



  Also checked STARTD_ATTRS also but doesn't seem to right candidate for this job as they are only adding classAD into machine classAD. Thought of doing hack on submit side but doesn't work hence discarded this idea. 


submit_attrs = $(submit_attrs) nodeempted
nodeempted = ifthenelse(!isundefined(target.nodehealth),target.nodehealth,False)


But in job classAD nodeempted always come as False. 

Thanks & Regards,

Vikrant Aggarwal



On Tue, Jun 2, 2020 at 8:33 PM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:

SYSTEM_JOB_MACHINE_ATTRS is a list of Machine attributes copied from the match ad (which the Negotiator sends to the Schedd) into the Job ad when a job starts running.  This is something that the Schedd does and only when a job starts. 


A change in the execute node of the value of the attribute that SYSTEM_JOB_MACHINE_ATTRS is copying will not be reflected into jobs until the next time a job starts on that machine *as the result of a full negotiation cycle*, so this can take a very long time to propagate, and the value will never change while a job a running.


For something like node health, which can change as the job runs, I think you want to configure STARTD_JOB_ATTRS on the execute node instead of SYSTEM_JOB_MACHINE_ATTRS on the submit node.




From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of ervikrant06@xxxxxxxxx
Sent: Tuesday, June 2, 2020 6:40 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Adding custom job classads on condor_starter nodes


Hello Experts, 


We are running condor jobs on pre-emptible google cloud instances. I wanted to add something in job classad to identify the jobs impacted because of pre-empted instances. 


On sched file: 




on started classAD is advertised. 


test.example:/etc/condor/config.d# condor_status -compact `hostname` -af machine nodehealth
test.example.com False1


I can see the following in job classAD. 


$ condor_q -run -af jobruncount MachineAttrnodehealth0 MachineAttrnodehealth1
1 False1 undefined
1 False1 undefined


But when I change the value of classAD (by directly modifying condor configuration and running condor_reconfig) on executor node it's not getting reflected in job definition. 


I have seen this message in log file. Our executor directory is only 


06/02/20 06:49:38 slot1_1: Failed to open '/spare/condor/dir_418909/.update.ad.tmp' for writing update ad: No such file or directory (2).


However I do see that .updated.ad file inside the execution directory has the updated value but still machine and job ad reflecting old value as they can't change dynamically. 


# grep nodehealth .update.ad
nodehealth = "False4"


# grep nodehealth .job.ad
MachineAttrnodehealth0 = "False1"


# grep nodehealth .machine.ad
nodehealth = "False1"


# condor_status -compact `hostname` -af machine nodehealth
test.example.com False4


After hold/release job is picking new value but I want to update the value in running instance of job. 


gone through link [1] but that one also is not useful.


Any input is highly appreciated. 


Thanks & Regards,

Vikrant Aggarwal

HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: