[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs matched to slots in OWNER state when startd_cron used



Hi,

It does seem like a race condition to me...  For testing, I had hardwired the "health" script to always return the result True, so the result changing wasn't an issue.  I also had the script writing timestamps to a file and the script was getting run immediately at startd startup.  Also, I the health script output False, then no jobs were matched.

I did look at condor_status and all seemed fine with the ClassAd, but I'll repeat with the two versions of the command you gave.

The batch node has the condor config setting "IS_OWNER = False".  I see that if the Startd_Cron related item is removed from the START expression, then the node goes right into UNCLAIMED state at startup.

Thanks,
Tom



On Apr 18, 2012, at 5:56 PM, Dan Bradley wrote:

> Hi Tom,
> 
> There is a race condition between decisions made by the negotiator and the startd.  The negotiator makes its decisions based on the state of machines as it observes them at the beginning of each negotiation cycle and the startd makes its decisions based on its current state.
> 
> The behavior you describe makes me wonder if the Startd_Cron_Health attribute is not getting published in a timely manner.  You can query the published state and the current state by doing something like this:
> 
> # published state
> condor_status -f "%s\n" Startd_Cron_Health <machine>
> # current state
> condor_status -f "%s\n" Startd_Cron_Health -direct <machine>
> 
> --Dan
> 
> On 4/18/12 3:54 PM, Tom Rockwell wrote:
>> Hi,
>> 
>> I'm wanting to implement a "startd_cron" job that does some checks on batch node "health" and provides results into the node's ClassAd and for use in the START macro.
>> 
>> I have this going in a test setup and jobs are not started when the "health" result is false, so it is mostly working.
>> 
>> However, I see that when the health status is included in START, that the node/slots go into OWNER state for 5 minutes after the startd is launched and also after health transitions from False to True.  This isn't so bad, except that jobs are being matched to the slots during this 5 minute time and then fail to start.  This seems like wasted work that might lead to problems at larger scale.  I've talked with admins at another site that uses this mechanism and they see the same 5 minute periods of slots in OWNER but don't get jobs matched during this time.
>> 
>> I have a mix of condor versions in the test setup: startd is 7.6.6, schedd is 7.4.2 and the collector is on 7.6.0
>> 
>> The START macro looks like:
>> 
>> START = ( Startd_Cron_Health =?= True )
>> 
>> Any suggestions on how to avoid the matches to slots in OWNER state?  My next guess to try is a later condor version on for the schedd.
>> 
>> Thanks,
>> Tom Rockwell
>> Michigan State U.
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>> 
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/