[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] A heads-up about HTCondor on Windows and daylight savings time.



At 2am On November 2 here in the USA, daylight savings time will end. This will trigger a bug in the c-runtime on Windows that will cause HTCondor to think that the timestamp on the condor_master.exe has changed. We thought that we had code in HTCondor to work around this bug, but it turns out it is only a partial fix, and all versions of HTCondor will react to the time change this coming weekend.

At this point what happens depends on what your MASTER_NEW_BINARY_RESTART configuration variable is set to. The choices are.

MASTER_NEW_BINARY_RESTART = GRACEFUL
The condor_master restarts all of the child daemons and itself gracefully. This means that jobs get a signal to checkpoint and then are killed 2 minutes later, put back in the queue and restarted on some other node. SCHEDD's will shutdown and then restart and try and reconnect to running jobs - This is the default behavior, and a reasonable choice for SCHEDDs and your central manager.

MASTER_NEW_BINARY_RESTART = PEACEFUL
The condor_master tell's child daemons to restart when they are done with current work. Thsi means that STARTDs will finish current jobs but not accept any new ones until they get a chance to restart. SCHEDDs will finish current jobs but will not start any new ones. This is a reasonable configuration for STARTDs, but not the best choice for your SCHEDDs or central manager.

MASTER_NEW_BINARY_RESTART = FAST
The condor_master kills all child daemons then restarts them. This is a reasonable configuration for your central manager.

MASTER_NEW_BINARY_RESTART = NO
MASTER_NEW_BINARY_RESTART = NEVER
MASTER_NEW_BINARY_RESTART = NONE
The condor_master notices the change but does nothing. This is a reasonable choice for all daemons, but it disables the ability to upgrade the HTCondor binaries without explicitly restarting them.

I recommend that you set MASTER_NEW_BINARY_RESTART to either PEACEFUL or NEVER on your execute nodes before next weekend if you expect to have jobs running over the weekend.

-tj