[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] update of condor-version and job-behaviour




HI, Greg !

Thanks for clarification ...

Indeed, I would like to prevent a restart of the jobs respectively a proper running of the Jobs until the end on each workernode, when i upgrade the cluster to an upper version of condor ...

So, if i understand your right, i've got to disable several bunches of Nodes before i start the Update in order to carry out the update as inconspicuously as possible .. in other words, every condor/package-update contains an auotmaticly restart of the daemon and so also of the running jobs on the workernodes ?

... by the way, on the master or sched's we use the workflow

1) condor_off -master -fast
2) Upgrade the binaries
3) restart the master
 ALL WITHIN 20 MINUTES.

If you are more concerned about the badput from restarting a running job,
than the potential loss of throughput from keeping cores idle, you can run
"condor_off -peaceful" on the worker node before your upgrade, and condor
will wait until all the jobs exit before it, itself exits, at which time you
could upgrade the machine.

i didn't know the command

condor_off -peaceful


In general we use to disable Workernodes with

condor_config_val -startd -name bird055.desy.de -set "StartJobs = false"
condor reconfig -startd -name bird055.desy.de
condor_drain -graceful bird055.desy.de


Is this equally significant ?

All in all .. the workflow should be

- disable the workernode
- wait until all jobs are finished
- update
- enable the workernode again
- ?


thanks & cheers,
   Martin



On Mon, 30 Aug 2021, Greg Thain wrote:


fg> Hi Martin:

When HTCondor is upgraded *on the worker node*, or, more generally, when the HTCondor worker node daemons restart for any reason:

Any running jobs are killed, will go back to the "I"dle state in the queue, and HTCondor will restart them, perhaps on another machine.

If you are more concerned about the badput from restarting a running job, than the potential loss of throughput from keeping cores idle, you can run "condor_off -peaceful" on the worker node before your upgrade, and condor will wait until all the jobs exit before it, itself exits, at which time you could upgrade the machine.

And just for completeness, upgrading the central manager will not evict jobs.  Upgrading the access point (where the schedd runs) will not evict jobs, if the new daemons restart quickly enough.

-greg



 Hi !

 Which is the default behaviour of running jobs on an working-node on which
 the condor-packages will be updated ...?

 a) the running jobs are running well with the old version, and each job
 after update of the packages, they will start with the new installed
 condor-version ?

 b) the running jobs will be canceld after the update and would be
 re-scheduled with the new version?

 c) the running jobs will be cancled and will be lost

 d) ....

 cheers & thanks,

        Martin

 _______________________________________________
 HTCondor-users mailing list
 To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
 a
 subject: Unsubscribe
 You can also unsubscribe by visiting
 https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

 The archives can be found at:
 https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


Gruss

       Martin Flemming


______________________________________________________
Martin Flemming
DESY / IT          office : Building 2b / 008a
Notkestr. 85       phone  : 040 - 8998 - 4667
22603 Hamburg      mail   : martin.flemming@xxxxxxx
______________________________________________________