[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_status update time



 

Hi TJ,

 

We have a pretty controlled ops environment where we always have an empty queue when we issue the command, so we have gotten away with a sudo  condor stop.   I will go experiment with the other shutdown options that you have suggested, but that will take a little time as I have to work with our IT group to set up sudo permissions.   I will come back if I have any questions. 

 

     Thank You for your suggestions.

 

              Mary

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Reply-To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Date: Tuesday, October 10, 2017 at 11:01 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_status update time

 

There has been no change in the shutdown logic as far as I know.

 

It sounds like you are killing the HTCondor daemons rather than actually shutting them down.   If you hard kill the daemons, (or if you terminate the VM without shutting it down) â then the HTCondor daemons never get a chance to send a DELETE_AD notification to the collector and so their ClassAds will remain visible until they expire.  That sure sounds like what is happening here.

 

To shut the daemons down cleanly, you can use condor_off -master.  Or you can send the condor_master a SIGTERM signal, which has the same effect.  Then you have to give the condor_startd and condor_master time to shutdown cleanly.  I believe that this can take as much as 2 minutes if the condor_startd is running a job and the job doesnât respond to SIGTERM, (or doesnât respond quickly).

 

If the daemons shut down cleanly, you should see that reflected in the MasterLog and StartdLog.  I would look there first.   You should also see the DELETE_AD notifications in the collector.

 

A clean shutdown will show this message in the StartLog

10/05/17 12:51:12.038 (D_ALWAYS) Got SIGTERM. Performing graceful shutdown.

and this message, which will be the last message in the log.

10/05/17 12:51:12.170 (D_ALWAYS) **** condor_startd.exe (condor_STARTD) pid 10328 EXITING WITH STATUS 0

 

If the second message is missing, thatâs a strong indicator that the shutdown was not clean.

 

-tj

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Mary Romelfanger
Sent: Monday, October 9, 2017 10:08 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] condor_status update time

 

 

Hi Everyone,

 

My apologies if this has been asked already, or if I missed a notification.   I have searched and not found any references to this question.

 

There appears to be a new delay in the updates on availability of a core in condor_status for the pool, when the htcondor on a machine is stopped?     I am pretty sure that delay was not there before?

 

Example:   If I have 80 cores (16 cores split over 5 VMs that are only running startdâs) and they are all up, then condor_status correctly shows 80 cores.   If I then shutdown HTCondor on one of the VMs â a ps shows that the condor processes are gone, but condor_status does not update and reflect that the number of cores is down to 64 for many minutes (as many as 10 or 15 minutes). 

 

I believe that this is new behavior in 8.6 (we are currently running 8.6.6).  I double checked in our 8.4 pool before we updated it and I am pretty sure that it did not have that behavior, meaning a shutdown of HTCondor on a VM in a pool was immediately reflected in condor_status.

 

Is this behavior expected?   Is there a better way (other than the ps) to determine what cores are really there with a reliable immediate answer?   We have been troubleshooting some issues which have required a number of shutdowns and startups and it has become an issue (really just a pain in theâ. - there are other ways to tell) that the condor_status result is not a true current reflection of the status of the pool.   Did I miss a new knob or a new command?  :) 

 

            Thank You -- Mary

 

Mary Romelfanger

Deputy Branch Manager

Data Systems Branch

.___.      

{o,o}      Phone 410-338-6708
/)__)      Cell      443-244-0191
-"-"-          mary@xxxxxxxxx

 

Space Telescope Science Institute

3700 San Martin Drive

Baltimore, MD 21218