[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Ghost machine list from `condor_status -any`



I suspect the HTCondor daemons in these SLURM jobs are being killed with insufficient time to inform the central manager that they are going away. With a standard configuration, they should disappear from condor_status after 15 minutes.

If they still appear in condor_status after 15 minutes, that suggests that the daemons are still running on the SLURM nodes (i.e. SLURM failed to kill them).

 - Jaime

> On Aug 22, 2023, at 1:18 PM, Seung-Jin Sul <ssul@xxxxxxx> wrote:
> 
> Hi, 
> 
> We are using HTCondor with the SLURM backend and recently we've seen deallocated SLURM nodes shown up in the list from the `condor_status -any` command like the one below.
> 
> 
> ```
> $ condor_status -any
> MyType             TargetType         Name
> 
> Collector          None               My Pool - ln010@ln010
> Submitter          None               condor_pool@svc
> Scheduler          None               svc@ln010
> DaemonMaster       None               svc@ln010
> Negotiator         None               svc@ln010
> Machine            Job                slot1@n0013
> DaemonMaster       None               svc@n0013
> Machine            Job                slot1@n0004
> DaemonMaster       None               svc@n0004
> Accounting         none               <none>
> Accounting         none               condor_pool@svc
> ```
> 
> 
> The `n0013` and `n0004` should have been allocated and used as htcondor worker nodes before but it's deallocated already.
> Also, we know the `n0013` and `n0004` will be cleared up eventually but We are wondering if there is a better way to handle this case like cleaning up the list more correctly.
> 
> We are starting a HTCondor worker with a SLURM script like the below.
> 
> ```
> #!/bin/bash
> #SBATCH -t 72:00:00
> #SBATCH --exclusive
> 
> # Run condor in forward mode
> condor_master -f
> ```
> 
> Any comment will be appreciated. 
> 
> 
> Best, 
> Seung Sul
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/