[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Removing a node ungracefully from the master.



So if I understand correctly, that's to prevent additional jobs from running on that machine correct?  But how do I have Condor immediately stop and resubmit the jobs that were running on that machine to the next available machine?  I essentially want that machine to go away from condor until it becomes available again and any jobs that were running on it to go back into the queue to be matched on another machine.

Is this possible?

On Wed, Feb 20, 2013 at 4:46 PM, Nathan Panike <nwp@xxxxxxxxxxx> wrote:
condor_advertise INVALIDATE_STARTD_ADS ...

The condor_advertise man page covers this pretty well.

Nathan Panike

On Wed, Feb 20, 2013 at 03:46:23PM -0800, Matt Wallington wrote:
> I am having an issue where when a node is shut down forcefully (i.e. the
> power cable is yanked from the system), The master and scheduler continues
> to think the job is running on the node until it hits a timeout (which is a
> significant amount of time).  Eventually condor realizes the node is
> offline and it resubmits the job to another node.  I've found if I reduce
> the timeout, then the node will timeout on jobs that run longer than the
> timeout even if the node is online and operating properly.
>
> Is there a way to forcefully remove a node from the master (a node that has
> dropped offline but Condor still thinks is running) from the command line
> on the scheduler or master?  I've tried condor_off, condor_vacate,
> condor_vacate_job, etc. but none of these work because they all try to
> reach out to the node (which is now offline).  Is a command to simply
> remove a node from the pool immediately and have the job start over on
> another node?
>
> Thanks,
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Matt Wallington  |  CPUsage, Inc.  |  503-708-1919