[HTCondor-users] schedd state

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

So, I have used a lot of job schedulers in the past and in studying the Condor architecture a bit, found what seems to be a unique feature to Condor.

In all the other systems I've used, there was one Job Queue and it was a separate thing then the machine doing the submitting of jobs.

In this kind of arrangement, we've always considered our login nodes somewhat ephemeral and could scale them or reinstall them after removing them from the load balancer or dns and after users all logged out.

But with having a schedd running on each login node, the login nodes are stateful beyond just having users logged in/out.

So, some questions:
* How do you know it is safe to shutdown a schedd node without affecting a running job? Can you temporarily mark the schedd for not getting new jobs accepted so no new ones start to drain things? Does condor_q only show local jobs? If so, is just checking for running = 0 enough to tell if its safe to shutdown?

* If you want to reinstall the node but not loose the jobs, you have to maintain the condors job state somehow. is persisting /var/lib/condor/spool all you need to maintain this state, or are there other places on the file system that need to persist?

* For sites that want to scale the number of schedd's and the number of login nodes differently, is that possible? Is there a remote schedd mode? I'm sure things like the syscall shadowing wouldn't work in such a mode, but we haven't had a need for our site for that.

Thanks,
Kevin

Mailing List Archives

Public Access

[HTCondor-users] schedd state