[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] What if an active pool PC disappears from the HTCondor radar?



Hi,

I have an HTCondor pool of public library PCs. The PCs can be switched on/off by the library visitors ad libitum. Despite this, the 400+ PCs provide HTCondor with some 2500 hrs of CPU time.

It happens occasionally that a PC is running a job for HTCondor, when that PC suddenly vanishes, by being suddenly switched off or by other unfortunate events (pulling the internet cable, etc.)

What parameters on the HTCondor master determine how to handle such a case?

1) I have noticed that my HTCondor master seems to wait for a certain amount of time, but then decides to give up on the job and restart it elsewhere.

2) I also have noticed that this "wait time until giving up" is added to the the HTCondor RUN_TIME value, although the job has not made any progress during that time; the log file then has one "ExecuteEvent" followed immediately by the next "ExecuteEvent", without suspension or checkpointing...... Obviously in that case the value of RUN_TIME gets wrongly too big! Could this be a bug in HTCondor?

I hope somebody can help me out here.

Thank you.
Rob