[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Checkpoint servers and vanilla universes on windows



We make extensive use of the inferred check pointing available on
clipped platforms to self checkpoint.

Of late this is causing some serious performance issues due (I think)
to the overhead when a sudden batch of preemptions occurs

When 50+ servers each with 2 CPU's all try to submit back up to 0.5 of
a GB of (compressed) data the schedd and shadows seem to give up the
ghost.

I think I am being somewhat optimistic in ever expecting the local
submitters to handle this (they are dual xeons but still the disks
aren't raided and the network is not gigabit right through to them).

Therefore I want to sort out storing the checkpoint data on (several)
central fileservers.

So I have two (possible) options

1) just change the app to write the data to the fileserver as needed.
then write a small file to disk saying what and where it did it. On
the restart the file is used to determine that it has been restarted
and go and get the data.

Great as I get full control but needs explicit coding in the app not
to mention the risk of server being down, having issues remapping jobs
to point to the right place,  scalability, preening dead data etc...

2) Use a condor checkpoint server (or more likely two located for each
bank of nodes) and thus the machine itself knows where to get the data
from/too and will handle unique identification of jobs for me as well
as (I believe) removing old files for dead jobs.

The question is do condor checkpoint servers work on windows and with
inferred check pointing?
And if they do are there performance implications to handling that
volume of concurrent check pointing (compared to just letting the
windows server deal with it)

http://www.cs.wisc.edu/condor/manual/v6.7/3_4Contrib_Module.html#sec:Ckpt-Server

...doesn't really say if it works in this manner or not

Any other suggestions most welcome

Thanks,
Matt