[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Recovering spool files from windows machines



Hi,

I'm running jobs on windows machines at Purdue that take a number of days to complete. In the process, I am wasting a lot of computing time. I have my jobs set up so that the dumpfiles are transferred to the spool directory when a job is evicted from a machine and then the updated spool files are transferred to the next machine when the job executes again. The problem with this is that many times, the machines seem to be manually rebooted (or something else unexpectedly happens that suddenly takes the machine off the network. This is the reason that a large fraction of my jobs stop executing on a given machine, and when this happens I lose all of the computing because my dumpfiles are not updated because condor didn't go through the normal evict process.

An example of this can be found in this log file (more text below excerpt):

----------------------------------------------------------------------------------

006 (78912.994.000) 08/25 11:46:21 Image size of job updated: 4980
...
022 (78912.994.000) 08/25 11:49:23 Job disconnected, attempting to reconnect
 Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <128.210.37.33:1061>
...
024 (78912.994.000) 08/25 12:09:23 Job reconnection failed
 Job disconnected too long: JobLeaseDuration (1200 seconds) expired
 Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job
...
001 (78912.994.000) 08/25 14:38:31 Job executing on host: <128.210.59.72:1057>

----------------------------------------------------------------------

So far the only way I've found to get around this is to do a condor_vacate_job /cluster/ command periodically to force the jobs to vacate and update normally. I have two questions: do you know why so many jobs are suddenly killed in the way I've talked about above? and 2) Is there another easier/more efficient way I can update spool files in a periodic manner to avoid this problem?

Thanks,

Nate Kaib