[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Recovering spool files from windows machines



Nate,

The reason for the job becoming "disconnected" is best investigated by looking in the StarterLog and StartLog on the execute machine. In case it helps, I should mention that someone with administrative powers can retreive these logs remotely using condor_fetchlog. Example:

condor_fetchlog full.host.name STARTD
condor_fetchlog full.host.name STARTER.slot1

Also, what version of Condor?  (condor_version)

--Dan

Nathan Kaib wrote:

Hi,

I'm running jobs on windows machines at Purdue that take a number of days to complete. In the process, I am wasting a lot of computing time. I have my jobs set up so that the dumpfiles are transferred to the spool directory when a job is evicted from a machine and then the updated spool files are transferred to the next machine when the job executes again. The problem with this is that many times, the machines seem to be manually rebooted (or something else unexpectedly happens that suddenly takes the machine off the network. This is the reason that a large fraction of my jobs stop executing on a given machine, and when this happens I lose all of the computing because my dumpfiles are not updated because condor didn't go through the normal evict process.

An example of this can be found in this log file (more text below excerpt):

----------------------------------------------------------------------------------

006 (78912.994.000) 08/25 11:46:21 Image size of job updated: 4980
...
022 (78912.994.000) 08/25 11:49:23 Job disconnected, attempting to reconnect
 Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <128.210.37.33:1061>
...
024 (78912.994.000) 08/25 12:09:23 Job reconnection failed
 Job disconnected too long: JobLeaseDuration (1200 seconds) expired
 Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job
...
001 (78912.994.000) 08/25 14:38:31 Job executing on host: <128.210.59.72:1057>

----------------------------------------------------------------------

So far the only way I've found to get around this is to do a condor_vacate_job /cluster/ command periodically to force the jobs to vacate and update normally. I have two questions: do you know why so many jobs are suddenly killed in the way I've talked about above? and 2) Is there another easier/more efficient way I can update spool files in a periodic manner to avoid this problem?

Thanks,

Nate Kaib

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/