Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Recovering spool files from windows machines

Date: Wed, 03 Sep 2008 17:01:40 -0500
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] Recovering spool files from windows machines

Nate,

The reason for the job becoming "disconnected" is best investigated bylooking in the StarterLog and StartLog on the execute machine. In caseit helps, I should mention that someone with administrative powers canretreive these logs remotely using condor_fetchlog. Example:


condor_fetchlog full.host.name STARTD
condor_fetchlog full.host.name STARTER.slot1

Also, what version of Condor?  (condor_version)

--Dan

Nathan Kaib wrote:

Hi,
I'm running jobs on windows machines at Purdue that take a number ofdays to complete. In the process, I am wasting a lot of computingtime. I have my jobs set up so that the dumpfiles are transferred tothe spool directory when a job is evicted from a machine and then theupdated spool files are transferred to the next machine when the jobexecutes again. The problem with this is that many times, the machinesseem to be manually rebooted (or something else unexpectedly happensthat suddenly takes the machine off the network. This is the reasonthat a large fraction of my jobs stop executing on a given machine, andwhen this happens I lose all of the computing because my dumpfiles arenot updated because condor didn't go through the normal evict process.
An example of this can be found in this log file (more text below excerpt):

----------------------------------------------------------------------------------

006 (78912.994.000) 08/25 11:46:21 Image size of job updated: 4980
...
022 (78912.994.000) 08/25 11:49:23 Job disconnected, attempting to reconnect
 Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx<128.210.37.33:1061>
...
024 (78912.994.000) 08/25 12:09:23 Job reconnection failed
 Job disconnected too long: JobLeaseDuration (1200 seconds) expired
 Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job
...
001 (78912.994.000) 08/25 14:38:31 Job executing on host:<128.210.59.72:1057>
----------------------------------------------------------------------
So far the only way I've found to get around this is to do acondor_vacate_job /cluster/ command periodically to force the jobs tovacate and update normally. I have two questions: do you know why somany jobs are suddenly killed in the way I've talked about above? and2) Is there another easier/more efficient way I can update spool filesin a periodic manner to avoid this problem?
Thanks,

Nate Kaib

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:https://lists.cs.wisc.edu/archive/condor-users/

Prev by Date: Re: [Condor-users] condor-g & matching a cluster to multiple jobs at once
Next by Date: Re: [Condor-users] condor-g & matching a cluster to multiple jobs at once
Previous by thread: Re: [Condor-users] condor-g & matching a cluster to multiple jobs at once
Next by thread: [Condor-users] java nodes not appearing
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Recovering spool files from windows machines