[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] job_lease_duration / multiple nics

I have a dedicated condor cluster here at FSU.  There are occasions where jobs are being evicted at 20 minutes.  Setting job_lease_duration in the submit file resolves the problem. 

My submit nodes have both a public and a private interface.  Communication with condor processing nodes occurs over the private network and the central manager is on the same private network.
##  What machine is your central manager?

Looking at the processing node logs (StartLog) I see the following - 

03/16/12 12:21:05 slot5: Remote owner is dcshrum@xxxxxxxxxxxxxxxxx
03/16/12 12:21:05 slot5: State change: claiming protocol successful
03/16/12 12:21:05 slot5: Changing state: Matched -> Claimed
03/16/12 12:21:05 slot4: Got activate_claim request from shadow (<>)
03/16/12 12:21:05 slot4: Remote job ID is 9184.0
03/16/12 12:21:05 slot4: Got universe "VANILLA" (5) from request classad
03/16/12 12:21:05 slot4: State change: claim-activation protocol successful
03/16/12 12:21:05 slot4: Changing activity: Idle -> Busy
03/16/12 12:23:59 attempt to connect to <> failed: Connection timed out (connect errno = 110).  Will keep trying for 597 total seconds (576 to go). and are the ips on the submit node.  It appears that communication is fine over to accept the job.  I presume the failure to communicate back on is my problem.

>From this I have two questions - 
1 - I'm not sure how to force condor to ignore the public ip address on the submit node.
2- I thought the lease renewal occurred when the submit node contacted the processing node.  Not the other way around as my log seems to indicate.