[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] job_lease_duration / multiple nics



You can specify NETWORK_INTERFACE = 10.x.x.x
in the condor_config.local of the submit host
and BIND_ALL_INTERFACES = FALSE. That might have other consequences but it should be the first thing you try.

Steve Timm



On Fri, 16 Mar 2012, Shrum, Donald C wrote:

I have a dedicated condor cluster here at FSU.  There are occasions where jobs are being evicted at 20 minutes.  Setting job_lease_duration in the submit file resolves the problem.

My submit nodes have both a public and a private interface.  Communication with condor processing nodes occurs over the private network and the central manager is on the same private network.
##  What machine is your central manager?
CONDOR_HOST     = 10.178.6.5



Looking at the processing node logs (StartLog) I see the following -

03/16/12 12:21:05 slot5: Remote owner is dcshrum@xxxxxxxxxxxxxxxxx
03/16/12 12:21:05 slot5: State change: claiming protocol successful
03/16/12 12:21:05 slot5: Changing state: Matched -> Claimed
03/16/12 12:21:05 slot4: Got activate_claim request from shadow (<10.175.15.2:37311>)
03/16/12 12:21:05 slot4: Remote job ID is 9184.0
03/16/12 12:21:05 slot4: Got universe "VANILLA" (5) from request classad
03/16/12 12:21:05 slot4: State change: claim-activation protocol successful
03/16/12 12:21:05 slot4: Changing activity: Idle -> Busy
03/16/12 12:23:59 attempt to connect to <144.174.80.97:51477> failed: Connection timed out (connect errno = 110).  Will keep trying for 597 total seconds (576 to go).


10.175.15.2 and 144.174.80.97 are the ips on the submit node.  It appears that communication is fine over 10.175.15.2 to accept the job.  I presume the failure to communicate back on 144.174.80.97 is my problem.


From this I have two questions -
1 - I'm not sure how to force condor to ignore the public ip address on the submit node.
2- I thought the lease renewal occurred when the submit node contacted the processing node.  Not the other way around as my log seems to indicate.




--Donny
FSU HPC

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Group Leader.
Lead of FermiCloud project.