[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] CCB Server - Client Communication (Condor 7.3.1)



Hello Condor Group,

We are having an issue running jobs using the new CCB feature in Condor. We have nodes that are running a master and startd behind a NAT (Condor 7.3.1). These execute hosts are connecting to a Central Manager, running a Collector, Negotiator, etc with a public address (Condor 7.3.0).  The machines appear to join the pool correctly - we can see them in condor_status and their status changes appropriately. 

However, running a job on the machine only works intermittently. Most of the time we receive the following in the schedd log:

Match record (worker_EEFFCD67D127.domain.edu <10.0.2.15:52446?CCBID=192.168.10.18:9618#217> for herzfeldd@xxxxxxxxxxxxxxx, 30.0) deleted
07/16 16:42:18 (pid:562) Sent ad to central manager for herzfeldd@xxxxxxxxxxxxxxx
07/16 16:42:18 (pid:562) Sent ad to 1 collectors for herzfeldd@xxxxxxxxxxxxxxx
07/16 16:42:37 (pid:562) Activity on stashed negotiator socket
07/16 16:42:37 (pid:562) Negotiating for owner: herzfeldd@xxxxxxxxxxxxxxx
07/16 16:42:37 (pid:562) Out of servers - 1 jobs matched, 9 jobs idle, 1 jobs rejected
07/16 16:42:37 (pid:562) Failed to send CCB_REQUEST to collector 192.168.10.18:9618: 
07/16 16:42:37 (pid:562) CCBClient: no more CCB servers to try for requesting reversed connection to startd worker_EEFFCD67D127.bio.mscs.mu.edu <10.0.2.15:52446?CCBID=192.168.10.18:9618#217> for herzfeldd@xxxxxxxxxxxxxxx; giving up.
07/16 16:42:37 (pid:562) Failed to send REQUEST_CLAIM to startd worker_EEFFCD67D127.bio.mscs.mu.edu <10.0.2.15:52446?CCBID=192.168.10.18:9618#217> for herzfeldd@xxxxxxxxxxxxxxx: SECMAN:2003:TCP connection to startd worker_EEFFCD67D127.bio.mscs.mu.edu <10.0.2.15:52446?CCBID=192.168.10.18:9618#217> for herzfeldd@xxxxxxxxxxxxxxx failed
07/16 16:42:37 (pid:562) Match record (worker_EEFFCD67D127.bio.mscs.mu.edu <10.0.2.15:52446?CCBID=192.168.10.18:9618#217> for herzfeldd@xxxxxxxxxxxxxxx, 30.0) deleted.

The log files on the execute host show nothing unusual - no jobs are getting rejected nor does it say that any sort of communications failure has occured. The ALLOW_DAEMON line on the execute host is set to *.

Sometimes a series of jobs are able to run (usually right after the execute node joins the pool). Any help in this matter would be greatly appreciated.

Many thanks,
David