[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] CCB Server - Client Communication (Condor 7.3.1)




Look in CollectorLog on the central manager. Does it report any errors around the time of the failed CCB_REQUEST attempt?

If not, please add D_FULLDEBUG, D_SECURITY, and D_COMMAND to COLLECTOR_DEBUG and SCHEDD_DEBUG.

--Dan

Herzfeld, David wrote:
Hello Condor Group,

We are having an issue running jobs using the new CCB feature in Condor. We have nodes that are running a master and startd behind a NAT (Condor 7.3.1). These execute hosts are connecting to a Central Manager, running a Collector, Negotiator, etc with a public address (Condor 7.3.0). The machines appear to join the pool correctly - we can see them in condor_status and their status changes appropriately.
However, running a job on the machine only works intermittently. Most of the time we receive the following in the schedd log:

Match record (worker_EEFFCD67D127.domain.edu <10.0.2.15:52446?CCBID=192.168.10.18:9618#217> for herzfeldd@xxxxxxxxxxxxxxx, 30.0) deleted
07/16 16:42:18 (pid:562) Sent ad to central manager for herzfeldd@xxxxxxxxxxxxxxx
07/16 16:42:18 (pid:562) Sent ad to 1 collectors for herzfeldd@xxxxxxxxxxxxxxx
07/16 16:42:37 (pid:562) Activity on stashed negotiator socket
07/16 16:42:37 (pid:562) Negotiating for owner: herzfeldd@xxxxxxxxxxxxxxx
07/16 16:42:37 (pid:562) Out of servers - 1 jobs matched, 9 jobs idle, 1 jobs rejected
07/16 16:42:37 (pid:562) Failed to send CCB_REQUEST to collector 192.168.10.18:9618: 07/16 16:42:37 (pid:562) CCBClient: no more CCB servers to try for requesting reversed connection to startd worker_EEFFCD67D127.bio.mscs.mu.edu <10.0.2.15:52446?CCBID=192.168.10.18:9618#217> for herzfeldd@xxxxxxxxxxxxxxx; giving up.
07/16 16:42:37 (pid:562) Failed to send REQUEST_CLAIM to startd worker_EEFFCD67D127.bio.mscs.mu.edu <10.0.2.15:52446?CCBID=192.168.10.18:9618#217> for herzfeldd@xxxxxxxxxxxxxxx: SECMAN:2003:TCP connection to startd worker_EEFFCD67D127.bio.mscs.mu.edu <10.0.2.15:52446?CCBID=192.168.10.18:9618#217> for herzfeldd@xxxxxxxxxxxxxxx failed
07/16 16:42:37 (pid:562) Match record (worker_EEFFCD67D127.bio.mscs.mu.edu <10.0.2.15:52446?CCBID=192.168.10.18:9618#217> for herzfeldd@xxxxxxxxxxxxxxx, 30.0) deleted.

The log files on the execute host show nothing unusual - no jobs are getting rejected nor does it say that any sort of communications failure has occured. The ALLOW_DAEMON line on the execute host is set to *.

Sometimes a series of jobs are able to run (usually right after the execute node joins the pool). Any help in this matter would be greatly appreciated.

Many thanks,
David
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/