[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] CCB Server - Client Communication (Condor 7.3.1)



Hello - I apologize if this is posted in the wrong place, but my email client is having problems responding to the mailing list post.

In response to Dan's request, 
>Look in CollectorLog on the central manager. Does it report any errors around the time of the failed CCB_REQUEST attempt?
>
>If not, please add D_FULLDEBUG, D_SECURITY, and D_COMMAND to COLLECTOR_DEBUG and SCHEDD_DEBUG.

I have added the following lines to the condor_config.local file on the CM:
STARTD_DEBUG = D_FULLDEBUG D_COMMAND D_SECURITY
SCHEDD_DEBUG = D_FULLDEBUG D_COMMAND D_SECURITY
NEGOTIATOR_DEBUG = D_FULLDEBUG D_COMMAND D_SECURITY
COLLECTOR_DEBUG = D_FULLDEBUG D_COMMAND D_SECURITY

During this time, the Collector shows no errors - all authorizations are granted. In the SchedLogs:
07/18 01:48:45 SECMAN: not negotiating, just sending command (68)
07/18 01:48:45 Authorizing server '*/134.48.90.158'.
07/18 01:48:45 Return from Handler <SecManStartCommand::WaitForSocketCallback CCB_REQUEST>
07/18 01:48:45 Calling Handler <DCMessenger::receiveMsgCallback CCB_REQUEST> (5)
07/18 01:48:45 Completed CCB_REQUEST to collector 134.48.90.158:9618
07/18 01:48:45 CCBClient:received failure message from CCB server 134.48.90.158:9618 in response to (non-blocking) request for reversed connection to startd slot1@xxxxxxxxxxxxxxxxxxxxx
07/18 01:48:45 CCBClient: no more CCB servers to try for requesting reversed connection to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <10.0.2.15:51444>#1247818233#5#... for herzfeldd.
07/18 01:48:45 Calling Handler <SecManStartCommand::WaitForSocketCallback REQUEST_CLAIM> (6)
07/18 01:48:45 SECMAN: resuming command 442 REQUEST_CLAIM to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <10.0.2.15:51444>#1247818233#5#... for herzfeldd@xxxxxxxxxxxxxxxxxxxxx from TC.
07/18 01:48:45 SECMAN: TCP connection to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <10.0.2.15:51444>#1247818233#5#... for herzfeldd@xxxxxxxxxxxxxxxxxxxxx failed.
07/18 01:48:45 Failed to send REQUEST_CLAIM to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <10.0.2.15:51444>#1247818233#5#... for herzfeldd@xxxxxxxxxxxxxxxxxxxxx: SECMAN:2003:TCP conn.
07/18 01:48:45 Match record (slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <10.0.2.15:51444>#1247818233#5#... for herzfeldd@xxxxxxxxxxxxxxxxxxxxx, 850.2) deleted
07/18 01:48:45 Return from Handler <SecManStartCommand::WaitForSocketCallback REQUEST_CLAIM>
07/18 01:48:45 Return from Handler <DCMessenger::receiveMsgCallback CCB_REQUEST>
07/18 01:48:45 Calling Handler <DCMessenger::receiveMsgCallback CCB_REQUEST> (15)
07/18 01:48:45 Completed CCB_REQUEST to collector 134.48.90.158:9618

On a, perhaps, related note. I am unable to perform a condor_q -better-analyze due to an error: Unable to process machine ClassAds.

Many thanks,
David
_______________________________________
From: Herzfeld, David
Sent: Thursday, July 16, 2009 5:15 PM
To: condor-users@xxxxxxxxxxx
Subject: CCB Server - Client Communication (Condor 7.3.1)

Hello Condor Group,

We are having an issue running jobs using the new CCB feature in Condor. We have nodes that are running a master and startd behind a NAT (Condor 7.3.1). These execute hosts are connecting to a Central Manager, running a Collector, Negotiator, etc with a public address (Condor 7.3.0).  The machines appear to join the pool correctly - we can see them in condor_status and their status changes appropriately.

However, running a job on the machine only works intermittently. Most of the time we receive the following in the schedd log:

Match record (worker_EEFFCD67D127.domain.edu <10.0.2.15:52446?CCBID=192.168.10.18:9618#217> for herzfeldd@xxxxxxxxxxxxxxx, 30.0) deleted
07/16 16:42:18 (pid:562) Sent ad to central manager for herzfeldd@xxxxxxxxxxxxxxx
07/16 16:42:18 (pid:562) Sent ad to 1 collectors for herzfeldd@xxxxxxxxxxxxxxx
07/16 16:42:37 (pid:562) Activity on stashed negotiator socket
07/16 16:42:37 (pid:562) Negotiating for owner: herzfeldd@xxxxxxxxxxxxxxx
07/16 16:42:37 (pid:562) Out of servers - 1 jobs matched, 9 jobs idle, 1 jobs rejected
07/16 16:42:37 (pid:562) Failed to send CCB_REQUEST to collector 192.168.10.18:9618:
07/16 16:42:37 (pid:562) CCBClient: no more CCB servers to try for requesting reversed connection to startd worker_EEFFCD67D127.bio.mscs.mu.edu <10.0.2.15:52446?CCBID=192.168.10.18:9618#217> for herzfeldd@xxxxxxxxxxxxxxx; giving up.
07/16 16:42:37 (pid:562) Failed to send REQUEST_CLAIM to startd worker_EEFFCD67D127.bio.mscs.mu.edu <10.0.2.15:52446?CCBID=192.168.10.18:9618#217> for herzfeldd@xxxxxxxxxxxxxxx: SECMAN:2003:TCP connection to startd worker_EEFFCD67D127.bio.mscs.mu.edu <10.0.2.15:52446?CCBID=192.168.10.18:9618#217> for herzfeldd@xxxxxxxxxxxxxxx failed
07/16 16:42:37 (pid:562) Match record (worker_EEFFCD67D127.bio.mscs.mu.edu <10.0.2.15:52446?CCBID=192.168.10.18:9618#217> for herzfeldd@xxxxxxxxxxxxxxx, 30.0) deleted.

The log files on the execute host show nothing unusual - no jobs are getting rejected nor does it say that any sort of communications failure has occured. The ALLOW_DAEMON line on the execute host is set to *.

Sometimes a series of jobs are able to run (usually right after the execute node joins the pool). Any help in this matter would be greatly appreciated.

Many thanks,
David