[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] newbie question




The HOSTALLOW_* values are whatever the installation script makes them. I haved not edited them. Tailing NegotiatorLog, I see connect failures. The central manager and one of the submit machines, 141.142.65.40, are not firewalled. The other submit machine 141.142.15.3, may be.


4/23 16:06:38 ---------- Started Negotiation Cycle ----------
4/23 16:06:38 Phase 1:  Obtaining ads from collector ...
4/23 16:06:38   Getting all public ads ...
4/23 16:06:38   Sorting 15 ads ...
4/23 16:06:38   Getting startd private ads ...
4/23 16:06:38 Got ads: 15 public and 7 private
4/23 16:06:38 Public ads include 2 submitter, 7 startd
4/23 16:06:38 Phase 2:  Performing accounting ...
4/23 16:06:38 Phase 3:  Sorting submitter ads by priority ...
4/23 16:06:38 Phase 4.1:  Negotiating with schedds ...
4/23 16:06:38   Negotiating with remijan@xxxxxxxxxxxxx at <141.142.15.3:33875>
4/23 16:07:08 select returns 0, connect failed
4/23 16:07:08 Will keep trying for 30 seconds...
4/23 16:07:09 Connect failed for 30 seconds; returning FALSE
4/23 16:07:09     Failed to connect to <141.142.15.3:33875>
4/23 16:07:09   Error: Ignoring schedd for this cycle
4/23 16:07:09   Negotiating with remijan@xxxxxxxxxxxxx at <141.142.65.40:35243>
4/23 16:07:09     Request 00004.00000:
4/23 16:10:18 Can't connect to <141.142.15.3:33876>:0, errno = 110
4/23 16:10:18 Will keep trying for 10 seconds...
4/23 16:10:19 Connect failed for 10 seconds; returning FALSE
4/23 16:10:19 ERROR:
SECMAN:2003:TCP connection to <141.142.15.3:33876> failed
4/23 16:10:19 condor_write(): Socket closed when trying to write buffer
4/23 16:10:19 Buf::write(): condor_write() failed
4/23 16:10:19       Could not send PERMISSION
4/23 16:10:19   Error: Ignoring schedd for this cycle
4/23 16:10:19 ---------- Finished Negotiation Cycle ----------




At 02:26 PM 4/23/2004, you wrote:


5 match, but prefer another specific job despite its worse user-priority

This message is misleading. It really means, "something else is wrong". Yup, that's vague and not useful.


Your ClassAds don't reveal anything interesting. Is there anything useful in your job log file?

/home/remijan/condor-jobs/hello/job.log

Are your permissions set up correctly to allow you to access the central manager? That is, are the HOSTALLOW_* variables set correctly? If they aren't, you'll see permission denied errors in the CollectorLog files on the central manager.

If the above do not help, try this:

1) On your central manager:
   tail -f NegotiatorLog

2) On the submit computer:
   condor_reschedule

You should see the negotiator trying to look for a match with your jobs. It may report errors, or it may fail. What do you see?

If we don't get anywhere with this, let's set up a VNC session/phone call, and I'll help debug it more directly. (VNC will let us share an X window so we can both type and see what is in it.)

-alain


Condor Support Information: http://www.cs.wisc.edu/condor/condor-support/ To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with unsubscribe condor-users <your_email_address>


Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>