[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Test jobs failed



Hi,

We have a problem with our Condor.

Firstly, we installed Condor using full-install without shared file system option on two machines.
One was chosen as central manager, while the other is a client.

Condor version is 6.6.8, and PC platform is Scientific Linux CERN 3.

(Scientific Linux CERN 3 is a Linux distribution build within the framework of Scientific Linux which in turn is rebuilt
from freely available Red Hat Enterprise Linux 3 product sources under terms and conditions of this product EULA.)

After installation we started daemons showing that central manager has all five needed processes (condor_master,
condor_collector, condor_negotiator, condor_startd, condor_schedd) and that the client has all three processes
(condor_master, condor_startd, condor_schedd).

Then, using condor_status, we saw that both machines are active.

But, when we tried to run some test (examples) jobs, there was a PROBLEM:
Jobs submitted on central manager were executed ONLY on central manager,
and jobs submitted on clien machine were NOT executed at all!

In NegotiatorLog (on central manager) we saw:

4/15 12:34:18 ---------- Started Negotiation Cycle ----------
4/15 12:34:18 Phase 1:  Obtaining ads from collector ...
4/15 12:34:18   Getting all public ads ...
4/15 12:34:18   Sorting 10 ads ...
4/15 12:34:18   Getting startd private ads ...
4/15 12:34:18 Got ads: 10 public and 4 private
4/15 12:34:18 Public ads include 2 submitter, 4 startd
4/15 12:34:18 Phase 2:  Performing accounting ...
4/15 12:34:18 Phase 3:  Sorting submitter ads by priority ...
4/15 12:34:18 Phase 4.1:  Negotiating with schedds ...
4/15 12:34:18   Negotiating with condor@xxxxxxxxxxxxxxxxxxxxx at <147.91.83.228:32770>
4/15 12:34:20 getpeername failed so connect must have failed
4/15 12:34:49 Connect failed for 30 seconds; returning FALSE
4/15 12:34:49     Failed to connect to <147.91.83.228:32770>
4/15 12:34:49   Error: Ignoring schedd for this cycle

On client machine, processes condor_shadow, condor_starter and condor_exec were not active
at all, and in StartLog we saw:

4/15 12:28:20 Swap space: 522104
4/15 12:28:20 70405876 kbytes available for "/home/condor/execute"
4/15 12:28:20 Looking up RESERVED_DISK parameter
4/15 12:28:20 Reserving 5120 kbytes for file system
4/15 12:28:20 Disk space: 70400756
4/15 12:28:20 Error on stat(/dev/:0,0xbfffe500), errno = 2(No such file or directory)
4/15 12:28:24 Attempting to send update via UDP to collector <147.91.83.254:9618>
4/15 12:28:24 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
4/15 12:28:24 vm1: Sent update to 1 collector(s)
4/15 12:28:25 Attempting to send update via UDP to collector <147.91.83.254:9618>
4/15 12:28:25 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
4/15 12:28:25 vm2: Sent update to 1 collector(s)

Command condor_q -analyze for all jobs say:

0 ...
4 match, match, but reject the job for unknown reasons
0 ...

What is the problem?

There are complete client log files in attachment (ClientLog.zip).

Thanks in advance,
   Dusan Radevic

Attachment: ClientLog.zip
Description: Zip archive