[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] RE: Condor Question

> Mike, you seem to know quite a bit about Condor, and since the mailing
> list seems rather slow to respond, I am going to hope that you have a
> minute to give me a hand.
> My problem is like this...
> I have condor installed on 2 fedora core 3 dual processor machines -
> machine1 and machine2. Machine1 is set up as a central manager and
> submission node, and machine2 is an execute/submit node. Running
> condor status on machine1 gives me:
> -------------------------
> $ condor_status
> Name          OpSys       Arch   State      Activity   LoadAv Mem
> ActvtyTime
> vm1@machine2 LINUX       INTEL  Owner      Idle       0.000     1
> 0+00:05:05
> vm2@machine2 LINUX       INTEL  Unclaimed  Idle       0.000     1
> 0+02:55:18
> vm3@machine2 LINUX       INTEL  Unclaimed  Idle       0.000     1
> 0+02:55:15
> vm4@machine2 LINUX       INTEL  Unclaimed  Idle       0.000     1
> 0+02:55:11
>                      Machines Owner Claimed Unclaimed Matched
>          INTEL/LINUX        4     1       0         3       0
>                Total        4     1       0         3       0
> ------------------------
> This implies to me that there is at least some communication going on
> between the two boxes. When I run condor_findhost right after starting
> up I get:
> ------------------------
> $ condor_findhost
> Warning:  Found no submitters
> ERROR: 1 machines not available
> -------------------------
> If I submit a command from machine1 the 'Warning:  Found no
> submitters' goes away (this makes sense, as the submit command
> probably registers the schedd with the negotiator, but why didn't it
> happen on startup?).

Very close - when you submit a job the schedd sends a "Submitter" ad to
the collector.  Since you hadn't submitted anything yet, you got the
"Warning:  Found no submitters" message.

> Also, the command waits on the queue indefinitely.

Which command hangs?  condor_submit?  condor_q?

> Now the weird stuff starts to happen. When I wait a few minutes and
> run condor_findhost again, I get:
> -------------------------
> $ condor_findhost
> vm2@xxxxxxxxxxxxxxxx
> -------------------------
> Perhaps I waited the 5 minutes for the negotiator cycle to begin
> again? I'm not sure. 

Condor requires patience.  :-)  Things like this are completely normal -
it takes time for information to percolate through the system.  You're
right, the delay can probably be attributed to the negotiator
periodically (once every negotiation cycle) getting the submitter
information from the collector.

I suspect that the above "ERROR: 1 machines not available" may just be a
bug - it was probably tested in UW's condor pool, and they *always* have
plenty of submitters.  

> But either way, the jobs that are on the queue
> remain there in an idle state. A shadow is never spawned on the master
> end (I have no shadow log at all). When I run condor_q -analyze on
> machine1 I get:
> -------------------------
> 005.000:  Run analysis summary.  Of 4 machines,
>       0 are rejected by your job's requirements
>       1 reject your job because of their own requirements
>       0 match, but are serving users with a better priority in the
>       3 match, match, but reject the job for unknown reasons
>       0 match, but will not currently preempt their existing job
>       0 are available to run your job
> -------------------------

Ah, the dreaded unknown reasons.

> The user that I am running condor under is a networked user, and
> condor is installed to an NFS mounted space. I installed Condor under
> a similar setup on 3 solaris machines last week, and it is working
> fine. Do you have any idea of what might be going on here?

No, but I can give you things to try:

- Increase your logging level if you haven't done so already:
- Restart your deamons - on machine1, 'condor_restart -all'.  It rarely
helps, but it feels good and sometimes I'm pleasantly surprised.  Don't
forget to wait a bit for all the daemons to report in.
- Put a job in the queue (if it isn't there already).  Watch the schedd
log in one window and the negotiator log in another window.  Do a
condor_reschedule and see what happens.  

This ought to give you some clues, or at least more ammunition for the
condor-users list. :-)

Mike Yoder
Principal Member of Technical Staff
Ask Mike: http://docs.optena.com
Direct  : +1.408.321.9000
Fax     : +1.408.321.9030
Mobile  : +1.408.497.7597

Optena Corporation
2860 Zanker Road, Suite 201
San Jose, CA 95134