[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs never start (well never finish turns out to be a better description)



Herman,

Well I think you are about right but I found the evidence in the job output file rather than the log files. I probably missed this as I was not waiting five minutes to look.

Still not sure why this is happening. One more point is that the network is simple Windows workgroup not a Windows domain and no DNS.

It appears that I have some connection issue between the two machines. Condor appears to be rescheduling the job every five minutes.

______________________________________________________________________________

000 (008.000.000) 04/16 14:27:04 Job submitted from host: <192.168.50.1:54597>
...
001 (008.000.000) 04/16 14:47:07 Job executing on host: <192.168.50.1:54599>
...
022 (008.000.000) 04/16 14:47:08 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot3@jhowes-HPT <192.168.50.1:54599>
...
024 (008.000.000) 04/16 14:47:08 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot3@jhowes-HPT, rescheduling job
...
001 (008.000.000) 04/16 14:52:08 Job executing on host: <192.168.50.1:54599>
...
022 (008.000.000) 04/16 14:52:08 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot3@jhowes-HPT <192.168.50.1:54599>
...
024 (008.000.000) 04/16 14:52:08 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot3@jhowes-HPT, rescheduling job
...
001 (008.000.000) 04/16 14:57:08 Job executing on host: <192.168.50.1:54599>
...
022 (008.000.000) 04/16 14:57:08 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot3@jhowes-HPT <192.168.50.1:54599>

Best,
John L. (Jack) Howes


On 16.04.2012 09:21, Hermann Fuchs wrote:
Hi

You should have a look into the Negotiatior log on the master server as
well as the startlog on the execute node.
I had a similar case where the Master matched the job, while the execute
node rejected it for some reason.
Then the master matched it again, the execute node rejected it and so
on...

The
Request has not yet been considered by the matchmaker.
means in this case that after the Master matched a job it forgets all
about it. If the job comes back again (e.g. because it was rejected by
the execute node a split second later) the master thinks it is a new
job.

Cheers,
Hermann

On Mon, 2012-04-16 at 08:16 -0400, jhowes@xxxxxxxxxxxxxxxx wrote:
I am looking for some help in trying to debug something that seems like
it should work without trouble [but not for me].

I setup a personal condor on my laptop under Win7 with no trouble and also set up the additional config stuff to enable RunAsOwner. Tested
with a simple Perl script job and it works as expected.

Next step was to add another node to create a real pool. So, installed the same version (7.6.6) on the desktop. Just used the msi script and pointed this at my laptop as the pool central manager. Also added the
credd changes to the config to allow RunAsOwner.  Both machines are
quadcores running Win7 64 bit.

But jobs just sit in queue -- no difference in behavior whether
submitting a RunAsOwner or not.

Condor status looks right -- two machines four slots each. The daemons that are running look right and there is nothing that jumps out at me in
the logs.

Seems like this should be dead simple but I am stuck. Any insight in
where to look would be appreciated.


_____________________________________________________________________________________
Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\jhowes>condor_status

Name               OpSys      Arch   State     Activity LoadAv Mem
ActvtyTime

slot1@HPTlaptop    WINNT61    X86_64 Unclaimed Idle     0.070   973
0+00:15:04
slot2@HPTlaptop    WINNT61    X86_64 Unclaimed Idle     0.000   973
0+00:14:46
slot3@HPTlaptop    WINNT61    X86_64 Unclaimed Idle     0.000   973
0+00:15:06
slot4@HPTlaptop    WINNT61    X86_64 Unclaimed Idle     0.000   973
0+00:15:07
slot1@jhowes-HPT   WINNT61    X86_64 Unclaimed Idle     0.090  2026
0+00:35:31
slot2@jhowes-HPT   WINNT61    X86_64 Unclaimed Idle     0.000  2026
0+00:34:40
slot3@jhowes-HPT   WINNT61    X86_64 Unclaimed Idle     0.000  2026
0+00:35:33
slot4@jhowes-HPT   WINNT61    X86_64 Unclaimed Idle     0.000  2026
0+00:35:34
Total Owner Claimed Unclaimed Matched Preempting
Backfill

X86_64/WINNT61 8 0 0 8 0 0
     0

Total 8 0 0 8 0 0
     0

C:\Users\jhowes>condor_q


-- Submitter: jhowes-HPT : <192.168.50.1:52258> : jhowes-HPT
  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
    7.0   jhowes          4/16 07:49   0+00:00:00 I  0   0.0
TimeStamp.pl

1 jobs; 1 idle, 0 running, 0 held

C:\Users\jhowes>condor_q -analyze


-- Submitter: jhowes-HPT : <192.168.50.1:52258> : jhowes-HPT
---
007.000:  Request has not yet been considered by the matchmaker.





_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

--
-------------
DI Hermann Fuchs
Christian Doppler Laboratory for Medical Radiation Research for
Radiation Oncology
Department of Radiation Oncology
Medical University Vienna
Währinger Gürtel 18-20
A-1090 Wien

Tel.  + 43 / 1 / 40 400 7271
Mail. hermann.fuchs@xxxxxxxxxxxxxxxx

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/