[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs never start (well never finish turns out to be a better description)



Is Condor on the firewall exception list?

On Mon, Apr 16, 2012 at 3:07 PM,  <jhowes@xxxxxxxxxxxxxxxx> wrote:
> Herman,
>
> Well I think you are about right but I found the evidence in the job output
> file rather than the log files.  I probably missed this as I was not waiting
> five minutes to look.
>
> Still not sure why this is happening.  One more point is that the network is
> simple Windows workgroup not a Windows domain and no DNS.
>
> It appears that I have some connection issue between the two machines.
>  Condor appears to be rescheduling the job every five minutes.
>
> ______________________________________________________________________________
>
> 000 (008.000.000) 04/16 14:27:04 Job submitted from host:
> <192.168.50.1:54597>
> ...
> 001 (008.000.000) 04/16 14:47:07 Job executing on host: <192.168.50.1:54599>
> ...
> 022 (008.000.000) 04/16 14:47:08 Job disconnected, attempting to reconnect
>    Socket between submit and execute hosts closed unexpectedly
>    Trying to reconnect to slot3@jhowes-HPT <192.168.50.1:54599>
> ...
> 024 (008.000.000) 04/16 14:47:08 Job reconnection failed
>    Job not found at execution machine
>    Can not reconnect to slot3@jhowes-HPT, rescheduling job
> ...
> 001 (008.000.000) 04/16 14:52:08 Job executing on host: <192.168.50.1:54599>
> ...
> 022 (008.000.000) 04/16 14:52:08 Job disconnected, attempting to reconnect
>    Socket between submit and execute hosts closed unexpectedly
>    Trying to reconnect to slot3@jhowes-HPT <192.168.50.1:54599>
> ...
> 024 (008.000.000) 04/16 14:52:08 Job reconnection failed
>    Job not found at execution machine
>    Can not reconnect to slot3@jhowes-HPT, rescheduling job
> ...
> 001 (008.000.000) 04/16 14:57:08 Job executing on host: <192.168.50.1:54599>
> ...
> 022 (008.000.000) 04/16 14:57:08 Job disconnected, attempting to reconnect
>    Socket between submit and execute hosts closed unexpectedly
>    Trying to reconnect to slot3@jhowes-HPT <192.168.50.1:54599>
>
> Best,
> John L. (Jack) Howes
>
>
> On 16.04.2012 09:21, Hermann Fuchs wrote:
>>
>> Hi
>>
>> You should have a look into the Negotiatior log on the master server as
>> well as the startlog on the execute node.
>> I had a similar case where the Master matched the job, while the execute
>> node rejected it for some reason.
>> Then the master matched it again, the execute node rejected it and so
>> on...
>>
>> The
>> Request has not yet been considered by the matchmaker.
>> means in this case that after the Master matched a job it forgets all
>> about it. If the job comes back again (e.g. because it was rejected by
>> the execute node a split second later) the master thinks it is a new
>> job.
>>
>> Cheers,
>> Hermann
>>
>> On Mon, 2012-04-16 at 08:16 -0400, jhowes@xxxxxxxxxxxxxxxx wrote:
>>>
>>> I am looking for some help in trying to debug something that seems like
>>> it should work without trouble [but not for me].
>>>
>>> I setup a personal condor on my laptop under Win7 with no trouble and
>>> also set up the additional config stuff to enable RunAsOwner.  Tested
>>> with a simple Perl script job and it works as expected.
>>>
>>> Next step was to add another node to create a real pool.  So, installed
>>> the same version (7.6.6) on the desktop.  Just used the msi script and
>>> pointed this at my laptop as the pool central manager.  Also added the
>>> credd changes to the config to allow RunAsOwner.  Both machines are
>>> quadcores running Win7 64 bit.
>>>
>>> But jobs just sit in queue -- no difference in behavior whether
>>> submitting a RunAsOwner or not.
>>>
>>> Condor status looks right -- two machines four slots each.  The daemons
>>> that are running look right and there is nothing that jumps out at me in
>>> the logs.
>>>
>>> Seems like this should be dead simple but I am stuck.  Any insight in
>>> where to look would be appreciated.
>>>
>>>
>>>
>>> _____________________________________________________________________________________
>>> Microsoft Windows [Version 6.1.7601]
>>> Copyright (c) 2009 Microsoft Corporation.  All rights reserved.
>>>
>>> C:\Users\jhowes>condor_status
>>>
>>> Name               OpSys      Arch   State     Activity LoadAv Mem
>>> ActvtyTime
>>>
>>> slot1@HPTlaptop    WINNT61    X86_64 Unclaimed Idle     0.070   973
>>> 0+00:15:04
>>> slot2@HPTlaptop    WINNT61    X86_64 Unclaimed Idle     0.000   973
>>> 0+00:14:46
>>> slot3@HPTlaptop    WINNT61    X86_64 Unclaimed Idle     0.000   973
>>> 0+00:15:06
>>> slot4@HPTlaptop    WINNT61    X86_64 Unclaimed Idle     0.000   973
>>> 0+00:15:07
>>> slot1@jhowes-HPT   WINNT61    X86_64 Unclaimed Idle     0.090  2026
>>> 0+00:35:31
>>> slot2@jhowes-HPT   WINNT61    X86_64 Unclaimed Idle     0.000  2026
>>> 0+00:34:40
>>> slot3@jhowes-HPT   WINNT61    X86_64 Unclaimed Idle     0.000  2026
>>> 0+00:35:33
>>> slot4@jhowes-HPT   WINNT61    X86_64 Unclaimed Idle     0.000  2026
>>> 0+00:35:34
>>>                      Total Owner Claimed Unclaimed Matched Preempting
>>> Backfill
>>>
>>>       X86_64/WINNT61     8     0       0         8       0          0
>>>     0
>>>
>>>                Total     8     0       0         8       0          0
>>>     0
>>>
>>> C:\Users\jhowes>condor_q
>>>
>>>
>>> -- Submitter: jhowes-HPT : <192.168.50.1:52258> : jhowes-HPT
>>>  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
>>>    7.0   jhowes          4/16 07:49   0+00:00:00 I  0   0.0
>>> TimeStamp.pl
>>>
>>> 1 jobs; 1 idle, 0 running, 0 held
>>>
>>> C:\Users\jhowes>condor_q -analyze
>>>
>>>
>>> -- Submitter: jhowes-HPT : <192.168.50.1:52258> : jhowes-HPT
>>> ---
>>> 007.000:  Request has not yet been considered by the matchmaker.
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Condor-users mailing list
>>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/condor-users/
>>
>>
>> --
>> -------------
>> DI Hermann Fuchs
>> Christian Doppler Laboratory for Medical Radiation Research for
>> Radiation Oncology
>> Department of Radiation Oncology
>> Medical University Vienna
>> Währinger Gürtel 18-20
>> A-1090 Wien
>>
>> Tel.  + 43 / 1 / 40 400 7271
>> Mail. hermann.fuchs@xxxxxxxxxxxxxxxx
>>
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/



-- 
Condor Project Windows Developer