Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Problems with Windows jobs running indefinitely!
- Date: Mon, 14 Jul 2008 18:05:59 -0400
- From: alan.ocais@xxxxxxxxx
- Subject: Re: [Condor-users] Problems with Windows jobs running indefinitely!
That sounds like the culprit. The jobs I had, once suspended, would
cycle in and out of suspension (without cause) before being evicted.
On 14/07/2008, Greg Quinn <gquinn@xxxxxxxxxxx> wrote:
> Hi all,
>
> One problem that I've recently been made aware of exists if you are
> using suspension policies in your pool. A bug introduced in the 6.9
> series causes suspended jobs on Windows to never be properly resumed,
> even if they are reported to be resumed by Condor.
>
> A fix for this problem will be available in Condor 7.0.4.
>
> Greg
>
> On Mon, 2008-07-14 at 17:37 -0400, Alan Cass wrote:
>> Hi Chris,
>>
>> I've had a very similar experience over the last few weeks with
>> version 7.0.1 but we've also experienced it on 7.0.0 since Christmas.
>> For some reason socket connections go bad and ultimately result in job
>> eviction after days(aimlessly) occupying the node.
>>
>> We originally believed it was a problem with the network/firewall but
>> having not been able to find it we've reluctantly reverted to 6.8.8
>> and these particular issues have disappeared.
>>
>> This wasn't the only problem with 7.0.1, we also experienced problems
>> with dynamic user creation on a large number of machines which again
>> does not exist in 6.8.8.
>>
>> Alan
>>
>> 2008/7/14 Ian Chesal <ICHESAL@xxxxxxxxxx>:
>> > I managed to get a windows Condor environment working fine
>> on
>> > a simple multi pc isolated network using a common login for
>> all pcs.
>> > I am now attempting to get Condor to work across a corporate
>> > network.....! Well I can see the slots in the pool and can
>> > successfully submit jobs from one PC to a head node and the
>> > jobs get assigned to selected slots (aren't ClassADs
>> > useful!). However, the jobs run indefinitely - last one I
>> > stopped after 4 days (the test model run is only a 15 minute
>> > task!). Key files are meant to be transferred from (model
>> > input files) and to (model results file) the local drive of
>> > the submitting PC, and I have added my windows AD user
>> > ID/password using condor_store_cred to all machines in
>> > question (just in case!). Is this 'hanging' behaviour
>> > permissions related or possibly something else? I am using
>> > Condor version 7.0.1.
>> > Any help would be gratefully received!
>>
>>
>> Chris, I can't offer you any direct help but here are some
>> tips for
>> debugging the problem. Windows makes running batch programs
>> particularly
>> annoying because of it's security model and its insistence
>> that even
>> batch, command line programs should generate graphical
>> warnings and
>> dialog boxes. Keeps us in jobs though! :)
>>
>> Download Process Explorer from Microsoft and install it on one
>> of your
>> clients where you jobs are running. You can use this to take a
>> better
>> look at the job processes:
>>
>> http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx
>>
>> Check to see if the job is actually taking up any CPU. My
>> hunch is your
>> jobs aren't running indefinitely but waiting indefinitely for
>> something.
>>
>> They might be producing a pop-up Window (like a missing DLL
>> error for
>> example) that's not visible (because Condor by default doesn't
>> run the
>> jobs in a visible desktop) that needs to get clicked.
>>
>> To check for the pop up windows problem set your machines to
>> 'use a
>> visible desktop' -- this'll tell Condor to run the jobs on the
>> desktop
>> of the logged in user. You'll see cmd windows pop up on the
>> desktop when
>> Condor starts to run the jobs and you'll be able to see if
>> they're
>> producing pop ups that are causing your softare to hang
>> indefinitely.
>> You can learn more about USE_VISIBLE_DESKTOP here:
>>
>>
>> http://www.cs.wisc.edu/condor/manual/v7.0/3_3Configuration.html#14350
>>
>> That should get you started. Good luck!
>>
>> - Ian
>>
>>
>> Confidentiality Notice.
>> This message may contain information that is confidential or
>> otherwise protected from disclosure. If you are not the
>> intended recipient, you are hereby notified that any use,
>> disclosure, dissemination, distribution, or copying of this
>> message, or any attachments, is strictly prohibited. If you
>> have received this message in error, please advise the sender
>> by reply e-mail, and delete the message and any attachments.
>> Thank you.
>>
>>
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to
>> condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>>
>>
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>