Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Problems with Windows jobs running indefinitely!
- Date: Mon, 14 Jul 2008 16:53:02 -0500
- From: Greg Quinn <gquinn@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Problems with Windows jobs running indefinitely!
Hi all,
One problem that I've recently been made aware of exists if you are
using suspension policies in your pool. A bug introduced in the 6.9
series causes suspended jobs on Windows to never be properly resumed,
even if they are reported to be resumed by Condor.
A fix for this problem will be available in Condor 7.0.4.
Greg
On Mon, 2008-07-14 at 17:37 -0400, Alan Cass wrote:
> Hi Chris,
>
> I've had a very similar experience over the last few weeks with
> version 7.0.1 but we've also experienced it on 7.0.0 since Christmas.
> For some reason socket connections go bad and ultimately result in job
> eviction after days(aimlessly) occupying the node.
>
> We originally believed it was a problem with the network/firewall but
> having not been able to find it we've reluctantly reverted to 6.8.8
> and these particular issues have disappeared.
>
> This wasn't the only problem with 7.0.1, we also experienced problems
> with dynamic user creation on a large number of machines which again
> does not exist in 6.8.8.
>
> Alan
>
> 2008/7/14 Ian Chesal <ICHESAL@xxxxxxxxxx>:
> > I managed to get a windows Condor environment working fine
> on
> > a simple multi pc isolated network using a common login for
> all pcs.
> > I am now attempting to get Condor to work across a corporate
> > network.....! Well I can see the slots in the pool and can
> > successfully submit jobs from one PC to a head node and the
> > jobs get assigned to selected slots (aren't ClassADs
> > useful!). However, the jobs run indefinitely - last one I
> > stopped after 4 days (the test model run is only a 15 minute
> > task!). Key files are meant to be transferred from (model
> > input files) and to (model results file) the local drive of
> > the submitting PC, and I have added my windows AD user
> > ID/password using condor_store_cred to all machines in
> > question (just in case!). Is this 'hanging' behaviour
> > permissions related or possibly something else? I am using
> > Condor version 7.0.1.
> > Any help would be gratefully received!
>
>
> Chris, I can't offer you any direct help but here are some
> tips for
> debugging the problem. Windows makes running batch programs
> particularly
> annoying because of it's security model and its insistence
> that even
> batch, command line programs should generate graphical
> warnings and
> dialog boxes. Keeps us in jobs though! :)
>
> Download Process Explorer from Microsoft and install it on one
> of your
> clients where you jobs are running. You can use this to take a
> better
> look at the job processes:
>
> http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx
>
> Check to see if the job is actually taking up any CPU. My
> hunch is your
> jobs aren't running indefinitely but waiting indefinitely for
> something.
>
> They might be producing a pop-up Window (like a missing DLL
> error for
> example) that's not visible (because Condor by default doesn't
> run the
> jobs in a visible desktop) that needs to get clicked.
>
> To check for the pop up windows problem set your machines to
> 'use a
> visible desktop' -- this'll tell Condor to run the jobs on the
> desktop
> of the logged in user. You'll see cmd windows pop up on the
> desktop when
> Condor starts to run the jobs and you'll be able to see if
> they're
> producing pop ups that are causing your softare to hang
> indefinitely.
> You can learn more about USE_VISIBLE_DESKTOP here:
>
> http://www.cs.wisc.edu/condor/manual/v7.0/3_3Configuration.html#14350
>
> That should get you started. Good luck!
>
> - Ian
>
>
> Confidentiality Notice.
> This message may contain information that is confidential or
> otherwise protected from disclosure. If you are not the
> intended recipient, you are hereby notified that any use,
> disclosure, dissemination, distribution, or copying of this
> message, or any attachments, is strictly prohibited. If you
> have received this message in error, please advise the sender
> by reply e-mail, and delete the message and any attachments.
> Thank you.
>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to
> condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/