[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problems with Windows jobs running indefinitely!



Hi all,

One problem that I've recently been made aware of exists if you are
using suspension policies in your pool. A bug introduced in the 6.9
series causes suspended jobs on Windows to never be properly resumed,
even if they are reported to be resumed by Condor.

A fix for this problem will be available in Condor 7.0.4.

Greg

On Mon, 2008-07-14 at 17:37 -0400, Alan Cass wrote:
> Hi Chris,
> 
> I've had a very similar experience over the last few weeks with
> version 7.0.1 but we've also experienced it on 7.0.0 since Christmas.
> For some reason socket connections go bad and ultimately result in job
> eviction after days(aimlessly) occupying the node.
> 
> We originally believed it was a problem with the network/firewall but
> having not been able to find it we've reluctantly reverted to 6.8.8
> and these particular issues have disappeared.
> 
> This wasn't the only problem with 7.0.1, we also experienced problems
> with dynamic user creation on  a large number of machines which again
> does not exist in 6.8.8.
> 
> Alan
> 
> 2008/7/14 Ian Chesal <ICHESAL@xxxxxxxxxx>:
>         > I managed to get a windows Condor environment working fine
>         on
>         > a simple multi pc isolated network using a common login for
>         all pcs.
>         > I am now attempting to get Condor to work across a corporate
>         > network.....!  Well I can see the slots in the pool and can
>         > successfully submit jobs from one PC to a head node and the
>         > jobs get assigned to selected slots (aren't ClassADs
>         > useful!).  However, the jobs run indefinitely - last one I
>         > stopped after 4 days (the test model run is only a 15 minute
>         > task!).  Key files are meant to be transferred from (model
>         > input files) and to (model results file) the local drive of
>         > the submitting PC, and I have added my windows AD user
>         > ID/password using condor_store_cred to all machines in
>         > question (just in case!).  Is this 'hanging' behaviour
>         > permissions related or possibly something else?  I am using
>         > Condor version 7.0.1.
>         > Any help would be gratefully received!
>         
>         
>         Chris, I can't offer you any direct help but here are some
>         tips for
>         debugging the problem. Windows makes running batch programs
>         particularly
>         annoying because of it's security model and its insistence
>         that even
>         batch, command line programs should generate graphical
>         warnings and
>         dialog boxes. Keeps us in jobs though! :)
>         
>         Download Process Explorer from Microsoft and install it on one
>         of your
>         clients where you jobs are running. You can use this to take a
>         better
>         look at the job processes:
>         
>         http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx
>         
>         Check to see if the job is actually taking up any CPU. My
>         hunch is your
>         jobs aren't running indefinitely but waiting indefinitely for
>         something.
>         
>         They might be producing a pop-up Window (like a missing DLL
>         error for
>         example) that's not visible (because Condor by default doesn't
>         run the
>         jobs in a visible desktop) that needs to get clicked.
>         
>         To check for the pop up windows problem set your machines to
>         'use a
>         visible desktop' -- this'll tell Condor to run the jobs on the
>         desktop
>         of the logged in user. You'll see cmd windows pop up on the
>         desktop when
>         Condor starts to run the jobs and you'll be able to see if
>         they're
>         producing pop ups that are causing your softare to hang
>         indefinitely.
>         You can learn more about USE_VISIBLE_DESKTOP here:
>         
>         http://www.cs.wisc.edu/condor/manual/v7.0/3_3Configuration.html#14350
>         
>         That should get you started. Good luck!
>         
>         - Ian
>         
>         
>         Confidentiality Notice.
>         This message may contain information that is confidential or
>         otherwise protected from disclosure. If you are not the
>         intended recipient, you are hereby notified that any use,
>         disclosure, dissemination, distribution,  or copying  of this
>         message, or any attachments, is strictly prohibited.  If you
>         have received this message in error, please advise the sender
>         by reply e-mail, and delete the message and any attachments.
>          Thank you.
>         
>         
>         _______________________________________________
>         Condor-users mailing list
>         To unsubscribe, send a message to
>         condor-users-request@xxxxxxxxxxx with a
>         subject: Unsubscribe
>         You can also unsubscribe by visiting
>         https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>         
>         The archives can be found at:
>         https://lists.cs.wisc.edu/archive/condor-users/
>         
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/