[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problems with Windows jobs running indefinitely!



That sounds like the culprit. The jobs I had, once suspended, would
cycle in and out of suspension (without cause) before being evicted.
On 14/07/2008, Greg Quinn <gquinn@xxxxxxxxxxx> wrote:
> Hi all,
>
> One problem that I've recently been made aware of exists if you are
> using suspension policies in your pool. A bug introduced in the 6.9
> series causes suspended jobs on Windows to never be properly resumed,
> even if they are reported to be resumed by Condor.
>
> A fix for this problem will be available in Condor 7.0.4.
>
> Greg
>
> On Mon, 2008-07-14 at 17:37 -0400, Alan Cass wrote:
>> Hi Chris,
>>
>> I've had a very similar experience over the last few weeks with
>> version 7.0.1 but we've also experienced it on 7.0.0 since Christmas.
>> For some reason socket connections go bad and ultimately result in job
>> eviction after days(aimlessly) occupying the node.
>>
>> We originally believed it was a problem with the network/firewall but
>> having not been able to find it we've reluctantly reverted to 6.8.8
>> and these particular issues have disappeared.
>>
>> This wasn't the only problem with 7.0.1, we also experienced problems
>> with dynamic user creation on  a large number of machines which again
>> does not exist in 6.8.8.
>>
>> Alan
>>
>> 2008/7/14 Ian Chesal <ICHESAL@xxxxxxxxxx>:
>>         > I managed to get a windows Condor environment working fine
>>         on
>>         > a simple multi pc isolated network using a common login for
>>         all pcs.
>>         > I am now attempting to get Condor to work across a corporate
>>         > network.....!  Well I can see the slots in the pool and can
>>         > successfully submit jobs from one PC to a head node and the
>>         > jobs get assigned to selected slots (aren't ClassADs
>>         > useful!).  However, the jobs run indefinitely - last one I
>>         > stopped after 4 days (the test model run is only a 15 minute
>>         > task!).  Key files are meant to be transferred from (model
>>         > input files) and to (model results file) the local drive of
>>         > the submitting PC, and I have added my windows AD user
>>         > ID/password using condor_store_cred to all machines in
>>         > question (just in case!).  Is this 'hanging' behaviour
>>         > permissions related or possibly something else?  I am using
>>         > Condor version 7.0.1.
>>         > Any help would be gratefully received!
>>
>>
>>         Chris, I can't offer you any direct help but here are some
>>         tips for
>>         debugging the problem. Windows makes running batch programs
>>         particularly
>>         annoying because of it's security model and its insistence
>>         that even
>>         batch, command line programs should generate graphical
>>         warnings and
>>         dialog boxes. Keeps us in jobs though! :)
>>
>>         Download Process Explorer from Microsoft and install it on one
>>         of your
>>         clients where you jobs are running. You can use this to take a
>>         better
>>         look at the job processes:
>>
>>         http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx
>>
>>         Check to see if the job is actually taking up any CPU. My
>>         hunch is your
>>         jobs aren't running indefinitely but waiting indefinitely for
>>         something.
>>
>>         They might be producing a pop-up Window (like a missing DLL
>>         error for
>>         example) that's not visible (because Condor by default doesn't
>>         run the
>>         jobs in a visible desktop) that needs to get clicked.
>>
>>         To check for the pop up windows problem set your machines to
>>         'use a
>>         visible desktop' -- this'll tell Condor to run the jobs on the
>>         desktop
>>         of the logged in user. You'll see cmd windows pop up on the
>>         desktop when
>>         Condor starts to run the jobs and you'll be able to see if
>>         they're
>>         producing pop ups that are causing your softare to hang
>>         indefinitely.
>>         You can learn more about USE_VISIBLE_DESKTOP here:
>>
>>
>> http://www.cs.wisc.edu/condor/manual/v7.0/3_3Configuration.html#14350
>>
>>         That should get you started. Good luck!
>>
>>         - Ian
>>
>>
>>         Confidentiality Notice.
>>         This message may contain information that is confidential or
>>         otherwise protected from disclosure. If you are not the
>>         intended recipient, you are hereby notified that any use,
>>         disclosure, dissemination, distribution,  or copying  of this
>>         message, or any attachments, is strictly prohibited.  If you
>>         have received this message in error, please advise the sender
>>         by reply e-mail, and delete the message and any attachments.
>>          Thank you.
>>
>>
>>         _______________________________________________
>>         Condor-users mailing list
>>         To unsubscribe, send a message to
>>         condor-users-request@xxxxxxxxxxx with a
>>         subject: Unsubscribe
>>         You can also unsubscribe by visiting
>>         https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>>         The archives can be found at:
>>         https://lists.cs.wisc.edu/archive/condor-users/
>>
>>
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>