[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] Bad Condor jobs killing GUI apps



I wonder if this kind of behavior happened to Linux pools ?

Perhaps XP is to be blamed in place to CONDOR ?

	Cheers,

	Alain

On Thu, 2004-04-08 at 09:59, Chris Tottle wrote:
> I have also noticed something very similar, and this has happened to us on
> 6.6.0, 6.6.1 and 6.6.2 running on Windows XP Pro.
> 
> The w/s in question were all installed using the install program, and the user
> is allowed to both run jobs on the w/s and submit from that w/s.
> 
> Since we are still evaluating Condor, the machine was setup to always run
> Condor, and process jobs.
> 
> The following has happened to several users...
> 
> The user has logged onto the w/s, started Word or something, and then gone to a
> meeting (leaving them selves logged in). During the meeting Condor starts
> processing jobs on the w/s.
> 
> When the user comes back to use the w/s, moves the mouse, the w/s has
> "crashed". All of the icons have vanished from the System Tray, and the w/s has
> lost network connectivity.
> 
> This seems to be an intermittent fault, and like Andy, we are also unable to
> reproduce the error.
> 
> My StarterLog file looks very similar to the one that Andy posted.
> 
> Regards,
> 
> 
> Chris Tottle
> ISG Windows Development (Team Leader)
> INFOS
> Cardiff University
> 39 - 41 Park Place
> Cardiff
> CF10 3BB
> 
> 029 20875221
> 
> >>> agoar@xxxxxxxxxx 07/04/2004 15:48:20 >>>
> We have a issue with running Condor that has proven very difficult to
> debug. 
> 
> Our Condor pool consists of about 2000 desktop machines (all running
> WindowsXP). Condor uses the machines when they are idle (mostly at
> night). 
> 
> We have received occasional reports from users that when they come in in
> the morning all (or almost all) of their GUI apps had been shut down.
> Users were reporting that if they disable Condor on their machine (thus
> removing the machine from the pool), then the problem would go away. At
> first we through the GUI apps shutting down had nothing to do with
> Condor. But it's happened enough times, and we have finally seen the
> behavior for ourselves, to be convinced there is a link. One of our
> admins was standing by a PC in the pool, when all of a sudden all the
> GUI apps shut down. He looked at the Condor log files, and verified that
> a job had just finished running on the machine. The starter log file
> contains the following lines just before the GUI apps started shutting
> down:
> 
> 	4/6 08:39:21
> ******************************************************
> 	4/6 08:39:21 ** condor_starter (CONDOR_STARTER) STARTING UP
> 	4/6 08:39:21 ** $CondorVersion: 6.4.7 Jan 27 2003 $
> 	4/6 08:39:21 ** $CondorPlatform: INTEL-WINNT40 $
> 	4/6 08:39:21 ** PID = 3236
> 	4/6 08:39:21
> ******************************************************
> 	4/6 08:39:21 DaemonCore: Command Socket at <10.104.41.216:3239>
> 	4/6 08:39:21 Submitting machine is "admin-srv50.micron.com"
> 	4/6 08:39:21 entering init_user_ids()...watch out.
> 	4/6 08:39:22 File transfer completed successfully.
> 	4/6 08:39:23 Starting a VANILLA universe job.
> 	4/6 08:39:23 Output file:
> C:\Progra~1\Condor/execute\dir_3236\admin-srv50_tppprod_21097_EngExt.bat
> out
> 	4/6 08:39:23 Error file:
> C:\Progra~1\Condor/execute\dir_3236\admin-srv50_tppprod_21097_EngExt.bat
> err
> 	4/6 08:39:23 About to exec C:\WINNT\System32\cmd.exe /Q /C
> condor_exec.bat 
> 	4/6 08:39:23 Create_Process succeeded, pid=3320
> 	4/6 08:40:04 Job exited, pid=3320, status=0
> 	4/6 08:40:06 Got SIGQUIT.  Performing fast shutdown.
> 	4/6 08:40:06 ShutdownFast all jobs.
> 	4/6 08:40:06 **** condor_starter (condor_STARTER) EXITING WITH
> STATUS 0
> 
> Can someone explain the "Got QIGQUIT.." line? What's a fast shutdown? Is
> this normal? Has anyone seen cases where the Condor starter daemon
> finishing a job affects the interactive apps running on the same
> machine?
> 
> So far, we have not been able to reproduce the issue at will (although
> we are still trying). It does seem to be a specific job that causes this
> every time.
> 
> Thanks.
> 
> Andy Goar
> Middleware Group
> Micron Technology Inc.
> email: agoar@xxxxxxxxxx 
> Phone: (208)368-3254
> Support: (208)368-4850
>     "Three things are certain:  Death, taxes, and lost data.  Guess
> which has occurred?"
> 
> 
> 
> Condor Support Information:
> http://www.cs.wisc.edu/condor/condor-support/
> To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
> unsubscribe condor-users <your_email_address>
-- 
------------------------------------------------------------
Dr Alain Empain  <alain.empain@xxxxxxxxx> <alain@xxxxxxxxxx>
      Bioinformatics, Molecular Genetics, 
      Fac. Med. Vet., University of Liège, Belgium
      Bd de Colonster, B43   B-4000 Liège (Sart-Tilman)
WORK: +32 4 366 3821  FAX: +32 4 366 4122
HOME: rue des Martyrs,7  B- 4550 Nandrin       
  +32 85 51 23 41  GSM: +32 497 70 17 64

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>