[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] Bad Condor jobs killing GUI apps



Title: Bad Condor jobs killing GUI apps

We have a issue with running Condor that has proven very difficult to debug.

Our Condor pool consists of about 2000 desktop machines (all running WindowsXP). Condor uses the machines when they are idle (mostly at night).

We have received occasional reports from users that when they come in in the morning all (or almost all) of their GUI apps had been shut down. Users were reporting that if they disable Condor on their machine (thus removing the machine from the pool), then the problem would go away. At first we through the GUI apps shutting down had nothing to do with Condor. But it's happened enough times, and we have finally seen the behavior for ourselves, to be convinced there is a link. One of our admins was standing by a PC in the pool, when all of a sudden all the GUI apps shut down. He looked at the Condor log files, and verified that a job had just finished running on the machine. The starter log file contains the following lines just before the GUI apps started shutting down:

        4/6 08:39:21 ******************************************************
        4/6 08:39:21 ** condor_starter (CONDOR_STARTER) STARTING UP
        4/6 08:39:21 ** $CondorVersion: 6.4.7 Jan 27 2003 $
        4/6 08:39:21 ** $CondorPlatform: INTEL-WINNT40 $
        4/6 08:39:21 ** PID = 3236
        4/6 08:39:21 ******************************************************
        4/6 08:39:21 DaemonCore: Command Socket at <10.104.41.216:3239>
        4/6 08:39:21 Submitting machine is "admin-srv50.micron.com"
        4/6 08:39:21 entering init_user_ids()...watch out.
        4/6 08:39:22 File transfer completed successfully.
        4/6 08:39:23 Starting a VANILLA universe job.
        4/6 08:39:23 Output file: C:\Progra~1\Condor/execute\dir_3236\admin-srv50_tppprod_21097_EngExt.bat.out
        4/6 08:39:23 Error file: C:\Progra~1\Condor/execute\dir_3236\admin-srv50_tppprod_21097_EngExt.bat.err
        4/6 08:39:23 About to exec C:\WINNT\System32\cmd.exe /Q /C condor_exec.bat
        4/6 08:39:23 Create_Process succeeded, pid=3320
        4/6 08:40:04 Job exited, pid=3320, status=0
        4/6 08:40:06 Got SIGQUIT.  Performing fast shutdown.
        4/6 08:40:06 ShutdownFast all jobs.
        4/6 08:40:06 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

Can someone explain the "Got QIGQUIT.." line? What's a fast shutdown? Is this normal? Has anyone seen cases where the Condor starter daemon finishing a job affects the interactive apps running on the same machine?

So far, we have not been able to reproduce the issue at will (although we are still trying). It does seem to be a specific job that causes this every time.

Thanks.

Andy Goar
Middleware Group
Micron Technology Inc.

email: agoar@xxxxxxxxxx
Phone: (208)368-3254
Support: (208)368-4850
    "Three things are certain:  Death, taxes, and lost data.  Guess which has occurred?"