[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] schedd problems?



Hi,

On Thu, 24 Feb 2005, Ian Chesal wrote:
> > Hi,
> > I've got a strange problem (aren't they all?), and could use 
> > guidance on how to figure out what's wrong.  I have a submit 
> > machine that can no longer tell what jobs are in it's own 
> > queue.  I upgraded condor to 6.7.3 (from 6.6.7) on Feb 10; 
> > yesterday (Feb 23), it was noticed that condor_q would return:
> > 
> > -- Failed to fetch ads from: <129.89.201.232:38456> : 
> > hydra.phys.uwm.edu
> > 
> > SchedLog doesn't seem to show anything interesting...
> > 
> > How can I debug what's failing?
> 
> Hi Paul,
> 
> We've seen similar messages when a single schedd instance has LOTS of
> ports open in the 6.7.3 builds. Can you check the number of open network
> connections on the machine?

Nothing out of the ordinary...

> Is the schedd currently preempting a lot of
> startd machines in your cluster?

No, schedd seems to be permanantly out to lunch at the moment...

It appears to have been restarted a few days ago, after which it 
immediately marked a bunch (I'm assuming "a bunch" = all) jobs at IDLE, 
and since then just sits there.  I've tried restarting condor on this 
submit machine and got this:

2/24 10:07:07 ******************************************************
2/24 10:07:07 ** condor_schedd (CONDOR_SCHEDD) STARTING UP
2/24 10:07:07 ** /opt/condor/sbin/condor_schedd
2/24 10:07:07 ** $CondorVersion: 6.7.3 Dec 28 2004 $
2/24 10:07:07 ** $CondorPlatform: I386-LINUX_RH9 $
2/24 10:07:07 ** PID = 16027
2/24 10:07:07 ******************************************************
2/24 10:07:07 Using config file: /etc/condor/condor_config
2/24 10:07:07 Using local config files: 
/opt/condor/home/condor_config.local
2/24 10:07:07 DaemonCore: Command Socket at <129.89.201.232:38456>
2/24 10:07:07 SEC_DEFAULT_SESSION_DURATION is undefined, using default 
value of 3600
2/24 10:07:07 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value 
of 0
2/24 10:07:07 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value 
of 0
2/24 10:07:07 Will use UDP to update collector condor.medusa.phys.uwm.edu 
<129.89.201.238:9618>
2/24 10:07:07 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value 
of 0
2/24 10:07:07 Using name: hydra.phys.uwm.edu
2/24 10:07:07 No Accountant host specified in config file
2/24 10:07:07 SCHEDD_MIN_INTERVAL is undefined, using default value of 5
2/24 10:07:07 JOB_START_COUNT is undefined, using default value of 1
2/24 10:07:07 MAX_JOBS_SUBMITTED is undefined, using default value of 
2147483647
2/24 10:07:07 STARTD_CONTACT_TIMEOUT is undefined, using default value of 
45
2/24 10:07:07 Queue Management Super Users:
2/24 10:07:07   root
2/24 10:07:07   condor
2/24 10:07:13 About to truncate log /opt/condor/home/spool/job_queue.log
2/24 10:07:14 Marked job 104860.0 as IDLE
2/24 10:07:14 Marked job 104806.0 as IDLE
2/24 10:07:14 Marked job 104761.0 as IDLE
2/24 10:07:14 Marked job 105652.0 as IDLE



> 
> - Ian
> 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 

-- 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ UWM-LSC Group Systems Administrator        parmor@xxxxxxxxxxxxxxxxxxxx +
+ Physics 462                                                            +
+ U. of W. - Milwaukee                                                   +
+ PO Box 413                                                414-229-2677 +
+ Milwaukee, WI 53201                                   fax 414-229-5589 +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++