[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Submitted jobs remaining idle



Assuming nothing has changed on your cluster, there are a couple of
things to check -

1. Are all of the condor daemons running on your front-end node? Maybe
your central manager is down. Check, with ps, to see what Condor
daemons are running on your central manager. Also check the log files
on your central manager, and try running 'condor_status -any' and see
if there are any results. (There should be at least entries for your
submit machine)

2. Check the logfiles on your execute machines. If the central manager
is running, they may have more details about why they can't reach it.

We'll need to know what you find there in order to help debug further.

-Erik


On Tue, Aug 5, 2008 at 11:32 AM, Patrick Haley <phaley@xxxxxxx> wrote:
>
> Hi,
>
> I'm running condor 6.8.5 under Rocks 4.3 (A CentOS version
> of linux).  Last Thursday jobs submitted to condor stopped
> running and just show up as idle (prior to this, condor had
> been running fine for about 1yr).  It almost looks like the
> condor daemons on the front-end machine are no longer
> communicating with the daemons on the compute nodes.
> (Although I can still ping and ssh into the compute nodes.)
>
> I've tried condor_restart on the front-end and all the
> compute nodes with no change (also "condor_restart -all" on
> the front-end).  I'm at a bit of a loss on how to proceed.
>
> The output from condor_status is blank
>
> The output from "condor_q -better" is blank on the compute nodes
> I've tested, but on the front-end the output for all jobs looks like
>
> ---
> 742.006:  Run analysis summary.  Of 0 machines,
>      0 are rejected by your job's requirements
>      0 reject your job because of their own requirements
>      0 match but are serving users with a better priority in the pool
>      0 match but reject the job for unknown reasons
>      0 match but will not currently preempt their existing job
>      0 are available to run your job
>
> WARNING:  Be advised:
>   No resources matched request's constraints
>
> WARNING:  Be advised:   Request 742.6 did not match any resource's constraints
>
> ---
>
> The output for "condor_q -ana" on the front-end looks like
> (again same for all jobs)
> ---
> 742.006:  Run analysis summary.  Of 0 machines,
>      0 are rejected by your job's requirements
>      0 reject your job because of their own requirements
>      0 match but are serving users with a better priority in the pool
>      0 match but reject the job for unknown reasons
>      0 match but will not currently preempt their existing job
>      0 are available to run your job
>
> WARNING:  Be advised:
>   No resources matched request's constraints
>   Check the Requirements expression below:
>
> Requirements = ((machine != "nas-0-0.local") && (machine != "nas-0-1.local") &&
> (machine != "nas-0-2.local") && (machine != "pvfs2-io-0-0.local") && (machine
> != "mseas.local")) && (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >=
> DiskUsage) && ((Memory * 1024) >= ImageSize) && (TARGET.FileSystemDomain ==
> MY.FileSystemDomain)
>
>
> WARNING:  Be advised:   Request 742.6 did not match any resource's constraints
>
> ---
>
> 12 jobs; 12 idle, 0 running, 0 held
>
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> Pat Haley                          Email:  phaley@xxxxxxx
> Center for Ocean Engineering       Phone:  (617) 253-6824
> Dept. of Mechanical Engineering    Fax:    (617) 253-8125
> MIT, Room 5-222B                   http://web.mit.edu/phaley/www/
> 77 Massachusetts Avenue
> Cambridge, MA  02139-4301
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>