[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] ' --- ???? ---' in v6.7.8 WAS Hawkeye module and condor_q problems in condor-6.6.6



Hi,

I'd be grateful if someone could respond further on the two issues I raised in the parent thread:

1) The "hawkeye modules stuck running" problem:

Where is the lock file or accounting information that registers a hawkeye module as running even when, according to "ps", it is not? How to re-synchronize this information with reality short of restarting condor_startd?

2) The "condor_q shows garbage problem"

Yesterday, I upgraded our system to condor v6.7.8, and this problem is currently still manifest on a linux-glibc2.3 system. We have a queue with ~6400 jobs on it, all marked simply with

 --- ???? ---

according to condor_q (no job information here, although condor_q -long works just fine). Apparently according to my users this has been an intermittent problem for some time, but this appears to be worse since I started running the attached hawkeye module. In addition: in the current case, the jobs will not start at all for no normal reason that I can find. "condor_hold -all" and similar commands fail with errors like:

Could not hold all jobs.

Pointers would be appreciated. At least one of the condor team (Peter Couvares) has login privileges to our machines (eg maxwell.fnal.gov, our queue manager) if a first-hand look would be helpful.

Thanks for any help,
Chris.


On Mon, 6 Jun 2005, Chris Green wrote:

Hi,

The problem is that I can't find any evidence outside of this message (with ps, for example) that this process really is running! So, my job is not being run every five minutes like it's supposed to and the values it publishes are never being updated. What I need is a way to "unstick" the startd so that its idea of whether a hawkeye module is still running re-aligns itself with reality.

Thanks,
Chris.




Hope this helps. :-)

-Nick





-- Chris Green, MiniBooNE / LANL. Email greenc@xxxxxxxx Tel: (630) 840-2167. Fax: (630) 840-3867
#!/usr/bin/perl -w

use strict;

# Obtain path and add to search path
BEGIN
{
    my $Dir = $0;
    if ( $Dir =~ /(.*)\/.*/ )
    {
	push @INC, "$1";
    }
}

# Include Hawkeye support libraries
use HawkeyePublish;
use HawkeyeLib;

my $command_options = "-constraint 'JobStatus == 1 && ImageSize > 0.0'";
my @jobs_to_fix = `\$BIN/condor_q $command_options 2>/dev/null`;
@jobs_to_fix = map { ($_ =~ /^\s*(\d+\.\d+).*$/?$1:()) } @jobs_to_fix;
if ($#jobs_to_fix > -1) {
  print STDERR "Fixing jobs: ", join(", ", @jobs_to_fix), "\n";
  system("\$BIN/condor_qedit $command_options ImageSize 0.0");
}

# Return true
1;