[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Why isn't the negotiation cycle finding this job?



I also have this specified

GROUP_DYNAMIC_MACH_CONSTRAINT = ( IS_MONITOR_VM =!= True )

which was a necessity so that the negotiator would use all the monitoring slots in its calculations and mess up the "surplus" to hand to each group.

I guess that's making it not see the monitoring slots.

It does sometimes run the monitoring jobs though but I think that's only if the user has other jobs in the "idle" state and the negotiator hands the submitter off to the schedd to run jobs. The schedd then seems to find the monitoring slot. The repeatable failure in running the monitoring job is I think when the user has no other jobs idle.

joe

On 11/09/2011 11:15 AM, Joe Boyd wrote:
Ever since we installed condor 7.6 (upgraded from 7.4) the new negotiation
algorithm when you have groups enabled has been biting us. I have this ticket
open which I'm hoping to get a response to:

http://www.cs.wisc.edu/condor/fermi-tickets/22715.html

that deals only with not running the jobs we expect to run but at least the
slots are staying full.

Now I see we aren't running our monitoring jobs which use the glideinwms
monitoring slot. Here is the monitoring job which is targeted at the monitoring
slot for a specific job (classads below):

Output from condor_q

1543992.0 willis 11/9 11:00 0+00:00:00 I 0 0.0 mon.sh

 From the negotiator log with full_debug

11/09/11 11:03:00 ---------- Started Negotiation Cycle ----------
11/09/11 11:03:00 Phase 1: Obtaining ads from collector ...
11/09/11 11:03:00 Getting all public ads ...
11/09/11 11:03:00 Trying to query collector <131.225.240.215:9618>
11/09/11 11:03:08 Sorting 8584 ads ...
<snip>
11/09/11 11:03:08 Ignoring submitter willis@xxxxxxxx with no requested jobs

The classad of the job

[cdfcaf@fcdfhead10 /export/condor_local/log] condor_q -name
schedd_3@xxxxxxxxxxxxxxxxxxx -l 1543992.0


-- Schedd: schedd_3@xxxxxxxxxxxxxxxxxxx : <131.225.240.215:50394>
PeriodicRemove = ( CurrentTime > 1320858524 )
CommittedSlotTime = 0
Out = "_condor_stdout"
ImageSize_RAW = 1
NumCkpts_RAW = 0
AutoClusterAttrs =
"CAFGroup,CAFAcctGroup,CAF_DEFAULT_START,GLIDEIN_Is_Monitor,CAFDH"
EnteredCurrentStatus = 1320858014
CommittedSuspensionTime = 0
WhenToTransferOutput = "ON_EXIT"
NumSystemHolds = 0
StreamOut = false
NumRestarts = 0
ImageSize = 1
Cmd = "/tmp/glidein_intmon_HzIdSU/mon.sh"
x509UserProxyVOName = "cdf"
CurrentHosts = 0
Iwd = "/tmp/glidein_intmon_HzIdSU"
CumulativeSlotTime = 0
ExecutableSize_RAW = 1
CondorVersion = "$CondorVersion: 7.6.2 Jul 14 2011 BuildID: 351672 $"
RemoteUserCpu = 0.0
NumCkpts = 0
JobStatus = 1
Arguments = ""
RemoteSysCpu = 0.0
OnExitRemove = true
BufferBlockSize = 32768
ClusterId = 1543992
In = "/dev/null"
LocalUserCpu = 0.0
x509UserProxyFQAN =
"/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=glidecaf/CN=cdf/CN=Willis K.
Sakumoto/CN=UID:willis,/cdf/Role=NULL/Capability=NULL"
MinHosts = 1
Environment = ""
JobUniverse = 5
RequestDisk = DiskUsage
RootDir = "/"
NumJobStarts = 0
WantRemoteIO = true
RequestMemory = ceiling(ifThenElse(JobVMMemory =!=
undefined,JobVMMemory,ImageSize / 1024.000000))
GlobalJobId = "schedd_3@xxxxxxxxxxxxxxxxxxx#1543992.0#1320858014"
x509UserProxyFirstFQAN = "/cdf/Role=NULL/Capability=NULL"
LocalSysCpu = 0.0
PeriodicRelease = false
DiskUsage = 1
CumulativeSuspensionTime = 0
JobLeaseDuration = 1200
UserLog = "/tmp/glidein_intmon_HzIdSU/mon.log"
GLIDEIN_Is_Monitor = true
ExecutableSize = 1
MaxHosts = 1
ServerTime = 1320858260
CoreSize = 0
DiskUsage_RAW = 1
ProcId = 0
TransferFiles = "ONEXIT"
ShouldTransferFiles = "YES"
CommittedTime = 0
TotalSuspensions = 0
Err = "_condor_stderr"
x509userproxysubject =
"/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=glidecaf/CN=cdf/CN=Willis K.
Sakumoto/CN=UID:willis"
AutoClusterId = 496
RequestCpus = 1
StreamErr = false
x509UserProxyExpiration = 1321256898
NiceUser = false
RemoteWallClockTime = 0.0
TargetType = "Machine"
TransferOutputRemaps =
"_condor_stdout=/tmp/glidein_intmon_HzIdSU/mon.out;_condor_stderr=/tmp/glidein_intmon_HzIdSU/mon.err"

PeriodicHold = false
QDate = 1320858014
OnExitHold = false
Rank = 0.0
ExitBySignal = false
CondorPlatform = "$CondorPlatform: x86_64_rhap_5 $"
JobPrio = 0
LastSuspensionTime = 0
CurrentTime = time()
User = "willis@xxxxxxxx"
x509userproxy = "/export/CafCondor/tickets/x509cc_willis"
JobNotification = 0
BufferSize = 524288
WantRemoteSyscalls = false
LeaveJobInQueue = false
ExitStatus = 0
CompletionDate = 0
MyType = "Job"
Requirements = ( ( Name =?= "monitor_30769@xxxxxxxxxxxxxxxxxxxx" ) && ( Arch =!=
"Absurd" ) ) && ( ( Memory >= 1 ) ) && ( TARGET.OpSys == "LINUX" ) && (
TARGET.Disk >= DiskUsage ) && ( ( RequestMemory * 1024 ) >= ImageSize ) && (
TARGET.HasFileTransfer )
WantCheckpoint = false
Owner = "willis"
LastJobStatus = 0
TransferIn = false


The slot it wants is there

[cdfcaf@fcdfhead10 /export/condor_local/log] condor_status -constraint 'name ==
"monitor_30769@xxxxxxxxxxxxxxxxxxxx"'

Name OpSys Arch State Activity LoadAv Mem ActvtyTime

monitor_30769@fcdf LINUX X86_64 Owner Idle 5.870 393 0+23:01:13
Total Owner Claimed Unclaimed Matched Preempting Backfill

X86_64/LINUX 1 1 0 0 0 0 0

Total 1 1 0 0 0 0 0

The slot is free and not usable by anything else but this job won't run in the 8
minutes allowed. It used to run on the next negotiation cycle because there is a
slot sitting there free for it. Why does it say "with no requested jobs" for the
user "willis" when there is one in the queue?

I believe it has to do with the way that now all the slots are parcelled out to
groups (even though not all jobs are in groups because jobs not in groups
getting added to a <none> group) and we have this set:

GROUP_ACCEPT_SURPLUS = True

I'll keep digging but I'm hoping someone has advice.

Thanks,

joe