Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Why isn't the negotiation cycle finding this job?

Date: Wed, 09 Nov 2011 11:48:28 -0600
From: Joe Boyd <boyd@xxxxxxxx>
Subject: Re: [Condor-users] Why isn't the negotiation cycle finding this job?

I also have this specified

GROUP_DYNAMIC_MACH_CONSTRAINT = ( IS_MONITOR_VM =!= True )

which was a necessity so that the negotiator would use all the monitoring slotsin its calculations and mess up the "surplus" to hand to each group.


I guess that's making it not see the monitoring slots.

It does sometimes run the monitoring jobs though but I think that's only if theuser has other jobs in the "idle" state and the negotiator hands the submitteroff to the schedd to run jobs. The schedd then seems to find the monitoringslot. The repeatable failure in running the monitoring job is I think when theuser has no other jobs idle.


joe

On 11/09/2011 11:15 AM, Joe Boyd wrote:

Ever since we installed condor 7.6 (upgraded from 7.4) the new negotiation
algorithm when you have groups enabled has been biting us. I have this ticket
open which I'm hoping to get a response to:

http://www.cs.wisc.edu/condor/fermi-tickets/22715.html

that deals only with not running the jobs we expect to run but at least the
slots are staying full.

Now I see we aren't running our monitoring jobs which use the glideinwms
monitoring slot. Here is the monitoring job which is targeted at the monitoring
slot for a specific job (classads below):

Output from condor_q

1543992.0 willis 11/9 11:00 0+00:00:00 I 0 0.0 mon.sh

 From the negotiator log with full_debug

11/09/11 11:03:00 ---------- Started Negotiation Cycle ----------
11/09/11 11:03:00 Phase 1: Obtaining ads from collector ...
11/09/11 11:03:00 Getting all public ads ...
11/09/11 11:03:00 Trying to query collector <131.225.240.215:9618>
11/09/11 11:03:08 Sorting 8584 ads ...
<snip>
11/09/11 11:03:08 Ignoring submitter willis@xxxxxxxx with no requested jobs

The classad of the job

[cdfcaf@fcdfhead10 /export/condor_local/log] condor_q -name
schedd_3@xxxxxxxxxxxxxxxxxxx -l 1543992.0


-- Schedd: schedd_3@xxxxxxxxxxxxxxxxxxx : <131.225.240.215:50394>
PeriodicRemove = ( CurrentTime > 1320858524 )
CommittedSlotTime = 0
Out = "_condor_stdout"
ImageSize_RAW = 1
NumCkpts_RAW = 0
AutoClusterAttrs =
"CAFGroup,CAFAcctGroup,CAF_DEFAULT_START,GLIDEIN_Is_Monitor,CAFDH"
EnteredCurrentStatus = 1320858014
CommittedSuspensionTime = 0
WhenToTransferOutput = "ON_EXIT"
NumSystemHolds = 0
StreamOut = false
NumRestarts = 0
ImageSize = 1
Cmd = "/tmp/glidein_intmon_HzIdSU/mon.sh"
x509UserProxyVOName = "cdf"
CurrentHosts = 0
Iwd = "/tmp/glidein_intmon_HzIdSU"
CumulativeSlotTime = 0
ExecutableSize_RAW = 1
CondorVersion = "$CondorVersion: 7.6.2 Jul 14 2011 BuildID: 351672 $"
RemoteUserCpu = 0.0
NumCkpts = 0
JobStatus = 1
Arguments = ""
RemoteSysCpu = 0.0
OnExitRemove = true
BufferBlockSize = 32768
ClusterId = 1543992
In = "/dev/null"
LocalUserCpu = 0.0
x509UserProxyFQAN =
"/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=glidecaf/CN=cdf/CN=Willis K.
Sakumoto/CN=UID:willis,/cdf/Role=NULL/Capability=NULL"
MinHosts = 1
Environment = ""
JobUniverse = 5
RequestDisk = DiskUsage
RootDir = "/"
NumJobStarts = 0
WantRemoteIO = true
RequestMemory = ceiling(ifThenElse(JobVMMemory =!=
undefined,JobVMMemory,ImageSize / 1024.000000))
GlobalJobId = "schedd_3@xxxxxxxxxxxxxxxxxxx#1543992.0#1320858014"
x509UserProxyFirstFQAN = "/cdf/Role=NULL/Capability=NULL"
LocalSysCpu = 0.0
PeriodicRelease = false
DiskUsage = 1
CumulativeSuspensionTime = 0
JobLeaseDuration = 1200
UserLog = "/tmp/glidein_intmon_HzIdSU/mon.log"
GLIDEIN_Is_Monitor = true
ExecutableSize = 1
MaxHosts = 1
ServerTime = 1320858260
CoreSize = 0
DiskUsage_RAW = 1
ProcId = 0
TransferFiles = "ONEXIT"
ShouldTransferFiles = "YES"
CommittedTime = 0
TotalSuspensions = 0
Err = "_condor_stderr"
x509userproxysubject =
"/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=glidecaf/CN=cdf/CN=Willis K.
Sakumoto/CN=UID:willis"
AutoClusterId = 496
RequestCpus = 1
StreamErr = false
x509UserProxyExpiration = 1321256898
NiceUser = false
RemoteWallClockTime = 0.0
TargetType = "Machine"
TransferOutputRemaps =
"_condor_stdout=/tmp/glidein_intmon_HzIdSU/mon.out;_condor_stderr=/tmp/glidein_intmon_HzIdSU/mon.err"

PeriodicHold = false
QDate = 1320858014
OnExitHold = false
Rank = 0.0
ExitBySignal = false
CondorPlatform = "$CondorPlatform: x86_64_rhap_5 $"
JobPrio = 0
LastSuspensionTime = 0
CurrentTime = time()
User = "willis@xxxxxxxx"
x509userproxy = "/export/CafCondor/tickets/x509cc_willis"
JobNotification = 0
BufferSize = 524288
WantRemoteSyscalls = false
LeaveJobInQueue = false
ExitStatus = 0
CompletionDate = 0
MyType = "Job"
Requirements = ( ( Name =?= "monitor_30769@xxxxxxxxxxxxxxxxxxxx" ) && ( Arch =!=
"Absurd" ) ) && ( ( Memory >= 1 ) ) && ( TARGET.OpSys == "LINUX" ) && (
TARGET.Disk >= DiskUsage ) && ( ( RequestMemory * 1024 ) >= ImageSize ) && (
TARGET.HasFileTransfer )
WantCheckpoint = false
Owner = "willis"
LastJobStatus = 0
TransferIn = false


The slot it wants is there

[cdfcaf@fcdfhead10 /export/condor_local/log] condor_status -constraint 'name ==
"monitor_30769@xxxxxxxxxxxxxxxxxxxx"'

Name OpSys Arch State Activity LoadAv Mem ActvtyTime

monitor_30769@fcdf LINUX X86_64 Owner Idle 5.870 393 0+23:01:13
Total Owner Claimed Unclaimed Matched Preempting Backfill

X86_64/LINUX 1 1 0 0 0 0 0

Total 1 1 0 0 0 0 0

The slot is free and not usable by anything else but this job won't run in the 8
minutes allowed. It used to run on the next negotiation cycle because there is a
slot sitting there free for it. Why does it say "with no requested jobs" for the
user "willis" when there is one in the queue?

I believe it has to do with the way that now all the slots are parcelled out to
groups (even though not all jobs are in groups because jobs not in groups
getting added to a <none> group) and we have this set:

GROUP_ACCEPT_SURPLUS = True

I'll keep digging but I'm hoping someone has advice.

Thanks,

joe

References:
- [Condor-users] Why isn't the negotiation cycle finding this job?
  - From: Joe Boyd

Prev by Date: [Condor-users] Why isn't the negotiation cycle finding this job?
Next by Date: [Condor-users] Condor netowork problem
Previous by thread: [Condor-users] Why isn't the negotiation cycle finding this job?
Next by thread: [Condor-users] Condor netowork problem
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Why isn't the negotiation cycle finding this job?