[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Why isn't the negotiation cycle finding this job?



Ever since we installed condor 7.6 (upgraded from 7.4) the new negotiation algorithm when you have groups enabled has been biting us. I have this ticket open which I'm hoping to get a response to:

http://www.cs.wisc.edu/condor/fermi-tickets/22715.html

that deals only with not running the jobs we expect to run but at least the slots are staying full.

Now I see we aren't running our monitoring jobs which use the glideinwms monitoring slot. Here is the monitoring job which is targeted at the monitoring slot for a specific job (classads below):

Output from condor_q

1543992.0   willis         11/9  11:00   0+00:00:00 I  0   0.0  mon.sh

From the negotiator log with full_debug

11/09/11 11:03:00 ---------- Started Negotiation Cycle ----------
11/09/11 11:03:00 Phase 1:  Obtaining ads from collector ...
11/09/11 11:03:00   Getting all public ads ...
11/09/11 11:03:00 Trying to query collector <131.225.240.215:9618>
11/09/11 11:03:08   Sorting 8584 ads ...
<snip>
11/09/11 11:03:08 Ignoring submitter willis@xxxxxxxx with no requested jobs

The classad of the job

[cdfcaf@fcdfhead10 /export/condor_local/log] condor_q -name schedd_3@xxxxxxxxxxxxxxxxxxx -l 1543992.0


-- Schedd: schedd_3@xxxxxxxxxxxxxxxxxxx : <131.225.240.215:50394>
PeriodicRemove = ( CurrentTime > 1320858524 )
CommittedSlotTime = 0
Out = "_condor_stdout"
ImageSize_RAW = 1
NumCkpts_RAW = 0
AutoClusterAttrs = "CAFGroup,CAFAcctGroup,CAF_DEFAULT_START,GLIDEIN_Is_Monitor,CAFDH"
EnteredCurrentStatus = 1320858014
CommittedSuspensionTime = 0
WhenToTransferOutput = "ON_EXIT"
NumSystemHolds = 0
StreamOut = false
NumRestarts = 0
ImageSize = 1
Cmd = "/tmp/glidein_intmon_HzIdSU/mon.sh"
x509UserProxyVOName = "cdf"
CurrentHosts = 0
Iwd = "/tmp/glidein_intmon_HzIdSU"
CumulativeSlotTime = 0
ExecutableSize_RAW = 1
CondorVersion = "$CondorVersion: 7.6.2 Jul 14 2011 BuildID: 351672 $"
RemoteUserCpu = 0.0
NumCkpts = 0
JobStatus = 1
Arguments = ""
RemoteSysCpu = 0.0
OnExitRemove = true
BufferBlockSize = 32768
ClusterId = 1543992
In = "/dev/null"
LocalUserCpu = 0.0
x509UserProxyFQAN = "/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=glidecaf/CN=cdf/CN=Willis K. Sakumoto/CN=UID:willis,/cdf/Role=NULL/Capability=NULL"
MinHosts = 1
Environment = ""
JobUniverse = 5
RequestDisk = DiskUsage
RootDir = "/"
NumJobStarts = 0
WantRemoteIO = true
RequestMemory = ceiling(ifThenElse(JobVMMemory =!= undefined,JobVMMemory,ImageSize / 1024.000000))
GlobalJobId = "schedd_3@xxxxxxxxxxxxxxxxxxx#1543992.0#1320858014"
x509UserProxyFirstFQAN = "/cdf/Role=NULL/Capability=NULL"
LocalSysCpu = 0.0
PeriodicRelease = false
DiskUsage = 1
CumulativeSuspensionTime = 0
JobLeaseDuration = 1200
UserLog = "/tmp/glidein_intmon_HzIdSU/mon.log"
GLIDEIN_Is_Monitor = true
ExecutableSize = 1
MaxHosts = 1
ServerTime = 1320858260
CoreSize = 0
DiskUsage_RAW = 1
ProcId = 0
TransferFiles = "ONEXIT"
ShouldTransferFiles = "YES"
CommittedTime = 0
TotalSuspensions = 0
Err = "_condor_stderr"
x509userproxysubject = "/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=glidecaf/CN=cdf/CN=Willis K. Sakumoto/CN=UID:willis"
AutoClusterId = 496
RequestCpus = 1
StreamErr = false
x509UserProxyExpiration = 1321256898
NiceUser = false
RemoteWallClockTime = 0.0
TargetType = "Machine"
TransferOutputRemaps = "_condor_stdout=/tmp/glidein_intmon_HzIdSU/mon.out;_condor_stderr=/tmp/glidein_intmon_HzIdSU/mon.err"
PeriodicHold = false
QDate = 1320858014
OnExitHold = false
Rank = 0.0
ExitBySignal = false
CondorPlatform = "$CondorPlatform: x86_64_rhap_5 $"
JobPrio = 0
LastSuspensionTime = 0
CurrentTime = time()
User = "willis@xxxxxxxx"
x509userproxy = "/export/CafCondor/tickets/x509cc_willis"
JobNotification = 0
BufferSize = 524288
WantRemoteSyscalls = false
LeaveJobInQueue = false
ExitStatus = 0
CompletionDate = 0
MyType = "Job"
Requirements = ( ( Name =?= "monitor_30769@xxxxxxxxxxxxxxxxxxxx" ) && ( Arch =!= "Absurd" ) ) && ( ( Memory >= 1 ) ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= DiskUsage ) && ( ( RequestMemory * 1024 ) >= ImageSize ) && ( TARGET.HasFileTransfer )
WantCheckpoint = false
Owner = "willis"
LastJobStatus = 0
TransferIn = false


The slot it wants is there

[cdfcaf@fcdfhead10 /export/condor_local/log] condor_status -constraint 'name == "monitor_30769@xxxxxxxxxxxxxxxxxxxx"'

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

monitor_30769@fcdf LINUX      X86_64 Owner     Idle     5.870   393  0+23:01:13
                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX     1     1       0         0       0          0        0

               Total     1     1       0         0       0          0        0

The slot is free and not usable by anything else but this job won't run in the 8 minutes allowed. It used to run on the next negotiation cycle because there is a slot sitting there free for it. Why does it say "with no requested jobs" for the user "willis" when there is one in the queue?

I believe it has to do with the way that now all the slots are parcelled out to groups (even though not all jobs are in groups because jobs not in groups getting added to a <none> group) and we have this set:

GROUP_ACCEPT_SURPLUS = True

I'll keep digging but I'm hoping someone has advice.

Thanks,

joe