[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] "condor_q -better-analyze" suggests removing a requirement I do not use !?!?



Hello,

Master (where the Negotiator and other 'master-stuff' is running)
has condor version 7.4.2.

The pool PCs have version 7.4.2 (and some already have 7.4.3).

I actually started thinking whether this problem is a result of an
incorrect configuration of the (commercial) firewall on the Windows
pool PCs.....

For example, this is what happens when I submit the job:

NegotiatorLog:
=================================
09/04 10:31:56 ---------- Started Negotiation Cycle ----------
09/04 10:31:56 Phase 1:  Obtaining ads from collector ...
09/04 10:31:56   Getting all public ads ...
09/04 10:31:56   Sorting 40 ads ...
09/04 10:31:56   Getting startd private ads ...
09/04 10:31:56 Got ads: 40 public and 23 private
09/04 10:31:56 Public ads include 1 submitter, 23 startd
09/04 10:31:56 Phase 2:  Performing accounting ...
09/04 10:31:56 Phase 3:  Sorting submitter ads by priority ...
09/04 10:31:56 Phase 4.1:  Negotiating with schedds ...
09/04 10:31:56   Negotiating with user@xxxxxxxxxxxxxxxxxxx at 
<115.145.220.21:33108>
09/04 10:31:56 0 seconds so far
09/04 10:31:56     Request 00236.00000:
09/04 10:31:56       Matched 236.0 user@xxxxxxxxxxxxxxxxxxx 
<115.145.220.21:33108> preempting none <115.145.228.11:1044> slot1@1-1
09/04 10:31:56       Successfully matched with slot1@1-1
09/04 10:31:56     Got NO_MORE_JOBS;  done negotiating
09/04 10:31:56 ---------- Finished Negotiation Cycle ----------
09/04 10:32:17 attempt to connect to <115.145.228.11:1044> failed: Connection 
timed out (connect errno = 110).
09/04 10:32:17 ERROR: SECMAN:2004:Failed to create security session to 
<115.145.228.11:1044> with TCP.|SECMAN:2003:TCP connection to 
<115.145.228.11:1044> failed.
09/04 10:32:17       Failed to initiate socket to send MATCH_INFO to slot1@1-1
=================================


SchedLog:
=================================
09/04 10:31:56 (pid:2109) Negotiating for owner: user@xxxxxxxxxxxxxxxxxxx
09/04 10:31:56 (pid:2109) Out of jobs - 1 jobs matched, 0 jobs idle, flock level 
= 0
09/04 10:31:56 (pid:2109) Sent ad to central manager for 
user@xxxxxxxxxxxxxxxxxxx
09/04 10:31:56 (pid:2109) Sent ad to 1 collectors for user@xxxxxxxxxxxxxxxxxxx
...
09/04 10:32:17 (pid:2109) attempt to connect to <115.145.228.11:1044> failed: 
Connection timed out (connect errno = 110).  Will keep trying for 45 total 
seconds (24 to go).
09/04 10:32:42 (pid:2109) attempt to connect to <115.145.228.11:1044> failed: 
Connection timed out (connect errno = 110).
09/04 10:32:42 (pid:2109) Failed to send REQUEST_CLAIM to startd slot1@1-1 
<115.145.228.11:1044> for user@xxxxxxxxxxxxxxxxxxx: SECMAN:2003:TCP connection 
to startd slot1@1-1 <115.145.228.11:1044> for user@xxxxxxxxxxxxxxxxxxx failed.
09/04 10:32:42 (pid:2109) Match record (slot1@1-1 <115.145.228.11:1044> for 
user@xxxxxxxxxxxxxxxxxxx, 236.0) deleted
09/04 10:32:57 (pid:2109) Activity on stashed negotiator socket
09/04 10:32:57 (pid:2109) Negotiating for owner: user@xxxxxxxxxxxxxxxxxxx
09/04 10:32:57 (pid:2109) Out of jobs - 1 jobs matched, 0 jobs idle, flock level 
= 0
09/04 10:32:57 (pid:2109) Sent ad to central manager for 
user@xxxxxxxxxxxxxxxxxxx
09/04 10:32:57 (pid:2109) Sent ad to 1 collectors for user@xxxxxxxxxxxxxxxxxxx
=================================


After the pool PC is found as a match for the job, there is a constant
failure of connections.

Now, condor_status on the master does get the status info of the pool PCs;
also, I can get get the pool PC's Log files using condor_fetchlog on the master!

Any idea what part of the communication is broken here?

Thank you!

Rob.





----- Original Message ----
From: Timothy St. Clair <tstclair@xxxxxxxxxx>
To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Sent: Sat, September 4, 2010 5:50:19 AM
Subject: Re: [Condor-users] "condor_q -better-analyze" suggests removing a 
requirement I do not use !?!?

What version is your negotiator? 

On Wed, 2010-09-01 at 07:44 -0700, Rob wrote:
> > 
> > ----- Original Message ----
> > From: Timothy St. Clair <tstclair@xxxxxxxxxx>
> > If we break this down:
> > 
> > ( ( ( 1024 * target.Memory ) >= 25 )
> > 
> > This appears to be ok.
> > 
> > && ( ( 1024 * ceiling(ifThenElse(JobVMMemory isnt
> > undefined,JobVMMemory,2.441406250000000E-02)) ) >= 25 ) )
> > 
> > assuming JobVMMemory is not defined...
> > 1024 * 2.441406250000000E-02 >= 25
> > or
> > 25 >= 25
> > 
> > so I'm a bit confused as well, could you show us your entire job ad?
> > condor_q -l <cluster>.<job>
> > 
> > You could always short circuit by adding Memory to your requirements as
> > well.
> 
> Here it is:
> 
> ClusterId = 221
> QDate = 1283350229
> CompletionDate = 0
> Owner = "rob"
> RemoteWallClockTime = 0.000000
> LocalUserCpu = 0.000000
> LocalSysCpu = 0.000000
> RemoteUserCpu = 0.000000
> RemoteSysCpu = 0.000000
> ExitStatus = 0
> NumCkpts_RAW = 0
> NumCkpts = 0
> NumJobStarts = 0
> NumRestarts = 0
> NumSystemHolds = 0
> CommittedTime = 0
> TotalSuspensions = 0
> LastSuspensionTime = 0
> CumulativeSuspensionTime = 0
> ExitBySignal = FALSE
> CondorVersion = "$CondorVersion: 7.4.2 Apr 21 2010 BuildID: Fedora-7.4.2-1.fc12 
>
> $"
> CondorPlatform = "$CondorPlatform: I386-LINUX_F12 $"
> RootDir = "/"
> Iwd = "/home/rob/Desktop/Research/Condor/Examples/Vanilla"
> JobUniverse = 5
> Cmd = "/home/rob/Desktop/Research/Condor/Examples/Vanilla/helloworld.exe"
> MinHosts = 1
> MaxHosts = 1
> CurrentHosts = 0
> WantRemoteSyscalls = FALSE
> WantCheckpoint = FALSE
> RequestCpus = 1
> EnteredCurrentStatus = 1283350229
> JobPrio = 0
> User = "rob@xxxxxxxxxxxxxx"
> NiceUser = FALSE
> Environment = ""
> JobNotification = 2
> WantRemoteIO = TRUE
> UserLog = "/home/rob/Desktop/Research/Condor/Examples/Vanilla/helloworld.log"
> CoreSize = 0
> KillSig = "SIGTERM"
> Rank = 0.000000
> In = "/dev/null"
> TransferIn = FALSE
> Out = "helloworld.out"
> StreamOut = FALSE
> Err = "helloworld.err"
> StreamErr = FALSE
> BufferSize = 524288
> BufferBlockSize = 32768
> ShouldTransferFiles = "YES"
> WhenToTransferOutput = "ON_EXIT"
> TransferFiles = "ONEXIT"
> ImageSize_RAW = 23
> ImageSize = 25
> ExecutableSize_RAW = 23
> ExecutableSize = 25
> DiskUsage_RAW = 23
> DiskUsage = 25
> RequestMemory = ceiling(ifThenElse(JobVMMemory =!= UNDEFINED, JobVMMemory, 
> ImageSize / 1024.000000))
> RequestDisk = DiskUsage
> Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51") && (Machine == 
> "46-5")) && (Disk >= DiskUsage) && (((Memory * 1024) >= ImageSize) && 
> ((RequestMemory * 1024) >= ImageSize)) && (HasFileTransfer)
> JobLeaseDuration = 1200
> PeriodicHold = FALSE
> PeriodicRelease = FALSE
> PeriodicRemove = FALSE
> OnExitHold = FALSE
> OnExitRemove = TRUE
> LeaveJobInQueue = FALSE
> Arguments = ""
> GlobalJobId = "condor1.dyndns.org#221.0#1283350229"
> LastJobStatus = 0
> JobStatus = 1
> ProcId = 0
> AutoClusterId = 0
> AutoClusterAttrs = 
>"JobUniverse,LastCheckpointPlatform,NumCkpts,DiskUsage,ImageSize,RequestMemory,Requirements,NiceUser,ConcurrencyLimits"
>"
> 
> WantMatchDiagnostics = TRUE
> LastMatchTime = 1283351860
> NumJobMatches = 28
> ServerTime = 1283351868
> 
> Rob.
> 
> > On Wed, 2010-09-01 at 01:06 -0700, Rob wrote:
> >> Hi,
> >>
> >> I sumit a simple "Hello World" executable as a Vanilla job
> >> to a Windows XP pool PC:
> >>
> >> Universe  = Vanilla
> >> Executable = helloworld.exe
> >> output = helloworld.out
> >> error  = helloworld.err
> >> log    = helloworld.log
> >> Requirements = (Arch == "INTEL") && (OpSys == "WINNT51")
> >> should_transfer_files = YES
> >> when_to_transfer_output = ON_EXIT
> >> Queue
> >>
> >>
> >> When I submit this job, it sits idle in the pool because:
> >>
> >> ===================================================
> >> 219.000:  Run analysis summary.  Of 581 machines,
> >>      2 are rejected by your job's requirements
> >>    281 reject your job because of their own requirements
> >>      0 match but are serving users with a better priority in the pool
> >>    298 match but reject the job for unknown reasons
> >>      0 match but will not currently preempt their existing job
> >>      0 match but are currently offline
> >>      0 are available to run your job
> >>    Last successful match: Wed Sep  1 17:00:50 2010
> >>
> >> The Requirements expression for your job is:
> >>
> >> ( ( target.Arch == "INTEL" ) && ( target.OpSys == "WINNT51" ) && ( 
> target.Disk
> >> >= DiskUsage ) &&
> >> ( ( ( target.Memory * 1024 ) >= ImageSize ) &&
> >> ( ( RequestMemory * 1024 ) >= ImageSize ) ) && ( target.HasFileTransfer )
> >>
> >>    Condition                        Machines Matched    Suggestion
> >>    ---------                        ----------------    ----------
> >> 1  ( ( ( 1024 * target.Memory ) >= 25 ) && ( ( 1024 *
> >> ceiling(ifThenElse(JobVMMemory isnt
> >> undefined,JobVMMemory,2.441406250000000E-02)) ) >= 25 ) )
> >>                                      0                  REMOVE            
> >> 2  ( target.Arch == "INTEL" )        581                
> >> 3  ( target.OpSys == "WINNT51" )    581                
> >> 4  ( target.Disk >= 25 )            581                
> >> 5  ( target.HasFileTransfer )        581                
> >> ===================================================
> >>
> >> I have no idea what to "REMOVE" here !?!?!
> >> This tiny helloworld executable has very minimal memory requirements,
> >> so I don't understand why this Memory stuff is blocking the job.
> >>
> >> Any ideas?
> >>
> >> Thanks,
> >> Rob.
> >> 
> 
> 
>      
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/