Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] "condor_q -better-analyze" suggests removing a requirement I do not use !?!?
- Date: Fri, 3 Sep 2010 18:43:07 -0700 (PDT)
- From: Rob <spamrefuse@xxxxxxxxx>
- Subject: Re: [Condor-users] "condor_q -better-analyze" suggests removing a requirement I do not use !?!?
Hello,
Master (where the Negotiator and other 'master-stuff' is running)
has condor version 7.4.2.
The pool PCs have version 7.4.2 (and some already have 7.4.3).
I actually started thinking whether this problem is a result of an
incorrect configuration of the (commercial) firewall on the Windows
pool PCs.....
For example, this is what happens when I submit the job:
NegotiatorLog:
=================================
09/04 10:31:56 ---------- Started Negotiation Cycle ----------
09/04 10:31:56 Phase 1: Obtaining ads from collector ...
09/04 10:31:56 Getting all public ads ...
09/04 10:31:56 Sorting 40 ads ...
09/04 10:31:56 Getting startd private ads ...
09/04 10:31:56 Got ads: 40 public and 23 private
09/04 10:31:56 Public ads include 1 submitter, 23 startd
09/04 10:31:56 Phase 2: Performing accounting ...
09/04 10:31:56 Phase 3: Sorting submitter ads by priority ...
09/04 10:31:56 Phase 4.1: Negotiating with schedds ...
09/04 10:31:56 Negotiating with user@xxxxxxxxxxxxxxxxxxx at
<115.145.220.21:33108>
09/04 10:31:56 0 seconds so far
09/04 10:31:56 Request 00236.00000:
09/04 10:31:56 Matched 236.0 user@xxxxxxxxxxxxxxxxxxx
<115.145.220.21:33108> preempting none <115.145.228.11:1044> slot1@1-1
09/04 10:31:56 Successfully matched with slot1@1-1
09/04 10:31:56 Got NO_MORE_JOBS; done negotiating
09/04 10:31:56 ---------- Finished Negotiation Cycle ----------
09/04 10:32:17 attempt to connect to <115.145.228.11:1044> failed: Connection
timed out (connect errno = 110).
09/04 10:32:17 ERROR: SECMAN:2004:Failed to create security session to
<115.145.228.11:1044> with TCP.|SECMAN:2003:TCP connection to
<115.145.228.11:1044> failed.
09/04 10:32:17 Failed to initiate socket to send MATCH_INFO to slot1@1-1
=================================
SchedLog:
=================================
09/04 10:31:56 (pid:2109) Negotiating for owner: user@xxxxxxxxxxxxxxxxxxx
09/04 10:31:56 (pid:2109) Out of jobs - 1 jobs matched, 0 jobs idle, flock level
= 0
09/04 10:31:56 (pid:2109) Sent ad to central manager for
user@xxxxxxxxxxxxxxxxxxx
09/04 10:31:56 (pid:2109) Sent ad to 1 collectors for user@xxxxxxxxxxxxxxxxxxx
...
09/04 10:32:17 (pid:2109) attempt to connect to <115.145.228.11:1044> failed:
Connection timed out (connect errno = 110). Will keep trying for 45 total
seconds (24 to go).
09/04 10:32:42 (pid:2109) attempt to connect to <115.145.228.11:1044> failed:
Connection timed out (connect errno = 110).
09/04 10:32:42 (pid:2109) Failed to send REQUEST_CLAIM to startd slot1@1-1
<115.145.228.11:1044> for user@xxxxxxxxxxxxxxxxxxx: SECMAN:2003:TCP connection
to startd slot1@1-1 <115.145.228.11:1044> for user@xxxxxxxxxxxxxxxxxxx failed.
09/04 10:32:42 (pid:2109) Match record (slot1@1-1 <115.145.228.11:1044> for
user@xxxxxxxxxxxxxxxxxxx, 236.0) deleted
09/04 10:32:57 (pid:2109) Activity on stashed negotiator socket
09/04 10:32:57 (pid:2109) Negotiating for owner: user@xxxxxxxxxxxxxxxxxxx
09/04 10:32:57 (pid:2109) Out of jobs - 1 jobs matched, 0 jobs idle, flock level
= 0
09/04 10:32:57 (pid:2109) Sent ad to central manager for
user@xxxxxxxxxxxxxxxxxxx
09/04 10:32:57 (pid:2109) Sent ad to 1 collectors for user@xxxxxxxxxxxxxxxxxxx
=================================
After the pool PC is found as a match for the job, there is a constant
failure of connections.
Now, condor_status on the master does get the status info of the pool PCs;
also, I can get get the pool PC's Log files using condor_fetchlog on the master!
Any idea what part of the communication is broken here?
Thank you!
Rob.
----- Original Message ----
From: Timothy St. Clair <tstclair@xxxxxxxxxx>
To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Sent: Sat, September 4, 2010 5:50:19 AM
Subject: Re: [Condor-users] "condor_q -better-analyze" suggests removing a
requirement I do not use !?!?
What version is your negotiator?
On Wed, 2010-09-01 at 07:44 -0700, Rob wrote:
> >
> > ----- Original Message ----
> > From: Timothy St. Clair <tstclair@xxxxxxxxxx>
> > If we break this down:
> >
> > ( ( ( 1024 * target.Memory ) >= 25 )
> >
> > This appears to be ok.
> >
> > && ( ( 1024 * ceiling(ifThenElse(JobVMMemory isnt
> > undefined,JobVMMemory,2.441406250000000E-02)) ) >= 25 ) )
> >
> > assuming JobVMMemory is not defined...
> > 1024 * 2.441406250000000E-02 >= 25
> > or
> > 25 >= 25
> >
> > so I'm a bit confused as well, could you show us your entire job ad?
> > condor_q -l <cluster>.<job>
> >
> > You could always short circuit by adding Memory to your requirements as
> > well.
>
> Here it is:
>
> ClusterId = 221
> QDate = 1283350229
> CompletionDate = 0
> Owner = "rob"
> RemoteWallClockTime = 0.000000
> LocalUserCpu = 0.000000
> LocalSysCpu = 0.000000
> RemoteUserCpu = 0.000000
> RemoteSysCpu = 0.000000
> ExitStatus = 0
> NumCkpts_RAW = 0
> NumCkpts = 0
> NumJobStarts = 0
> NumRestarts = 0
> NumSystemHolds = 0
> CommittedTime = 0
> TotalSuspensions = 0
> LastSuspensionTime = 0
> CumulativeSuspensionTime = 0
> ExitBySignal = FALSE
> CondorVersion = "$CondorVersion: 7.4.2 Apr 21 2010 BuildID: Fedora-7.4.2-1.fc12
>
> $"
> CondorPlatform = "$CondorPlatform: I386-LINUX_F12 $"
> RootDir = "/"
> Iwd = "/home/rob/Desktop/Research/Condor/Examples/Vanilla"
> JobUniverse = 5
> Cmd = "/home/rob/Desktop/Research/Condor/Examples/Vanilla/helloworld.exe"
> MinHosts = 1
> MaxHosts = 1
> CurrentHosts = 0
> WantRemoteSyscalls = FALSE
> WantCheckpoint = FALSE
> RequestCpus = 1
> EnteredCurrentStatus = 1283350229
> JobPrio = 0
> User = "rob@xxxxxxxxxxxxxx"
> NiceUser = FALSE
> Environment = ""
> JobNotification = 2
> WantRemoteIO = TRUE
> UserLog = "/home/rob/Desktop/Research/Condor/Examples/Vanilla/helloworld.log"
> CoreSize = 0
> KillSig = "SIGTERM"
> Rank = 0.000000
> In = "/dev/null"
> TransferIn = FALSE
> Out = "helloworld.out"
> StreamOut = FALSE
> Err = "helloworld.err"
> StreamErr = FALSE
> BufferSize = 524288
> BufferBlockSize = 32768
> ShouldTransferFiles = "YES"
> WhenToTransferOutput = "ON_EXIT"
> TransferFiles = "ONEXIT"
> ImageSize_RAW = 23
> ImageSize = 25
> ExecutableSize_RAW = 23
> ExecutableSize = 25
> DiskUsage_RAW = 23
> DiskUsage = 25
> RequestMemory = ceiling(ifThenElse(JobVMMemory =!= UNDEFINED, JobVMMemory,
> ImageSize / 1024.000000))
> RequestDisk = DiskUsage
> Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51") && (Machine ==
> "46-5")) && (Disk >= DiskUsage) && (((Memory * 1024) >= ImageSize) &&
> ((RequestMemory * 1024) >= ImageSize)) && (HasFileTransfer)
> JobLeaseDuration = 1200
> PeriodicHold = FALSE
> PeriodicRelease = FALSE
> PeriodicRemove = FALSE
> OnExitHold = FALSE
> OnExitRemove = TRUE
> LeaveJobInQueue = FALSE
> Arguments = ""
> GlobalJobId = "condor1.dyndns.org#221.0#1283350229"
> LastJobStatus = 0
> JobStatus = 1
> ProcId = 0
> AutoClusterId = 0
> AutoClusterAttrs =
>"JobUniverse,LastCheckpointPlatform,NumCkpts,DiskUsage,ImageSize,RequestMemory,Requirements,NiceUser,ConcurrencyLimits"
>"
>
> WantMatchDiagnostics = TRUE
> LastMatchTime = 1283351860
> NumJobMatches = 28
> ServerTime = 1283351868
>
> Rob.
>
> > On Wed, 2010-09-01 at 01:06 -0700, Rob wrote:
> >> Hi,
> >>
> >> I sumit a simple "Hello World" executable as a Vanilla job
> >> to a Windows XP pool PC:
> >>
> >> Universe = Vanilla
> >> Executable = helloworld.exe
> >> output = helloworld.out
> >> error = helloworld.err
> >> log = helloworld.log
> >> Requirements = (Arch == "INTEL") && (OpSys == "WINNT51")
> >> should_transfer_files = YES
> >> when_to_transfer_output = ON_EXIT
> >> Queue
> >>
> >>
> >> When I submit this job, it sits idle in the pool because:
> >>
> >> ===================================================
> >> 219.000: Run analysis summary. Of 581 machines,
> >> 2 are rejected by your job's requirements
> >> 281 reject your job because of their own requirements
> >> 0 match but are serving users with a better priority in the pool
> >> 298 match but reject the job for unknown reasons
> >> 0 match but will not currently preempt their existing job
> >> 0 match but are currently offline
> >> 0 are available to run your job
> >> Last successful match: Wed Sep 1 17:00:50 2010
> >>
> >> The Requirements expression for your job is:
> >>
> >> ( ( target.Arch == "INTEL" ) && ( target.OpSys == "WINNT51" ) && (
> target.Disk
> >> >= DiskUsage ) &&
> >> ( ( ( target.Memory * 1024 ) >= ImageSize ) &&
> >> ( ( RequestMemory * 1024 ) >= ImageSize ) ) && ( target.HasFileTransfer )
> >>
> >> Condition Machines Matched Suggestion
> >> --------- ---------------- ----------
> >> 1 ( ( ( 1024 * target.Memory ) >= 25 ) && ( ( 1024 *
> >> ceiling(ifThenElse(JobVMMemory isnt
> >> undefined,JobVMMemory,2.441406250000000E-02)) ) >= 25 ) )
> >> 0 REMOVE
> >> 2 ( target.Arch == "INTEL" ) 581
> >> 3 ( target.OpSys == "WINNT51" ) 581
> >> 4 ( target.Disk >= 25 ) 581
> >> 5 ( target.HasFileTransfer ) 581
> >> ===================================================
> >>
> >> I have no idea what to "REMOVE" here !?!?!
> >> This tiny helloworld executable has very minimal memory requirements,
> >> so I don't understand why this Memory stuff is blocking the job.
> >>
> >> Any ideas?
> >>
> >> Thanks,
> >> Rob.
> >>
>
>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/