[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] "condor_q -better-analyze" suggests removing a requirement I do not use !?!?



Hi, I wonder if you can turn off the Firewall and test it again.
 In Scientific Linux we got that fail and it was because the Firewall, when we deactivated it, the job's run.


On Fri, Sep 3, 2010 at 8:43 PM, Rob <spamrefuse@xxxxxxxxx> wrote:

Hello,

Master (where the Negotiator and other 'master-stuff' is running)
has condor version 7.4.2.

The pool PCs have version 7.4.2 (and some already have 7.4.3).

I actually started thinking whether this problem is a result of an
incorrect configuration of the (commercial) firewall on the Windows
pool PCs.....

For example, this is what happens when I submit the job:

NegotiatorLog:
=================================
09/04 10:31:56 ---------- Started Negotiation Cycle ----------
09/04 10:31:56 Phase 1:  Obtaining ads from collector ...
09/04 10:31:56   Getting all public ads ...
09/04 10:31:56   Sorting 40 ads ...
09/04 10:31:56   Getting startd private ads ...
09/04 10:31:56 Got ads: 40 public and 23 private
09/04 10:31:56 Public ads include 1 submitter, 23 startd
09/04 10:31:56 Phase 2:  Performing accounting ...
09/04 10:31:56 Phase 3:  Sorting submitter ads by priority ...
09/04 10:31:56 Phase 4.1:  Negotiating with schedds ...
09/04 10:31:56   Negotiating with user@xxxxxxxxxxxxxxxxxxx at
<115.145.220.21:33108>
09/04 10:31:56 0 seconds so far
09/04 10:31:56     Request 00236.00000:
09/04 10:31:56       Matched 236.0 user@xxxxxxxxxxxxxxxxxxx
<115.145.220.21:33108> preempting none <115.145.228.11:1044> slot1@1-1
09/04 10:31:56       Successfully matched with slot1@1-1
09/04 10:31:56     Got NO_MORE_JOBS;  done negotiating
09/04 10:31:56 ---------- Finished Negotiation Cycle ----------
09/04 10:32:17 attempt to connect to <115.145.228.11:1044> failed: Connection
timed out (connect errno = 110).
09/04 10:32:17 ERROR: SECMAN:2004:Failed to create security session to
<115.145.228.11:1044> with TCP.|SECMAN:2003:TCP connection to
<115.145.228.11:1044> failed.
09/04 10:32:17       Failed to initiate socket to send MATCH_INFO to slot1@1-1
=================================


SchedLog:
=================================
09/04 10:31:56 (pid:2109) Negotiating for owner: user@xxxxxxxxxxxxxxxxxxx
09/04 10:31:56 (pid:2109) Out of jobs - 1 jobs matched, 0 jobs idle, flock level
= 0
09/04 10:31:56 (pid:2109) Sent ad to central manager for
user@xxxxxxxxxxxxxxxxxxx
09/04 10:31:56 (pid:2109) Sent ad to 1 collectors for user@xxxxxxxxxxxxxxxxxxx
...
09/04 10:32:17 (pid:2109) attempt to connect to <115.145.228.11:1044> failed:
Connection timed out (connect errno = 110).  Will keep trying for 45 total
seconds (24 to go).
09/04 10:32:42 (pid:2109) attempt to connect to <115.145.228.11:1044> failed:
Connection timed out (connect errno = 110).
09/04 10:32:42 (pid:2109) Failed to send REQUEST_CLAIM to startd slot1@1-1
<115.145.228.11:1044> for user@xxxxxxxxxxxxxxxxxxx: SECMAN:2003:TCP connection
to startd slot1@1-1 <115.145.228.11:1044> for user@xxxxxxxxxxxxxxxxxxx failed.
09/04 10:32:42 (pid:2109) Match record (slot1@1-1 <115.145.228.11:1044> for
user@xxxxxxxxxxxxxxxxxxx, 236.0) deleted
09/04 10:32:57 (pid:2109) Activity on stashed negotiator socket
09/04 10:32:57 (pid:2109) Negotiating for owner: user@xxxxxxxxxxxxxxxxxxx
09/04 10:32:57 (pid:2109) Out of jobs - 1 jobs matched, 0 jobs idle, flock level
= 0
09/04 10:32:57 (pid:2109) Sent ad to central manager for
user@xxxxxxxxxxxxxxxxxxx
09/04 10:32:57 (pid:2109) Sent ad to 1 collectors for user@xxxxxxxxxxxxxxxxxxx
=================================


After the pool PC is found as a match for the job, there is a constant
failure of connections.

Now, condor_status on the master does get the status info of the pool PCs;
also, I can get get the pool PC's Log files using condor_fetchlog on the master!

Any idea what part of the communication is broken here?

Thank you!

Rob.





----- Original Message ----
From: Timothy St. Clair <tstclair@xxxxxxxxxx>
To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Sent: Sat, September 4, 2010 5:50:19 AM
Subject: Re: [Condor-users] "condor_q -better-analyze" suggests removing a
requirement I do not use !?!?

What version is your negotiator?

On Wed, 2010-09-01 at 07:44 -0700, Rob wrote:
> >
> > ----- Original Message ----
> > From: Timothy St. Clair <tstclair@xxxxxxxxxx>
> > If we break this down:
> >
> > ( ( ( 1024 * target.Memory ) >= 25 )
> >
> > This appears to be ok.
> >
> > && ( ( 1024 * ceiling(ifThenElse(JobVMMemory isnt
> > undefined,JobVMMemory,2.441406250000000E-02)) ) >= 25 ) )
> >
> > assuming JobVMMemory is not defined...
> > 1024 * 2.441406250000000E-02 >= 25
> > or
> > 25 >= 25
> >
> > so I'm a bit confused as well, could you show us your entire job ad?
> > condor_q -l <cluster>.<job>
> >
> > You could always short circuit by adding Memory to your requirements as
> > well.
>
> Here it is:
>
> ClusterId = 221
> QDate = 1283350229
> CompletionDate = 0
> Owner = "rob"
> RemoteWallClockTime = 0.000000
> LocalUserCpu = 0.000000
> LocalSysCpu = 0.000000
> RemoteUserCpu = 0.000000
> RemoteSysCpu = 0.000000
> ExitStatus = 0
> NumCkpts_RAW = 0
> NumCkpts = 0
> NumJobStarts = 0
> NumRestarts = 0
> NumSystemHolds = 0
> CommittedTime = 0
> TotalSuspensions = 0
> LastSuspensionTime = 0
> CumulativeSuspensionTime = 0
> ExitBySignal = FALSE
> CondorVersion = "$CondorVersion: 7.4.2 Apr 21 2010 BuildID: Fedora-7.4.2-1.fc12
>
> $"
> CondorPlatform = "$CondorPlatform: I386-LINUX_F12 $"
> RootDir = "/"
> Iwd = "/home/rob/Desktop/Research/Condor/Examples/Vanilla"
> JobUniverse = 5
> Cmd = "/home/rob/Desktop/Research/Condor/Examples/Vanilla/helloworld.exe"
> MinHosts = 1
> MaxHosts = 1
> CurrentHosts = 0
> WantRemoteSyscalls = FALSE
> WantCheckpoint = FALSE
> RequestCpus = 1
> EnteredCurrentStatus = 1283350229
> JobPrio = 0
> User = "rob@xxxxxxxxxxxxxx"
> NiceUser = FALSE
> Environment = ""
> JobNotification = 2
> WantRemoteIO = TRUE
> UserLog = "/home/rob/Desktop/Research/Condor/Examples/Vanilla/helloworld.log"
> CoreSize = 0
> KillSig = "SIGTERM"
> Rank = 0.000000
> In = "/dev/null"
> TransferIn = FALSE
> Out = "helloworld.out"
> StreamOut = FALSE
> Err = "helloworld.err"
> StreamErr = FALSE
> BufferSize = 524288
> BufferBlockSize = 32768
> ShouldTransferFiles = "YES"
> WhenToTransferOutput = "ON_EXIT"
> TransferFiles = "ONEXIT"
> ImageSize_RAW = 23
> ImageSize = 25
> ExecutableSize_RAW = 23
> ExecutableSize = 25
> DiskUsage_RAW = 23
> DiskUsage = 25
> RequestMemory = ceiling(ifThenElse(JobVMMemory =!= UNDEFINED, JobVMMemory,
> ImageSize / 1024.000000))
> RequestDisk = DiskUsage
> Requirements = ((Arch == "INTEL") && (OpSys == "WINNT51") && (Machine ==
> "46-5")) && (Disk >= DiskUsage) && (((Memory * 1024) >= ImageSize) &&
> ((RequestMemory * 1024) >= ImageSize)) && (HasFileTransfer)
> JobLeaseDuration = 1200
> PeriodicHold = FALSE
> PeriodicRelease = FALSE
> PeriodicRemove = FALSE
> > > > > LeaveJobInQueue = FALSE
> Arguments = ""
> GlobalJobId = "condor1.dyndns.org#221.0#1283350229"
> LastJobStatus = 0
> JobStatus = 1
> ProcId = 0
> AutoClusterId = 0
> AutoClusterAttrs =
>"JobUniverse,LastCheckpointPlatform,NumCkpts,DiskUsage,ImageSize,RequestMemory,Requirements,NiceUser,ConcurrencyLimits"
>"
>
> WantMatchDiagnostics = TRUE
> LastMatchTime = 1283351860
> NumJobMatches = 28
> ServerTime = 1283351868
>
> Rob.
>
> > On Wed, 2010-09-01 at 01:06 -0700, Rob wrote:
> >> Hi,
> >>
> >> I sumit a simple "Hello World" executable as a Vanilla job
> >> to a Windows XP pool PC:
> >>
> >> Universe  = Vanilla
> >> Executable = helloworld.exe
> >> output = helloworld.out
> >> error  = helloworld.err
> >> log    = helloworld.log
> >> Requirements = (Arch == "INTEL") && (OpSys == "WINNT51")
> >> should_transfer_files = YES
> >> when_to_transfer_output = ON_EXIT
> >> Queue
> >>
> >>
> >> When I submit this job, it sits idle in the pool because:
> >>
> >> ===================================================
> >> 219.000:  Run analysis summary.  Of 581 machines,
> >>      2 are rejected by your job's requirements
> >>    281 reject your job because of their own requirements
> >>      0 match but are serving users with a better priority in the pool
> >>    298 match but reject the job for unknown reasons
> >>      0 match but will not currently preempt their existing job
> >>      0 match but are currently offline
> >>      0 are available to run your job
> >>    Last successful match: Wed Sep  1 17:00:50 2010
> >>
> >> The Requirements _expression_ for your job is:
> >>
> >> ( ( target.Arch == "INTEL" ) && ( target.OpSys == "WINNT51" ) && (
> target.Disk
> >> >= DiskUsage ) &&
> >> ( ( ( target.Memory * 1024 ) >= ImageSize ) &&
> >> ( ( RequestMemory * 1024 ) >= ImageSize ) ) && ( target.HasFileTransfer )
> >>
> >>    Condition                        Machines Matched    Suggestion
> >>    ---------                        ----------------    ----------
> >> 1  ( ( ( 1024 * target.Memory ) >= 25 ) && ( ( 1024 *
> >> ceiling(ifThenElse(JobVMMemory isnt
> >> undefined,JobVMMemory,2.441406250000000E-02)) ) >= 25 ) )
> >>                                      0                  REMOVE
> >> 2  ( target.Arch == "INTEL" )        581
> >> 3  ( target.OpSys == "WINNT51" )    581
> >> 4  ( target.Disk >= 25 )            581
> >> 5  ( target.HasFileTransfer )        581
> >> ===================================================
> >>
> >> I have no idea what to "REMOVE" here !?!?!
> >> This tiny helloworld executable has very minimal memory requirements,
> >> so I don't understand why this Memory stuff is blocking the job.
> >>
> >> Any ideas?
> >>
> >> Thanks,
> >> Rob.
> >>
>
>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/




_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/



--
----
Edier Alberto Zapata Hernández
Est. Ingeniería de Sistemas
Universidad de Valle