[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor-g & matching a cluster to multiple jobs at once





Warren Smith wrote:


Thanks - that got things going.

With that negotiator option, will I be able to use something like:

match_list_length = 1
Rank  = TARGET.Name != LastMatchName0

in a job submit file?


Yes. The "match list" referred to in your example is a very different thing from the "match list" referred to by the configuration variable NEGOTIATOR_MATCHLIST_CACHING. The latter is really an internal optimization that users should not ever have to be aware of. However, the optimization does not work correctly with grid matchmaking, unfortunately.

--Dan



Warren


Dan Bradley wrote:

This is a bug in Condor.  A fix for it has been discussed but not yet
implemented.

The workaround is to add the following to your fake startd ads:

RemoteUser = "fake_user"
Rank = 1.0
CurrentRank = 0.0


and to add the following to your negotiator configuration:

NEGOTIATOR_MATCHLIST_CACHING = false


--Dan

Warren Smith wrote:

Hi, I'm working on deploying Condor-G and matchmaking. My problem is
that while jobs are being matched and executed, they are only matched to
a system one at a time. I'd like Condor-G to have several jobs submitted
to a system at the same time. I have a simple test job that only can
match to a single class ad:


executable = /bin/hostname
arguments = --fqdn
transfer_executable = false

output = hostname-match-$(CLUSTER)-$(PROCESS).out
error = hostname-match-$(CLUSTER)-$(PROCESS).err
log = hostname-match-$(CLUSTER)-$(PROCESS).log

universe = grid
x509userproxy=/home/utexas/staff/wsmith/.globus/userproxy.pem
grid_resource = $$(GridResource)
Requirements = (Name=="tacc.lonestar.serial")
globusrsl = (maxWallTime=5)(count=1)(queue=$$(Queue))

queue 10



And the classad in Condor is:

lslogin2$ condor_status -l tacc.lonestar.serial
MyType = "Machine"
TargetType = "Job"
Requirements = (TARGET.JobUniverse == 9)
Rank = 0.000000
CurrentRank = 0.000000
WantAdRevaluate = TRUE
CurMatches = 0
Name = "tacc.lonestar.serial"
Machine = "gatekeeper.lonestar.tacc.teragrid.org"
StartdIpAddr = "<129.114.50.32>"
GridResource = "gt2
gatekeeper.lonestar.tacc.teragrid.org:2119/jobmanager-lsf"
State = "Unclaimed"
Activity = "Idle"
UpdateSequenceNumber = 1220367368
Arch = "X86_64"
OpSys = "LINUX"
LoadAvg = 0.865580
TotalMemory = 11840721
Memory = 1725537
Queue = "serial"
Priority = 0.030000
MaxWallTime = 720
MaxProcessors = 1
MyAddress = "<192.5.198.172:0>"
LastHeardFrom = 1220367369
UpdatesTotal = 1328
UpdatesSequenced = 0
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"


From the Condor manual, it seems like setting WantAdRevaluate to True
will result in Condor matching multiple jobs to this system. What I'm
seeing is that the jobs run one at a time on the system. Here's part of
the MatchLog:

9/2 09:48:49       Matched 153.0 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial
9/2 09:48:49       Rejected 153.1 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761>: no match found
9/2 09:53:51       Matched 153.1 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial
9/2 09:53:51       Rejected 153.2 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761>: no match found
9/2 09:58:52       Matched 153.2 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial
9/2 09:58:52       Rejected 153.3 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761>: no match found
9/2 10:03:53       Matched 153.3 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial
9/2 10:03:53       Rejected 153.4 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761>: no match found
9/2 10:08:55       Matched 153.4 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial
9/2 10:08:55       Rejected 153.5 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761>: no match found
9/2 10:13:56       Matched 153.5 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial
9/2 10:13:56       Rejected 153.6 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761>: no match found
9/2 10:18:58       Matched 153.6 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial
9/2 10:18:58       Rejected 153.7 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761>: no match found
9/2 10:24:00       Matched 153.7 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial
9/2 10:24:00       Rejected 153.8 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761>: no match found
9/2 10:29:01       Matched 153.8 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial
9/2 10:29:01       Rejected 153.9 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761>: no match found
9/2 10:34:02       Matched 153.9 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial

As you can see, all of the jobs get matched and run, but only one gets
matched every 5 mins (every Negotiator cycle?). The serial queue on
lonestar was empty so the jobs ran quickly.

The collector and negotiator are from Condor 7.1.0. I sent an earlier
query to the list about a STARTD_AD_REEVAL_EXPR error message in my
NegotiatorLog that I don't think is related to this...


Thanks for the help,


Warren

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
------------------------------------------------------------------------

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/