[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor-g & matching a cluster to multiple jobs at once





Hi, I'm working on deploying Condor-G and matchmaking. My problem is that while jobs are being matched and executed, they are only matched to a system one at a time. I'd like Condor-G to have several jobs submitted to a system at the same time. I have a simple test job that only can match to a single class ad:


executable = /bin/hostname
arguments = --fqdn
transfer_executable = false

output = hostname-match-$(CLUSTER)-$(PROCESS).out
error = hostname-match-$(CLUSTER)-$(PROCESS).err
log = hostname-match-$(CLUSTER)-$(PROCESS).log

universe = grid
x509userproxy=/home/utexas/staff/wsmith/.globus/userproxy.pem
grid_resource = $$(GridResource)
Requirements = (Name=="tacc.lonestar.serial")
globusrsl = (maxWallTime=5)(count=1)(queue=$$(Queue))

queue 10



And the classad in Condor is:

lslogin2$ condor_status -l tacc.lonestar.serial
MyType = "Machine"
TargetType = "Job"
Requirements = (TARGET.JobUniverse == 9)
Rank = 0.000000
CurrentRank = 0.000000
WantAdRevaluate = TRUE
CurMatches = 0
Name = "tacc.lonestar.serial"
Machine = "gatekeeper.lonestar.tacc.teragrid.org"
StartdIpAddr = "<129.114.50.32>"
GridResource = "gt2 gatekeeper.lonestar.tacc.teragrid.org:2119/jobmanager-lsf"
State = "Unclaimed"
Activity = "Idle"
UpdateSequenceNumber = 1220367368
Arch = "X86_64"
OpSys = "LINUX"
LoadAvg = 0.865580
TotalMemory = 11840721
Memory = 1725537
Queue = "serial"
Priority = 0.030000
MaxWallTime = 720
MaxProcessors = 1
MyAddress = "<192.5.198.172:0>"
LastHeardFrom = 1220367369
UpdatesTotal = 1328
UpdatesSequenced = 0
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"


From the Condor manual, it seems like setting WantAdRevaluate to True will result in Condor matching multiple jobs to this system. What I'm seeing is that the jobs run one at a time on the system. Here's part of the MatchLog:

9/2 09:48:49 Matched 153.0 wsmith@xxxxxxxxxxxxxxxxx <129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial 9/2 09:48:49 Rejected 153.1 wsmith@xxxxxxxxxxxxxxxxx <129.114.69.97:50761>: no match found 9/2 09:53:51 Matched 153.1 wsmith@xxxxxxxxxxxxxxxxx <129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial 9/2 09:53:51 Rejected 153.2 wsmith@xxxxxxxxxxxxxxxxx <129.114.69.97:50761>: no match found 9/2 09:58:52 Matched 153.2 wsmith@xxxxxxxxxxxxxxxxx <129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial 9/2 09:58:52 Rejected 153.3 wsmith@xxxxxxxxxxxxxxxxx <129.114.69.97:50761>: no match found 9/2 10:03:53 Matched 153.3 wsmith@xxxxxxxxxxxxxxxxx <129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial 9/2 10:03:53 Rejected 153.4 wsmith@xxxxxxxxxxxxxxxxx <129.114.69.97:50761>: no match found 9/2 10:08:55 Matched 153.4 wsmith@xxxxxxxxxxxxxxxxx <129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial 9/2 10:08:55 Rejected 153.5 wsmith@xxxxxxxxxxxxxxxxx <129.114.69.97:50761>: no match found 9/2 10:13:56 Matched 153.5 wsmith@xxxxxxxxxxxxxxxxx <129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial 9/2 10:13:56 Rejected 153.6 wsmith@xxxxxxxxxxxxxxxxx <129.114.69.97:50761>: no match found 9/2 10:18:58 Matched 153.6 wsmith@xxxxxxxxxxxxxxxxx <129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial 9/2 10:18:58 Rejected 153.7 wsmith@xxxxxxxxxxxxxxxxx <129.114.69.97:50761>: no match found 9/2 10:24:00 Matched 153.7 wsmith@xxxxxxxxxxxxxxxxx <129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial 9/2 10:24:00 Rejected 153.8 wsmith@xxxxxxxxxxxxxxxxx <129.114.69.97:50761>: no match found 9/2 10:29:01 Matched 153.8 wsmith@xxxxxxxxxxxxxxxxx <129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial 9/2 10:29:01 Rejected 153.9 wsmith@xxxxxxxxxxxxxxxxx <129.114.69.97:50761>: no match found 9/2 10:34:02 Matched 153.9 wsmith@xxxxxxxxxxxxxxxxx <129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial

As you can see, all of the jobs get matched and run, but only one gets matched every 5 mins (every Negotiator cycle?). The serial queue on lonestar was empty so the jobs ran quickly.

The collector and negotiator are from Condor 7.1.0. I sent an earlier query to the list about a STARTD_AD_REEVAL_EXPR error message in my NegotiatorLog that I don't think is related to this...


Thanks for the help,


Warren