[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Evaluating Condor 6.8.2 (GROUP, RANK)



Hello,

I just installed Condor 6.8.2 and I am testing it on a single Windows 2000 machine.

I'm using the installed condor_config file, except for adding/changing the following:
RANK, PREEMPTION_REQUIREMENTS = TRUE, GROUP_NAMES = group_test1, group_test2, GROUP_PRIO_FACTOR_<groupname>, NEGOTIATIOR_INTERVAL = 30.

I have encountered a few unexpected things/problems.  Could you please answer the following questions?

1. Does GROUP_PRIO_FACTOR_<groupname>  work at all?  When I do condor_userprio -all, I don't see the values
I specify in condor_config, but 1.0 (the default). So, I have to use condor_userprio to manually set the
priority factors.

TESTING RANK:

I set the effective priorities of "group_test1", "group_test2", and "myusername" to be the same.

2. I use RANK = (AccountingGroup == "group_test1")*20 and run a job with +AccountingGroup = "group_test", 
but condor_status -l shows CurrentRank = 0.0.  The job from group_test does preempt a job from myusername
not using a group. Why is CurrentRank zero then?

3. I use RANK = (AccountingGroup == "group_test1@xxxxxxxxxxxx")*20 and run a job with +AccountingGroup = "group_test1", 
and condor_status -l does show CurrentRank = 20.0 (although I think sometimes it shows 0.0).  HOWEVER, the
job from group_test does NOT preempt a job from myusername not using a group. I would think that the 
AccountingGroup expression does not evaluate correctly (because the use of the domain name) and that this
would cause preemption not to occur, BUT why then does condor_status -l show 20.0 and can't preempt?!

4. RANK does not seem to work for preferring one group over another because a job from group_test2 is not preempted by a job from group_test1 (regardless whether RANK is set as in 2 o 3 above).  It does work for preferring a group over a user not using a group (myusername in this case).  Is this true? why?

I see in the StartLog from time to time, an "Error evaluating rank." message, so there is definitely a problem evaluating the rank.

5. I have observed the following a couple of times:
a job stays Idle for a long time (about 20 minutes) until it finally starts running. 
Do you know what may be the source of the problem? Is it maybe related to the RANK issue?

NegotiatorLog shows Rejected 109.0 group_test2@... <...>: no match for several Negotiation Cycles until about 20
minutes later it's matched.

During that time, StartLog shows the following:

loadavg thread died, restarting. (exit code=2)
no loadavg samples this minute, maybe thread died???
...
DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
Called deactivate_claim_forcibly()

DaemonCore: received command 60011 (DC_NOP), ...
Starter pid 592 exited with status 0
State change: starter exited
Changing activity: Busy -> Idle
...
DaemonCore: received command 443 (RELEASE_CLAIM), ...
State change: received RELEASE_CLAIM command
Changing state and activity: Claimed/Idle -> Preempting/Vacating
State change: No preempting claim, returning to owner
Changing state and activity: Preempting/Vacating -> Owner/Idle
State change; IS_OWNER is false
Changing state: Owner -> Unclaimed
...
DaemonCore: received command 443 (RELEASE_CLAIM), ...
Warning: can't find resource with ClaimId (<>#1162331881#5)
loadavg thread died, restarting. (exit code=2)
no loadavg samples this minute, maybe thread died???
...
DaemonCore: received command 440 (MATCH_INFO), ...
match_info called
Received match <>#1162331881#7
State change: match notification protocol successful
Changing change: Unclaimed -> Matched
...
DaemonCore: received command 442 (REQUEST_CLAIM), ...
Request accepted
Remote owner is myusername@ ...
State change: claiming protocol successful
Changing state: Matched -> Claimed
loadavg thread died, restarting. (exit code=2)
no loadavg samples this minute, maybe thread died???
...

6. This may also be related to items 2-5: When I see an Idle job and do: condor_q -analyze job_number,
I've seen the following message:

WARNING: Be advised: No resources matched request's constraints. Check the Requirements expression below:
...
WARNING: Be advised: Request job_number did not match any resources constraints.

However, a while later and without my doing anything at all, the jobs starts running!

Another message I get(sometimes after the one I just reproduced above) is:
1 match but rejected the job for unknown reasons.

But again, a while later and without my doing anything at all, the jobs starts running!

And yet other times (everything the same as when I get the messages I just pointed out), 
the job starts running right away!

This Condor behavior seems very erratic, even though I'm just running on one machine!

Please help me understand all this (it's very confusing).  Any pointers would be appreciated.

Thank you very much,

Roberto