[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor configuration question



Steve,

I think I have a better understanding now. I changed the configuration like this:

[bala@vulcan condor-7.1.4]$ condor_config_val PREEMPT
(CurrentTime - JobStart) > 60
[bala@vulcan condor-7.1.4]$ condor_config_val MaxJobRetirementTime
1
[bala@vulcan condor-7.1.4]$  condor_config_val WANT_VACATE
(CurrentTime - JobStart) > 60

and from the logs, I see:

1/30 15:09:43 slot2: match_info called
1/30 15:09:44 slot2: Received match <10.0.0.2:33539>#1233004692#549#...
1/30 15:09:44 slot2: State change: match notification protocol successful
1/30 15:09:44 slot2: Changing state: Unclaimed -> Matched
1/30 15:09:44 slot2: Request accepted.
1/30 15:09:54 slot2: Remote owner is bala@xxxxxxxxxx
1/30 15:09:54 slot2: State change: claiming protocol successful
1/30 15:09:54 slot2: Changing state: Matched -> Claimed
1/30 15:09:54 slot2: Got activate_claim request from shadow (<10.0.0.105:51060>)
1/30 15:09:54 slot2: Remote job ID is 46.0
1/30 15:09:54 slot2: Got universe "STANDARD" (1) from request classad
1/30 15:09:54 slot2: State change: claim-activation protocol successful
1/30 15:09:54 slot2: Changing activity: Idle -> Busy

After a minute or so, PREEMPT and WANT_VACATE are true,

1/30 15:10:59 slot2: State change: PREEMPT is TRUE
1/30 15:10:59 slot2: Changing activity: Busy -> Retiring
1/30 15:10:59 slot2: State change: claim retirement ended/expired
1/30 15:10:59 slot2: State change: WANT_VACATE is TRUE
1/30 15:10:59 slot2: Changing state and activity: Claimed/Retiring -> Preempting/Vacating

The job gets preempted,

1/30 15:11:01 slot2: Got KILL_FRGN_JOB while in Preempting state, ignoring.
1/30 15:11:01 slot2: Got RELEASE_CLAIM while in Preempting state, ignoring.

Need to look what the above two signals are, and why it is getting ignored,

1/30 15:11:01 Starter pid 9365 exited with status 0
1/30 15:11:01 slot2: State change: starter exited
1/30 15:11:01 slot2: State change: No preempting claim, returning to owner
1/30 15:11:01 slot2: Changing state and activity: Preempting/Vacating -> Owner/Idle
1/30 15:11:01 slot2: State change: IS_OWNER is false
1/30 15:11:01 slot2: Changing state: Owner -> Unclaimed

And the job goes back to Idle state, but still remains in the queue, until the next time match_info is called,

1/30 15:19:43 slot2: match_info called
1/30 15:19:43 slot2: Received match <10.0.0.2:33539>#1233004692#551#...
1/30 15:19:44 slot2: State change: match notification protocol successful
1/30 15:19:44 slot2: Changing state: Unclaimed -> Matched
1/30 15:19:44 slot2: Request accepted.
1/30 15:19:54 slot2: Remote owner is bala@xxxxxxxxxx
1/30 15:19:54 slot2: State change: claiming protocol successful
1/30 15:19:54 slot2: Changing state: Matched -> Claimed
1/30 15:19:54 slot2: Got activate_claim request from shadow (<10.0.0.105:44826>)
1/30 15:19:54 slot2: Remote job ID is 46.0
1/30 15:19:54 slot2: Got universe "STANDARD" (1) from request classad
1/30 15:19:54 slot2: State change: claim-activation protocol successful
1/30 15:19:54 slot2: Changing activity: Idle -> Busy
1/30 15:20:59 slot2: State change: PREEMPT is TRUE
1/30 15:20:59 slot2: Changing activity: Busy -> Retiring
1/30 15:20:59 slot2: State change: claim retirement ended/expired
1/30 15:20:59 slot2: State change: WANT_VACATE is TRUE
1/30 15:20:59 slot2: Changing state and activity: Claimed/Retiring -> Preempting/Vacating
1/30 15:21:01 slot2: Got KILL_FRGN_JOB while in Preempting state, ignoring.
1/30 15:21:01 slot2: Got RELEASE_CLAIM while in Preempting state, ignoring.
1/30 15:21:01 Starter pid 9383 exited with status 0
1/30 15:21:01 slot2: State change: starter exited
1/30 15:21:01 slot2: State change: No preempting claim, returning to owner
1/30 15:21:01 slot2: Changing state and activity: Preempting/Vacating -> Owner/Idle
1/30 15:21:01 slot2: State change: IS_OWNER is false
1/30 15:21:01 slot2: Changing state: Owner -> Unclaimed


So everytime when preemption happens, the jobs gets removed from the execute machine but not from the condor queue. Since my job has checkpoints, and when scheduled to run again it continues from where it left off. I misunderstood the term PREEMPTION; its not killing a job and removing from the queue, but vacating a job from the execute machine and possibly rescheduling it again sometime later.

Thanks.
.Bala.

Steven Timm wrote:
WANT_VACATE and PREEMPT are evaluated by the startd on the execute
machine. PREEMPTION_REQUIREMENTS are evaluated by the negotiator.
You should be looking at the StartLog to see what is actually happening.
Also, by default a negotiation cycle only runs every 20 seconds so
PREEMPTION_REQUIREMENTS as you have it written is probably not
going to have time to kick in.

Steve Timm



On Fri, 30 Jan 2009, Balamurali Ananthan wrote:

Thanks for the reply Steve. Here is another question:

I want to preempt a job. And here is the configuration I have at both
the master/submit machine and in the execute machine:

[bala@node2 condor-7.1.4]$ condor_config_val WANT_VACATE
(CurrentTime - JobStart) > 10

[bala@node2 condor-7.1.4]$ condor_config_val PREEMPTION_REQUIREMENTS
(CurrentTime - JobStart) > 10

[bala@node2 condor-7.1.4]$ condor_config_val PREEMPT
(CurrentTime - JobStart) > 10

[bala@node2 condor-7.1.4]$ condor_config_val MaxJobRetirementTime
10

I did a condor_reconfig on all the machines. With this configuration in
place, I was expecting every job be preempted 10 seconds after it starts
and would have a 10 sec to do clean up and be killed. But the jobs that
I submit which usually runs for 3 mins runs for around 30 mins (a lot of
times I see the job in the Idle state) and gets completed which is not
expected.

Any idea on whats wrong with the configuration? I would like condor to
kill my jobs in 20 seconds.

Thanks.
.Bala.

Steven Timm wrote:
2 ways to do it
a) here is a preemption requirements statement much like
the UWCS default one.

[root@fcdf2x1 ~]# condor_config_val PREEMPTION_REQUIREMENTS
(((CurrentTime - EnteredCurrentState) > (1 * (10 * 60)) && RemoteUserPrio
SubmittorPrio * 1.2) && RemoteUser =!= "cdf@xxxxxxxx" && RemoteUser =!=
"cdffgrid@xxxxxxxx" && RemoteUser =!= "cdfnam@xxxxxxxx" && RemoteUser =!=
"cdfdev@xxxxxxxx"

All you have to do is to up the timestamp from more than 600 seconds,
as above, to however much time you want in seconds.

Second thing you can do is to use nonzero maxjobretirementtime
so things will still pre-empt but it will still have maxjobretirementtime
seconds to finish the job.

For both of the scenarios above machine RANK should be set to zero.

Steve Timm



On Tue, 6 Jan 2009, Balamurali Ananthan wrote:

Greetings!

Wondering if it is possible to configure condor in such a way that, a remote
user's job should not be preempted before a certain time is elapsed.

For example, userx submits a job that runs for more-or-less 10 hours. I want to configure condor in such a way that the job once started on an execute
machine, should not be disturbed for 11 hours.

If this is possible, could someone please point me to the right
documentation.

Thanks much!
.Bala.

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/







--
Balamurali Ananthan (bala@xxxxxxxxxx) (720.974.1843)
Tech-X Corp, 5621 Arapahoe Ave, Suite A, Boulder, CO 80303