Re: [Condor-users] Jobs interruption in the middle of running cause end results to failed

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Douglas,

Thanks for replying to my e-mail,

I started testing the pool again after the changes you suggested but now it looks like some nodes are rejecting the jobs due to the requirements of the jobs. This are jobs that I ran before the changes and they were no rejected by any of the CPU nodes. Before there were 8 machines that were rejecting the job after restarting the condor service on both only one accepted the jobs, leaving only one with the condition of rejecting jobs. Can the changes I made be responsible for this behavior? Is there a way to list the job requirement? Should I issue a condor_reconfig –all, what I did after the changes was to restart on each machine the condor service, starting from the negotiator machine to all the condor startd machines,

Thanks in advance for your input,

Sincerely

Alex

From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Douglas Clayton
Sent: Thursday, February 19, 2009 1:02 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Jobs interruption in the middle of running causeend results to failed

Alex,

At minimum, to simply disable preemption, set PREEMPT = FALSE and RANK = 0 in the configuration for your startds, and PREEMPTION_REQUIREMENTS = FALSE for your negotiator. That will stop preempting based on user activity, better-ranked jobs, and user prio, respectively.

There is a helpful section explaining preemption in section 3.5 of the Condor manual at

http://www.cs.wisc.edu/condor/manual/v7.2/3_5Policy_Configuration.html#sec:Disabling_Preemption

Regards,

Doug

On Feb 19, 2009, at 10:33 AM, Alas, Alex [FEDI] wrote:

Hello to all of you,

I have a situation here. I have a Condor Pool of 22 CPU’s running version 7.05, distributed in 4 Windows 2003 boxes (each win2k3 box has 4 cpu’s) and 3 Windows XP boxes (each winxp box has 2 cpu’s) . Jobs runs fine while there is only one user running his jobs at a time. I have to clarify that we haven’t set any kind of priorities (job or user) when the jobs are been sent. The issues presents when two users send their jobs at the same time or one after the other, being more specific if one user sends a jobs first and minutes later the other one send his jobs while the first ones are running. The Central manager puts the first jobs on hold to take the jobs of the second user to run with higher priority but once those jobs of the first users were placed on idle the integrity of the data is corrupted and from there the rest of jobs once they come back to running mode is wrong. The same happen with the jobs of the second user, during that process of putting them on idle and running the data is mishandle it and it will degenerate the results.

Is there any setting I need to configure at the condor_config level of the central manager to stipulate that once a jobs is running on one node it should not be interrupted and need to finish to its completion?

Thanks for your answer\input in advance!!!

Respectfully,

Alex Alas

Systems Administrator
Fugro EarthData Inc.

Tel. 301-948-8550 x219 Fax 301-963-2064 E-mail: aalas@xxxxxxxxxxxxx

7320 Executive Way, Frederick, MD 21704

Website: http://www.fugroearthdata.com

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

===================================
Douglas Clayton

phone: 919.647.9648

Cycle Computing, LLC
Leader in Condor Grid Solutions
Enterprise Condor Support and Management Tools

http://www.cyclecomputing.com

http://www.cyclecloud.com

Mailing List Archives

Public Access

Re: [Condor-users] Jobs interruption in the middle of running cause end results to failed