[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] dagman evicting other jobs



Hi Jacek,

I work as a Research Computing Facilitator in the Center for High Throughput Computing, on campus, which produces the HTCondor software.

You may get a more specific answer from the administrator of the WEI's HTCondor pool, who will have configured the specific policies for it. In the CHTC pool, for example, we configure our execute servers (where jobs run) to allow jobs to be pre-empted by higher-priority jobs after 72 hours. This effectively establishes a 'cap' on how long jobs are guaranteed to run (barring power/network interruptions, etc.). Such runtime caps are implemented usually to balance factors around slot turnover, fairness, and a desire to not run jobs that are too long (because longer jobs are more susceptible to other types of failures, and because computational work can generally be run more efficiently by splitting a long job into multiple shorter jobs).

There are other potential reasons for eviction, depending on the specific configuration settings, such as interrupting jobs that use far more memory or CPUs than the amount requested. At some point, HTCondor will automatically evict in such cases, purely to keep the execute server from crashing.

If you're not sure who the administrator of the cluster is, let us know, and we can help make a connection since we're on campus. Please otherwise let us know if we haven't helped you get to an answer, or if anything I've described isn't very clear.

Best,
Lauren Michael

Lauren Michael -ÂResearch Computing Facilitator,ÂCenter for High Throughput ComputingUniversity of Wisconsin - Madison
lmichael@xxxxxxxxwww.tinyurl.com/LMichaelCalendarDiscovery 2262, 608.316.4430

On Thu, Jun 22, 2017 at 7:30 AM, Jacek Kominek <jkominek@xxxxxxxx> wrote:
Hi,
My name is Jacek, I work at UW in Genetics as a bioinformatician and I have the following problem: when I use our local WEI Scarcity cluster (running Condor 8.4.7) some of my jobs which are already running on the local pool (sometimes for days) get evicted out by another person's condor_dagman jobs. Is this supposed to be the expected behaviour? I know that condor has it's internal user-specific priorities based on usage etc, and yes, I am a higher-than-average-volume user on that pool, but I would expect that to kick in when there are competing jobs in the queue (i.e. other user's jobs getting started earlier than mine), rather than already during jobs execution. I would very much appreciate some clarification about that, thank you.

Cheers,
-Jacek
-- 
Jacek Kominek
Department of Genetics
University of Wisconsin-Madison
425-G Henry Mall, Genetics/Biotechnology Center
Madison, WI  53706-1580
jkominek@xxxxxxxx
http://hittinger.genetics.wisc.edu

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/