Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] problem with matched job sitting idle

Date: Wed, 19 Apr 2006 15:00:30 -0400
From: Armen Babikyan <armenb@xxxxxxxxxx>
Subject: Re: [Condor-users] problem with matched job sitting idle

Hi,

This issue is still causing problems for me. I've created amini-scenario (independent of my larger project) that manifests theproblem described, though in a slightly different way (***). I willgladly provide a tar.gz of my scenario (i'd rather not spam this mailinglist with attachments). Here is more detailed output of the problems Iam seeing:

1.0 armenb 4/19 14:19 0+00:17:50 R 0 3.8condor_dagman -f -4.0 armenb 4/19 14:22 0+00:15:05 R 0 3.8condor_dagman -f -5.0 armenb 4/19 14:22 0+00:14:59 R 0 3.8condor_dagman -f -7.0 armenb 4/19 14:22 0+00:14:59 R 0 3.8condor_dagman -f -8.0 armenb 4/19 14:22 0+00:14:59 R 0 3.8condor_dagman -f -9.0 armenb 4/19 14:22 0+00:04:49 R 0 0.0 mysleep 600-n 010.0 armenb 4/19 14:22 0+00:00:00 I 0 0.0 mysleep 600-n 011.0 armenb 4/19 14:22 0+00:00:00 I 0 0.0 mysleep 600-n 012.0 armenb 4/19 14:33 0+00:00:00 I 0 0.0 mysleep2600 -n 0

mysleep requires MY_RESOURCE_1, which is provided by VM1 of machinegrid-2. mysleep2 requires MY_RESOURCE_2, which is provided by VM1 ofmachine grid-3. grid-3's VM1 is currently idle, and should bescheduling mysleep2.

condor_q has this to say about job 12:

012.000:  Run analysis summary.  Of 50 machines,
    49 are rejected by your job's requirements
     0 reject your job because of their own requirements
     0 match but are serving users with a better priority in the pool
     1 match but reject the job for unknown reasons
     0 match but will not currently preempt their existing job
     0 are available to run your job

There is no mention of job "12.0" anywhere in the condor logs - only thefollowing in the submit machine's spool directory:


local.grid-8/spool/job_queue.log:101 12.0 Job Machine

local.grid-8/spool/job_queue.log:103 12.0 GlobalJobId"grid-8.llan.ll.mit.edu#1145471586#12.0"

local.grid-8/spool/job_queue.log:103 12.0 ProcId 0

When job 9 finishes, condor_dagman adds another mysleep2 to thepipeline, but it also remains idle in the queue:

1.0 armenb 4/19 14:19 0+00:25:40 R 0 3.8condor_dagman -f -4.0 armenb 4/19 14:22 0+00:22:55 R 0 3.8condor_dagman -f -5.0 armenb 4/19 14:22 0+00:22:49 R 0 3.8condor_dagman -f -7.0 armenb 4/19 14:22 0+00:22:49 R 0 3.8condor_dagman -f -8.0 armenb 4/19 14:22 0+00:22:49 R 0 3.8condor_dagman -f -10.0 armenb 4/19 14:22 0+00:02:36 R 0 0.0 mysleep 600-n 011.0 armenb 4/19 14:22 0+00:00:00 I 0 0.0 mysleep 600-n 012.0 armenb 4/19 14:33 0+00:00:00 I 0 0.0 mysleep2600 -n 013.0 armenb 4/19 14:43 0+00:00:00 I 0 0.0 mysleep2600 -n 0

(***) I should mention here that in my actual application, job 9 wouldnot terminate until job 12 terminated, and this would be controlled byan external program that both job 9 and job 12 would be talking to.With the current problem I am describing, this causes deadlock.


*Finally*, only when job 10 exits does Condor fire up job 12 on grid-3:

1.0 armenb 4/19 14:19 0+00:34:01 R 0 3.8condor_dagman -f -4.0 armenb 4/19 14:22 0+00:31:16 R 0 3.8condor_dagman -f -5.0 armenb 4/19 14:22 0+00:31:10 R 0 3.8condor_dagman -f -7.0 armenb 4/19 14:22 0+00:31:10 R 0 3.8condor_dagman -f -8.0 armenb 4/19 14:22 0+00:31:10 R 0 3.8condor_dagman -f -11.0 armenb 4/19 14:22 0+00:00:55 R 0 0.0 mysleep 600-n 012.0 armenb 4/19 14:33 0+00:00:51 R 0 0.0 mysleep2600 -n 013.0 armenb 4/19 14:43 0+00:00:00 I 0 0.0 mysleep2600 -n 014.0 armenb 4/19 14:53 0+00:00:00 I 0 0.0 mysleep2600 -n 0

Why doesn't Condor schedule an idle job with a Resource that should beavailable? How do I get Condor to be more opportunistic in its scheduling?


Any advice would be very helpful,

Thanks!

 - Armen

Armen Babikyan wrote:

Hi,
In my Condor setup, I'm using Condor's DAG functionality to ensure orderof dependencies between different programs, and to maximize efficiencyof the pipeline through my system. I'm having a problem though:
In two consecutive stages of the pipeline (called B and C,respectively), interaction with some external hardware device occurs.The design of the hardware allows for pipelined control - the iteration#N of stage B can occur while iteration #(N-1)'s stage C is takingplace. Furthermore, there is only one of these hardware resourcesavailable. I've written a proxy between Condor and this hardware, andhave two programs which (should) effectively run and exit right aftereach other.
My experiment generates lots of these "B -> C" DAGs. Among several jobB's that will appear as idle in 'condor_q', one will get to run, and theother job B's will sit idle, waiting for the first job B to exit. Whenthe job B exits, Condor will schedule a job C, and another job B.
When only one B -> C process is occuring, the system runs fine: B runs,B exits, condor schedules C, C runs, C exits, pipeline continues.
The problem I'm having is that with more than one "B -> C" DAG, thesecond B-job runs, but the first C-job sits idle in 'condor_q' forever.I'm not sure why. The single machine controlling this external hardwarehas two VM's on it, (configured with NUM_CPUS = 2. It also hashyperthreading, but that shouldn't matter, i'm pretty sure). I've madesure to define my two resources (in the machine and job configurations),and add one resource to each of the VM1_STARTD_EXPRS andVM2_STARTD_EXPRS variables in the machine's config. IOW, Job B and JobC require different resources (e.g. JOB_B and JOB_C, the former providedby VM1 and the latter by VM2).
I looked up C's job number among the log files of the machines in mycluster, and none have any mention of the job. I can only find mentionof my job in the spool directory of the submitting machine. 'condor_q-analyze' has this to say about the C job that won't run:
129.000:  Run analysis summary.  Of 50 machines,
    49 are rejected by your job's requirements
     0 reject your job because of their own requirements
     0 match but are serving users with a better priority in the pool
     0 match but reject the job for unknown reasons
     0 match but will not currently preempt their existing job
     1 are available to run your job

Any ideas or advice would be most helpful.  Thanks!

 - Armen

--
Armen Babikyan
MIT Lincoln Laboratory

armenb@xxxxxxxxxx . 781-981-1796

Follow-Ups:
- Re: [Condor-users] problem with matched job sitting idle
  - From: Armen Babikyan

References:
- [Condor-users] problem with matched job sitting idle
  - From: Armen Babikyan

Prev by Date: [Condor-users] Occasional stalls in the negotiation cycle
Next by Date: [Condor-users] erroneous job starts
Previous by thread: [Condor-users] problem with matched job sitting idle
Next by thread: Re: [Condor-users] problem with matched job sitting idle
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] problem with matched job sitting idle