[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Results of jobs comming back unexpected slow



LS,

We have a condor pool of >1500 excution nodes (XP) and We have one Central Master server (Linux) from which we submit jobs.

The problem: When I submit 1000 jobs (each is doing exactly the same small (test) computation for about 30 seconds max)  the job results are not returning more quickly than if I ran those jobs on a one machine pool ... The results are comming in slowly, one by one, 5 or more seconds between ... 

Eventually they will get through. From the log I can see they actualy run on many different execution nodes.

>From Condor_status:
                    Total Owner Claimed Unclaimed Matched Preempting Backfill

       INTEL/WINNT51  1592  1187     320        70      15          0        0

               Total  1592  1187     320        70      15          0        0

Some what later:

>From condor_q:

20 jobs; 566 idle, 354 running, 0 held


The submit machine is almost 'idle'. Condor_q says 354 running (?) but the small jobs should run for only 30 seconds and then must return a result. I am waiting now for hours... It seems the results got stuck on the execution nodes not able to deliver to the submit machine...

Any help is appreciated

Sincerely,

Luc de Zeeuw
Rotterdam University