[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Most of the time in Condor jobs gets wasted in I/o




Hello experts,

I am submitting 120 jobs in 120 nodes using condor. What I am basically doing is that I have approx 20,000 input files in /rdata2 dir.

/dev/sdd1              39T   19T   21T  48% /NFSv3exports/rdata2


 I have a file containing name and path of 20,000 input files (i.e Full2013.list) containing paths of the files. I split that file (containing 20,000 lines corresponds to 20,000 files) into 120 jobs as 120parts so my each job have approx. 20,000/120= 166 files.

In Condor, its taking 1 day to finish my jobs.


I ran  one job interactively which is running over one node :****Finishes in 40 min

15126.0   bawa            4/23 04:56   0+03:50:16 R  0   317.4 parallel_90.sh



Statistics for comparison:-  
Interactively:-
==============
real    63m57.321s
user    42m17.957s
sys     1m24.413s


Statistics for
Condor Node:

==========
condor_q -analyze 15126.0


-- Submitter: t3nfs.atlas.csufresno.edu : <192.168.100.2:9905> : t3nfs.atlas.csufresno.edu
---
15126.000:  Request is being serviced


The jobs are running since 1 day, If I see Real CPUTime of this job, its
[bawa@t3nfs Wstar_sin0_NewCalib17]$ condor_q 15126.0 -cputime


-- Submitter: t3nfs.atlas.csufresno.edu : <192.168.100.2:9905> : t3nfs.atlas.csufresno.edu
 ID      OWNER            SUBMITTED     CPU_TIME ST PRI SIZE CMD              
15126.0   bawa            4/23 04:56   0+00:06:47 R  0   317.4 parallel_90.sh


If I understand correctly, CPUtime(CPU time is time of running CPU) is just 6min 47 sec Out of  RunTime which is 3 Hr 50 min
. I suspect there is something serious in data transfer going on.(i/o)

Is there any suggestion how to debug that.

Thanks
-Harinder