[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Most of the time in Condor jobs gets wasted in I/o

I would suspect that your NFS share is not optimized for your deployment, or use case, which is likely causing the issue when reading+writing your files. 


If the share is common among all machines, make certain 'should_transfer_files = NO' on your submission too. Also, if you still experience long wait times you can always enforce concurrency limits on your jobs, so they don't all hit at the same shared resource at one time.

Long term, you may want to look into other distributed filesystems to reduce load on a single source e.g (Gluster, HDFS, QFS, etc.) 


From: "Dr. Harinder Singh Bawa" <harinder.singh.bawa@xxxxxxxxx>
To: htcondor-users@xxxxxxxxxxx
Sent: Wednesday, April 24, 2013 8:14:31 AM
Subject: [HTCondor-users] Most of the time in Condor jobs gets wasted in I/o

Hello experts,

I am submitting 120 jobs in 120 nodes using condor. What I am basically doing is that I have approx 20,000 input files in /rdata2 dir.

/dev/sdd1              39T   19T   21T  48% /NFSv3exports/rdata2

 I have a file containing name and path of 20,000 input files (i.e Full2013.list) containing paths of the files. I split that file (containing 20,000 lines corresponds to 20,000 files) into 120 jobs as 120parts so my each job have approx. 20,000/120= 166 files.

In Condor, its taking 1 day to finish my jobs.

I ran  one job interactively which is running over one node :****Finishes in 40 min

15126.0   bawa            4/23 04:56   0+03:50:16 R  0   317.4 parallel_90.sh

Statistics for comparison:-  
real    63m57.321s
user    42m17.957s
sys     1m24.413s

Statistics for
Condor Node:

condor_q -analyze 15126.0

-- Submitter: t3nfs.atlas.csufresno.edu : <> : t3nfs.atlas.csufresno.edu
15126.000:  Request is being serviced

The jobs are running since 1 day, If I see Real CPUTime of this job, its
[bawa@t3nfs Wstar_sin0_NewCalib17]$ condor_q 15126.0 -cputime

-- Submitter: t3nfs.atlas.csufresno.edu : <> : t3nfs.atlas.csufresno.edu
 ID      OWNER            SUBMITTED     CPU_TIME ST PRI SIZE CMD              
15126.0   bawa            4/23 04:56   0+00:06:47 R  0   317.4 parallel_90.sh

If I understand correctly, CPUtime(CPU time is time of running CPU) is just 6min 47 sec Out of  RunTime which is 3 Hr 50 min
. I suspect there is something serious in data transfer going on.(i/o)

Is there any suggestion how to debug that.


HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: