[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Most of the time in Condor jobs gets wasted in I/o



Hi David,

Thanks for your reply. I am also understanding the system. Few of your queries are:


>>>>To understand whether i/o is a factor it is important to know how much raw data is being processed, i.e. what is the total size of the 20,000 files?

**********The total size of 20,000 files  is 16TB. 

What I am doing is that, I have fulllist.txt file which contains name of 20k files and its path. Since my cluster have 120 nodes, I split 20k files into 20 parts. So I have instead of one file of 20k inputfiles, I have 120 listfiles containg 20k/120 files = 200files approx running on each node.

All 20k files are on /rdata2 dir. When I submit 120 jobs on 120 nodes, Each job which is now getting 200 files take input from /rdata2 dir.(parallely). So each job needs approx 16TB/120= 500GB of input from /rdata2. 

.
You said: So if your average file size is 15 Megabytes or more that could explain the throughput limitation. Then it hints to your stated limitations.


PS: BTW, I am not able to run the following command:
"iostat -dx 10 300"

it says iostat command not found. Is this some OS specific? I am using linux .


following is df command. I run condor jobs on /disk dir and /rdata2 is dir containing all input files.
****************
[bawa@t3nfs ~]$ df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1             10153988   5656168   3973704  59% /
/dev/mapper/vgsys-atlas
                     203147960  20286340 172375860  11% /nfs/t3nfs/share/atlas
/dev/mapper/vgsys-pilot
                     203147960    191892 192470308   1% /nfs/t3nfs/share/pilot
/dev/mapper/vgsys-osg
                      50786940    184272  47981228   1% /nfs/t3nfs/share/osg
/dev/mapper/vgsys-export_home
                      10157368    154236   9478844   2% /nfs/t3nfs/home
/dev/mapper/vgsys-opt--csu
                     101573920  32019424  64311616  34% /NFSv3exports/opt-csu
/dev/mapper/vgsys-home
                     253934980  18641784 222185996   8% /NFSv3exports/home
/dev/mapper/vgsys-tmp
                      10157368    160592   9472488   2% /tmp
/dev/mapper/vgsys-var
                      10157368   9293392    339688  97% /var
/dev/mapper/vgsys-opt
                      10157368   1922196   7710884  20% /opt
/dev/mapper/vgarchive-archive
                     1032090368   1500980 978162228   1% /NFSv3exports/archive
/dev/mapper/vgsys-vmsystems
                     182833140    191952 173204004   1% /vmsystems
/dev/mapper/vgsys-scratch
                     253934980   1567704 239260076   1% /disk
/dev/mapper/vgsys-cvmfs2
                      50786940  13433864  34731636  28% /var/cache/cvmfs2
/dev/mapper/vgsys-condor_log
                      10157368    161428   9471652   2% /var/log/condor
/dev/mapper/vgsys-condor_lib
                      10157368    211448   9421632   3% /var/lib/condor
/dev/mapper/vgsys-pandat3
                     203147960    191892 192470308   1% /nfs/t3nfs/share/pandat3-output
/dev/sdc1            14640611456 409333600 14231277856   3% /NFSv3exports/rdata1
/dev/sdd1            41012297692 19402289068 21610008624  48% /NFSv3exports/rdata2
tmpfs                 12336172         0  12336172   0% /dev/shm
xrootdfs              10153988   7633004   2520984  76% /xrootdfs/atlas
cvmfs2                38912000  13159700  25752301  34% /cvmfs/atlas.cern.ch
pt3head:/xdata       3565950376 386530112 2998280400  12% /nfs/t3head/xdata
pt3head:/etc/condor-etc
                      10157368    154256   9478824   2% /nfs/t3head/condor-etc
***********************************************************************************




On Wed, Apr 24, 2013 at 9:20 PM, David Hentchel <dhentchel@xxxxxxxxx> wrote:
I am not a Condor specialist, but I work in the area of performance engineering.

As your subject implies, slow response coupled with low CPU usage often indicates the system is i/o bound, but it is also possible there is some kind of resource lock that is forcing operations to be single-threaded when they don't need to be.

My suggestions:

To understand whether i/o is a factor it is important to know how much raw data is being processed, i.e. what is the total size of the 20,000 files?  If it takes 40 minutes to process 160 files, that is 15 seconds per file.  Typical i/o subsystems can process data at rates ranging from 1 to 100 Megabytes per second (depends on device, read versus write, random versus sequential, block/packet size and other stuff).  So if your average file size is 15 Megabytes or more that could explain the throughput limitation.

Next question is what kind of i/o, disk or network?  You can start by logging in to the machine where the files are stored and running "iostat -dx 10 300", which will track the number of bytes being read and written over five minutes.  You can compare that to the specs for the disk, or simply do a copy of a big file and time it yourself to determine whether that's near the limit.

Even if the disk capacity is not the problem there is the possibility that you are network bound.  You say this is a single-node test - but is there a chance that within that test the application does some network i/o?  If the files you are using are located on a remote disk (use the "df" command to see which filepaths are mounted from remote hosts) it could be the network, not the physical disk, causing the problem. You need to know whether the networks for your system are configured as 100 megabit versus 1 gigabit (high end systems have 10 gigabit networks) - but remember a "bit" is 1/8 of a "byte", so the 1 gigabit network has a limitation around 125 Megabytes per second.  While the test is running you can login to the machine and run "sar -n DEV 10 36" to check network traffic.

The final possibility (since you know you're not CPU bound) is that there is contention on some other kind of system or application resource that is slowing down operations and/or causing work to be single-threaded across the run.  Pay particular attention to any external service (e.g. a web service call that runs slowly)  but if the job is self-contained you'd have to use a profiling tool to find any opportunities to tune any underlying bottlenecks.

More relevant, I think, is whether running Condor jobs in parallel could help you finish the batch faster.  If the average job  job takes 40 minutes, as your sample did, it would take 120 x 40 minutes = 80 hours.  Since you indicated the complete run finishes in 24 hours, it is clear that Condor is succeeding in scheduling the jobs concurrently.  You could look into whether you could increase the number of condor nodes per host or more likely figure out if there's a way you could spread the work over multiple machines.  For example, is there a good reason why all the files end up in the same location? 

It would be desirable for condor to somehow track Disk capacity as it does CPU.  Also pretty nifty if condor tools made it easier to manage clusters of hosts, maybe a Group ClassAd that advertised the network speed and aggregate disk resource for a ParallelSchedulingGroup mapping to a physical machine cluster or rack.  If anyone knows of Condor features that help in these areas, I'd be keenly interested to learn about them.


dave


On Wed, Apr 24, 2013 at 9:14 AM, Dr. Harinder Singh Bawa <harinder.singh.bawa@xxxxxxxxx> wrote:

Hello experts,

I am submitting 120 jobs in 120 nodes using condor. What I am basically doing is that I have approx 20,000 input files in /rdata2 dir.

/dev/sdd1              39T   19T   21T  48% /NFSv3exports/rdata2


 I have a file containing name and path of 20,000 input files (i.e Full2013.list) containing paths of the files. I split that file (containing 20,000 lines corresponds to 20,000 files) into 120 jobs as 120parts so my each job have approx. 20,000/120= 166 files.

In Condor, its taking 1 day to finish my jobs.


I ran  one job interactively which is running over one node :****Finishes in 40 min

15126.0   bawa            4/23 04:56   0+03:50:16 R  0   317.4 parallel_90.sh



Statistics for comparison:-  
Interactively:-
==============
real    63m57.321s
user    42m17.957s
sys     1m24.413s


Statistics for
Condor Node:

==========
condor_q -analyze 15126.0


-- Submitter: t3nfs.atlas.csufresno.edu : <192.168.100.2:9905> : t3nfs.atlas.csufresno.edu
---
15126.000:  Request is being serviced


The jobs are running since 1 day, If I see Real CPUTime of this job, its
[bawa@t3nfs Wstar_sin0_NewCalib17]$ condor_q 15126.0 -cputime


-- Submitter: t3nfs.atlas.csufresno.edu : <192.168.100.2:9905> : t3nfs.atlas.csufresno.edu
 ID      OWNER            SUBMITTED     CPU_TIME ST PRI SIZE CMD              
15126.0   bawa            4/23 04:56   0+00:06:47 R  0   317.4 parallel_90.sh


If I understand correctly, CPUtime(CPU time is time of running CPU) is just 6min 47 sec Out of  RunTime which is 3 Hr 50 min
. I suspect there is something serious in data transfer going on.(i/o)

Is there any suggestion how to debug that.

Thanks
-Harinder






_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--

David Hentchel

Performance Engineer

www.nuodb.com

(617) 803 - 1193


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Dr. Harinder Singh Bawa
Experimental High Energy Physics
ATLAS Experiment
@CERN, Geneva