Re: [HTCondor-users] Most of the time in Condor jobs gets wasted in I/o

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

I am not a Condor specialist, but I work in the area of performance engineering.

As your subject implies, slow response coupled with low CPU usage often indicates the system is i/o bound, but it is also possible there is some kind of resource lock that is forcing operations to be single-threaded when they don't need to be.

My suggestions:

To understand whether i/o is a factor it is important to know how much raw data is being processed, i.e. what is the total size of the 20,000 files? If it takes 40 minutes to process 160 files, that is 15 seconds per file. Typical i/o subsystems can process data at rates ranging from 1 to 100 Megabytes per second (depends on device, read versus write, random versus sequential, block/packet size and other stuff). So if your average file size is 15 Megabytes or more that could explain the throughput limitation.

Next question is what kind of i/o, disk or network? You can start by logging in to the machine where the files are stored and running "iostat -dx 10 300", which will track the number of bytes being read and written over five minutes. You can compare that to the specs for the disk, or simply do a copy of a big file and time it yourself to determine whether that's near the limit.

Even if the disk capacity is not the problem there is the possibility that you are network bound. You say this is a single-node test - but is there a chance that within that test the application does some network i/o? If the files you are using are located on a remote disk (use the "df" command to see which filepaths are mounted from remote hosts) it could be the network, not the physical disk, causing the problem. You need to know whether the networks for your system are configured as 100 megabit versus 1 gigabit (high end systems have 10 gigabit networks) - but remember a "bit" is 1/8 of a "byte", so the 1 gigabit network has a limitation around 125 Megabytes per second. While the test is running you can login to the machine and run "sar -n DEV 10 36" to check network traffic.

The final possibility (since you know you're not CPU bound) is that there is contention on some other kind of system or application resource that is slowing down operations and/or causing work to be single-threaded across the run. Pay particular attention to any external service (e.g. a web service call that runs slowly) but if the job is self-contained you'd have to use a profiling tool to find any opportunities to tune any underlying bottlenecks.

More relevant, I think, is whether running Condor jobs in parallel could help you finish the batch faster. If the average job job takes 40 minutes, as your sample did, it would take 120 x 40 minutes = 80 hours. Since you indicated the complete run finishes in 24 hours, it is clear that Condor is succeeding in scheduling the jobs concurrently. You could look into whether you could increase the number of condor nodes per host or more likely figure out if there's a way you could spread the work over multiple machines. For example, is there a good reason why all the files end up in the same location?

It would be desirable for condor to somehow track Disk capacity as it does CPU. Also pretty nifty if condor tools made it easier to manage clusters of hosts, maybe a Group ClassAd that advertised the network speed and aggregate disk resource for a ParallelSchedulingGroup mapping to a physical machine cluster or rack. If anyone knows of Condor features that help in these areas, I'd be keenly interested to learn about them.

dave

On Wed, Apr 24, 2013 at 9:14 AM, Dr. Harinder Singh Bawa <harinder.singh.bawa@xxxxxxxxx> wrote:

Hello experts,

I am submitting 120 jobs in 120 nodes using condor. What I am basically doing is that I have approx 20,000 input files in /rdata2 dir.

/dev/sdd1              39T   19T   21T 48% /NFSv3exports/rdata2

I have a file containing name and path of 20,000 input files (i.e Full2013.list) containing paths of the files. I split that file (containing 20,000 lines corresponds to 20,000 files) into 120 jobs as 120parts so my each job have approx. 20,000/120= 166 files.

In Condor, its taking 1 day to finish my jobs.

I ran one job interactively which is running over one node :****Finishes in 40 min

15126.0   bawa            4/23 04:56   0+03:50:16 R 0   317.4 parallel_90.sh

Statistics for comparison:-
Interactively:-
==============
real    63m57.321s
user    42m17.957s
sys     1m24.413s

Statistics for
Condor Node:
==========
condor_q -analyze 15126.0

-- Submitter: t3nfs.atlas.csufresno.edu : <192.168.100.2:9905> : t3nfs.atlas.csufresno.edu
---
15126.000: Request is being serviced

The jobs are running since 1 day, If I see Real CPUTime of this job, its
[bawa@t3nfs Wstar_sin0_NewCalib17]$ condor_q 15126.0 -cputime

-- Submitter: t3nfs.atlas.csufresno.edu : <192.168.100.2:9905> : t3nfs.atlas.csufresno.edu
ID      OWNER            SUBMITTED     CPU_TIME ST PRI SIZE CMD
15126.0   bawa            4/23 04:56   0+00:06:47 R 0   317.4 parallel_90.sh

If I understand correctly, CPUtime(CPU time is time of running CPU) is just 6min 47 sec Out of RunTime which is 3 Hr 50 min. I suspect there is something serious in data transfer going on.(i/o)

Is there any suggestion how to debug that.

Thanks
-Harinder

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

David Hentchel

Performance Engineer

www.nuodb.com

(617) 803 - 1193

Mailing List Archives

Public Access

Re: [HTCondor-users] Most of the time in Condor jobs gets wasted in I/o