[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor performance problem



hi hao...

with regards to condor, if i understand your comments, you might find that
condor doesn't really give you the parallelization that you're trying to
achieve. condor appears to provide good parallelization, if you have the
'num_cpus' set to something appropriate, and your jobs are sufficiently
long.

from my testing, and the fact that my 'jobs' are short, i realized that
condor was/is relatively good at doing the job/load balancing aspects for
the network. however, because my jobs were/are 'short' in time duration, i
wasn't able to get condor to really run a lot of parallel jobs, even thought
i was using the 'testing' mode of condor.

to get around this/these issues, i created an intermediate shell process
that essentially runs multiple processes on a given node of the condor
system.

regards,

-bruce


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of Hao Liu
Sent: Monday, January 15, 2007 8:16 AM
To: Condor-users@xxxxxxxxxxx
Subject: [Condor-users] Condor performance problem


Hello everyone:

I realised a common performance problem (might be) in condor. For example I
submit 10 jobs in condor pool, even if at that moment there were absolutely
enough idle condor nodes (50 nodes), not all the 10 jobs could run
immediately. Only some of them instantly started to run while others started
to run later on. This is one log:

******************************
*********************
000 (126.000.000) 01/12 16:20:43 Job submitted from host: <
128.16.3.68:58385>
...
000 (126.001.000) 01/12 16:20:43 Job submitted from host: <
128.16.3.68:58385>
...
000 (126.002.000) 01/12 16:20:43 Job submitted from host: <
128.16.3.68:58385>
...
000 (126.003.000) 01/12 16:20:43 Job submitted from host: <
128.16.3.68:58385>
...
000 (126.004.000) 01/12 16:20:43 Job submitted from host: <
128.16.3.68:58385>
...
000 (126.005.000) 01/12 16:20:43 Job submitted from host: <
128.16.3.68:58385>
...
000 (126.006.000) 01/12 16:20:43 Job submitted from host: <
128.16.3.68:58385>
...
000 (126.007.000) 01/12 16:20:43 Job submitted from host: <
128.16.3.68:58385>
...
000 (126.008.000) 01/12 16:20:43 Job submitted from host: <
128.16.3.68:58385>
...
000 (126.009.000) 01/12 16:20:43 Job submitted from host: <
128.16.3.68:58385>
...
001 (126.000.000) 01/12 16:20:48 Job executing on host: < 128.16.9.11:33303>
...
001 (126.006.000) 01/12 16:20:50 Job executing on host: <
128.16.13.22:32975>
...
001 (126.001.000) 01/12 16:20:52 Job executing on host: <
128.16.13.27:33551>
...
001 (126.002.000 ) 01/12 16:20:54 Job executing on host:
<128.16.13.42:33469>
...
001 (126.003.000) 01/12 16:20:56 Job executing on host: < 128.16.9.23:33266
>
...
005 (126.000.000) 01/12 16:20:58 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:02, Sys 0 00:00:05  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:02, Sys 0 00:00:05  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
005 (126.006.000) 01/12 16:21:00 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:03, Sys 0 00:00:04  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:03, Sys 0 00:00:04  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
001 (126.005.000) 01/12 16:21:00 Job executing on host: <
128.16.13.38:34509>
...
001 (126.007.000) 01/12 16:21:03 Job executing on host: <128.16.13.34:33987
>
...
005 (126.001.000) 01/12 16:21:03 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:03, Sys 0 00:00:04  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:03, Sys 0 00:00:04  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
001 (126.008.000) 01/12 16:21:04 Job executing on host: <128.16.13.37:33208>
...
005 ( 126.002.000) 01/12 16:21:04 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:03, Sys 0 00:00:04  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:03, Sys 0 00:00:04  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
005 (126.003.000) 01/12 16:21:06 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:02, Sys 0 00:00:05  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:02, Sys 0 00:00:05  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
001 (126.009.000) 01/12 16:21:07 Job executing on host: <128.16.13.28:34609>
...
001 ( 126.004.000) 01/12 16:21:08 Job executing on host: <
128.16.9.11:33303>
...
005 (126.005.000) 01/12 16:21:10 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:03, Sys 0 00:00:04  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:03, Sys 0 00:00:04  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
005 (126.007.000) 01/12 16:21:13 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:03, Sys 0 00:00:04  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:03, Sys 0 00:00:04  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
005 (126.008.000) 01/12 16:21:14 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:03, Sys 0 00:00:04  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:03, Sys 0 00:00:04  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
005 (126.009.000) 01/12 16:21:17 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:03, Sys 0 00:00:04  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:03, Sys 0 00:00:04  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
005 (126.004.000) 01/12 16:21:18 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:02, Sys 0 00:00:05  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:02, Sys 0 00:00:05  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
******************************************************

It is not a big problem if all the jobs are relatively long, but if the jobs
are very short compared with the delay and we got huge numbers of jobs, that
would be a apparent problem.

So, could anybody tell me why this problem happen?  Is that because the
match making process only can process one job per time?