[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor performance problem



Hello everyone:

I realised a common performance problem (might be) in condor. For example I submit 10 jobs in condor pool, even if at that moment there were absolutely enough idle condor nodes (50 nodes), not all the 10 jobs could run immediately. Only some of them instantly started to run while others started to run later on. This is one log:

***************************************************
000 (126.000.000) 01/12 16:20:43 Job submitted from host: <128.16.3.68:58385>
...
000 (126.001.000) 01/12 16:20:43 Job submitted from host: < 128.16.3.68:58385>
...
000 (126.002.000) 01/12 16:20:43 Job submitted from host: <128.16.3.68:58385>
...
000 (126.003.000) 01/12 16:20:43 Job submitted from host: < 128.16.3.68:58385>
...
000 (126.004.000) 01/12 16:20:43 Job submitted from host: <128.16.3.68:58385>
...
000 (126.005.000) 01/12 16:20:43 Job submitted from host: < 128.16.3.68:58385>
...
000 (126.006.000) 01/12 16:20:43 Job submitted from host: <128.16.3.68:58385>
...
000 (126.007.000) 01/12 16:20:43 Job submitted from host: < 128.16.3.68:58385>
...
000 (126.008.000) 01/12 16:20:43 Job submitted from host: <128.16.3.68:58385>
...
000 (126.009.000) 01/12 16:20:43 Job submitted from host: < 128.16.3.68:58385>
...
001 (126.000.000) 01/12 16:20:48 Job executing on host: <128.16.9.11:33303>
...
001 (126.006.000) 01/12 16:20:50 Job executing on host: < 128.16.13.22:32975>
...
001 (126.001.000) 01/12 16:20:52 Job executing on host: <128.16.13.27:33551>
...
001 (126.002.000 ) 01/12 16:20:54 Job executing on host: <128.16.13.42:33469>
...
001 (126.003.000) 01/12 16:20:56 Job executing on host: <128.16.9.23:33266 >
...
005 (126.000.000) 01/12 16:20:58 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:02, Sys 0 00:00:05  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:02, Sys 0 00:00:05  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
005 (126.006.000) 01/12 16:21:00 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:03, Sys 0 00:00:04  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:03, Sys 0 00:00:04  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
001 (126.005.000) 01/12 16:21:00 Job executing on host: < 128.16.13.38:34509>
...
001 (126.007.000) 01/12 16:21:03 Job executing on host: <128.16.13.34:33987>
...
005 (126.001.000) 01/12 16:21:03 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:03, Sys 0 00:00:04  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:03, Sys 0 00:00:04  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
001 (126.008.000) 01/12 16:21:04 Job executing on host: <128.16.13.37:33208>
...
005 (126.002.000) 01/12 16:21:04 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:03, Sys 0 00:00:04  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:03, Sys 0 00:00:04  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
005 (126.003.000) 01/12 16:21:06 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:02, Sys 0 00:00:05  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:02, Sys 0 00:00:05  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
001 (126.009.000) 01/12 16:21:07 Job executing on host: <128.16.13.28:34609>
...
001 (126.004.000) 01/12 16:21:08 Job executing on host: < 128.16.9.11:33303>
...
005 (126.005.000) 01/12 16:21:10 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:03, Sys 0 00:00:04  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:03, Sys 0 00:00:04  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
005 (126.007.000) 01/12 16:21:13 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:03, Sys 0 00:00:04  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:03, Sys 0 00:00:04  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
005 (126.008.000) 01/12 16:21:14 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:03, Sys 0 00:00:04  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:03, Sys 0 00:00:04  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
005 (126.009.000) 01/12 16:21:17 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:03, Sys 0 00:00:04  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:03, Sys 0 00:00:04  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
005 (126.004.000) 01/12 16:21:18 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:02, Sys 0 00:00:05  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:02, Sys 0 00:00:05  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
******************************************************

It is not a big problem if all the jobs are relatively long, but if the jobs are very short compared with the delay and we got huge numbers of jobs, that would be a apparent problem.

So, could anybody tell me why this problem happen?  Is that because the match making process only can process one job per time?