[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor performance problem



Hi Hao,

take a screen shot of the condor_q -analyze and send the NegoitatorLog and ShadowLogs so we can see why the jobs are not getting assigned right away. You might want to create a different log file for each job, that way you can isolate any delays with writing to the same log file.

Regards

Mark

Hao Liu wrote:
Hello everyone:

I realised a common performance problem (might be) in condor. For example I submit 10 jobs in condor pool, even if at that moment there were absolutely enough idle condor nodes (50 nodes), not all the 10 jobs could run immediately. Only some of them instantly started to run while others started to run later on. This is one log:

******************************
*********************
000 (126.000.000) 01/12 16:20:43 Job submitted from host: < 128.16.3.68:58385>
...
000 (126.001.000) 01/12 16:20:43 Job submitted from host: < 128.16.3.68:58385>
...
000 (126.002.000) 01/12 16:20:43 Job submitted from host: < 128.16.3.68:58385>
...
000 (126.003.000) 01/12 16:20:43 Job submitted from host: < 128.16.3.68:58385>
...
000 (126.004.000) 01/12 16:20:43 Job submitted from host: < 128.16.3.68:58385>
...
000 (126.005.000) 01/12 16:20:43 Job submitted from host: < 128.16.3.68:58385>
...
000 (126.006.000) 01/12 16:20:43 Job submitted from host: < 128.16.3.68:58385>
...
000 (126.007.000) 01/12 16:20:43 Job submitted from host: < 128.16.3.68:58385>
...
000 (126.008.000) 01/12 16:20:43 Job submitted from host: < 128.16.3.68:58385>
...
000 (126.009.000) 01/12 16:20:43 Job submitted from host: < 128.16.3.68:58385>
...
001 (126.000.000) 01/12 16:20:48 Job executing on host: < 128.16.9.11:33303>
...
001 (126.006.000) 01/12 16:20:50 Job executing on host: < 128.16.13.22:32975>
...
001 (126.001.000) 01/12 16:20:52 Job executing on host: < 128.16.13.27:33551>
...
001 (126.002.000 ) 01/12 16:20:54 Job executing on host: <128.16.13.42:33469>
...
001 (126.003.000) 01/12 16:20:56 Job executing on host: < 128.16.9.23:33266 >
...
005 (126.000.000) 01/12 16:20:58 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:02, Sys 0 00:00:05  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:02, Sys 0 00:00:05  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
005 (126.006.000) 01/12 16:21:00 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:03, Sys 0 00:00:04  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:03, Sys 0 00:00:04  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
001 (126.005.000) 01/12 16:21:00 Job executing on host: < 128.16.13.38:34509>
...
001 (126.007.000) 01/12 16:21:03 Job executing on host: <128.16.13.34:33987 >
...
005 (126.001.000) 01/12 16:21:03 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:03, Sys 0 00:00:04  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:03, Sys 0 00:00:04  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
001 (126.008.000) 01/12 16:21:04 Job executing on host: <128.16.13.37:33208>
...
005 ( 126.002.000) 01/12 16:21:04 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:03, Sys 0 00:00:04  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:03, Sys 0 00:00:04  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
005 (126.003.000) 01/12 16:21:06 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:02, Sys 0 00:00:05  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:02, Sys 0 00:00:05  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
001 (126.009.000) 01/12 16:21:07 Job executing on host: <128.16.13.28:34609>
...
001 ( 126.004.000) 01/12 16:21:08 Job executing on host: < 128.16.9.11:33303>
...
005 (126.005.000) 01/12 16:21:10 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:03, Sys 0 00:00:04  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:03, Sys 0 00:00:04  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
005 (126.007.000) 01/12 16:21:13 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:03, Sys 0 00:00:04  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:03, Sys 0 00:00:04  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
005 (126.008.000) 01/12 16:21:14 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:03, Sys 0 00:00:04  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:03, Sys 0 00:00:04  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
005 (126.009.000) 01/12 16:21:17 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:03, Sys 0 00:00:04  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:03, Sys 0 00:00:04  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
...
005 (126.004.000) 01/12 16:21:18 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:02, Sys 0 00:00:05  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:02, Sys 0 00:00:05  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        51  -  Run Bytes Sent By Job
        5419  -  Run Bytes Received By Job
        51  -  Total Bytes Sent By Job
        5419  -  Total Bytes Received By Job
******************************************************

It is not a big problem if all the jobs are relatively long, but if the jobs are very short compared with the delay and we got huge numbers of jobs, that would be a apparent problem.

So, could anybody tell me why this problem happen?  Is that because the match making process only can process one job per time?

_______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at either https://lists.cs.wisc.edu/archive/condor-users/ http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR

-- 
Mark Ellul
Research and Development Manager

This email and any attachments may be confidential or legally privileged.

If you received this message in error or are not the intended recipient. you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information containing herein. Please inform us of the erroneous delivery by return e-mail. Thank you for your co-operation.


www.cellcast.tv

150 Great Portland Street

London

W1W 6QD

UK

Tel: (020) 7190 0300

Fax: (020) 7190 0301