[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] why my submitted job runs for 2 mins and gets suspended and unsuspended every 10 mins



hi,

i have setup condor for 2 machines with NFS sharing(linux fedora core 3) .The installation and configuration was perfect.But when i submit a simple example job sh_loop.cmd ,it gets excecuted for couple of mins and afterwards get suspeneded and after sometime again gets unsuspended and afterwards gets evicted.please can any one help  me  out with this .i am showing the sh_loop.log file here .i hope it might be of some help.This log file is for a single machine.i am facing the same problem for single and multiple machines




000 (002.000.000) 04/25 20:02:23 Job submitted from host: <127.0.0.1:32878>
...
001 (002.000.000) 04/25 20:02:25 Job executing on host: <127.0.0.1:32877>
...
010 (002.000.000) 04/25 20:02:30 Job was suspended.
    Number of processes actually suspended: 2
...
011 (002.000.000) 04/25 20:12:31 Job was unsuspended.
...
004 (002.000.000) 04/25 20:12:32 Job was evicted.
    (0) Job was not checkpointed.
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
...
001 (002.000.000) 04/25 20:22:29 Job executing on host: <127.0.0.1:32877>
...
010 (002.000.000) 04/25 20:22:33 Job was suspended.
    Number of processes actually suspended: 2
...
011 (002.000.000) 04/25 20:32:34 Job was unsuspended.
...
004 (002.000.000) 04/25 20:32:34 Job was evicted.
    (0) Job was not checkpointed.
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
...
001 (002.000.000) 04/25 20:42:28 Job executing on host: <127.0.0.1:32877>
...
010 (002.000.000) 04/25 20:42:33 Job was suspended.
    Number of processes actually suspended: 2
...
011 (002.000.000) 04/25 20:52:36 Job was unsuspended.
...
004 (002.000.000) 04/25 20:52:36 Job was evicted.
    (0) Job was not checkpointed.
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
...