[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Jobs Evicted, never finish!



Hi all.
I started usind Condor sucessfully but from a weeks to now, I have all jobs beign suspended and continued a few moments (even hours).

Is there anyone who can help/explain me what is happening? Is there any place to look an answer?
I don't think it can be a priority problem because it also happens when there is fewer jobs queued than machines unclaimed, and all jobs queued are owned by the same user.

There is a sample log if can help... (I'm using 6.8.7 version):
000 (2339.000.000) 07/02 19:25:33 Job submitted from host: <192.168.100.10:56196>
...
001 (2339.000.000) 07/02 19:25:38 Job executing on host: <192.168.39.13:32772>
...
006 (2339.000.000) 07/02 19:25:46 Image size of job updated: 386580
...
006 (2339.000.000) 07/02 19:45:45 Image size of job updated: 1871392
...
006 (2339.000.000) 07/02 20:45:46 Image size of job updated: 1871396
...
006 (2339.000.000) 07/02 21:25:47 Image size of job updated: 1871924
...
010 (2339.000.000) 07/03 04:26:07 Job was suspended.
        Number of processes actually suspended: 4
...
011 (2339.000.000) 07/03 04:36:09 Job was unsuspended.
...
004 (2339.000.000) 07/03 04:36:09 Job was evicted.
        (0) Job was not checkpointed.
                Usr 0 07:40:23, Sys 0 01:28:49  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
001 (2339.000.000) 07/03 04:46:02 Job executing on host: <192.168.39.13:32772>
...
006 (2339.000.000) 07/03 05:06:11 Image size of job updated: 1871444
...
006 (2339.000.000) 07/03 05:26:11 Image size of job updated: 1871548
...
006 (2339.000.000) 07/03 06:06:11 Image size of job updated: 1871592
...
010 (2339.000.000) 07/03 07:42:49 Job was suspended.
        Number of processes actually suspended: 4
...
011 (2339.000.000) 07/03 07:52:50 Job was unsuspended.
...
004 (2339.000.000) 07/03 07:52:50 Job was evicted.
        (0) Job was not checkpointed.
                Usr 0 02:36:22, Sys 0 00:29:42  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
001 (2339.000.000) 07/03 08:00:23 Job executing on host: <192.168.39.13:32772>
...
006 (2339.000.000) 07/03 08:20:32 Image size of job updated: 1871492
...
010 (2339.000.000) 07/03 08:32:07 Job was suspended.
        Number of processes actually suspended: 4
...
011 (2339.000.000) 07/03 08:35:42 Job was unsuspended.
...
006 (2339.000.000) 07/03 08:35:42 Image size of job updated: 1871548
Thanks in advance,
Sergio