Hi all.
I started usind Condor sucessfully but from a weeks to now, I have all
jobs beign suspended and continued a few moments (even hours).
Is there anyone who can help/explain me what is happening? Is there any
place to look an answer?
I don't think it can be a priority problem because it also happens when
there is fewer jobs queued than machines unclaimed, and all jobs queued
are owned by the same user.
There is a sample log if can help... (I'm using 6.8.7 version):
000 (2339.000.000) 07/02 19:25:33 Job submitted from
host: <192.168.100.10:56196>
...
001 (2339.000.000) 07/02 19:25:38 Job executing on host:
<192.168.39.13:32772>
...
006 (2339.000.000) 07/02 19:25:46 Image size of job updated: 386580
...
006 (2339.000.000) 07/02 19:45:45 Image size of job updated: 1871392
...
006 (2339.000.000) 07/02 20:45:46 Image size of job updated: 1871396
...
006 (2339.000.000) 07/02 21:25:47 Image size of job updated: 1871924
...
010 (2339.000.000) 07/03 04:26:07 Job was suspended.
Number of processes actually suspended: 4
...
011 (2339.000.000) 07/03 04:36:09 Job was unsuspended.
...
004 (2339.000.000) 07/03 04:36:09 Job was evicted.
(0) Job was not checkpointed.
Usr 0 07:40:23, Sys 0 01:28:49 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
001 (2339.000.000) 07/03 04:46:02 Job executing on host:
<192.168.39.13:32772>
...
006 (2339.000.000) 07/03 05:06:11 Image size of job updated: 1871444
...
006 (2339.000.000) 07/03 05:26:11 Image size of job updated: 1871548
...
006 (2339.000.000) 07/03 06:06:11 Image size of job updated: 1871592
...
010 (2339.000.000) 07/03 07:42:49 Job was suspended.
Number of processes actually suspended: 4
...
011 (2339.000.000) 07/03 07:52:50 Job was unsuspended.
...
004 (2339.000.000) 07/03 07:52:50 Job was evicted.
(0) Job was not checkpointed.
Usr 0 02:36:22, Sys 0 00:29:42 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
001 (2339.000.000) 07/03 08:00:23 Job executing on host:
<192.168.39.13:32772>
...
006 (2339.000.000) 07/03 08:20:32 Image size of job updated: 1871492
...
010 (2339.000.000) 07/03 08:32:07 Job was suspended.
Number of processes actually suspended: 4
...
011 (2339.000.000) 07/03 08:35:42 Job was unsuspended.
...
006 (2339.000.000) 07/03 08:35:42 Image size of job updated: 1871548
Thanks in advance,
Sergio
|