[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs evicted for no obvious reason



Also, the log shows SUSPEND and CONTINUE becoming TRUE around the time the job is suspending and resuming. You may want to look at those policy expressions to see if they are load avg based and if the times correlate with any other tasks that happen on the exec machine, cron jobs maybe.

Best,


matt

Dan Bradley wrote:
The eviction notices in your job log appear precisely 10 minutes after the job gets suspended. That's what would happen if you were using a policy equivalent to the example "UWCS" policy that ships with condor:

MaxSuspendTime          = 10 * $(MINUTE)
UWCS_PREEMPT = ( ((Activity == "Suspended") && \
                  ($(ActivityTimer) > $(MaxSuspendTime))) \
                 || (SUSPEND && (WANT_SUSPEND == False)) )

--Dan

Richard Gillman wrote:
We run Condor 6.8.8. This user is experiencing problems on systems running Suse SLED10.

One of our users is having problems with her job being evicted by condor. Although this example happened during the day apparently these evictions have been happening in the middle of the night too when it is unlikely to be another user. She says:

I usually log on to notion and submit my jobs to condor from there.

I've attached a typical log file that I get. The corresponding error files are now empty as are overwritten when the job automatically starts again and so when eventually runs to completion the error file remains empty. During the time immediately after the job has been evicted and before it restarts (this lapse ranges from a couple of seconds to up to an hour), the error file contains the following line:
forrtl: error (78): process killed (SIGTERM)

For the attached log file, the job is submitted to condor from notion.

In it's first attempt it's noidea which starts the run, this then gets evicted and the job then starts to run on nozzle, then after eviction again it's nohow and the list goes on. I haven't noticed any obvious pattern in which machines the jobs are evicted from.

I attach her log file, and extracts from logs on the first of the systems on which the job is run but is evicted.

Anyone any ideas?

Her log file:

000 (437.009.000) 01/29 10:31:12 Job submitted from host: <xxx.xxx.250.91:57377>... 001 (437.009.000) 01/29 10:31:39 Job executing on host: <xxx.xxx.250.90:60167>
...
006 (437.009.000) 01/29 10:31:47 Image size of job updated: 12712
...
010 (437.009.000) 01/29 10:35:54 Job was suspended.
Number of processes actually suspended: 1
...
011 (437.009.000) 01/29 10:38:05 Job was unsuspended.
...
010 (437.009.000) 01/29 10:41:05 Job was suspended.
Number of processes actually suspended: 1
...
011 (437.009.000) 01/29 10:51:07 Job was unsuspended.
...
004 (437.009.000) 01/29 10:51:08 Job was evicted.
(0) Job was not checkpointed.
Usr 0 00:02:32, Sys 0 00:00:12 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
001 (437.009.000) 01/29 10:56:25 Job executing on host: <xxx.xxx.250.94:49884>
...
010 (437.009.000) 01/29 11:00:30 Job was suspended.
Number of processes actually suspended: 1
...
011 (437.009.000) 01/29 11:10:32 Job was unsuspended.
...
004 (437.009.000) 01/29 11:10:33 Job was evicted.
(0) Job was not checkpointed.
Usr 0 00:00:46, Sys 0 00:00:05 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
001 (437.009.000) 01/29 11:36:33 Job executing on host: <xxx.xxx.250.65:33824>
...
010 (437.009.000) 01/29 11:41:26 Job was suspended.
Number of processes actually suspended: 1
...
011 (437.009.000) 01/29 11:43:07 Job was unsuspended.
...
010 (437.009.000) 01/29 11:52:58 Job was suspended.
Number of processes actually suspended: 1
...
006 (437.009.000) 01/29 11:52:59 Image size of job updated: 12676
...
011 (437.009.000) 01/29 11:54:48 Job was unsuspended.
...
006 (437.009.000) 01/29 11:56:41 Image size of job updated: 12684
...
010 (437.009.000) 01/29 11:59:09 Job was suspended.
Number of processes actually suspended: 1
...
006 (437.009.000) 01/29 11:59:09 Image size of job updated: 12696
...
011 (437.009.000) 01/29 12:00:14 Job was unsuspended.
...
010 (437.009.000) 01/29 12:09:16 Job was suspended.
Number of processes actually suspended: 1
...
006 (437.009.000) 01/29 12:09:16 Image size of job updated: 12732
...
011 (437.009.000) 01/29 12:10:22 Job was unsuspended.
...
010 (437.009.000) 01/29 12:14:47 Job was suspended.
Number of processes actually suspended: 1
...
006 (437.009.000) 01/29 12:14:47 Image size of job updated: 12752
...
011 (437.009.000) 01/29 12:16:12 Job was unsuspended.
...
010 (437.009.000) 01/29 12:20:02 Job was suspended.
Number of processes actually suspended: 1
...
006 (437.009.000) 01/29 12:20:02 Image size of job updated: 12772
...
011 (437.009.000) 01/29 12:21:27 Job was unsuspended.
...
006 (437.009.000) 01/29 12:36:41 Image size of job updated: 12844
...
010 (437.009.000) 01/29 12:44:46 Job was suspended.
Number of processes actually suspended: 1
...
006 (437.009.000) 01/29 12:44:46 Image size of job updated: 12884
...
011 (437.009.000) 01/29 12:45:56 Job was unsuspended.
...
006 (437.009.000) 01/29 12:56:41 Image size of job updated: 12932
...
006 (437.009.000) 01/29 13:16:41 Image size of job updated: 13032
...
010 (437.009.000) 01/29 13:27:22 Job was suspended.
Number of processes actually suspended: 1
...
006 (437.009.000) 01/29 13:27:22 Image size of job updated: 13076
...
011 (437.009.000) 01/29 13:32:52 Job was unsuspended.
...
010 (437.009.000) 01/29 13:35:12 Job was suspended.
Number of processes actually suspended: 1
...
006 (437.009.000) 01/29 13:35:12 Image size of job updated: 13088
...
011 (437.009.000) 01/29 13:37:37 Job was unsuspended.
...
010 (437.009.000) 01/29 13:45:52 Job was suspended.
Number of processes actually suspended: 1
...
006 (437.009.000) 01/29 13:45:53 Image size of job updated: 13128
...
011 (437.009.000) 01/29 13:48:48 Job was unsuspended.
...
010 (437.009.000) 01/29 13:52:33 Job was suspended.
Number of processes actually suspended: 1
...
006 (437.009.000) 01/29 13:52:34 Image size of job updated: 13148
...
011 (437.009.000) 01/29 13:54:07 Job was unsuspended.
...
006 (437.009.000) 01/29 13:56:40 Image size of job updated: 13160
...
010 (437.009.000) 01/29 13:59:49 Job was suspended.
Number of processes actually suspended: 1
...
006 (437.009.000) 01/29 13:59:49 Image size of job updated: 13176
...
011 (437.009.000) 01/29 14:01:09 Job was unsuspended.
...
010 (437.009.000) 01/29 14:06:55 Job was suspended.
Number of processes actually suspended: 1
...
006 (437.009.000) 01/29 14:06:56 Image size of job updated: 1320
...
011 (437.009.000) 01/29 14:08:51 Job was unsuspended.
...
006 (437.009.000) 01/29 14:16:41 Image size of job updated: 13244
...
010 (437.009.000) 01/29 14:22:57 Job was suspended.
Number of processes actually suspended: 1
...
006 (437.009.000) 01/29 14:22:57 Image size of job updated: 13276
...
011 (437.009.000) 01/29 14:24:32 Job was unsuspended.
...
006 (437.009.000) 01/29 14:36:41 Image size of job updated: 13336
...
005 (437.009.000) 01/29 14:45:09 Job terminated.
(1) Normal termination (return value 0)
Usr 0 01:01:41, Sys 0 00:03:40 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 01:01:41, Sys 0 00:03:40 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...

Trying to track 437.009 in logs of first system from which evicted

StarterLog.vm2

1/29 10:31:57 ** condor_starter (CONDOR_STARTER) STARTING UP
1/29 10:31:57 ** /nerc/packages/condor/6.8.8/rhel3/condor-6.8.8/sbin/condor_starter
1/29 10:31:57 ** $CondorVersion: 6.8.8 Dec 19 2007 $
1/29 10:31:57 ** $CondorPlatform: I386-LINUX_RHEL3 $
1/29 10:31:57 ** PID = 6254
1/29 10:31:57 ** Log last touched 1/28 01:19:21
1/29 10:31:57 ******************************************************
1/29 10:31:57 Using config source: /users/condor/condor/condor_config
1/29 10:31:57 Using local config sources:
1/29 10:31:57 /users/condor/condor/condor_config.X86_64.LINUX
1/29 10:31:57 /users/condor/condor/hosts/noidea/condor_config.local
1/29 10:31:57 DaemonCore: Command Socket at <xxx.xxx.xxx.90:38582>
1/29 10:31:57 Done setting resource limits
1/29 10:31:57 Communicating with shadow <xxx.xxx.xxx.91:37063>
1/29 10:31:58 Submitting machine is "notion.noc.soton.ac.uk"
1/29 10:31:58 Starting a VANILLA universe job with ID: 437.9
1/29 10:31:58 IWD: /noc/data/scratch/ndg/ooc/mucm/EmulateDerivs/genie-cgoldstein1/29 10:31:58 Input file: /noc/data/scratch/ndg/ooc/mucm/EmulateDerivs/genie-cgoldstein/../goin/goin.9 1/29 10:31:58 Output file: /noc/data/scratch/ndg/ooc/mucm/EmulateDerivs/genie-cgoldstein/../goout/goout.9 1/29 10:31:58 Error file: /noc/data/scratch/ndg/ooc/mucm/EmulateDerivs/genie-cgoldstein/../err/err.9 1/29 10:31:58 About to exec /noc/data/scratch/ndg/ooc/mucm/EmulateDerivs/genie-cgoldstein/goldstein
1/29 10:31:58 Create_Process succeeded, pid=6259
1/29 10:36:12 Suspending all jobs.
1/29 10:38:23 Continuing all jobs.
1/29 10:41:23 Suspending all jobs.
1/29 10:51:25 Continuing all jobs.
1/29 10:51:25 Got SIGTERM. Performing graceful shutdown.
1/29 10:51:25 ShutdownGraceful all jobs.
1/29 10:51:26 Process exited, pid=6259, status=1
1/29 10:51:26 Last process exited, now Starter is exiting
1/29 10:51:26 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
1/29 11:46:50 ******************************************************

StartLog.old:

1/29 10:31:32 DaemonCore: Command received via UDP from host <139.166.240.211:35940> 1/29 10:31:32 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
1/29 10:31:32 vm1: match_info called
1/29 10:31:32 vm1: Received match <xxx.xxx.xxx.90:60167>#1228475580#2012#...
1/29 10:31:32 vm1: State change: match notification protocol successful
1/29 10:31:32 vm1: Changing state: Unclaimed -> Matched
1/29 10:31:32 DaemonCore: Command received via UDP from host <139.166.240.211:35940> 1/29 10:31:32 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
1/29 10:31:32 vm2: match_info called
1/29 10:31:32 vm2: Received match <xxx.xxx.xxx.90:60167>#1228475580#2015#...
1/29 10:31:32 vm2: State change: match notification protocol successful
1/29 10:31:32 vm2: Changing state: Unclaimed -> Matched
1/29 10:31:32 DaemonCore: Command received via UDP from host <139.166.240.211:35940> 1/29 10:31:32 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
1/29 10:31:32 vm3: match_info called
1/29 10:31:32 vm3: Received match <xxx.xxx.xxx.90:60167>#1228475580#2010#...
1/29 10:31:32 vm3: State change: match notification protocol successful
1/29 10:31:32 vm3: Changing state: Unclaimed -> Matched
1/29 10:31:32 DaemonCore: Command received via UDP from host <139.166.240.211:35940> 1/29 10:31:32 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
1/29 10:31:32 vm4: match_info called
1/29 10:31:32 vm4: Received match <xxx.xxx.xxx.90:60167>#1228475580#2011#...
1/29 10:31:32 vm4: State change: match notification protocol successful
1/29 10:31:32 vm4: Changing state: Unclaimed -> Matched
1/29 10:31:33 DaemonCore: Command received via TCP from host <xxx.xxx.xxx.91:34338> 1/29 10:31:33 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
1/29 10:31:33 vm2: Request accepted.
1/29 10:31:33 vm2: Remote owner is nice-user.gs7@xxxxxxxxxxxxxxx
1/29 10:31:33 vm2: State change: claiming protocol successful
1/29 10:31:33 vm2: Changing state: Matched -> Claimed
1/29 10:31:34 DaemonCore: Command received via TCP from host <xxx.xxx.xxx.91:46593> 1/29 10:31:34 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
1/29 10:31:34 vm4: Request accepted.
1/29 10:31:34 vm4: Remote owner is nice-user.gs7@xxxxxxxxxxxxxxx
1/29 10:31:34 vm4: State change: claiming protocol successful
1/29 10:31:34 vm4: Changing state: Matched -> Claimed
1/29 10:31:34 DaemonCore: Command received via TCP from host <xxx.xxx.xxx.91:33678> 1/29 10:31:34 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
1/29 10:31:34 vm3: Request accepted.
1/29 10:31:34 vm3: Remote owner is nice-user.gs7@xxxxxxxxxxxxxxx
1/29 10:31:34 vm3: State change: claiming protocol successful
1/29 10:31:34 vm3: Changing state: Matched -> Claimed
1/29 10:31:38 DaemonCore: Command received via TCP from host <xxx.xxx.xxx.91:33483> 1/29 10:31:38 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
1/29 10:31:38 vm1: Request accepted.
1/29 10:31:38 vm1: Remote owner is nice-user.gs7@xxxxxxxxxxxxxxx
1/29 10:31:38 vm1: State change: claiming protocol successful
1/29 10:31:38 vm1: Changing state: Matched -> Claimed
1/29 10:31:57 DaemonCore: Command received via TCP from host <xxx.xxx.xxx.91:52368> 1/29 10:31:57 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim) 1/29 10:31:57 vm2: Got activate_claim request from shadow (<xxx.xxx.xxx.91:52368>)
1/29 10:31:57 vm2: Remote job ID is 437.9
1/29 10:31:57 vm2: Got universe "VANILLA" (5) from request classad
1/29 10:31:57 vm2: State change: claim-activation protocol successful
1/29 10:31:57 vm2: Changing activity: Idle -> Busy
1/29 10:32:00 DaemonCore: Command received via TCP from host <xxx.xxx.xxx.91:52003> 1/29 10:32:00 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim) 1/29 10:32:00 vm3: Got activate_claim request from shadow (<xxx.xxx.xxx.91:52003>)
1/29 10:32:00 vm3: Remote job ID is 437.10
1/29 10:32:00 vm3: Got universe "VANILLA" (5) from request classad
1/29 10:32:00 vm3: State change: claim-activation protocol successful
1/29 10:32:00 vm3: Changing activity: Idle -> Busy
1/29 10:32:02 DaemonCore: Command received via TCP from host <xxx.xxx.xxx.91:52261> 1/29 10:32:02 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim) 1/29 10:32:02 vm4: Got activate_claim request from shadow (<xxx.xxx.xxx.91:52261>)
1/29 10:32:02 vm4: Remote job ID is 437.11
1/29 10:32:02 vm4: Got universe "VANILLA" (5) from request classad
1/29 10:32:02 vm4: State change: claim-activation protocol successful
1/29 10:32:02 vm4: Changing activity: Idle -> Busy
1/29 10:33:23 DaemonCore: Command received via TCP from host <xxx.xxx.xxx.91:57073> 1/29 10:33:23 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim) 1/29 10:33:23 vm1: Got activate_claim request from shadow (<xxx.xxx.xxx.91:57073>)
1/29 10:33:23 vm1: Remote job ID is 437.8
1/29 10:33:23 vm1: Got universe "VANILLA" (5) from request classad
1/29 10:33:23 vm1: State change: claim-activation protocol successful
1/29 10:33:23 vm1: Changing activity: Idle -> Busy
1/29 10:36:12 vm2: State change: SUSPEND is TRUE
1/29 10:36:12 vm2: Changing activity: Busy -> Suspended
1/29 10:38:23 vm2: State change: CONTINUE is TRUE
1/29 10:38:23 vm2: Changing activity: Suspended -> Busy
1/29 10:41:23 vm2: State change: SUSPEND is TRUE
1/29 10:41:23 vm2: Changing activity: Busy -> Suspended
1/29 10:51:25 vm2: State change: PREEMPT is TRUE
1/29 10:51:25 vm2: Changing activity: Suspended -> Retiring
1/29 10:51:25 vm2: State change: claim retirement ended/expired
1/29 10:51:25 vm2: State change: WANT_VACATE is TRUE
1/29 10:51:25 vm2: Changing state and activity: Claimed/Retiring -> Preempting/Vacating 1/29 10:51:26 DaemonCore: Command received via TCP from host <xxx.xxx.xxx.91:59681> 1/29 10:51:26 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
1/29 10:51:26 vm2: Got KILL_FRGN_JOB while in Preempting state, ignoring.
1/29 10:51:26 Starter pid 6254 exited with status 0
1/29 10:51:26 vm2: State change: starter exited
1/29 10:51:26 vm2: State change: No preempting claim, returning to owner
1/29 10:51:26 vm2: Changing state and activity: Preempting/Vacating -> Owner/Idle 1/29 10:51:26 DaemonCore: Command received via UDP from host <xxx.xxx.xxx.91:57374> 1/29 10:51:26 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_release_claim) 1/29 10:51:26 Warning: can't find resource with ClaimId (<xxx.xxx.xxx.90:60167>#1228475580#2015#...) 1/29 10:32:02 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim) 1/29 10:32:02 vm4: Got activate_claim request from shadow (<xxx.xxx.xxx.91:52261>)
1/29 10:32:02 vm4: Remote job ID is 437.11
1/29 10:32:02 vm4: Got universe "VANILLA" (5) from request classad
1/29 10:32:02 vm4: State change: claim-activation protocol successful
1/29 10:32:02 vm4: Changing activity: Idle -> Busy
1/29 10:33:23 DaemonCore: Command received via TCP from host <xxx.xxx.xxx.91:57073> 1/29 10:33:23 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim) 1/29 10:33:23 vm1: Got activate_claim request from shadow (<xxx.xxx.xxx.91:57073>)
1/29 10:33:23 vm1: Remote job ID is 437.8
1/29 10:33:23 vm1: Got universe "VANILLA" (5) from request classad
1/29 10:33:23 vm1: State change: claim-activation protocol successful
1/29 10:33:23 vm1: Changing activity: Idle -> Busy
1/29 10:36:12 vm2: State change: SUSPEND is TRUE
1/29 10:36:12 vm2: Changing activity: Busy -> Suspended
1/29 10:38:23 vm2: State change: CONTINUE is TRUE
1/29 10:38:23 vm2: Changing activity: Suspended -> Busy
1/29 10:41:23 vm2: State change: SUSPEND is TRUE
1/29 10:41:23 vm2: Changing activity: Busy -> Suspended
1/29 10:51:25 vm2: State change: PREEMPT is TRUE
1/29 10:51:25 vm2: Changing activity: Suspended -> Retiring
1/29 10:51:25 vm2: State change: claim retirement ended/expired
1/29 10:51:25 vm2: State change: WANT_VACATE is TRUE
1/29 10:51:25 vm2: Changing state and activity: Claimed/Retiring -> Preempting/Vacating 1/29 10:51:26 DaemonCore: Command received via TCP from host <xxx.xxx.xxx.91:59681> 1/29 10:51:26 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
1/29 10:51:26 vm2: Got KILL_FRGN_JOB while in Preempting state, ignoring.
1/29 10:51:26 Starter pid 6254 exited with status 0
1/29 10:51:26 vm2: State change: starter exited
1/29 10:51:26 vm2: State change: No preempting claim, returning to owner
1/29 10:51:26 vm2: Changing state and activity: Preempting/Vacating -> Owner/Idle 1/29 10:51:26 DaemonCore: Command received via UDP from host <xxx.xxx.xxx.91:57374> 1/29 10:51:26 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_release_claim) 1/29 10:51:26 Warning: can't find resource with ClaimId (<xxx.xxx.xxx.90:60167>#1228475580#2015#...)




_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/