[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] job only checkpoints sometimes



Hi all, I see this strange situation:

Sometimes a job (standard universe) is checkpointed.. sometime its NOT! Could this be because multiple different signals are being used? Ie suspend/preempt/owner?

See my condor log below, that indicates what was happened on each machine. (note: the log files are in reverse time order). The machine that did NOT checkpoint seems to have called DEACTIVATE_CLAIM_FORCIBLY immediately, while the one that DID checkpoint correctly called DEACTIVATE_CLAIM then DEACTIVATE_CLAIM_FORCIBLY. What could have caused this to happen?

Ashish

LOG ON MACHINE THAT DID CHECKPOINT
1/5 14:35:45 Error: can't find resource with capability (< 192.168.1.102:32775>#3511251448)
1/5 14:35:45 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
1/5 14:35:45 DaemonCore: Command received via UDP from host <128.2.211.9:33316>
1/5 14:35:45 vm2: Changing state: Owner -> Unclaimed
1/5 14:35:45 vm2: State change: IS_OWNER is false
1/5 14:35:45 vm2: Changing state and activity: Preempting/Vacating -> Owner/Idle
1/5 14:35:45 vm2: State change: No preempting claim, returning to owner
1/5 14:35:45 vm2: Changing state and activity: Claimed/Idle -> Preempting/Vacating
1/5 14:35:45 vm2: State change: received RELEASE_CLAIM command
1/5 14:35:45 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
1/5 14:35:45 DaemonCore: Command received via UDP from host < 128.2.211.9:33316>
1/5 14:35:43 vm2: Changing activity: Busy -> Idle
1/5 14:35:43 vm2: State change: starter exited
1/5 14:35:43 Starter pid 7870 exited with status 0
1/5 14:35:43 vm2: Called deactivate_claim_forcibly()
1/5 14:35:43 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
1/5 14:35:43 DaemonCore: Command received via TCP from host < 128.2.211.9:37412>
1/5 14:35:42 Assuming the keyboard and mouse to be infinitely idle.
1/5 14:35:42 Failed to obtain keyboard or mouse idle information.
1/5 14:35:40 vm2: Called deactivate_claim()
1/5 14:35:40 DaemonCore: received command 403 (DEACTIVATE_CLAIM), calling handler (command_handler)
1/5 14:35:40 DaemonCore: Command received via UDP from host < 128.2.211.9:33316 >

LOG ON MACHINE THAT DID NOT!

1/5 18:43:06 Error: can't find resource with capability (< 192.168.1.104:32774>#1929149340)
1/5 18:43:06 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
1/5 18:43:06 DaemonCore: Command received via UDP from host <128.2.211.9:33346>
1/5 18:43:06 vm2: Changing state: Owner -> Unclaimed
1/5 18:43:06 vm2: State change: IS_OWNER is false
1/5 18:43:06 vm2: Changing state and activity: Preempting/Vacating -> Owner/Idle
1/5 18:43:06 vm2: State change: No preempting claim, returning to owner
1/5 18:43:06 vm2: Changing state and activity: Claimed/Idle -> Preempting/Vacating
1/5 18:43:06 vm2: State change: received RELEASE_CLAIM command
1/5 18:43:06 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
1/5 18:43:06 DaemonCore: Command received via UDP from host < 128.2.211.9:33346>
1/5 18:43:06 vm2: Changing activity: Busy -> Idle
1/5 18:43:06 vm2: State change: starter exited
1/5 18:43:06 Starter pid 6381 exited with status 0
1/5 18:43:06 vm2: Called deactivate_claim_forcibly()
1/5 18:43:06 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
1/5 18:43:06 DaemonCore: Command received via TCP from host < 128.2.211.9:38919>
1/5 18:43:05 vm2: Performing a periodic checkpoint on vm2@xxxxxxxxxxxxxxxxxxxxx.
1/5 18:43:05 Assuming the keyboard and mouse to be infinitely idle.


001 (214.000.000) 01/05 14:33:24 Job executing on host: <192.168.1.102:32775>
...
006 ( 214.000.000) 01/05 14:35:40 Image size of job updated: 107981
...
004 (214.000.000) 01/05 14:35:45 Job was evicted.
        (1) Job was checkpointed.
                Usr 0 00:01:52, Sys 0 00:00:01  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:02  -  Run Local Usage
        97712680  -  Run Bytes Sent By Job
        240792208  -  Run Bytes Received By Job
...
001 (214.000.000) 01/05 14:43:02 Job executing on host: < 192.168.1.104:32774 >
...
004 (214.000.000) 01/05 18:43:06 Job was evicted.
        (0) Job was not checkpointed.
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:12, Sys 0 00:01:12  -  Run Local Usage
        11367188  -  Run Bytes Sent By Job
        8140449280  -  Run Bytes Received By Job