[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] Output files not being returned upon preemption (transfer_files = ALWAYS)



I have a problem that I hope you guys can help me with.

Condor (version 6.4.7, BTW) isn't transferring output files back upon job preemption, although it does transfer them back if the job runs to completion.  "transfer_files = ALWAYS" is set in the job's .sub file.

The .sub file is as follows:
universe = vanilla
requirements = (OpSys == "WINNT50" || OpSys == "WINNT51") && Machine == "nsi-compaq2.noregon.com"
executable = sleeper.bat
transfer_files = ALWAYS
copy_to_spool = false
transfer_input_files = sleeper.exe   [sleeper is a tiny program that just sleeps a given # of seconds]
output = sleeper.out
error  = sleeper.err
log    = sleeper.log
queue

I run this job on a node over my company's network.  I've shared the node's hard drive over the network, so I can peek into its execution directory during job execution without affecting the job.

I've narrowed the problem to as small a job as I can.  I created a job that is a 3-line batch file (sleeper.bat), as follows:
type nul > begin.flg
sleeper 120
type nul > end.flg

It creates an empty file, sleeps for 2 minutes (120 seconds), then creates another empty file.  If I let it run to completion, both created files (begin.flg and end.flg) are returned.  If I preempt it, neither file is returned.  I know that the first file (begin.flg) was created, because it's there when I peek into the execution directory prior to preemption.

On the Central Manager, ShadowLog records the preemption of this job (cluster 1072) as follows:
2/24 09:56:42 (1072.0) (2372): Request to run on <192.168.33.130:1267> was ACCEPTED
2/24 09:58:53 (1072.0) (2372): Job 1072.0 is being evicted
2/24 09:58:54 (1072.0) (2372): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 107

According to the Condor source code, status 107 is JOB_NOT_CKPTED.

On the node, StartLog records the job's preemption as follows:
2/24 09:58:54 DaemonCore: Command received via TCP from host <192.168.33.4:1677>
2/24 09:58:54 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
2/24 09:58:54 Got deactivate_claim_forcibly while in Preempting state, ignoring.
2/24 09:58:54 DaemonCore: Command received via UDP from host <192.168.33.130:2377>
2/24 09:58:54 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
2/24 09:58:54 Starter pid 1652 exited with status 0
2/24 09:58:54 dynuser: LsaRemoveAccountRights Failed winerr=87l
2/24 09:58:54 State change: starter exited
2/24 09:58:54 State change: No preempting match, returning to owner
2/24 09:58:54 Changing state and activity: Preempting/Vacating -> Owner/Idle
2/24 09:58:54 State change: IS_OWNER is false
2/24 09:58:54 Changing state: Owner -> Unclaimed
2/24 09:58:54 DaemonCore: Command received via UDP from host <192.168.33.4:1680>
2/24 09:58:54 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
2/24 09:58:54 Error: can't find resource with capability (<192.168.33.130:1267>#2160232456)
2/24 09:58:54 DaemonCore: Command received via UDP from host <192.168.33.4:1681>
2/24 09:58:54 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
2/24 09:58:54 Error: can't find resource with capability (<192.168.33.130:1267>#2160232456)

I see several errors, but I don't know what they mean, and I don't know if they're pertinent.  Do these errors indicate the reason that my output files are not being transferred back to the Central Manager when my jobs are preempted?
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>