[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [SPAM] - Re: [condor-users] Output files not being returned upon preemption (transfer_files = ALWAYS) - Email found in subject



Alexander,

I think I've read about your experiences with symptoms like these in http://www.cs.wisc.edu/~lists/archive/condor-users/msg00129.html, with the following differences:  In that issue, the CONDOR_SHADOW exit value was 100, and CONDOR_SHADOW exited almost immediately after job acceptance, whereas in my log the exit value is 107, CONDOR_SHADOW exited when the job vacated.

I took Zach's advice from that thread by adding to the Central Manager's config file (unsuccessfully):
SEC_DEFAULT_SESSION_DURATION = 8640000

I was preempting the job by interacting with the node, not with condor_vacate.  However, I tried preempting it with condor_vacate, with the same result.  The files do not appear in the condor spool directory either.

I note with interest that in that past issue, as well as in mine, these errors are in the node's StartLog:
10/7 12:22:52 dynuser: LsaRemoveAccountRights Failed winerr=87l
10/7 12:22:53 Error: can't find resource with capability (<192.168.1.21:1112>#1709965739)

However, I can't find what windows error 871 is; it doesn't appear in MSDN (http://tinyurl.com/3xt6m) or in Winerror.h.

Could this indicate some sort of misconfiguration of the condor dynamic user?  If so, could that explain why output files of vacated jobs aren't returned?  I should note that my Central Manager and the node are WinXP, and that I'm using Condor 6.4.7.

In response to J Kewley, unfortunately, I can't specify the output files specifically.  They'll vary a good deal depending on what jobs are submitted, and I won't know them in advance.

Thanks to both of you,
David


-----Original Message-----
From: Alexander Klyubin [mailto:A.Kljubin@xxxxxxxxxxx]
Sent: Tuesday, February 24, 2004 12:44 PM
To: condor-users@xxxxxxxxxxx
Subject: [SPAM] - Re: [condor-users] Output files not being returned
upon preemption (transfer_files = ALWAYS) - Email found in subject


I remember I had a similar question. If I recall correctly, the answer 
was that if you used condor_vacate to preempt a job, its files were at 
best returned to the *spool* directory on the submit machine. So, not to 
the job's submit directory. However, even that did not seem to work for 
me during a brief test.

Regards,
Alexander Klyubin

David Vestal wrote:
> I have a problem that I hope you guys can help me with.
> 
> Condor (version 6.4.7, BTW) isn't transferring output files back upon job preemption, although it does transfer them back if the job runs to completion.  "transfer_files = ALWAYS" is set in the job's .sub file.
> 
> The .sub file is as follows:
> universe = vanilla
> requirements = (OpSys == "WINNT50" || OpSys == "WINNT51") && Machine == "nsi-compaq2.noregon.com"
> executable = sleeper.bat
> transfer_files = ALWAYS
> copy_to_spool = false
> transfer_input_files = sleeper.exe   [sleeper is a tiny program that just sleeps a given # of seconds]
> output = sleeper.out
> error  = sleeper.err
> log    = sleeper.log
> queue
> 
> I run this job on a node over my company's network.  I've shared the node's hard drive over the network, so I can peek into its execution directory during job execution without affecting the job.
> 
> I've narrowed the problem to as small a job as I can.  I created a job that is a 3-line batch file (sleeper.bat), as follows:
> type nul > begin.flg
> sleeper 120
> type nul > end.flg
> 
> It creates an empty file, sleeps for 2 minutes (120 seconds), then creates another empty file.  If I let it run to completion, both created files (begin.flg and end.flg) are returned.  If I preempt it, neither file is returned.  I know that the first file (begin.flg) was created, because it's there when I peek into the execution directory prior to preemption.
> 
> On the Central Manager, ShadowLog records the preemption of this job (cluster 1072) as follows:
> 2/24 09:56:42 (1072.0) (2372): Request to run on <192.168.33.130:1267> was ACCEPTED
> 2/24 09:58:53 (1072.0) (2372): Job 1072.0 is being evicted
> 2/24 09:58:54 (1072.0) (2372): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 107
> 
> According to the Condor source code, status 107 is JOB_NOT_CKPTED.
> 
> On the node, StartLog records the job's preemption as follows:
> 2/24 09:58:54 DaemonCore: Command received via TCP from host <192.168.33.4:1677>
> 2/24 09:58:54 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
> 2/24 09:58:54 Got deactivate_claim_forcibly while in Preempting state, ignoring.
> 2/24 09:58:54 DaemonCore: Command received via UDP from host <192.168.33.130:2377>
> 2/24 09:58:54 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
> 2/24 09:58:54 Starter pid 1652 exited with status 0
> 2/24 09:58:54 dynuser: LsaRemoveAccountRights Failed winerr=87l
> 2/24 09:58:54 State change: starter exited
> 2/24 09:58:54 State change: No preempting match, returning to owner
> 2/24 09:58:54 Changing state and activity: Preempting/Vacating -> Owner/Idle
> 2/24 09:58:54 State change: IS_OWNER is false
> 2/24 09:58:54 Changing state: Owner -> Unclaimed
> 2/24 09:58:54 DaemonCore: Command received via UDP from host <192.168.33.4:1680>
> 2/24 09:58:54 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
> 2/24 09:58:54 Error: can't find resource with capability (<192.168.33.130:1267>#2160232456)
> 2/24 09:58:54 DaemonCore: Command received via UDP from host <192.168.33.4:1681>
> 2/24 09:58:54 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
> 2/24 09:58:54 Error: can't find resource with capability (<192.168.33.130:1267>#2160232456)
> 
> I see several errors, but I don't know what they mean, and I don't know if they're pertinent.  Do these errors indicate the reason that my output files are not being transferred back to the Central Manager when my jobs are preempted?
> Condor Support Information:
> http://www.cs.wisc.edu/condor/condor-support/
> To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
> unsubscribe condor-users <your_email_address>
> 
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>