[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] Shadow Exception!

Try to add

should_transfer_files = YES
when_to_transfer_output = ON_EXIT

into your .sub file



Thomas Bauer wrote:

Hello again,

I am still trying to get Condor working on my little testing-pool of 4
Intel-Windows2000(SP4)-machines. I found out, that there seems to be a
problem with writing the results back to the submitting machines, but I
don't know how to solve this problem. I have a program for testing, which
creates three files (fort.19,fort.20,fort.21) with the results. This program
works fine on any machine of the pool without using condor. When I submit
this program to the pool, everything works fine till the first result is
calculated. The log-file of the job says the following:
000 (007.000.000) 10/10 18:57:48 Job submitted from host: <x.x.x.x:4178>
001 (007.000.000) 10/10 18:57:56 Job executing on host: <y.y.y.y:1281>
006 (007.000.000) 10/10 18:58:05 Image size of job updated: 792
007 (007.000.000) 10/10 18:58:59 Shadow exception!
	Can no longer talk to condor_starter on execute machine (y.y.y.y)
	0  -  Run Bytes Sent By Job
	528441  -  Run Bytes Received By Job
(The 528441 received bytes are exactly the size of the executable)

To look what had happened, I checked the starterlog on the executing
10/10 18:57:52 ******************************************************
10/10 18:57:52 ** condor_starter (CONDOR_STARTER) STARTING UP
10/10 18:57:52 ** $CondorVersion: 6.5.5 Sep 17 2003 $
10/10 18:57:53 ** $CondorPlatform: INTEL-WINNT40 $
10/10 18:57:53 ** PID = 652
10/10 18:57:53 ******************************************************
10/10 18:57:53 Using config file: C:\Condor\condor_config
10/10 18:57:53 Using local config files: C:\Condor\condor_config.local
10/10 18:57:53 DaemonCore: Command Socket at <y.y.y.y:1328>
10/10 18:57:53 Setting resource limits not implemented!
10/10 18:57:53 Starter communicating with condor_shadow <x.x.x.x:4327>
10/10 18:57:53 Submitting machine is "COMPUTERNAME.DOMAIN.COM"
10/10 18:57:55 File transfer completed successfully.
10/10 18:57:56 Starting a VANILLA universe job with ID: 7.0
10/10 18:57:56 IWD: C:\Condor\execute\dir_652
10/10 18:57:56 Output file: C:\Condor\execute\dir_652\trapez.out
10/10 18:57:56 Error file: C:\Condor\execute\dir_652\trapez.err
10/10 18:57:56 Renice expr "10" evaluated to 10
10/10 18:57:56 About to exec C:\Condor\execute\dir_652\condor_exec.exe
10/10 18:57:56 Create_Process succeeded, pid=1416
10/10 18:58:38 Process exited, pid=1416, status=0
10/10 18:58:59 ReliSock: put_file: TransmitFile() failed, errno=10054
10/10 18:58:59 ERROR "DoUpload: Failed to send file
C:\Condor\execute\dir_652\fort.19, exiting at 1371
" at line 1370 in file ..\src\condor_c++_util\file_transfer.C
10/10 18:58:59 ShutdownFast all jobs.
10/10 18:58:59 Error disabling account condor-reuse-vm1 (ACCESS DENIED)

In one of the last lines there seems to be the failure. The file fort.19 is
calculated and created, but can't be send back. I don't have the
source-code, because of that, I don't know, why the program exits at line
Than, I tested a job, which had a batch-file (@echo Hello!) as executable.
This job executed without any problems. I did't find any error-messages, but
the output (Hello!) was not written to the hello.out-file, which I
designated to be the output-file.

Does anybody know, what I am doing wrong? I don't believe that this has
something to do with user-rights, because I already made tests with very low
privilegs needed to write on those harddisks. Maybe one of you can tell me,
what is written in the 1370th line of that c-program?

Thanks in forward,
Thomas Bauer

Condor Support Information:
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>

Condor Support Information:
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>