[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] Still Shadow Exceptions :-(



Hello all,

I've still got the Shadow-Exception-Problem. I have a small Testing-Pool
with 4 Win2000(SP4)-Machines. All machines have Condor 6.6.0(Nov 24 2003)
installed.
On all machines the local condor-reuse-Account is in a group called
PowerUsers. This Group has the right to start a batch-job. I convinced
myself by checking the local security policy on every machine.
Ok, thats my setup. Now the problem:
I submit a job, called trapez.exe. This program should create 3 files as
results: fort.19, fort.20 and fort.21.
After a few seconds the Shadow Exception occur:
The trapez.log says:

000 (015.000.000) 12/08 13:43:37 Job submitted from host:
<128.176.208.220:1051>
...
001 (015.000.000) 12/08 13:43:47 Job executing on host:
<128.176.206.149:1048>
...
006 (015.000.000) 12/08 13:43:55 Image size of job updated: 868
...
007 (015.000.000) 12/08 13:44:12 Shadow exception!
        Can no longer talk to condor_starter on execute machine
(128.176.206.149)
        0  -  Run Bytes Sent By Job
        528441  -  Run Bytes Received By Job

The StartLog of the executing machine says:

12/8 13:43:43 ******************************************************
12/8 13:43:43 ** condor_starter (CONDOR_STARTER) STARTING UP
12/8 13:43:43 ** $CondorVersion: 6.6.0 Nov 24 2003 $
12/8 13:43:43 ** $CondorPlatform: INTEL-WINNT40 $
12/8 13:43:43 ** PID = 580
12/8 13:43:44 ******************************************************
12/8 13:43:44 Using config file: C:\Condor\condor_config
12/8 13:43:44 Using local config files: C:\Condor\condor_config.local
12/8 13:43:44 DaemonCore: Command Socket at <128.176.206.149:1153>
12/8 13:43:44 Setting resource limits not implemented!
12/8 13:43:44 Starter communicating with condor_shadow
<128.176.208.220:3410>
12/8 13:43:44 Submitting machine is "PFT23.NWZNET.UNI-MUENSTER.DE"
12/8 13:43:45 File transfer completed successfully.
12/8 13:43:46 Starting a VANILLA universe job with ID: 15.0
12/8 13:43:46 IWD: C:\Condor\execute\dir_580
12/8 13:43:46 Output file: C:\Condor\execute\dir_580\trapez.out
12/8 13:43:46 Error file: C:\Condor\execute\dir_580\trapez.err
12/8 13:43:46 Renice expr "10" evaluated to 10
12/8 13:43:46 About to exec C:\Condor\execute\dir_580\condor_exec.exe
12/8 13:43:47 Create_Process succeeded, pid=528
12/8 13:44:11 Process exited, pid=528, status=0
12/8 13:44:12 ReliSock: put_file: TransmitFile() failed, errno=10054
12/8 13:44:12 ERROR "DoUpload: Failed to send file
C:\Condor\execute\dir_580\fort.19, exiting at 1371
" at line 1370 in file ..\src\condor_c++_util\file_transfer.C
12/8 13:44:12 ShutdownFast all jobs.
12/8 13:44:12 Error disabling account condor-reuse-vm1 (ACCESS DENIED)

Ok, here goes something wrong. The fort.19-file is created on the executing
machine but can't be uploaded to the submitting machine.
But why? Why is there suddenly a Problem to bring that file back to the
submitting machine?

Thanks in advance,
Thomas Bauer
--------------------------------------------
Westfaelische Wilhelms-Universitaet Muenster
Institut fuer Festkoerpertheorie
Wilhelm-Klemm-Str. 10
D 48149 Muenster
++49 (251) 8339040
--------------------------------------------

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>