[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Transfer problem?



Gerar,

This looks like a problem I was having.  I had 
to reconfigure my HOSTALLOW_WRITE and HOSTALLOW_READ 
on my central manager to the network address (rather 
than the network name).  Somehow the DNS Inverse 
Lookup wasn't being done correctly.  

Please note, I am not a wizard at this, but I haven't had 
any responses on this board and I am hoping this will 
help you out.

Good luck,
Jim

-----Original Message-----
From: LOPEZ BARNES, GERAR [mailto:glopez1@xxxxxxx]
Sent: Wednesday, July 21, 2004 5:53 AM
To: 'condor-users@xxxxxxxxxxx'
Subject: [Condor-users] Transfer problem?


 
We have a problem when running condor, that we can't figure out how to
solve it. When sending a new job to the queue, we can see that the job
runs for a short period of time and then it goes to an idle status.
Cheking out the logs files we have found this:

Starterlog.vm2:

7/21 14:25:32 ******************************************************
7/21 14:25:32 ** condor_starter (CONDOR_STARTER) STARTING UP
7/21 14:25:32 ** $CondorVersion: 6.6.5 May  3 2004 $
7/21 14:25:32 ** $CondorPlatform: I386-LINUX-RH9 $
7/21 14:25:32 ** PID = 6863
7/21 14:25:32 ******************************************************
7/21 14:25:32 Using config file:
/home/condor/condor-6.6.5/etc/condor_config
7/21 14:25:32 Using local config files:
/home/condor/condor-6.6.5/local.thymus/con
dor_config.local
7/21 14:25:32 DaemonCore: Command Socket at <193.147.240.191:50621>
7/21 14:25:32 Done setting resource limits
7/21 14:25:32 Starter communicating with condor_shadow
<193.147.240.196:4582>
7/21 14:25:32 Submitting machine is "adonis.imim.es"
7/21 14:25:32 File transfer completed successfully.
7/21 14:25:32 Starting a VANILLA universe job with ID: 3.0
7/21 14:25:32 IWD:
/home/condor/condor-6.6.5/local.thymus/execute/dir_6863
7/21 14:25:32 Output file:
/home/condor/condor-6.6.5/local.thymus/execute/dir_6863
/2program.out
7/21 14:25:32 Error file:
/home/condor/condor-6.6.5/local.thymus/execute/dir_6863/
2program.err
7/21 14:25:32 About to exec
/home/condor/condor-6.6.5/local.thymus/execute/dir_686
3/condor_exec.exe
7/21 14:25:32 Create_Process succeeded, pid=6865
7/21 14:25:35 Process exited, pid=6865, status=0
7/21 14:25:35 ReliSock: put_file: Failed to open file
/home/condor/condor-6.6.5/lo
cal.thymus/execute/dir_6863/2program.log, errno = 2.
7/21 14:25:35 ERROR "DoUpload: Failed to send file
/home/condor/condor-6.6.5/local
.thymus/execute/dir_6863/2program.log, exiting at 1379
" at line 1378 in file file_transfer.C
7/21 14:25:35 ShutdownFast all jobs.

And when we see the log file from the job we get:
2program.log:

000 (003.000.000) 07/21 14:24:38 Job submitted from host:
<193.147.240.196:4543>
...
001 (003.000.000) 07/21 14:28:59 Job executing on host:
<193.147.240.191:50474>
...
007 (003.000.000) 07/21 14:29:03 Shadow exception!
        Can no longer talk to condor_starter on execute machine
(193.147.240.191)
        0  -  Run Bytes Sent By Job
        14829  -  Run Bytes Received By Job
...
001 (003.000.000) 07/21 14:29:04 Job executing on host:
<193.147.240.191:50474>
...
007 (003.000.000) 07/21 14:29:07 Shadow exception!
        Can no longer talk to condor_starter on execute machine
(193.147.240.191)
        0  -  Run Bytes Sent By Job
        14829  -  Run Bytes Received By Job

Any idea where the problem is?

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users