[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Chirp problem?



Hello everybody,

I have some problems with MPICH jobs.
I want to run benchmark NAS bt.A.1 with Condor, but when I submit the job,
it is running for 3 seconds and than prematurely ends. In the output/error files I have theses messages:

nas.out
error 0 chirp putting identity keys back

nas.error
chirp: couldn't get response from server: Illegal seek
/usr/local/condor/libexec/sshd.sh: line 72: 3356 Aborted (core dumped) $CONDOR_CHIRP put -perm 0700 $idkey $_CONDOR_REMOTE_SPOOL_DIR/$_CONDOR_PROCNO.key


Job submition file:

universe = parallel
executable = mp1script
arguments = bt.A.1
machine_count = 1
transfer_input_files = bt.A.1
should_transfer_files = yes
when_to_transfer_output = on_exit
+WantIOProxy=true
log = nas.log
output = nas.out
error = nas.error
queue


And on the host where the job begin to run, the StarterLog.vm1 is:

12/4 16:31:24 ******************************************************
12/4 16:31:24 ** condor_starter (CONDOR_STARTER) STARTING UP
12/4 16:31:24 ** /usr/local/condor/sbin/condor_starter
12/4 16:31:24 ** $CondorVersion: 6.8.1 Sep 17 2006  $
12/4 16:31:24 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
12/4 16:31:24 ** PID = 3341
12/4 16:31:24 ** Log last touched time unavailable (No such file or directory)
12/4 16:31:24 ******************************************************
12/4 16:31:24 Using config source: /usr/local/condor/etc/condor_config
12/4 16:31:24 Using local config sources:
12/4 16:31:24    /usr/local/condor/home/condor_config.local
12/4 16:31:24 DaemonCore: Command Socket at <0.0.0.0:32780>
12/4 16:31:24 Done setting resource limits
12/4 16:31:24 Communicating with shadow <172.24.1.3:33401>
12/4 16:31:24 Submitting machine is "gdx0003.orsay.grid5000.fr"
12/4 16:31:24 Job has WantIOProxy=true
12/4 16:31:24 Initialized IO Proxy.
12/4 16:31:24 File transfer completed successfully.
12/4 16:31:25 Starting a PARALLEL universe job with ID: 27.0
12/4 16:31:25 IWD: /usr/local/condor/home/execute/dir_3341
12/4 16:31:25 Output file: /usr/local/condor/home/execute/dir_3341/nas1.out
12/4 16:31:25 Error file: /usr/local/condor/home/execute/dir_3341/nas1.error 12/4 16:31:25 About to exec /usr/local/condor/home/execute/dir_3341/condor_exec.exe bt.A.1
12/4 16:31:25 Create_Process succeeded, pid=3343
**12/4 16:31:26 IOProxy: rejecting connection from 127.0.0.1: invalid ip addr* *
12/4 16:31:26 Process exited, pid=3343, status=255
12/4 16:31:26 Got SIGQUIT.  Performing fast shutdown.
12/4 16:31:26 ShutdownFast all jobs.
12/4 16:31:26 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0


Thank you,
Ala Rezmerita