[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Run mpich, error, then get: condor_write() failed: send() 1 bytes returned -1, timeout=0, errno=32 Broken pipe



Hi,

I'm trying to run mpich under htcondor. I've set the dedicated scheduler, and they work. Then I tried to run pi_montecarlo.x application. From the Log file, it look like run well, but finally i do not get any result.

#### Submission file:

######################################
## Example submit description file
## for MPICH 1 MPI
## works with MPICH 1.2.4, 1.2.5 and 1.2.6
######################################
universe = parallel
executable = mp1script
arguments = pi_montecarlo.x
machine_count = 1
output         = loop.out               
error          = loop.error            
log            = loop.log   
should_transfer_files = yes
when_to_transfer_output = on_exit
transfer_input_files = pi_montecarlo.x
queue

#### Log file:

000 (018.000.000) 01/15 11:38:41 Job submitted from host: <10.3.16.144:55930>
...
014 (018.000.000) 01/15 11:38:56 Node 0 executing on host: <10.3.16.112:39838>
...
001 (018.000.000) 01/15 11:38:56 Job executing on host: MPI_job
...
015 (018.000.000) 01/15 11:39:04 Node 0 terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    479  -  Run Bytes Sent By Node
    1317684  -  Run Bytes Received By Node
    479  -  Total Bytes Sent By Node
    1317684  -  Total Bytes Received By Node
...
005 (018.000.000) 01/15 11:39:05 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    479  -  Run Bytes Sent By Job
    1317684  -  Run Bytes Received By Job
    479  -  Total Bytes Sent By Job
    1317684  -  Total Bytes Received By Job
    Partitionable Resources :    Usage  Request Allocated
       Cpus                 :                 1         1
       Disk (KB)            :     1500     1500  13561016
       Memory (MB)          :        3        1      1995
...

#############

From the Error file, i got this:

/etc/condor/var/execute/dir_15035/condor_exec.exe: 125: [: Illegal number: pi_montecarlo.x
/etc/condor/var/execute/dir_15035/condor_exec.exe: 35: [: Illegal number: pi_montecarlo.x
/etc/condor/var/execute/dir_15035/condor_exec.exe: 61: /etc/condor/var/execute/dir_15035/condor_exec.exe: cannot open /etc/condor/var/execute/dir_15035/contact: No such file
/etc/condor/var/execute/dir_15035/condor_exec.exe: 64: /etc/condor/var/execute/dir_15035/condor_exec.exe: mpirun: not found

##############

This is the StarterLog of 10.3.16.112 (worker)

01/15/14 11:38:50 ** condor_starter (CONDOR_STARTER) STARTING UP
01/15/14 11:38:50 ** /etc/condor/sbin/condor_starter
01/15/14 11:38:50 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
01/15/14 11:38:50 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
01/15/14 11:38:50 ** $CondorVersion: 8.0.4 Oct 19 2013 BuildID: 189770 $
01/15/14 11:38:50 ** $CondorPlatform: x86_64_Ubuntu12 $
01/15/14 11:38:50 ** PID = 15035
01/15/14 11:38:50 ** Log last touched 1/15 10:48:16
01/15/14 11:38:50 ******************************************************
01/15/14 11:38:50 Using config source: /etc/condor/etc/condor_config
01/15/14 11:38:50 Using local config sources:
01/15/14 11:38:50    /etc/condor/var/condor_config.local
01/15/14 11:38:50 DaemonCore: command socket at <10.3.16.112:33481>
01/15/14 11:38:50 DaemonCore: private command socket at <10.3.16.112:33481>
01/15/14 11:38:50 Communicating with shadow <10.3.16.144:41776?noUDP>
01/15/14 11:38:50 Submitting machine is "hpclab.abcd.efg.hi"
01/15/14 11:38:50 setting the orig job name in starter
01/15/14 11:38:50 setting the orig job iwd in starter
01/15/14 11:38:50 Job has WantIOProxy=true
01/15/14 11:38:50 Initialized IO Proxy.
01/15/14 11:38:50 Done setting resource limits
01/15/14 11:38:50 File transfer completed successfully.
01/15/14 11:38:51 Job 18.0 set to execute immediately
01/15/14 11:38:51 Starting a PARALLEL universe job with ID: 18.0
01/15/14 11:38:51 IWD: /etc/condor/var/execute/dir_15035
01/15/14 11:38:51 Output file: /etc/condor/var/execute/dir_15035/_condor_stdout
01/15/14 11:38:51 Error file: /etc/condor/var/execute/dir_15035/_condor_stderr
01/15/14 11:38:56 About to exec /etc/condor/var/execute/dir_15035/condor_exec.exe pi_montecarlo.x
01/15/14 11:38:56 Setting job's virtual memory rlimit to 0 megabytes
01/15/14 11:38:56 Running job as user nobody
01/15/14 11:38:56 Create_Process succeeded, pid=15039
01/15/14 11:38:57 condor_write() failed: send() 1 bytes to <10.3.16.112:39661> returned -1, timeout=0, errno=32 Broken pipe.
01/15/14 11:38:59 Process exited, pid=15039, status=0
01/15/14 11:39:00 Got SIGQUIT.  Performing fast shutdown.
01/15/14 11:39:00 ShutdownFast all jobs.
01/15/14 11:39:00 **** condor_starter (condor_STARTER) pid 15035 EXITING WITH STATUS 0

####

I guess the problem is related to "condor_write() failed: send() 1 bytes to <10.3.16.112:47961> returned -1, timeout=0, errno=32 Broken pipe". But i do not how to solve this problem. And i am also wondering why in the linux environment there is condor_exec.exe (see error file).

I hope someone can help me.

Thank you so much before.