[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problem running mpi job on Condor 6.7.5 Feb 28 2005, I386-LINUX_RH9



Hi!

I want to benchmark my setup using an mpi linpack job.

the binary is compiled and runs on one machine without problems. When starting
via condor, it doesn't get executed:

=== Submit file ====
universe = MPI
executable = xhpl-mpi-condor
log = log/logfile
output = log/outfile.$(cluster).$(process)
error = log/errfile.$(cluster).$(process)
machine_count = 10
queue
=== Submit file ====

My Cluster is a Linux cluster running Debian Sarge on 2.6.10. 

Following comes the Schedd Log using D_FULLDEBUG:

4/20 20:24:21 DaemonCore: Command received via TCP from host
<193.170.74.44:48747>
4/20 20:24:21 DaemonCore: received command 1111 (QMGMT_CMD), calling handler
(handle_q)
4/20 20:24:21 AUTHENTICATE_FS: used file /tmp/qmgr_lWcXDp, status: 1
4/20 20:24:21 OwnerCheck retval 1 (success),no ad
4/20 20:24:21 OwnerCheck retval 1 (success),no ad
4/20 20:24:21 get_file(): going to write to filename
/grid/condor/hosts/gridmaster/spool/cluster99.ickpt.subproc0
4/20 20:24:21 get_file: Receiving 1476050 bytes
4/20 20:24:21 get_file: wrote 1476050 bytes to file
4/20 20:24:21 done with transfer, errno = 0
4/20 20:24:22 condor_read(): Socket closed when trying to read buffer
4/20 20:24:22 QMGR Connection closed
4/20 20:24:22 DaemonCore: Command received via TCP from host
<193.170.74.44:48748>
4/20 20:24:22 DaemonCore: received command 464 (ATTEMPT_ACCESS), calling
handler (attempt_access_handler)
4/20 20:24:22 ATTEMPT_ACCESS: Switching to user uid: 22677 gid: 22677.
4/20 20:24:22 Checking file /grid/home/pkolmann/mpitest/log/outfile.99.0 for
write permission.
4/20 20:24:22 Switching back to old priv state.
4/20 20:24:22 DaemonCore: Command received via TCP from host
<193.170.74.44:48749>
4/20 20:24:22 DaemonCore: received command 464 (ATTEMPT_ACCESS), calling
handler (attempt_access_handler)
4/20 20:24:22 ATTEMPT_ACCESS: Switching to user uid: 22677 gid: 22677.
4/20 20:24:22 Checking file /grid/home/pkolmann/mpitest/log/errfile.99.0 for
write permission.
4/20 20:24:22 Switching back to old priv state.
4/20 20:24:22 Found idle MPI cluster 99
4/20 20:24:22 Started timer (1401) to call handleDedicatedJobs() in 2 secs
4/20 20:24:22 JobsRunning = 0
4/20 20:24:22 JobsIdle = 0
4/20 20:24:22 JobsHeld = 0
4/20 20:24:22 JobsRemoved = 0
4/20 20:24:22 LocalUniverseJobsRunning = 0
4/20 20:24:22 LocalUniverseJobsIdle = 0
4/20 20:24:22 SchedUniverseJobsRunning = 0
4/20 20:24:22 SchedUniverseJobsIdle = 0
4/20 20:24:22 N_Owners = 1
4/20 20:24:22 MaxJobsRunning = 200
4/20 20:24:22 ENABLE_SOAP is undefined, using default value of False
4/20 20:24:22 Trying to update collector <193.170.74.44:9618>
4/20 20:24:22 Attempting to send update via UDP to collector
gridmaster.ben.tuwien.ac.at <193.170.74.44:9618>
4/20 20:24:22 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
4/20 20:24:22 Sent HEART BEAT ad to 1 collectors. Number of submittors=1
4/20 20:24:22 Changed attribute: RunningJobs = 0
4/20 20:24:22 Changed attribute: IdleJobs = 0
4/20 20:24:22 Changed attribute: HeldJobs = 0
4/20 20:24:22 Changed attribute: FlockedJobs = 0
4/20 20:24:22 Changed attribute: Name = "pkolmann@xxxxxxxxxxxxxxxx"
4/20 20:24:22 Sent ad to central manager for pkolmann@xxxxxxxxxxxxxxxx
4/20 20:24:22 Trying to update collector <193.170.74.44:9618>
4/20 20:24:22 Attempting to send update via UDP to collector
gridmaster.ben.tuwien.ac.at <193.170.74.44:9618>
4/20 20:24:22 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
4/20 20:24:22 Sent ad to 1 collectors for pkolmann@xxxxxxxxxxxxxxxx
4/20 20:24:22 ============ Begin clean_shadow_recs =============
4/20 20:24:22 ============ End clean_shadow_recs =============
4/20 20:24:22 Called reschedule_negotiator()
4/20 20:24:22 Sending RESCHEDULE command to negotiator(s)
4/20 20:24:22 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value of 0
4/20 20:24:22 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
4/20 20:24:24 Starting DedicatedScheduler::handleDedicatedJobs
4/20 20:24:24 Found 1 idle dedicated job(s)
4/20 20:24:24 DedicatedScheduler: Listing all dedicated jobs -
4/20 20:24:24 Dedicated job: 99.0 pkolmann
4/20 20:24:24 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value of 0
4/20 20:24:24 Will use UDP to update collector gridmaster.ben.tuwien.ac.at
<193.170.74.44:9618>
4/20 20:24:24 Trying to query collector <193.170.74.44:9618>
4/20 20:24:24 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value of 0
4/20 20:24:24 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
4/20 20:24:24 Found 0 potential dedicated resources
4/20 20:24:24 idle resource list
4/20 20:24:24  ************ empty ************
4/20 20:24:24 limbo resource list
4/20 20:24:24  ************ empty ************
4/20 20:24:24 unclaimed resource list
4/20 20:24:24  ************ empty ************
4/20 20:24:24 busy resource list
4/20 20:24:24  ************ empty ************
4/20 20:24:24 Trying to find 10 resource(s) for dedicated job 99.0
4/20 20:24:24 Trying to satisfy job with all possible resources
4/20 20:24:24 Can't satisfy job 99 with all possible resources... trying next job
4/20 20:24:24 In DedicatedScheduler::publishRequestAd()
4/20 20:24:24 Trying to update collector <193.170.74.44:9618>
4/20 20:24:24 Attempting to send update via UDP to collector
gridmaster.ben.tuwien.ac.at <193.170.74.44:9618>
4/20 20:24:24 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
4/20 20:24:24 Entering DedicatedScheduler::checkSanity()
4/20 20:24:24 Finished DedicatedScheduler::handleDedicatedJobs
4/20 20:25:02 -------- Begin starting jobs --------
4/20 20:25:02 -------- Done starting jobs --------
4/20 20:27:40 Getting monitoring info for pid 29731
4/20 20:27:53 DaemonCore: in SendAliveToParent()
4/20 20:27:53 DaemonCore: attempting to connect to '<193.170.74.44:53510>'
4/20 20:27:53 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value of 0
4/20 20:27:53 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False

pkolmann@gridmaster:~/mpitest$ condor_q -analyze
-- Submitter: gridmaster.ben.tuwien.ac.at : <193.170.74.44:43696> :
gridmaster.ben.tuwien.ac.at
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
---
099.000:  Run analysis summary.  Of 67 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
     67 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job

WARNING: Analysis is meaningless for MPI universe jobs.

1 jobs; 1 idle, 0 running, 0 held


Maybe someone has a suggestions for me. The DedicatedSchedulers are set to
gridmaster:

DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxx"
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler

in the local config files.


thanks for your help
Kind Regards
Philipp Kolmann
TU Wien, Austria

-- 
If you have problems in Windows: REBOOT
If you have problems in Linux:   BE ROOT