[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor MPI Problem on Windows



I am testing MPI application on Windows Condor. I configured two
machines as dedicated machines and one of them works as the dedicated
scheduler in a Condor pool. I submited a simple MPI program requiring
two processors. The job is in idle state and doesn't run, however, I
was continuously geeting the following error email and the dedicated
scheduler was continuously restarted until I removed the job. Does
anyone have the similar experience and can give some helps?
Thanks a lot!

---------- Forwarded message ----------
From: SYSTEM@hliu <SYSTEM@hliu>
Date: Apr 7, 2005 2:58 PM
Subject: [Condor] Problem
To: honggao.liu@xxxxxxxxx


This is an automated email from the Condor system
on machine "hliu.ocs.lsu.edu".  Do not reply.

"C:\Condor/bin/condor_schedd.exe" on "hliu.ocs.lsu.edu" exited with status 3.
Condor will automatically restart this process in 10 seconds.

*** Last 20 line(s) of file SchedLog:
4/7 14:53:21 failed to send RESCHEDULE command to negotiator
4/7 14:53:21 Sent ad to central manager for honggao@xxxxxxx
4/7 14:53:21 Sent ad to 1 collectors for honggao@xxxxxxx
4/7 14:53:24 Can't connect to <130.39.187.25:9614>:0, errno = 10061
4/7 14:53:24 Will keep trying for 10 seconds...
4/7 14:53:33 Connect failed for 10 seconds; returning FALSE
4/7 14:53:33 ERROR: SECMAN:2003:TCP connection to <130.39.187.25:9614> failed
4/7 14:53:33 failed to send RESCHEDULE command to negotiator
4/7 14:53:33 DaemonCore: Command received via UDP from host
<130.39.198.109:1809>
4/7 14:53:33 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())

4/7 14:53:33 Shadow pid 3928 for job 10.0 exited with status 100
4/7 14:53:33 match (<130.39.187.25:44195>#1112903283#1) out of jobs
(cluster id 10); relinquishing
4/7 14:53:33 Sent RELEASE_CLAIM to startd on <130.39.187.25:44195>
4/7 14:53:33 Match record (<130.39.187.25:44195>, 10, -1) deleted
4/7 14:53:33 DaemonCore: Command received via TCP from host
<130.39.187.25:44230>
4/7 14:53:33 DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
4/7 14:53:33 Got VACATE_SERVICE from <130.39.187.25:44230>
4/7 14:58:08 Activity on stashed negotiator socket
4/7 14:58:08 Negotiating for owner: DedicatedScheduler@xxxxxxxxxxxxxxxx
4/7 14:58:09 Out of requests - 2 reqs matched, 0 reqs idle
*** End of file SchedLog

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: honggao.liu@xxxxxxxxx
The Official Condor Homepage is http://www.cs.wisc.edu/condor



-- 
Honggao Liu, Ph.D
High Performance Computing
Office of Computing Services
Louisiana State University
Tel: (225) 578-0235
Fax: (225) 578-6400
E-mail: honggao@xxxxxxx
            honggao.liu@xxxxxxxxx