[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor Problem when running MPI job



Hi!

I get this messages when I have an mpi job running:

----- Forwarded message from condor@xxxxxxxxxxxxxxxxxxxxxx -----

> Date: Fri, 22 Apr 2005 21:04:34 +0200
> From: condor@xxxxxxxxxxxxxxxxxxxxxx
> Subject: [Condor] Problem
> To: kolmann@xxxxxxxxxxxxxxxx
> X-Spam-Status: No, hits=-4.7 required=5.0 tests=BAYES_00,NO_REAL_NAME 
> 	autolearn=no version=2.64
> 
> This is an automated email from the Condor system
> on machine "pc167.ben.tuwien.ac.at".  Do not reply.
> 
> "/grid/condor/sbin/condor_startd" on "pc167.ben.tuwien.ac.at" exited with status 4.
> Condor will automatically restart this process in 10 seconds.
> 
> *** Last 20 line(s) of file StartLog:
> 4/22 20:46:33 Changing state: Owner -> Unclaimed
> 4/22 21:03:37 DaemonCore: Command received via UDP from host <193.170.74.44:48698>
> 4/22 21:03:37 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
> 4/22 21:03:37 match_info called
> 4/22 21:03:37 Received match <193.170.75.167:32801>#1114195587#1
> 4/22 21:03:37 State change: match notification protocol successful
> 4/22 21:03:37 Changing state: Unclaimed -> Matched
> 4/22 21:03:38 DaemonCore: Command received via TCP from host <193.170.74.44:54791>
> 4/22 21:03:38 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
> 4/22 21:03:38 Request accepted.
> 4/22 21:03:38 Remote owner is DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxx
> 4/22 21:03:38 State change: claiming protocol successful
> 4/22 21:03:38 Changing state: Matched -> Claimed
> 4/22 21:04:33 ERROR "Can't find WANT_SUSPEND in internal ClassAd" at line 894 in file Resource.C
> 4/22 21:04:33 Changing state and activity: Claimed/Idle -> Preempting/Killing
> 4/22 21:04:33 State change: No preempting claim, returning to owner
> 4/22 21:04:33 Changing state and activity: Preempting/Killing -> Owner/Idle
> 4/22 21:04:33 State change: IS_OWNER is false
> 4/22 21:04:33 Changing state: Owner -> Unclaimed
> 4/22 21:04:33 startd exiting because of fatal exception.
> *** End of file StartLog
> 
> 
> 
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Questions about this message or Condor in general?
> Email address of the local Condor administrator: kolmann@xxxxxxxxxxxxxxxx
> The Official Condor Homepage is http://www.cs.wisc.edu/condor

----- End forwarded message -----


condor -v
$CondorVersion: 6.7.6 Mar 15 2005 $
$CondorPlatform: I386-LINUX_RH9 $

The mpi job seems to be continue to run on the nodes.

thanks for any help
philipp

-- 
If you have problems in Windows: REBOOT
If you have problems in Linux:   BE ROOT