[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Parallel Universe and dedicated scheduling



Hi All,

I'm trying to get Dedicated Scheduling setup for Parallel Universe
jobs.  My understadning is taht all I need to do is define
DedicatedScheduler on the execute nodes and do all submissions through
the host I define the (and I should also make sure these are never
preempted or suspended).  I have a set of nodes that lets all jobs run
to completion except NiceUser jobs, I chose these as my test set.

My parallel universe jobs are getting scheduled but they are crashing
the Startd which claims "WantSuspend" is undefined:

condor/latest-install/sbin/condor_startd" on
+"borg68.csail.mit.edu" exited with status 4.
Condor will automatically restart this process in 10 seconds.

*** Last 20 line(s) of file /opt/condor/log/StartLog:
11/9 22:02:33 Calling HandleReq <command_match_info> (0)
11/9 22:02:33 match_info called
11/9 22:02:33 Received match <128.30.112.196:38230>#1254759541#411#...
11/9 22:02:33 State change: match notification protocol successful
11/9 22:02:33 Changing state: Unclaimed -> Matched
11/9 22:02:33 Return from HandleReq <command_match_info> (handler: 0.000s, sec:
+0.003s)
11/9 22:02:33 Calling Handler <DaemonCore::HandleReqSocketHandler>
11/9 22:02:33 Received TCP command 442 (REQUEST_CLAIM) from condor
+<128.30.112.26:34738>, access level DAEMON
11/9 22:02:33 Calling HandleReq <command_request_claim> (0)
11/9 22:02:33 Request accepted.
11/9 22:02:33 Remote owner is DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxx
11/9 22:02:33 State change: claiming protocol successful
11/9 22:02:33 Changing state: Matched -> Claimed
11/9 22:02:33 ERROR "Can't find WANT_SUSPEND in internal ClassAd" at line 1226
+in file Resource.cpp
11/9 22:02:33 Changing state and activity: Claimed/Idle -> Preempting/Killing
11/9 22:02:34 State change: No preempting claim, returning to owner
11/9 22:02:34 Changing state and activity: Preempting/Killing -> Owner/Idle
11/9 22:02:34 State change: IS_OWNER is false
11/9 22:02:34 Changing state: Owner -> Unclaimed
11/9 22:02:34 startd exiting because of fatal exception.
*** End of file StartLog

But "Want_Suspend" *is* defined:

[jon@borg-login-1 ~]$ condor_config_val -n borg68 DedicatedScheduler Start Want_Suspend
"DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxx"
True
((TARGET.ImageSize <  (15 * 1024)) || ((KeyboardIdle < 60) == False) || (TARGET.JobUniverse == 4) || (TARGET.JobUniverse == 5) ) && ( NiceUser == True)


I'm puzzled...

-Jon