[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Parallel universe job can't match ?



fellow condor users:
	I am trying to submit a parallel universe job to condor.
	my condor pool is a single machine pool, with master, negotiator,
	collector, running under my own username, i.e. a "personal condor".
	I first added configuration to my local condor config file, to specify
	that the schedd is a dedicated scheduler, and that the startd takes
	request from that dedicated scheduler.

	then I can submit normal vanilla jobs, and they are executed.
	but when I submit a parallel universe job, which node request = 1, it is
	never executed. I looked at Negotiator log (section 5 in attached debug
	file) , it says "no match found, and
	job rejected". why is this? condor_q and condor_status (section 1,2 in
	attached file) shows that the job is lying idle, and machine is in
	unclaimed state. 

	I don't know why the schedd classAd and Startd classAd can't be matched.
	anybody could give a clue? 

	also I see that there are 2 schedd classAds posted, one with the normal
	yyang@hostname identifier, the other with
	DedicatedScheduler@yyang@hostname. is it true that whenever I submit a
	parallel job, it goes to negotiator with both normal request and dedicated
	request, so that hopefully one would match?


	Thanks a lot
	Yang

********************************************************************************
---1)condor_q output:

-- Submitter: stocksong.corp.yahoo.com : <10.72.107.32:38440> : stocksong.corp.yahoo.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   2.0   yyang           1/18 12:05   0+00:00:00 I  0   9.8  echo hello        

1 jobs; 1 idle, 0 running, 0 held

********************************************************************************
----2)condor_status output:
Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime

stocksong.cor LINUX       INTEL  Unclaimed  Idle       0.310  1003  0+00:54:51

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

         INTEL/LINUX     1     0       0         1       0          0        0

               Total     1     0       0         1       0          0        0
********************************************************************************
---- 3) par.job file
######################################
## Parallel example submit description file
######################################
universe = parallel
executable = /bin/echo
log = logfile
output = outfile.$(NODE)
error = errfile.$(NODE)
Arguments = hello 
machine_count = 1
queue

********************************************************************************
----4) extra dedicated schedd and startd config
# schdd identity
DedicatedScheduler = "DedicatedScheduler@yyang@stocksong.corp.yahoo.com"
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler

# startd policy
START		= True
SUSPEND	= False
CONTINUE	= True
PREEMPT	= False
KILL		= False
WANT_SUSPEND	= False
WANT_VACATE	= False
RANK		= Scheduler =?= $(DedicatedScheduler)

NEGOTIATOR_INTERVAL = 10
ALL_DEBUG=D_ALL
********************************************************************************
--5) Negotiator log
1/18 12:58:04 (fd:7) (pid:10718) Phase 4.1:  Negotiating with schedds ...
1/18 12:58:04 (fd:7) (pid:10718)     NumStartdAds = 1
1/18 12:58:04 (fd:7) (pid:10718)     NormalFactor = 2.000000
1/18 12:58:04 (fd:7) (pid:10718)     MaxPrioValue = 0.500000
1/18 12:58:04 (fd:7) (pid:10718)     NumScheddAds = 2
1/18 12:58:04 (fd:7) (pid:10718)   Negotiating with DedicatedScheduler@yyang@stocksong.corp.yahoo.com at <10.72.107.32:38440>
1/18 12:58:04 (fd:7) (pid:10718) 0 seconds so far
1/18 12:58:04 (fd:7) (pid:10718) NEGOTIATOR_IGNORE_USER_PRIORITIES is undefined, using default value of False
1/18 12:58:04 (fd:7) (pid:10718)   Calculating schedd limit with the following parameters
1/18 12:58:04 (fd:7) (pid:10718)     ScheddPrio       = 0.500000
1/18 12:58:04 (fd:7) (pid:10718)     ScheddPrioFactor = 1.000000
1/18 12:58:04 (fd:7) (pid:10718)     scheddShare      = 0.500000
1/18 12:58:04 (fd:7) (pid:10718)     scheddAbsShare   = 0.500000
1/18 12:58:04 (fd:7) (pid:10718)     ScheddUsage      = 0
1/18 12:58:04 (fd:7) (pid:10718)     scheddLimit      = 0
1/18 12:58:04 (fd:7) (pid:10718)     MaxscheddLimit   = 0
1/18 12:58:04 (fd:7) (pid:10718) Socket to <10.72.107.32:38440> already in cache, reusing
1/18 12:58:04 (fd:7) (pid:10718)     Over submitter resource limit (0) ... only consider startd ranks
1/18 12:58:04 (fd:7) (pid:10718)     Sending SEND_JOB_INFO/eom
1/18 12:58:04 (fd:7) (pid:10718)     Getting reply from schedd ...
1/18 12:58:04 (fd:7) (pid:10718) condor_read(): nfds=7
1/18 12:58:04 (fd:7) (pid:10718) condor_read(): nfound=1
1/18 12:58:04 (fd:7) (pid:10718) condor_read(): nfds=7
1/18 12:58:04 (fd:7) (pid:10718) condor_read(): nfound=1
1/18 12:58:04 (fd:7) (pid:10718)     Got JOB_INFO command; getting classad/eom
1/18 12:58:04 (fd:7) (pid:10718)     Request 00002.00000:
1/18 12:58:04 (fd:7) (pid:10718)       Rejected 2.0 DedicatedScheduler@yyang@stocksong.corp.yahoo.com <10.72.107.32:38440>: no match found
1/18 12:58:04 (fd:7) (pid:10718)     Sending SEND_JOB_INFO/eom
1/18 12:58:04 (fd:7) (pid:10718)     Getting reply from schedd ...
1/18 12:58:04 (fd:7) (pid:10718) condor_read(): nfds=7
1/18 12:58:04 (fd:7) (pid:10718) condor_read(): nfound=1
1/18 12:58:04 (fd:7) (pid:10718) condor_read(): nfds=7
1/18 12:58:04 (fd:7) (pid:10718) condor_read(): nfound=1
1/18 12:58:04 (fd:7) (pid:10718)     Got NO_MORE_JOBS;  done negotiating
1/18 12:58:04 (fd:7) (pid:10718)   This schedd hit its scheddlimit.
1/18 12:58:04 (fd:7) (pid:10718) NEGOTIATOR_IGNORE_USER_PRIORITIES is undefined, using default value of False
1/18 12:58:04 (fd:7) (pid:10718)   Negotiating with yyang@xxxxxxxxxxxxxxxxxxxxxxxx skipped because no idle jobs
1/18 12:58:04 (fd:7) (pid:10718)   Schedd yyang@xxxxxxxxxxxxxxxxxxxxxxxx got all it wants; removing it.
1/18 12:58:04 (fd:7) (pid:10718) ---------- Finished Negotiation Cycle ----------
1/18 12:58:04 (fd:7) (pid:10718) in DaemonCore NewTimer()