[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] 2 match but reject the job for unknown reasons



Hi,

I've run into a problem that I'm trying to debug, but haven't come up with a clue to what might be going wrong.

I've set up the condor binaries on my own cluster, and submit a glide-in request to another system.   This works.  The nodes show up on my local cluster.   I can then send vanilla universe condor jobs to them, and they execute.  I can also send simple (one job) DAGs, and the job also executes.

What I haven't been been able to get to work is to get this working under a parallel universe.   I've simplified this to the the "sleep" example (with "mydomain.org" pointing at my cluster's site):

    universe = parallel
    executable = /bin/sleep
    arguments = 30
    machine_count = 2
    Requirements = target.Disk == 0 && TARGET.FileSystemDomain == "mydomain.org"
    queue

On the nodes where this would execute, I have the following lines added to the generic "glidein_condor_config" file that comes with the distribution (I put these lines at the bottom of the file):

     DEDICATEDSCHEDULER = "DedicatedScheduler@myusername@mylocalnode.mydomain.org"
     STARTD_ATTRS = $(STARTD_ATTRS), DEDICATEDSCHEDULER

Everything else is a regular (vanilla - untouched) install, apart from the condor_config.local file changes I had to add to make sure it worked in the first place.  I have the DAEMON_LIST set to:

     DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, SHADOW


With all this in place, when the job tries to run, I get the message out of "condor_q -analyze" and "condor_q -better-analyze":

     2 match but reject the job for unknown reasons 

It appears that I'm missing a configuration parameter somewhere, either locally, or remotely.   I've looked through the log files, and haven't seen why the job is being rejected.  I've tried setting:

     DEDICATEDSCHEDULER = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx"

in the "glidein_condor_config" file on the execute nodes, but that doesn't appear to have made a difference either.

Can someone please point me to a LOG file I should be looking at or let me know a parameter I should be setting?  

I would really appreciate the help!

Thanks,

Steve