[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs only running on submit machine



74 cores unclaimed, 10 jobs submitted


codytrey@metis:~/condor$ condor_q -analyze


-- Submitter: metis.physics.tamu.edu : <128.194.151.193:49656> : metis.physics.tamu.edu
---
055.000:  Request has not yet been considered by the matchmaker.

---
055.001:  Request has not yet been considered by the matchmaker.

---
055.002:  Request has not yet been considered by the matchmaker.

---
055.003:  Request has not yet been considered by the matchmaker.

---
055.004:  Request has not yet been considered by the matchmaker.

---
055.005:  Request has not yet been considered by the matchmaker.

---
055.006:  Request has not yet been considered by the matchmaker.

---
055.007:  Request has not yet been considered by the matchmaker.

---
055.008:  Request has not yet been considered by the matchmaker.

---
055.009:  Request has not yet been considered by the matchmaker.




waiting for them to be considered (how long should this take? some times it seems very fast, other times it takes upwards of 10 minutes)

after some time, it changes to:

-- Submitter: metis.physics.tamu.edu : <128.194.151.193:49656> : metis.physics.tamu.edu
---
057.004:  Request is being serviced

---
057.005:  Request is being serviced


 the python script that it runs sleeps for some amount of time then, echo's the host name, the out put of this shows that they all run on the machine named metis.

I was looking at the logs on the central manager, it seems that it often tries to communicate with it's self over 127.0.0.1:49152 but fails, could this be related and or the cause of my problems?

I also just noticed, that condor_status shows 20 cores were matched and 16 remained unclaimed, this leads me to think that the jobs are match to run on other nodes, but the central manager is not able to send it.

You've been very helpful thus far, I greatly appreciate it.

-Cody

 

On 2013-02-26 11:50, Jaime Frey wrote:

Make sure all of the machines are in the Unclaimed state in condor_status, and not Owner. If they're in Owner state, they don't want to accept jobs.
Then, submit a new set of jobs and run condor_q -analyze, using the id of one of the idle jobs. That may provide some information about why the jobs aren't running on the other machines.
 -- Jaime

On Feb 26, 2013, at 11:03 AM, Cody Belcher <codytrey@xxxxxxxxxxxxxxxx> wrote:

Jaime,

my submit files is:

        Executable = PQL
        Universe = vanilla
        Output = pql.out
        Log = pql.log
        Error = pql.err
        Arguments = -p params.in -t temps.in
        notification = Error
        notify_user = codytrey@xxxxxxxx
        should_transfer_files = YES
Queue 20


I have it queue 20 jobs to see if it would force jobs to other machines if the submit node had all it's processors in use, but it just ran 4 at a time until it was complete

Same results with:

Executable = test.py
Universe = vanilla
Output = /Volumes/Scratch/test/test.out.$(Process)
Log = /Volumes/Scratch/test/test.log
Error = /Volumes/Scratch/test/test.err
should_transfer_files = ALWAYS
Queue 10


-Cody

 

On 2013-02-26 10:29, Jaime Frey wrote:

What does your submit file look like?
A common problem is that the machines don't have a shared filesystem, and HTCondor's file transfer option isn't being requested in the submit file. In this case, HTCondor will only run the jobs on the submit machine.
 -- Jaime

On Feb 26, 2013, at 9:09 AM, Cody Belcher <codytrey@xxxxxxxxxxxxxxxx> wrote:

I do see all of the machines in condor-status

"codytrey@metis:~$ condor_config_val DAEMON_LIST
MASTER, SCHEDD, STARTD"

This is the submit machine, it is the same on an execute a just tried.

-Cody

On 2013-02-26 08:47, Cotton, Benjamin J wrote:

Cody,

The first question is are you sure they're all in the same pool? To
check this, do they all show up in the output of condor_status?

My suspicion is that your submit/execute machine might be running its
own condor_collector and condor_negotiator processes. You can check this
with 

condor_config_val DAEMON_LIST

If that's the case, then your execute-only nodes might be as well.

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project