[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Swapped out condor master node



Title: Swapped out condor master node
Hi All,

I had a pool of 32 WINNT 51 processors (5 Intel Macs running WinXP –running Condor 7.2) and 8 Linux processors (running RHEL 4.5- running Condor 7.2). The Linux machine was my central manager. For various reasons, we swapped out the central manager for a new machine running OSX (Condor version 7.2), 16 processors. We “may” have swapped it out why the WinXP machines were still running condor, and I’m not sure it matters, but wanted to mention it in case it does. Anyway, I’m getting some “weird” behavior, i.e., when I do a condor_submit for my WinXP jobs, they come back with “job matched but rejected for unknown reasons”. However, if I restart condor master on the central manager (and BTW, I actually need to use the condor_master command after a restart, so there is likely something going on here as well), all the machines pop up on a condor_status, and if I resubmit the WinXP  jobs, they start and complete.

So, my questions are:
  1. How can I tell why a job was rejected on a node in the pool; can I do that without accessing the node? I’ve looked in all the *Log files on the central manager, but nothing pops out as to why the job was rejected. BTW, I do my submissions remotely and I only have access (ssh, etc) to the central manager.
  2. I’m guessing this has something to do with permissions, but I’m not sure where or what to check. Any suggestions?
  3. Any information I’m missing that might help me debug or general procedures for solving this type of problem?

BTW, condor is a great tool!! You guys have done some awesome work. I’ve been using condor now for about 5 years, and I “usually” don’t have any problems, except when it comes to some off-the-wall problem, mostly of my own doing. ;-)

Thanx
steve


Stephen C. Upton
Research Associate
SEED (Simulation Experiments & Efficient Designs) Center for Data Farming
Cell: 831-402-3888