[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Easy Setup Guide



Hi,
Check flock_to,flock_from,in condor_config.loal and add the ip of the central machine.
Then run condor_reconfig.
Bye

On 3 Apr 2012at 05:30 PM, "Alain Roy" <roy@xxxxxxxxxxx> wrote:

On Mar 30, 2012, at 1:21 PM, Spuds wrote:
I inherited a cluster of condor machines.  I don't know anything about condor and I have a mess on my hands.

Is there access to an easy setup guide to just set up a simple, no-nonsense, nothing special, just the basics cluster?

I think that's a hard question because each person has their own "no-nonsense" needs. The Condor manual is really more of a reference manual than an easy setup guide, that's for sure. 

You might find past tutorials to be helpful in getting you up to speed, though they aren't straightforward "easy setup guides". One example if our Condor administration tutorial. Last year's can be found at:

Three years ago, we had a video made of the Condor administration tutorial. While that's out of date, the basics really haven't changed. If you prefer video to reading, you can check it out:

We have 3 windows hosts and a linux host, and I'm having all kinds of issues.  Some of them I have solved, others still exist.

1) Can't submit jobs on linux
rholloway@rebelbase:~$ condor_submit submit
Submitting job(s)
ERROR: Failed to connect to local queue manager
CEDAR:6001:Failed to connect to <127.0.1.1:60211>

That sounds like Condor isn't running on that computer. How did you start up Condor? What do the following commands print?

condor_version
condor_config_val -v DAEMON_LIST
ps faux | grep condor

2) Jobs get submitted to the cluster and then show up as "held" and never do anything.

You can find out why the jobs are held, though it's harder than it should be to get reasons. On the computer from which you submitted the jobs:

condor_q -hold JOB_ID
condor_q -l JOB_ID | grep -i hold

Change JOB_ID to the id of a single job that is in the held state.

3) I get all kinds of errors in the Collector Log on what is supposed to be the master:

03/30 12:55:54 DC_AUTHENTICATE: attempt to open invalid session Dagobah:1228:1333129419:23, failing.
03/30 12:55:54 DC_AUTHENTICATE: attempt to open invalid session Dagobah:1228:1333129427:24, failing.
03/30 12:55:54 DC_AUTHENTICATE: attempt to open invalid session Dagobah:1228:1333129427:24, failing.
03/30 12:55:54 DC_AUTHENTICATE: attempt to open invalid session Dagobah:1228:1333129427:24, failing.
03/30 12:55:54 DC_AUTHENTICATE: attempt to open invalid session Dagobah:1228:1333129427:24, failing.
03/30 12:55:54 DC_AUTHENTICATE: attempt to open invalid session Dagobah:1228:1333129427:24, failing.
03/30 12:55:54 DC_AUTHENTICATE: attempt to open invalid session Dagobah:1228:1333129427:24, failing.
03/30 12:55:54 DC_AUTHENTICATE: attempt to open invalid session Dagobah:1228:1333129427:24, failing.
03/30 12:55:54 DC_AUTHENTICATE: attempt to open invalid session Dagobah:1228:1333129427:24, failing.

03/30 12:55:56 Failed to send DC_INVALIDATE_KEY to daemon at <127.0.1.1:53521>: SECMAN:2003:TCP connection to daemon at <127.0.1.1:53521> failed.

Something could be down. Try those commands I suggested above:

condor_config_val -v DAEMON_LIST
ps faux | grep condor

But I have no problems joining other machines to the cluster. 

And if there are any contractors out there that do this for a living, I'll even pay to have someone fix the environment.  We just need it to work. 

There are several. I'm aware of Cycle Computing:

-alain
------------------------------
Alain Roy
Condor Project

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/