[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] New installation of Condor : questions about configuration files and other general things



Thomas Pegeot wrote:
Hello,

I'm doing a work placement related to Condor, but as i'm a newbie in Condor, i need your help. Actually, i have to put in place a desktop grid composed of virtual machines running on dual-core computers (Windows hosts running VirtualBox or VMWare Server virtual machines - Debian Etch) and i want to use Condor to deal with these virtual machines.

I set up Condor-6.9.2

Condor-6.9.2 is a development release, maybe you should stick with the stable 6.8.4?

> as a Central Manager (cm) on a linux server
(Debian Etch) running a NFS Server, which allows me to share the release directory (/usr/local/condor in my case) and a unique home (/home/condor/hosts/$hostname/) with the virtual machines (node1,node2 ...).

To install the central manager, i used the Older Unix Installation Procedure (condor_install).

Unfortunately, there is a lot of things i don't know how to deal with, even with the Condor Manual.

First, about files sharing. Do you think my setup is correct ?

It is ok. AFAIK, apart from the lock files (which reside in local /tmp by default), you can share everything over NFS.

On each virtual machine, i mount cm:/home/condor/ on /home/condor (i have a condor user on each vm) but is it correct ?

Yes.

Same thing for the release directory : is it good to share this same folder between the central manager and the nodes ?

Yes.

After the installation, i tried to run condor_status but i only got the status of my central manager, but there was nothing about my nodes. :s

How long have you waited? (A few minutes should be enough.)

http://pastebin.ca/448387

When i run "condor_status -direct node1", "condor_status -direct node2" ..., i get the following error : "condor_status:can't find address for startd ...".
So i checked which processes were running with ps -ef | egrep condor_ :
- on my cm : condor_master,condor_procd,condor_negociator,condor_collector,condor_startd,condor_schedd.
- on my nodes : condor_master,condor_procd,startd,condor_sched

So, i think that my problem must be related to my configuration files.

It could also be your network configuration (firewall, routing etc.)

Here is my condor_config : http://pastebin.ca/448390
My cm condor_config.local : http://pastebin.ca/448397

I don't see any errors at first glance.

But the nodes condor_config.local are empty. I wonder whether it is correct or not to have empty configurations files. :s

It is okay for the local configuration files to be empty. It simply means that you do not override any option from the global configuration file.

Is another script to run on my virtual machines ?

No.

I ran condor_init but it didn't fix my problem.

Next thing to do is inspect the logs of the cm and a chosen node:
$(TILDE)/hosts/$(HOSTNAME)/log/*

Shut down Condor (e.g. by killing condor_master on cm and the node), manually remove files from the log directory, then start the condor_master process again and look at or paste messages that appear in log files.

Regards,
Jan Ploski