[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Help with setting up a new cluster



HiÂ

I'm new to the list and if this questionÂis not appropriate, please let me know.Â

First a little background: I work at a small startupÂdoing computational fluid dynamics. We have 5 high-end workstations with 2 GPUs each that we would like to be able to use off-hours as a cluster for runningÂsingle-node jobs (no MPI). From what I understand, HTCondor is perfectly suited for this use case.Â

A little info on our setup: The 5 workstations are all networked with their own local IP addresses on a subnet. I have a Ubuntu VM running on a networked attached storage device that I would like to act as the central manager. All the other 5 machines will serve under the execute and submit roles.Â

I used the get_htcondor.sh script to setup the central manager on the VM as well as my workstation (1 of the 5).Â

The installationÂwent ok and all the servicesÂare runningÂand healthy, but the two machines cannot communicateÂwith each other. I also suspectÂsomething is very wrong since neither sudo condor_reconfig or sudo condor_restart worksÂon either machine. There are no firewalls enabled on either machine. (I can ssh for example between the machines)

I've attached the two condor_config files (renamed from the default files names under /etc/condor). The submit/execute node has additional 01-submit.config and 02-execute.config files defined in the config.d directory. Likewise, the central manager has 01-central-manger.config specified. They are mostly unmodified from what the install setup, except for all the ALLOW macros in my attempt to get things working.Â

When I originally ran the get_HTcondor script, I specifiedÂa shared filesystem which I suspect was a mistake: All the nodes do have a mounted, sharedÂNFS folder, but not the entire filesystem. I've since commented these things out (but I don't think this has been reflected since the reconfig/restart commands don't work).Â

For example when I run condor_restart on the sunmit/execute node I get theÂfollowing:

kenway@haleakala:/etc/condor$ sudo condor_restart
[sudo] password for kenway:
ERROR
SECMAN:2010:Received "DENIED" from server for user condor_pool@ using method IDTOKENS.
Can't send Restart command to local master

Likewise on the centrlÂmanager node I get :

volcano@volcano:~$ sudo condor_restart
ERROR
SECMAN:2010:Received "DENIED" from server for user condor_pool@ using method IDTOKENS.
Can't send Restart command to local master

I'm sure there is something obvious I'm doing wrong. I have lots of experience using a variety of workload managers (although not htcondor), but this is the first time I'm trying to administer one.Â

I will add the mini-condor setup worked perfectly on my workstation when all 3 roles were in the same computer. So I suspect this is a networking issue on my end.Â

Any pointers on where to go from here would be greatly appreciated.Â

Thank you,
Gaetan Kenway
Volcano Platforms
www.volcanoplatforms.com

Attachment: condor_config.execute_submit
Description: Binary data

Attachment: condor_config.central
Description: Binary data

Attachment: 01-submit.config
Description: Binary data

Attachment: 02-execute.config
Description: Binary data

Attachment: 01-central-manager.config
Description: Binary data