[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor Jobs getting stuck in idle state. No valid starters were found!



Sridhar:

I have run into problems like this when the condor user cannot read you configuration files. Maybe the ownership of your gluster volume changed during the upgrade?

My recommendation for you is to use Puppet (or Chef, CFEngine) to ensure that your configuration files are the same. It's not that hard and it's a solution that will scale to a lot more nodes than a few.

--
Tom Downes
Associate Scientist and Data CenterÂManager
Center for Gravitation, Cosmology and Astrophysics
University of Wisconsin-Milwaukee
414.229.2678

On Mon, Jun 22, 2015 at 1:08 AM, Sridhar Thumma <deadman.den@xxxxxxxxx> wrote:
Anyone got an idea?Â

On Sun, Jun 21, 2015 at 12:50 AM, Sridhar Thumma <deadman.den@xxxxxxxxx> wrote:
Hi,

I am facing strange issue with workernodes. I have a setup like this: one machine acts as central manager and couple machines acts as worker nodes. We were using gluster file system as share file system. Due to some technical issues, we migrated from gfs1 to gfs2 (two different servers). From that time onwards, jobs are not getting executed.. They stuck in idle. I am seeing this log in worker node StartLog file.Â

06/20/15 13:59:11 Detected hibernation states: S3,S4 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ÂÂ
06/20/15 13:59:11 "/usr/sbin/condor_starter -classad" did not produce any output, ignoring                                      Â
06/20/15 13:59:11 Failed to execute /usr/sbin/condor_starter.std, ignoring      ÂÂ
06/20/15 13:59:11 WARNING WARNING WARNING: No valid starters were found! Is something wrong with your Condor installation? This startd will not be able to run jobs.
06/20/15 13:59:11 No STARTD_HISTORY file specified in config file         ÂÂ
06/20/15 13:59:11 History file rotation is enabled. Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
06/20/15 13:59:11  Maximum history file size is: 20971520 bytes  ÂÂÂ

gfs1_config=/mnt/gfs1/files/workernode.config
gfs2_config=/mnt/gfs2/files/workernode.config

I externalized workernode configuration so that all worker nodes share same configuration. If I start with gfs1_config external configuration, jobs are executing file. but if I start withÂgfs2_configÂexternal configuration, jobs are getting stuck and print above message. Going mad on this.. :( not sure what is going wrong. I compared both configuration files. both areÂsame. I actually copied gfs1 configuration to gfs2 configuration. nothing is changed.Â

Anyone has any idea about this?



                                            Â


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/