[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_master.exe does not start on windows machine reboot




Hi Ben,

Thanks for your answers. Here are my comments:

>>>>>>
In some machines, when Condor Service tryes to start
the condor_master daemon, it fails to find the
configuration files in the shared filesystem.
<<<<<<

>>>
Do you know for sure if the files available to the
master when it goes to fetch them?  That is, is the
shared fs available when the Condor service starts
or does it depend on another service to be started
locally?

<<<

The shared filesystem (SaMBa) is available for the
Windows Machines served by some linux servers.
It does not depend on another service to be started
locally.
In most of our pool machines the service finds the
configuration files but on some machines it doesn't.
It seems that there is some delay on NETLOGON
authentication process of this machine on the AD
that is preventing Condor to find the configuration
files on the network.



>>>>>>
I have renamed the service command name to run a
condor_master.bat that waits for 15 secs and then
runs condor_master.exe.
<<<<<<

>>>
Is it always the case that 15 seconds is enough time
for the shared fs to become available?  You could
also have the service control manager restart the
service after the first failure (via the recovery tab
in the services management console extension).  
<<<


The 15 seconds delay must be checked but it worked on
one machine we have investigated. We are trying your
suggestion to use the "first failure" on the recovery
tab and will let you know if it was satisfactory for
the enduser.



>>>>>>
Condor service starts but the jobs submitted from
this machine run only locally and does not distribute
jobs to the other pool machines.
Cold someone help me with this?
<<<<<<

>>>
You need a central manager to start distributing jobs.
Do all Condor instances pull from the same configuration
file?  If so you will need one to have a slightly different
configuration so that it can be the pool's CM.
<<<


It was not clearly mentioned on my previous message,
but sure we have a CM for this pool. If we remove the
work around leaving the service as-installed, and starting
manually the condor service, all machines in the pool run
the jobs submitted from this machine.
The DAEMON_LIST on all executor machines is
MASTER SCHEDD STARTD.
With the batch file defined as the service, only this
machine runs the jobs submitted on it. Communication with
the CM still works, there are entries for this PC in the
CollectorLog.

Our .bat is as follow:

@echo off
setlocal
set THINKING_TIME=15
if not A%1 == A ( set THINKING_TIME=%1 )
ping -n %THINKING_TIME% 127.0.0.1 >NUL 2>&1
c:\condor\bin\condor_master.exe
endlocal

Any suggestion to fix this behavior? Command line parameters?
Way to run condor_master.exe in the batch file?

>>>>>>
Is there any other way to delay the Condor service start
until the shared filesystems are available?
<<<<<<

>>>
Sure, there are a few ways to do this.  If the shared fs
relies on some service running, you could make the Condor
service list it as a dependency (i.e. the other service
must start before Condor will).  

<<<

The shared filesystem does not depend on a local service
to start.
It is a Samba service and depends only on the network be
available and the machine be authenticated on the network.



>>>
Or make sure that when
the Condor service fails to start, that it waits for some
period of time.  

<<<

I will be trying this option. I will let you know if it
satisfys the enduser needs.



>>>
Or, as you have tried with the batch file,
you could run a loop in the batch file and periodically
check for the existence of the configuration files.  Once
it finds it, then start the master running.

<<<

This option works to start the service but the jobs run only on the submit machine
not on the whole pool as mentioned before (above). The script I am trying was
pasted above.


Regards, Klaus




Ben Burnett <burnett@xxxxxxxxxxx>
Sent by: condor-users-bounces@xxxxxxxxxxx

16/03/2009 20:41

Please respond to
Condor-Users Mail List <condor-users@xxxxxxxxxxx>

To
"'Condor-Users Mail List'" <condor-users@xxxxxxxxxxx>
cc
Subject
Re: [Condor-users] condor_master.exe does not start on windows        machine        reboot





Hi Klaus:

>>>
In some machines, when Condor Service tryes to start
the condor_master daemon, it fails to find the
configuration files in the shared filesystem.
<<<

Do you know for sure if the files available to the
master when it goes to fetch them?  That is, is the
shared fs available when the Condor service starts
or does it depend on another service to be started
locally?

>>>
I have renamed the service command name to run a
condor_master.bat that waits for 15 secs and then
runs condor_master.exe.
<<<

Is it always the case that 15 seconds is enough time
for the shared fs to become available?  You could
also have the service control manager restart the
service after the first failure (via the recovery tab
in the services management console extension).  

>>>
Condor service starts but the jobs submitted from
this machine run only locally and does not distribute
jobs to the other pool machines.
Cold someone help me with this?
<<<

You need a central manager to start distributing jobs.
Do all Condor instances pull from the same configuration
file?  If so you will need one to have a slightly different
configuration so that it can be the pool's CM.

>>>
Is there any other way to delay the Condor service start
until the shared filesystems are available?
<<<

Sure, there are a few ways to do this.  If the shared fs
relies on some service running, you could make the Condor
service list it as a dependency (i.e. the other service
must start before Condor will).  Or make sure that when
the Condor service fails to start, that it waits for some
period of time.  Or, as you have tried with the batch file,
you could run a loop in the batch file and periodically
check for the existence of the configuration files.  Once
it finds it, then start the master running.

Hope some of that helps.

Regards,
-B


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/



This message is intended solely for the use of its addressee and may contain privileged or confidential information. If you are not the addressee you should not distribute, copy or file this message. In this case, please notify the sender and destroy its contents immediately.
Esta mensagem é para uso exclusivo de seu destinatário e pode conter informações privilegiadas e confidenciais. Se você não é o destinatário não deve distribuir, copiar ou arquivar a mensagem. Neste caso, por favor, notifique o remetente da mesma e destrua imediatamente a mensagem.