[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] condor doesn't perceive that job is done



On Fri, Aug 08, 2003 at 10:52:31AM +0200, Anika Boehm wrote:
> Hi,
> 
> the program I submit to condor starts up all fine and terminates
> successfully, but condor_q still shows the programm as running, although there is no job
> at all anymore on the executing machine. This program is a binary which
> starts a wrapper script which again 'exec' to some binary, as far as I could
> figure out. The PID of this first binary is logged in the StarterLog file of the
> executing machine and this binary itself runs as long as the second binary
> runs. Both condor_starter and condor_shadow continue running after my job
> terminates until the condor_starter daemon eventually dies and a shadow exception
> is reported.
> 
> Does anyone know how condor knows that a job is done? What does condor look
> at or what is it waiting for?

The condor_starter is waiting for first process to exit. We're specifically 
watching for SIGCHLD, and we've got an added bonus of periodically walking 
through /proc and building a process tree rooted at the pid of the job.

> It seems to me that the condor daemon is
> waiting for some signal or whatever else that my program doesn't send 

Your program shouldn't have to do anything except exit.

or that never
> reaches the daemon for whatever reason.
> 
> The binary actually does send an exit code which I checked with a shell
> script. I also set this shell script as 'executable' in the submit file and
> started the binary from this script and made the script send explicitely the exit
> code when the binaries are done. But this didn't change condor's behaviour at
> all. However when I start 'uname -a' instead of the binary everything works
> correctly (I'm working on Solaris).
> 
> Any ideas why condor doesn't perceive that the job is done?
> 

In order to debug this, we'd like to see the StarterLog from the machine
where the job ran. 

Please make the following changes to your config file on the execute machine
as well:

MAX_STARTER_LOG = 640000
STARTER_DEBUG = D_FULLDEBUG D_PROCFAMILY D_PROC

And then run a job through.

Thanks,

-Erik
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>