[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] IDLE then RUN then IDLE for nothing



On Fri June 25 2004 7:11 am, Jérôme Jaglale wrote:
> Hello,
>
> 	When I submit some condor jobs, they begin to run (ST=R) while a few
> seconds. Then they return to idle (ST=I) whithout any results. I
> examined the logs :
>
>
> The job log on the submit machine :
> 000 (001.000.000) 06/25 11:59:41 Job submitted from host:
> <192.168.1.1:54151>
> ...
> 007 (001.000.000) 06/25 11:59:58 Shadow exception!
> 	Can no longer talk to condor_starter on execute machine (192.168.1.23)
> 	0  -  Run Bytes Sent By Job
> 	0  -  Run Bytes Received By Job
>
>
> The StartLog on the execute machine :
> 6/25 11:59:19 Starter pid 25840 exited with status 4
>
>
> The StarterLog.vm2 on the execute machine :
> 6/25 11:59:13 ******************************************************
> 6/25 11:59:13 ** condor_starter (CONDOR_STARTER) STARTING UP
> 6/25 11:59:13 ** $CondorVersion: 6.6.5 May  3 2004 $
> 6/25 11:59:13 ** $CondorPlatform: PPC-DARWIN-6_8 $
> 6/25 11:59:13 ** PID = 25840
> 6/25 11:59:13 ******************************************************
> 6/25 11:59:13 Using config file:
> /Users/condor/Programmes/condor-6.6.5/etc/condor_config
> 6/25 11:59:13 Using local config files:
> /Users/condor/Programmes/condor-6.6.5/local.cluster13/
> condor_config.local
> 6/25 11:59:13 DaemonCore: Command Socket at <192.168.1.23:55008>
> 6/25 11:59:13 Setting resource limits not implemented!
> 6/25 11:59:13 Starter communicating with condor_shadow
> <192.168.1.1:54937>
> 6/25 11:59:13 Submitting machine is "(null)"
> 6/25 11:59:13 ERROR "Assertion ERROR on (shadow->name())" at line 984
> in file jic_shadow.C
> 6/25 11:59:13 ShutdownFast all jobs.

Notice the 'Submitting machine is "(null)"'?  That, to me, is the smoking gun.  
I can't offer an explanation as to what's causing it, but it's certainly 
where I'd start digging.  Almost certainly something is messed up in your 
schedd or it's configuration.

> It's strange, I don't understand what really happens. Could you help ?
> The computers are running with Mac OS X.

Just a thought; is the host name and IP properly setup?  It seems that we've 
had a number of issues with OS/X /etc/hosts not being properly setup or 
something like that.

> Thanks,
> Jérôme

Hope this helps

-Nick

-- 
           <<< The answer is out there, Neo. >>>
 /`-_    Nicholas R. LeRoy               The Condor Project
{     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
 \    /  nleroy@xxxxxxxxxxx              The University of Wisconsin
 |_*_|   608-265-5761                    Department of Computer Sciences