[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs remaining Idle



If there were firewall problems, then the jobs wouldn't start.
Check in the logs to see if the job actually got through from the 
submit node to the execute node.

JK

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of rnayar@xxxxxxxx
> Sent: Monday, May 22, 2006 6:11 PM
> To: Condor-Users Mail List; Shaun J. O'Callaghan
> Subject: Re: [Condor-users] Jobs remaining Idle
> 
> 
> Shaun,
> 
> Hey buddy, I don't know if this will help you but "failing to 
> connect" could be 
> the result of your firewall not configured properly. I had a 
> little run in with 
> this problem when I was getting Condor to work.
> 
> Cheers
> 
> Danny
> 
> Quoting "Shaun J. O'Callaghan" <Shaun.OCallaghan@xxxxxxxxxxxx>:
> 
> > Hi there,
> > 
> > Firstly, apologies if this has been dealt with in this list already.
> > I've searched through this list, and the docs, and don't 
> seem to be able
> > to find an answer.
> > 
> > I'm running a test Condor pool at the moment.  I have a Windows XP
> > machine (the master server) and a Windows Server 2003 
> machine (the only
> > other machine in the pool).
> > 
> > I've written a test application, a 'hello world' app, in C just to
> > demonstrate that jobs actually get executed and run ok.  
> However, the
> > jobs are queued and then appear to run briefly before entering the
> > "Idle" state which is where they stay.  I submit the job from the
> > Windows Server 2003 machine to the pool.
> > 
> > The submit file is as follows:
> > 
> > ---
> > executable = 	condortestapp.exe
> > universe =	vanilla
> > Requirements = (OpSys == "WINNT50") || (OpSys == "WINNT51") 
> || (OpSys ==
> > "WINNT52")
> > error = 	error.output
> > output =	out.output
> > 
> > queue
> > 
> > ---
> > 
> > Negotiator.log has the following line:
> > 
> > 5/22 16:42:32 DC_AUTHENTICATE: attempt to open invalid session
> > GEOG41:2204:1148048993:2, failing.
> > 
> > ---
> > 
> > CollectorLog.log has the following:
> > 
> > 5/22 16:42:08 (Sent 7 ads in response to query)
> > 5/22 16:42:08 Got QUERY_STARTD_PVT_ADS
> > 5/22 16:42:08 (Sent 2 ads in response to query)
> > 5/22 16:42:32 SubmittorAd  : Inserting ** "<
> > Administrator@xxxxxxxxxxxxxxxxxx , xxx.xxx.xxx.xxx >"
> > 5/22 16:42:32 stats: Inserting new hashent for
> > 'Submittor':'Administrator@xxxxxxxxxxxxxxxxxx:' xxx.xxx.xxx.xxx'
> > 5/22 16:42:49 Got QUERY_SCHEDD_ADS
> > 5/22 16:42:49 (Sent 1 ads in response to query)
> > 5/22 16:46:44 Housekeeper:  Ready to clean old ads
> > 5/22 16:46:44 	Cleaning StartdAds ...
> > 5/22 16:46:44 	Cleaning StartdPrivateAds ...
> > 5/22 16:46:44 	Cleaning ScheddAds ...
> > 5/22 16:46:44 	Cleaning SubmittorAds ...
> > 5/22 16:46:44 	Cleaning LicenseAds ...
> > 5/22 16:46:44 	Cleaning MasterAds ...
> > 5/22 16:46:44 	Cleaning CkptServerAds ...
> > 5/22 16:46:44 	Cleaning CollectorAds ...
> > 5/22 16:46:44 	Cleaning StorageAds ...
> > 5/22 16:46:44 Housekeeper:  Done cleaning
> > 5/22 16:46:48 Can't connect to < xxx.xxx.xxx.xxx:9618>:0, 
> errno = 10060
> > 5/22 16:46:48 Will keep trying for 10 seconds...
> > 5/22 16:46:57 Connect failed for 10 seconds; returning FALSE
> > 5/22 16:46:57 ERROR:
> > SECMAN:2003:TCP connection to <xxx.xxx.xxx.xxx:9618> failed
> > 
> > 5/22 16:46:57 Can't send UPDATE_COLLECTOR_AD to collector
> > (condor.cs.wisc.edu): Failed to send UDP update command to collector
> > 5/22 16:47:09 (Sent 8 ads in response to query)
> > 5/22 16:47:09 Got QUERY_STARTD_PVT_ADS
> > 5/22 16:47:09 (Sent 2 ads in response to query)
> > 
> > 
> > Condor_q -analyze gives the following output from the Windows Server
> > 2003 machine:
> > 
> > 011.000:  Run analysis summary.  Of 2 machines,
> >       0 are rejected by your job's requirements
> >       0 reject your job because of their own requirements
> >       0 match, but are serving users with a better priority 
> in the pool
> >       2 match, match, but reject the job for unknown reasons
> >       0 match, but will not currently preempt their existing job
> >       0 are available to run your job
> >         Last successful match: Mon May 22 16:47:10 2006
> > 
> > 1 jobs; 1 idle, 0 running, 0 held
> > 
> > ----
> > 
> > 
> > Condor_q -global gives the following output from the 
> Windows XP machine
> > (central server)
> > 
> > ---
> > 
> > -- Failed to fetch ads from: <xxx.xxx.xxx.xxx:12566> :
> > internaldomain.com (IP of Windows Server 2003)
> > 
> > 
> > 
> > If anybody can shed any light on why these jobs are remaining idle,
> > which I'm sure is a pretty straightforward error I just 
> can't seem to
> > put my finger on it, that'd be great.
> > 
> > Thanks in advance,
> > 
> > Shaun James O'Callaghan
> > 
> > 
> > 
> > 
> > 
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > 
> 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>