[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Failed to send REQUEST_CLAIM to startd



On Wed, May 15, 2013 at 09:44:44AM -0500, Cody Belcher wrote:
> Thanks for the quick reply and awesomely clear explanation. I found
> that on helike Allow_Write had never been set. Fixed that, and bam.
> It works great. One more question, why did condor keep trying to
> send the job to helike not one of the other nodes?

Cody:

I am sorry, but that is a difficult question to answer without knowing
more about your pool.  Maybe your submit file caused it to rank highest,
or it was the only resource available to run your job, or nobody was
using it (obviously), and so the negotiator just kept matching you with
helike. The MatchLog and NegotiatorLog will have information about
computations it made for the match.

See section 3.4.5 in the condor manual for details about the matching
algorithm.

http://research.cs.wisc.edu/htcondor/manual/v7.9/3_4User_Priorities.html#SECTION00445000000000000000

Nathan Panike

> 
> Thanks,
> 
> Cody Belcher
> 
> On 05/15/2013 09:29 AM, Nathan Panike wrote:
> >When the schedd said "hey, I want to run a job", the startd said, "You
> >cannot run here".  So the schedd says the match is bad and deletes it,
> >so that it can try to match the job the next time it talks to the
> >negotiator.
> >
> >You need to log into the helike.physics.tamu.edu machine, if possible,
> >and check the StartLog to see what the exact problem is on the execute
> >side. Otherwise, you will need to talk to the admin of that machine.
> >
> >Nathan Panike
> >
> >On Wed, May 15, 2013 at 09:13:14AM -0500, Cody Belcher wrote:
> >>Can someone explain to me what this means and how to fix it? I've
> >>been trying to figure out a relyable way to submit Mathematica jobs
> >>to condor so that I can write a how to for my users, but the jobs
> >>stay in idle state. I believe this is the reason why.
> >>
> >>05/15/13 09:07:02 (pid:71980) Sent ad to central manager for
> >>codytrey@xxxxxxxxxxxxxxxx
> >>05/15/13 09:07:02 (pid:71980) Sent ad to 1 collectors for
> >>codytrey@xxxxxxxxxxxxxxxx
> >>05/15/13 09:07:02 (pid:71980) Response problem from startd when
> >>requesting claim slot1@xxxxxxxxxxxxxxxxxxxxxxx
> >><128.194.151.209:49154> for codytrey 20.0.
> >>05/15/13 09:07:02 (pid:71980) Failed to send REQUEST_CLAIM to startd
> >>slot1@xxxxxxxxxxxxxxxxxxxxxxx <128.194.151.209:49154> for codytrey:
> >>CEDAR:6004:failed reading from socket
> >>05/15/13 09:07:02 (pid:71980) Match record
> >>(slot1@xxxxxxxxxxxxxxxxxxxxxxx <128.194.151.209:49154> for codytrey,
> >>20.0) deleted