[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] standard universe jobs won't start but vanilla are OK



I think I'm making some headway with this so thanks for the pointers.
I was using a cut down Condor installation as I was trying to save space
on the VM so I've now done a full install and at least now I'm getting
a different problem. 

When I run a job under standard universe it still fails almost immediately
and gives this message in the user log:

007 (053.000.000) 11/25 11:22:43 Shadow exception!
        Unable to talk to job: disconnected

If I look in the startd log I can see this:


11/25/11 06:14:03 Job wants old RSC/Ckpt starter, skipping /usr/sbin/condor_sta
ter
11/25/11 06:14:03 Sock::bindWithin - failed to bind any port within (9600 ~ 9609
)
11/25/11 06:14:03 Failed to listen on TCP socket, because it is not bound to a
port.
11/25/11 06:14:03 Sock::bindWithin - failed to bind any port within (9600 ~ 9609
)
11/25/11 06:14:03 Failed to listen on TCP socket, because it is not bound to a
port.
11/25/11 06:14:03 tcp_accept_timeout returns -1, errno=22
11/25/11 06:14:03 State change: received RELEASE_CLAIM command

so I suspect this is the culprit. I've forwarded ports 9600-9609 inc. from
the VM but I'm wondering if this enough for standard universe even though
the manual suggests only 5 + 5*no_of_slots are needed. Looking at netstat
suggests that it is trying to use the entire forwarded port range.

Will go and RTFM again but if anyone has a quick answer that would be useful.
BTW I think the two error messages below are red herrings.

Cheers,

-ian.

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Cottam
Sent: 24 November 2011 11:20
To: condor-users@xxxxxxxxxxx
Subject: Re: [Condor-users] standard universe jobs won't start but vanilla are OK

On 24/11/2011 10:42, "Smith, Ian" <I.C.Smith@xxxxxxxxxxxxxxx> wrote:

>Job wants old RSC/Ckpt starter, skipping /usr/sbin/condor_starter
Could be your problem, but don't know how to fix, sorry.


>
>Is this relevant ? Also whenever I run the code I get this error:
>
>Condor: Notice: Remote system calls disabled.
>
>Which sounds serious ???
You always get that if you just test the relinked code locally. The remote
procedure calls only work when it is running under Condor on a remote
machine.

-Ian




>
>


-- 
Ian Cottam
ext. 61851
IT Services for Research
Faculty of Engineering and Physical Sciences
The University of Manchester
"The only strategy that is guaranteed to fail is not taking risks." Mark
Zuckerberg




_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/