[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] First experience with the parallel universe.



Amy,

Are you using partitionable slots on these execute nodes? If so, if
you're able to upgrade to 8.6.5, it's possible that a recent bugfix
might have taken care of this.

https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6308

Jason Patton

On Mon, Sep 11, 2017 at 4:13 PM, Amy Bush <amy@xxxxxxxxxxxxx> wrote:
> We're running a fairly long-running instantiation of htcondor here, but
> only just recently has one of my users decided to try out the parallel
> universe. It isn't going flawlessly, and I thought maybe someone has
> seen this before and might be able to help. Hopefully I'm just doing or
> overlooking something dumb and obvious.
>
> condor 8.6.4, runs flawlessly normally.
>
> parallel job is submitted, excerpt from user's log file:
>
> 001 (066.000.000) 09/10 22:40:11 Job executing on host: MPI_job
> ...
> 008 (066.000.000) 09/10 22:40:11 Greetings and felicitations from node 6 of 13
> ...
> 008 (066.000.000) 09/10 22:40:11 Greetings and felicitations from node 7 of 13
> ...
> (snip)
> ...
> 008 (066.000.000) 09/10 22:40:11 Starting Orc follower node 6
> ...
> 008 (066.000.000) 09/10 22:40:11 Starting Orc follower node 7
> ...
> (snip)
> ...
> 008 (066.000.000) 09/10 22:40:11 All 12 followers found, Starting Orc
> leader
> ...
> 007 (066.000.000) 09/10 22:40:48 Shadow exception!
>      Assertion ERROR on (nextResourceToStart == numNodes)
>      0  -  Run Bytes Sent By Job
>      0  -  Run Bytes Received By Job
>
> It will then retry. And retry. And retry. And then run successfully.
> Evidently it retried 70 times last night before it was ultimately
> successful. On the same machine it had failed on up until then.
>
> Looking in the StarterLogs for that host:
>
> 09/11/17 15:30:44 (pid:388) condor_read() failed: recv(fd=9) returned -1, errno
> = 104 Connection reset by peer, reading 5 bytes from <(dedicated hostname redacted):46369>.
> 09/11/17 15:30:44 (pid:388) IO: Failed to read packet header
> 09/11/17 15:30:44 (pid:388) i/o error result is 0, errno is 104
> 09/11/17 15:30:44 (pid:388) condor_write(): Socket closed when trying to write 2
> 1 bytes to <dedicated hostname redacted):46369>, fd is 9
> 09/11/17 15:30:44 (pid:388) Buf::write(): condor_write() failed
> 09/11/17 15:30:44 (pid:388) i/o error result is 0, errno is 0
> 09/11/17 15:30:44 (pid:388) condor_write(): Socket closed when trying to write 1
> 52 bytes to <(dedicated hostname redacted):46369>, fd is 9
> 09/11/17 15:30:44 (pid:388) Buf::write(): condor_write() failed
> 09/11/17 15:30:44 (pid:388) ERROR "Assertion ERROR on (result)" at line 902 in f
> ile /slots/03/dir_36001/sources/src/condor_starter.V6.1/NTsenders.cpp
> 09/11/17 15:30:44 (pid:388) condor_write(): Socket closed when trying to write 1
> 82 bytes to <(dedicated hostname redacted):46369>, fd is 9
> 09/11/17 15:30:44 (pid:388) Buf::write(): condor_write() failed
> 09/11/17 15:30:44 (pid:388) ERROR "Assertion ERROR on (result)" at line 902 in f
> ile /slots/03/dir_36001/sources/src/condor_starter.V6.1/NTsenders.cpp
> 09/11/17 15:30:48 (pid:588) ****************************************************
> **
> 09/11/17 15:30:48 (pid:588) ** condor_starter (CONDOR_STARTER) STARTING UP
>
>
> As I said.. we have no experience with the parallel universe, so I'm not sure what direction to explore. We have a dedicated submit node for it, several dedicated hosts to run the jobs, and it will successfully run tiny test cases. (And there are many, many available nodes.)
>
> google hasn't been able to help me so far, so I turn to you. You guys. Yous. Y'all. Any hints you might be able to provide would be deeply appreciated.
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/