[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] First experience with the parallel universe.



We're running a fairly long-running instantiation of htcondor here, but
only just recently has one of my users decided to try out the parallel
universe. It isn't going flawlessly, and I thought maybe someone has
seen this before and might be able to help. Hopefully I'm just doing or
overlooking something dumb and obvious.

condor 8.6.4, runs flawlessly normally.

parallel job is submitted, excerpt from user's log file:

001 (066.000.000) 09/10 22:40:11 Job executing on host: MPI_job
...
008 (066.000.000) 09/10 22:40:11 Greetings and felicitations from node 6 of 13
...
008 (066.000.000) 09/10 22:40:11 Greetings and felicitations from node 7 of 13
...
(snip)
...
008 (066.000.000) 09/10 22:40:11 Starting Orc follower node 6
...
008 (066.000.000) 09/10 22:40:11 Starting Orc follower node 7
...
(snip)
...
008 (066.000.000) 09/10 22:40:11 All 12 followers found, Starting Orc
leader
...
007 (066.000.000) 09/10 22:40:48 Shadow exception!
     Assertion ERROR on (nextResourceToStart == numNodes)
     0  -  Run Bytes Sent By Job
     0  -  Run Bytes Received By Job

It will then retry. And retry. And retry. And then run successfully.
Evidently it retried 70 times last night before it was ultimately
successful. On the same machine it had failed on up until then.

Looking in the StarterLogs for that host:

09/11/17 15:30:44 (pid:388) condor_read() failed: recv(fd=9) returned -1, errno 
= 104 Connection reset by peer, reading 5 bytes from <(dedicated hostname redacted):46369>.
09/11/17 15:30:44 (pid:388) IO: Failed to read packet header
09/11/17 15:30:44 (pid:388) i/o error result is 0, errno is 104
09/11/17 15:30:44 (pid:388) condor_write(): Socket closed when trying to write 2
1 bytes to <dedicated hostname redacted):46369>, fd is 9
09/11/17 15:30:44 (pid:388) Buf::write(): condor_write() failed
09/11/17 15:30:44 (pid:388) i/o error result is 0, errno is 0
09/11/17 15:30:44 (pid:388) condor_write(): Socket closed when trying to write 1
52 bytes to <(dedicated hostname redacted):46369>, fd is 9
09/11/17 15:30:44 (pid:388) Buf::write(): condor_write() failed
09/11/17 15:30:44 (pid:388) ERROR "Assertion ERROR on (result)" at line 902 in f
ile /slots/03/dir_36001/sources/src/condor_starter.V6.1/NTsenders.cpp
09/11/17 15:30:44 (pid:388) condor_write(): Socket closed when trying to write 1
82 bytes to <(dedicated hostname redacted):46369>, fd is 9
09/11/17 15:30:44 (pid:388) Buf::write(): condor_write() failed
09/11/17 15:30:44 (pid:388) ERROR "Assertion ERROR on (result)" at line 902 in f
ile /slots/03/dir_36001/sources/src/condor_starter.V6.1/NTsenders.cpp
09/11/17 15:30:48 (pid:588) ****************************************************
**
09/11/17 15:30:48 (pid:588) ** condor_starter (CONDOR_STARTER) STARTING UP


As I said.. we have no experience with the parallel universe, so I'm not sure what direction to explore. We have a dedicated submit node for it, several dedicated hosts to run the jobs, and it will successfully run tiny test cases. (And there are many, many available nodes.)

google hasn't been able to help me so far, so I turn to you. You guys. Yous. Y'all. Any hints you might be able to provide would be deeply appreciated.