[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] file transfer failure with MPI jobs

hello all,

MPI jobs with a high machine_count setting and several transer_input_files fail
to run, yielding the error: "Can no longer talk to condor_starter on execute

i'm using a win32 pool running condor 6.6.5.  i can successfully run MPI jobs
in the pool if they contain one or two transfer_input_files and machine_count
is low  (say, 5).  however when i submit an MPI job to 15 nodes with 2 input
files (they can be 1 byte in size), i receive the following errors in my logs:

[ShadowLog of central manager running all daemons (collector, negotiator,
3/16 18:19:16 (20.0) (1292): condor_read(): recv() returned -1, errno = 10054,
assuming failure.
3/16 18:19:16 (20.0) (1292): ERROR "Can no longer talk to condor_starter on
execute machine (" at line 63 in file

[StarterLog of execute node, ip=]:
3/16 18:12:43 entering FileTransfer::Init
3/16 18:12:43 entering FileTransfer::SimpleInit
3/16 18:12:43 TransferIntermediate=""
3/16 18:12:43 entering FileTransfer::DownloadFiles
3/16 18:12:43 Can't connect to <>:0, errno = 10061
3/16 18:12:43 Will keep trying for 10 seconds...
3/16 18:12:53 Connect failed for 10 seconds; returning FALSE
3/16 18:12:53 ERROR "Unable to connect to server <>
" at line 559 in file ..\src\condor_c++_util\file_transfer.C
3/16 18:12:53 ShutdownFast all jobs.
3/16 18:12:53 Got ShutdownFast when no jobs running.

the error condition occurs unexpectedly.  condor tries and retries to acquire
the set of 15 MPI nodes that my job requested.  eventually it will run but i
think the higher the machine_count the less feasible time this will occur in. 
i can't predict which execute nodes it will occur on either.  i once saw it
occur on the central manager itself (i allowed the central manager to run MPI
jobs).  this leads me to believe that it's not a network issue but a possible
bug in the file transfer mechanism.  this problem has existed for previous
versions of condor that i have used as well.

the file size of the input files does not seem to be a major factor.  i ran
trials with 1MB files and 1 byte files.  relevant factors seem to be the value
of machine_count and the quantity of files in transfer_input_files.  i have
seen the same execute nodes successfully run and fail the same MPI job.  there
are no firewalls in this setup.

more logs needed to clarify anything?  let me know.  have an awesome day, and
thank you.  :-)


< NPACI Education Center on Computational Science and Engineering >
< http://www.edcenter.sdsu.edu/>

"A friend is someone who knows the song in your heart and can sing it back to you when you have forgotten the words."  -Unknown Author 

Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
Condor Support Information:
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>