[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] schedd crash due to failed file transfers



We use a configuration with a central schedd and all users submit to it using condor_submit -remote. The schedd is running on Windows 2008R2. The clients are also running Windows (Win7). It seems that sometimes the file transfer dies (especially if there's heavy network traffic on the client machine). Then the remote schedd is crashing and terminates itself. Relevant log of schedd (file names are obfuscated):

MoveFileEx(xxx,yyy) failed with error 32
08/20/13 12:52:14 (pid:1480) ERROR "FileTransfer CommitFiles Failed -- What Now?!?!" at line 2309 in file Z:\home\felixwolfheimer\drm-development\trunk\condor\condor-7.8.7\src\condor_utils\file_transfer.cpp
08/20/13 12:52:14 (pid:1480) DaemonCore: async_pipe is signalled, but async_pipe_signal is false.
08/20/13 12:52:14 (pid:1480) DaemonCore: async_pipe[0]0.bytes_available_to_read returned WSA Error 10093
08/20/13 12:52:14 (pid:1480) ERROR "Assertion ERROR on (already_been_here == false)" at line 329 in file z:\home\felixwolfheimer\drm-development\trunk\condor\condor-7.8.7\src\condor_utils\condor_threads.cpp
08/20/13 12:52:14 (pid:1480) Cron: Killing all jobs
08/20/13 12:52:14 (pid:1480) CronJobList: Deleting all jobs
08/20/13 12:52:14 (pid:1480) Cron: Killing all jobs
08/20/13 12:52:14 (pid:1480) CronJobList: Deleting all jobs
08/20/13 12:52:14 (pid:1480) DaemonCore::Wake_up_select called from an unknown thread. windows tid = 3340
08/20/13 12:52:14 (pid:1480) DaemonCore::Wake_up_select called from an unknown thread. windows tid = 3340
08/20/13 12:52:26 (pid:1296) Locale: English_United States.1252 08/20/13 12:52:26 (pid:1296) Setting maximum accepts per cycle 8. 08/20/13 12:52:26 (pid:1296)

Has anyone seen this and is there a workaround to decouple the file transfer from the schedd? I found the condor_transferd which seems to be a way of doing this but there seems to be no documentation about it. Is this condor component documented somewhere and might it help in my case?

As a workaround I'll probably change the code of the schedd such that no exception is thrown but just an error message is given but it would be nice to understand and fix the problem of course. 

Thanks for your help.