[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problem with MSYS command started from condor_starter

I am using Condor for some time as user. Recently, I am using MSYS (http://www.mingw.org/wiki/msys), calling MSYS command in some scripts.
We have 8 condor_starter running concurrently in a 8 core Windows XP machine.

To simulate the issue, I create a script just to copy some files and then remove them using MSYS command.
I submit 10000 jobs running this script (with MSYS calls) to the condor farm. After running concurrently for some time (1 or 2 hours), there is 1 MSYS command (eg. cp.exe) hang. I try to attach the hanging command with mingw gdb and get the following call stack.

(gdb) where
#0  0x7d61002e in strchr () from C:\WINDOWS\system32\ntdll.dll
#1  0x7d666ea1 in ntdll!RtlCopyUnicodeString ()
   from C:\WINDOWS\system32\ntdll.dll
#2  0x00000000 in ?? ()

With the hanging process, the subsequent jobs with MSYS call will fail mysteriously with the following error.

    cp: cannot stat `s:/data/regtestfiles/main/current/regtest/infrastructure/regutils/reg_run/reg_copy_files//prev/reg.rout': No such file or directory

This file does exist actually. If I kill the hanging process, the subsequent jobs will be back to normal.
I check the the process explorer, this hanging process and my script is the child process of condor_master->condor_starter.

Btw, it works fine on Windows 7.
I also tried simulate the same process on 8 cmd prompts (not from condor_starter), everything run fine.

Thanks in advance for your replies and insight.

Mun Soon