[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] sched crash on dag removal on windows



Hi,

I sent the following mail to condor-admin but beside the automatic response (case #13153)
I received no response for quite a few days.
Since its still a major problem for me I try to post it on the forum too, maybe someone
knows how to avoid the problem.

Cheers,
Szabolcs

--


I tried 6.7.14 to test whether my dag problems had been fixed but had no luck.
When I removed a dag job from the queue all of its child jobs were still left in the queue
and I received the - usual - scheduler crash message.
(I kept quite a few crash masseges from the past and found a few with similar log-file errors,
so it is not a 6.7.14 log path issue.)

---

"C:\Condor/bin/condor_schedd.exe" on "snoopy.digicpictures.local" died due to exception ACCESS_VIOLATION.

Condor will automatically restart this process in 10 seconds.

*** Last 20 line(s) of file SchedLog:
1/5 11:29:19 IO: Failed to read packet header
1/5 11:29:19 IO: Failed to read packet header
1/5 11:29:20 IO: Failed to read packet header
1/5 11:29:24 DaemonCore: Command received via TCP from host <192.168.0.71:3595>
1/5 11:29:24 DaemonCore: received command 478 (ACT_ON_JOBS), calling handler (actOnJobs)
1/5 11:29:24 UserLog::initialize: fopen("X:\temp\CondorJobs\1136456862/x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag.dagman.log") failed - errno 22 (Invalid argument)


1/5 11:29:24 WARNING: Invalid user log file specified: X:\temp\CondorJobs\1136456862/x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag.dagman.log

1/5 11:29:26 IO: Failed to read packet header
1/5 11:29:30 IO: Failed to read packet header
1/5 11:29:38 IO: Failed to read packet header
1/5 11:29:39 IO: Failed to read packet header
1/5 11:29:44 IO: Failed to read packet header
1/5 11:29:44 IO: Failed to read packet header
1/5 11:29:46 DaemonCore: Command received via TCP from host <192.168.0.71:3602>
1/5 11:29:46 DaemonCore: received command 478 (ACT_ON_JOBS), calling handler (actOnJobs)
1/5 11:29:46 IO: Failed to read packet header
1/5 11:29:46 DaemonCore: Command received via TCP from host <192.168.0.71:3604>
1/5 11:29:46 DaemonCore: received command 478 (ACT_ON_JOBS), calling handler (actOnJobs)
1/5 11:29:46 DaemonCore: Command received via TCP from host <192.168.0.71:3605>
1/5 11:29:46 DaemonCore: received command 478 (ACT_ON_JOBS), calling handler (actOnJobs)
*** End of file SchedLog

*** Last entry in core file core.SCHEDD.WIN32

==============================
Exception code: C0000005 ACCESS_VIOLATION
Fault address:  0040A8D6 01:000098D6 C:\Condor\bin\condor_schedd.exe

Registers:
EAX:00923A14
EBX:00000000
ECX:00019AE4
EDX:00000002
ESI:0000000E
EDI:00019AE4
CS:EIP:001B:0040A8D6
SS:ESP:0023:001292C4  EBP:001292C8
DS:0023  ES:0023  FS:003B  GS:0000
Flags:00010206

Call stack:
Address   Frame
0040A8D6  001292C8  DestroyProc+1EB
0040A85C  001293F0  DestroyProc+171
004143D0  00129658  Scheduler::actOnJobs+C28
0047599C  0012C0E0  DaemonCore::HandleReq+15E5
0047439A  0012D108  DaemonCore::ServiceCommandSocket+D5
00414425  0012D368  Scheduler::actOnJobs+C7D
0047599C  0012FDF0  DaemonCore::HandleReq+15E5
004741DD  0012FE34  DaemonCore::Driver+977
0047BDB4  0012FF68  dc_main+A4C
0047BEC3  0012FF80  main+CE
004A1A5E  00000001  mainCRTStartup+C5

*** End of file core.SCHEDD.WIN32

---



This part of the message looks suspicious, since it seems like a bad concatenation of the same root dir using both
forward and backward slashes:
X:\temp\CondorJobs\1136456862/x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag.dagman.log


This is the submit file of the dag job:

# Filename: x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag.condor.sub
# Generated by condor_submit_dag x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag 
universe	= scheduler
executable	= C:\Condor\bin\condor_dagman.exe
getenv		= True
output		= x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag.lib.out
error		= x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag.lib.out
log		= x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag.dagman.log
remove_kill_sig	= SIGUSR1
on_exit_remove	= (ExitBySignal == false || ExitSignal =!= 9)
arguments	= -f -l . -Debug 3 -Lockfile x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag.lock -Condorlog X:\temp\CondorJobs\1136456862\logs/Job.log -Dag x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag -Rescue x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag.rescue
environment	=_CONDOR_DAGMAN_LOG=x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag.dagman.out|_CONDOR_MAX_DAGMAN_LOG=0
queue



Hope it helps tracking these problems since its a major pain hadling dag jobs on windows.

Cheers,
Szabolcs