[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] DAG not exiting on windows, condor 6.7.6



Hi,

I have a very basic DAG that executes correctly on
windows machines with a central manager running on linux
(all condor 6.7.6); however, the dag never exits.
It just sits there. In the logs, all seems fine,
see below.

I also noticed that condor_rm on a DAG node on windows
doesn't correctly kill all jobs of the DAG, but only
kills the main node, and leaves the rest of the jobs
in the queue.

Did anybody else notice this behaviour ? I'm hoping that
I'm simply overlooking something..

Log files/DAG are below...

Thanks in advance,

- Filip

-- dag --

Job PROC_0 job.sub
VARS PROC_0 procNb="0"
Script POST PROC_0 C:\PROGRA~1\GdiExplorer\jre\bin\java.exe -cp
C:\PROGRA~1\GdiExplorer\cache\gdi-queue\jars\libQueueStubs-impl-2.0-SNAPSHOT.jar;C:\PROGRA~1\GdiExplorer\cache\gdi-queue\jars\libQueueStubs-spi-2.0-SNAPSHOT.jar;C:\PROGRA~1\GdiExplorer\cache\avalon-framework\jars\avalon-framework-api-4.1.5.jar;C:\PROGRA~1\GdiExplorer\cache\avalon-framework\jars\avalon-framework-impl-4.1.5.jar
com.cirque.gdi.queue.stubs.impl.CondorPostScript
C:\Temp\queue\1bc58cf3ac10013d00aad8f99c5c750a\post.xconf 0 $RETURN
xxxx.0.iff,xxxx.1.iff z:\xxx\iff ""

-- dag submit --

universe = scheduler
executable = C:\soft\condor\bin\condor_dagman.exe
getenv = True
output = condorlib.out
error = condorlib.out
log = job.log
remove_kill_sig = SIGUSR1
notification = NEVER
arguments = -f -l . -Debug 3 -Lockfile job.dag.lock -Dag job.dag -Rescue
job.dag.rescue -Condorlog dummy_log
environment = _CONDOR_DAGMAN_LOG=dagman.out|_CONDOR_MAX_DAGMAN_LOG=0

+gdiJobId = "1bc58cf3ac10013d00aad8f99c5c750a"
+gdiJobTitle = "s3d_xxx net map"
+gdiOutputRegexpURL = "file://Z:/xxx/xxx_rgb_lin08_v0011x.{}.iff"
+gdiInputURL = "file://Z:/xxx/xxxx_v0009x_r0002.mb"
+gdiTemplate = "Maya 6.0 Render (Software)"

queue



-- dag log --

4/7 17:27:51 ******************************************************
4/7 17:27:51 ** condor_scheduniv_exec.12.0 (CONDOR_DAGMAN) STARTING UP
4/7 17:27:51 ** C:\soft\condor\bin\condor_dagman.exe
4/7 17:27:51 ** $CondorVersion: 6.7.5 Mar 22 2005 $
4/7 17:27:51 ** $CondorPlatform: INTEL-WINNT40 $
4/7 17:27:51 ** PID = 2580
4/7 17:27:51 ******************************************************
4/7 17:27:51 Using config file: c:\soft\condor\condor_config
4/7 17:27:51 Using local config files: c:\soft\condor/condor_config.local
4/7 17:27:51 DaemonCore: Command Socket at <172.16.1.61:2708>
4/7 17:27:51 argv[0] == "condor_scheduniv_exec.12.0"
4/7 17:27:51 argv[1] == "-Debug"
4/7 17:27:51 argv[2] == "3"
4/7 17:27:51 argv[3] == "-Lockfile"
4/7 17:27:51 argv[4] == "job.dag.lock"
4/7 17:27:51 argv[5] == "-Dag"
4/7 17:27:51 argv[6] == "job.dag"
4/7 17:27:51 argv[7] == "-Rescue"
4/7 17:27:51 argv[8] == "job.dag.rescue"
4/7 17:27:51 argv[9] == "-Condorlog"
4/7 17:27:51 argv[10] == "dummy_log"
4/7 17:27:51 DAG Lockfile will be written to job.dag.lock
4/7 17:27:51 DAG Input file is job.dag
4/7 17:27:51 Rescue DAG will be written to job.dag.rescue
4/7 17:27:51 All DAG node user log files:
4/7 17:27:51   C:\Temp\queue\1bc58cf3ac10013d00aad8f99c5c750a/job.log
4/7 17:27:51 Parsing job.dag ...
4/7 17:27:51 Argument added, Name="procNb"	Value="0"
4/7 17:27:51 jobName: PROC_0
4/7 17:27:51 Dag contains 1 total jobs
4/7 17:27:51 Deleting any older versions of log files...
4/7 17:27:51 ReadMultipleUserLogs: deleting older version of
C:\Temp\queue\1bc58cf3ac10013d00aad8f99c5c750a/job.log
4/7 17:27:51 Bootstrapping...
4/7 17:27:51 Number of pre-completed jobs: 0
4/7 17:27:51 Registering condor_event_timer...
4/7 17:27:52 Submitting Condor Job PROC_0 ...
4/7 17:27:52 submitting: condor_submit  -a "dag_node_name = PROC_0" -a
"+DAGManJobID = 12.0" -a "submit_event_notes = DAG Node: $(dag_node_name)"
-a "procNb = 0" job.sub
4/7 17:27:52 	assigned Condor ID (13.0)
4/7 17:27:52 Just submitted 1 job this cycle...
4/7 17:27:52 Event: ULOG_EXECUTE for Unknown Job (12.0): ignoring...
4/7 17:27:52 Event: ULOG_SUBMIT for Condor Job PROC_0 (13.0)
4/7 17:27:52 Of 1 nodes total:
4/7 17:27:52  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
4/7 17:27:52   ===     ===      ===     ===     ===        ===      ===
4/7 17:27:52     0       0        1       0       0          0        0
4/7 17:28:27 Event: ULOG_EXECUTE for Condor Job PROC_0 (13.0)
4/7 17:28:32 Event: ULOG_IMAGE_SIZE for Condor Job PROC_0 (13.0)
4/7 17:31:22 Event: ULOG_JOB_TERMINATED for Condor Job PROC_0 (13.0)
4/7 17:31:22 Job PROC_0 completed successfully.
4/7 17:31:22 Running POST script of Job PROC_0...
4/7 17:31:22 Of 1 nodes total:
4/7 17:31:22  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
4/7 17:31:22   ===     ===      ===     ===     ===        ===      ===
4/7 17:31:22     0       0        0       1       0          0        0
4/7 17:31:27 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Job PROC_0 (13.0)
4/7 17:31:27 POST Script of Job PROC_0 completed successfully.
4/7 17:31:27 Of 1 nodes total:
4/7 17:31:27  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
4/7 17:31:27   ===     ===      ===     ===     ===        ===      ===
4/7 17:31:27     1       0        0       0       0          0        0
4/7 17:31:27 All jobs Completed!
4/7 17:31:27 **** condor_scheduniv_exec.12.0 (condor_DAGMAN) EXITING WITH
STATUS 0