[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] DAGMan log file bug....



I think I've found a bug with dagman managing the log files for a large DAG. Actually, it has to do with parsing the DAG and .sub files. Ultimately it causes the DAG to hang without ever completing. I'm running 6.8.0 on WindowsXP. Here are some details:

I have a DAG with 82 nodes, no dependencies. In the .dag.dagman.out log file I can see that for a few of my nodes, the log file name is not being read correctly from the .sub file. A few of the pertinent lines from the .dag.dagman.out file are included below. Since dagman never gets the name correct, it is unable to read the file and therefore the usual ULOG events never show up for those nodes and it doesn't know that they complete. The nodes' log files are created and contain reasonable information. Finally, if I create a DAG of a subset of the nodes, the problem goes away (or at least moves).

Help?

Thanks in advance,
Bob Mortensen

8/23 13:49:24 ******************************************************
8/23 13:49:24 ** condor_scheduniv_exec.1400.0 (CONDOR_DAGMAN) STARTING UP
8/23 13:49:24 ** D:\Condor\bin\condor_dagman.exe
8/23 13:49:24 ** $CondorVersion: 6.8.0 Jul 19 2006 $
8/23 13:49:24 ** $CondorPlatform: INTEL-WINNT50 $
8/23 13:49:24 ** PID = 6120
8/23 13:49:24 ** Log last touched time unavailable (No such file or directory)
8/23 13:49:24 ******************************************************
8/23 13:49:24 Using config source: D:\condor\condor_config
8/23 13:49:24 Using local config sources:
8/23 13:49:24    D:\condor/condor_config.local
8/23 13:49:24 DaemonCore: Command Socket at <10.1.3.110:2226>
8/23 13:49:24 DAGMAN_SUBMIT_DELAY setting: 0
8/23 13:49:24 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
8/23 13:49:24 DAGMAN_STARTUP_CYCLE_DETECT setting: 0
8/23 13:49:24 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
8/23 13:49:24 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114
8/23 13:49:24 DAGMAN_RETRY_SUBMIT_FIRST setting: 1
8/23 13:49:24 DAGMAN_RETRY_NODE_FIRST setting: 0
8/23 13:49:24 DAGMAN_MAX_JOBS_IDLE setting: 0
8/23 13:49:24 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
8/23 13:49:24 DAGMAN_MUNGE_NODE_NAMES setting: 1
8/23 13:49:24 DAGMAN_DELETE_OLD_LOGS setting: 1
8/23 13:49:24 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0
8/23 13:49:24 argv[0] == "condor_scheduniv_exec.1400.0"
8/23 13:49:24 argv[1] == "-Debug"
8/23 13:49:24 argv[2] == "3"
8/23 13:49:24 argv[3] == "-Lockfile"
8/23 13:49:24 argv[4] == "testcases.dag.lock"
8/23 13:49:24 argv[5] == "-Condorlog"
8/23 13:49:24 argv[6] == "E:\CondorTemp\JOB.1036\AddMoveRenGrpFilters\condor/AddMoveRenGrpFilters.log"
8/23 13:49:24 argv[7] == "-Dag"
8/23 13:49:24 argv[8] == "testcases.dag"
8/23 13:49:24 argv[9] == "-Rescue"
8/23 13:49:24 argv[10] == "testcases.dag.rescue"
8/23 13:49:24 argv[11] == "-MaxIdle"
8/23 13:49:24 argv[12] == "5"
8/23 13:49:24 argv[13] == "-MaxPre"
8/23 13:49:24 argv[14] == "5"
8/23 13:49:24 argv[15] == "-MaxPost"
8/23 13:49:24 argv[16] == "5"
8/23 13:49:24 DAG Lockfile will be written to testcases.dag.lock
8/23 13:49:24 DAG Input file is testcases.dag
8/23 13:49:24 Rescue DAG will be written to testcases.dag.rescue
8/23 13:49:24 All DAG node user log files:
8/23 13:49:24 E:\CondorTemp\JOB.1036\AddMoveRenGrpFilters\condor/AddMoveRenGrpFilters.log (Condor) 8/23 13:49:24 E:\CondorTemp\JOB.1036\AngularLuminanceMeter\condor/AngularLuminanceMeter.log (Condor) 8/23 13:49:24 E:\CondorTemp\JOB.1036\BestFocusExiting\condor/BestFocusExiting.log (Condor)
    .... lines skipped ....
8/23 13:49:24   E:\CondorTemp\JOB.1036\absViaBulkTrans\condo (Condor)
    .... lines skipped ....
8/23 13:49:24   E:\CondorTemp\JOB.1036\cylVolApod\condor/ (Condor)
    .... lines skipped ....
8/23 13:49:24   E:\CondorTemp\JOB.1036\exitingRayDataLumTest\con (Condor)
    .... lines skipped ....
8/23 13:49:24 Parsing testcases.dag ...
8/23 13:49:24 Dag contains 82 total jobs
8/23 13:49:24 Deleting any older versions of log files...
8/23 13:49:24 MultiLogFiles: deleting older version of E:\CondorTemp\JOB.1036\cylVolApod\condor/ 8/23 13:49:24 MultiLogFiles error: can't remove E:\CondorTemp\JOB.1036\cylVolApod\condor/ 8/23 13:49:24 MultiLogFiles: deleting older version of E:\CondorTemp\JOB.1036\exitingRayDataLumTest\con 8/23 13:49:24 MultiLogFiles error: can't remove E:\CondorTemp\JOB.1036\exitingRayDataLumTest\con