[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] log file indicates termination of job, but output file is empty !?!



Good morning Rob,
 I recommend add the Error variable too, maybe there's something wrong
in the execute nodes, and the Standard output don't catch it.
 You can add a line like this:
 error = 005/$(Cluster)_$(PROCESS).error
and test again.

Regards.


On 11/24/10, Rob <spamrefuse@xxxxxxxxx> wrote:
>
> Hi,
>
> I am testing my condor pool by sending a large amount of jobs to it:
>
> #---- Condor submit file
> Universe   = Vanilla
> Executable = sleeper.exe
> should_transfer_files = YES
> when_to_transfer_output = ON_EXIT
>
> Requirements = (target.Arch == "INTEL") && (target.OpSys == "WINNT51")
>
> output = 005/$(Cluster)_$(PROCESS).out
> log = 005/$(Cluster)_$(PROCESS).log
> log_xml = true
>
> arguments = "5"
> Queue 15000
> #----
>
>
> The 'arguments = "5"' tells the sleeper.exe to sleep for 5 minutes, so I
> know
> that this job will run for close to 5 minutes on a pool PC.
>
> Most of the jobs complete nicely, giving the report in the .log file and its
> output in the .out file.
>
> However, some jobs indicate that they have completed (see below), but the
> output
> file remains empty.
> Notice that the "SentBytes" and "TotalSentBytes" at the end of the log file
> are
> both zero in this case!
>
> Any idea why and how this happens?
> Should I investigate further? If yes, how?
>
> Thanks,
> Rob.
>
>
>
> <c>
>     <a n="MyType"><s>SubmitEvent</s></a>
>     <a n="EventTypeNumber"><i>0</i></a>
>     <a n="EventTime"><s>2010-11-24T08:46:40</s></a>
>     <a n="Cluster"><i>319</i></a>
>     <a n="Proc"><i>4146</i></a>
>     <a n="Subproc"><i>0</i></a>
>     <a n="SubmitHost"><s>&lt;115.125.120.71:60614&gt;</s></a>
> </c>
> <c>
>     <a n="MyType"><s>ExecuteEvent</s></a>
>     <a n="EventTypeNumber"><i>1</i></a>
>     <a n="EventTime"><s>2010-11-24T13:18:55</s></a>
>     <a n="Cluster"><i>319</i></a>
>     <a n="Proc"><i>4146</i></a>
>     <a n="Subproc"><i>0</i></a>
>     <a n="ExecuteHost"><s>&lt;115.145.228.43:1047&gt;</s></a>
> </c>
> <c>
>     <a n="MyType"><s>JobImageSizeEvent</s></a>
>     <a n="EventTypeNumber"><i>6</i></a>
>     <a n="EventTime"><s>2010-11-24T13:19:03</s></a>
>     <a n="Cluster"><i>319</i></a>
>     <a n="Proc"><i>4146</i></a>
>     <a n="Subproc"><i>0</i></a>
>     <a n="Size"><i>756</i></a>
> </c>
> <c>
>     <a n="MyType"><s>JobSuspendedEvent</s></a>
>     <a n="EventTypeNumber"><i>10</i></a>
>     <a n="EventTime"><s>2010-11-24T13:19:38</s></a>
>     <a n="Cluster"><i>319</i></a>
>     <a n="Proc"><i>4146</i></a>
>     <a n="Subproc"><i>0</i></a>
>     <a n="NumberOfPIDs"><i>1</i></a>
> </c>
> <c>
>     <a n="MyType"><s>JobDisconnectedEvent</s></a>
>     <a n="EventTypeNumber"><i>22</i></a>
>     <a n="EventTime"><s>2010-11-24T15:19:38</s></a>
>     <a n="Cluster"><i>319</i></a>
>     <a n="Proc"><i>4146</i></a>
>     <a n="Subproc"><i>0</i></a>
>     <a n="StartdAddr"><s>&lt;115.145.228.43:1047&gt;</s></a>
>     <a n="StartdName"><s>slot1@06-3</s></a>
>     <a n="DisconnectReason"><s>Socket between submit and execute hosts
> closed
> unexpectedly</s></a>
>     <a n="EventDescription"><s>Job disconnected, attempting to
> reconnect</s></a>
> </c>
> <c>
>     <a n="MyType"><s>JobReconnectFailedEvent</s></a>
>     <a n="EventTypeNumber"><i>24</i></a>
>     <a n="EventTime"><s>2010-11-24T15:19:38</s></a>
>     <a n="Cluster"><i>319</i></a>
>     <a n="Proc"><i>4146</i></a>
>     <a n="Subproc"><i>0</i></a>
>     <a n="StartdName"><s>slot1@06-3</s></a>
>     <a n="Reason"><s>Job disconnected too long: JobLeaseDuration (1200
> seconds)
> expired</s></a>
>     <a n="EventDescription"><s>Job reconnect impossible: rescheduling
> job</s></a>
> </c>
> <c>
>     <a n="MyType"><s>ExecuteEvent</s></a>
>     <a n="EventTypeNumber"><i>1</i></a>
>     <a n="EventTime"><s>2010-11-24T15:19:54</s></a>
>     <a n="Cluster"><i>319</i></a>
>     <a n="Proc"><i>4146</i></a>
>     <a n="Subproc"><i>0</i></a>
>     <a n="ExecuteHost"><s>&lt;115.145.228.201:1045&gt;</s></a>
> </c>
> <c>
>     <a n="MyType"><s>JobSuspendedEvent</s></a>
>     <a n="EventTypeNumber"><i>10</i></a>
>     <a n="EventTime"><s>2010-11-24T15:20:47</s></a>
>     <a n="Cluster"><i>319</i></a>
>     <a n="Proc"><i>4146</i></a>
>     <a n="Subproc"><i>0</i></a>
>     <a n="NumberOfPIDs"><i>1</i></a>
> </c>
> <c>
>     <a n="MyType"><s>JobUnsuspendedEvent</s></a>
>     <a n="EventTypeNumber"><i>11</i></a>
>     <a n="EventTime"><s>2010-11-24T15:25:52</s></a>
>     <a n="Cluster"><i>319</i></a>
>     <a n="Proc"><i>4146</i></a>
>     <a n="Subproc"><i>0</i></a>
> </c>
> <c>
>     <a n="MyType"><s>JobEvictedEvent</s></a>
>     <a n="EventTypeNumber"><i>4</i></a>
>     <a n="EventTime"><s>2010-11-24T15:25:52</s></a>
>     <a n="Cluster"><i>319</i></a>
>     <a n="Proc"><i>4146</i></a>
>     <a n="Subproc"><i>0</i></a>
>     <a n="Checkpointed"><b v="f"/></a>
>     <a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
>     <a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
>     <a n="SentBytes"><r>0.000000000000000E+00</r></a>
>     <a n="ReceivedBytes"><r>2.373600000000000E+04</r></a>
>     <a n="TerminatedAndRequeued"><b v="f"/></a>
>     <a n="TerminatedNormally"><b v="f"/></a>
> </c>
> <c>
>     <a n="MyType"><s>ExecuteEvent</s></a>
>     <a n="EventTypeNumber"><i>1</i></a>
>     <a n="EventTime"><s>2010-11-24T15:26:00</s></a>
>     <a n="Cluster"><i>319</i></a>
>     <a n="Proc"><i>4146</i></a>
>     <a n="Subproc"><i>0</i></a>
>     <a n="ExecuteHost"><s>&lt;115.145.230.95:4146&gt;</s></a>
> </c>
> <c>
>     <a n="MyType"><s>JobTerminatedEvent</s></a>
>     <a n="EventTypeNumber"><i>5</i></a>
>     <a n="EventTime"><s>2010-11-24T15:26:00</s></a>
>     <a n="Cluster"><i>319</i></a>
>     <a n="Proc"><i>4146</i></a>
>     <a n="Subproc"><i>0</i></a>
>     <a n="TerminatedNormally"><b v="t"/></a>
>     <a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
>     <a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
>     <a n="TotalLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
>     <a n="TotalRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
>     <a n="SentBytes"><r>0.000000000000000E+00</r></a>
>     <a n="ReceivedBytes"><r>2.373600000000000E+04</r></a>
>     <a n="TotalSentBytes"><r>0.000000000000000E+00</r></a>
>     <a n="TotalReceivedBytes"><r>4.747200000000000E+04</r></a>
> </c>
>
>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>


-- 
----
Edier Alberto Zapata Hernández
Est. Ingeniería de Sistemas
Universidad de Valle