[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] log file indicates termination of job, but output file is empty !?!



Edier Alberto Zapata Hernández wrote:
> On 11/24/10, Rob <spamrefuse@xxxxxxxxx> wrote:
>>
>> Hi,
>>
>> I am testing my condor pool by sending a large amount of jobs to it:
>>
>> #---- Condor submit file
>> Universe   = Vanilla
>> Executable = sleeper.exe
>> should_transfer_files = YES
>> when_to_transfer_output = ON_EXIT
>>
>> Requirements = (target.Arch == "INTEL") && (target.OpSys == "WINNT51")
>>
>> output = 005/$(Cluster)_$(PROCESS).out
>> log = 005/$(Cluster)_$(PROCESS).log
>> log_xml = true
>>
>> arguments = "5"
>> Queue 15000
>> #----
>>
>>
>> The 'arguments = "5"' tells the sleeper.exe to sleep for 5 minutes, so I
>> know
>> that this job will run for close to 5 minutes on a pool PC.
>>
>> Most of the jobs complete nicely, giving the report in the .log file and its
>> output in the .out file.
>>
>> However, some jobs indicate that they have completed (see below), but the
>> output
>> file remains empty.
>> Notice that the "SentBytes" and "TotalSentBytes" at the end of the log file
>> are
>> both zero in this case!
>>
>> Any idea why and how this happens?
>> Should I investigate further? If yes, how?
>
> Good morning Rob,
>  I recommend add the Error variable too, maybe there's something wrong
> in the execute nodes, and the Standard output don't catch it.
>  You can add a line like this:
>  error = 005/$(Cluster)_$(PROCESS).error
> and test again.


Thank you. I have tested again using the error entry in the submit file.
I found that both the error and output files are empty when this problem 
re-occured.
The log file tells me that the job has terminated.

This is what I think is relevant in the log files:
(Notice the exit status "-1073741502"; what does that mean?)

ShadowLog:11/25 12:55:13 Initializing a VANILLA shadow for job 322.1464
ShadowLog:11/25 12:55:13 (322.1464) (19248): Request to run on slot1@34-5 
<115.125.128.213:1045> was ACCEPTED

SchedLog:11/25 12:55:13 (pid:2043) Starting add_shadow_birthdate(322.1464)
SchedLog:11/25 12:55:13 (pid:2043) Started shadow for job 322.1464 on slot1@34-5 
<115.125.128.213:1045> for spamrefuse@xxxxxxxxxxxxxx, (shadow pid = 19248)

ShadowLog:11/25 12:55:20 (322.1464) (19248): Job 322.1464 terminated: exited 
with status -1073741502
ShadowLog:11/25 12:55:20 (322.1464) (19248): **** condor_shadow (condor_SHADOW) 
pid 19248 EXITING WITH STATUS 100

SchedLog:11/25 12:55:20 (pid:2043) Shadow pid 19248 for job 322.1464 exited with 
status 100



Here are some more details of what I have done:

$ cat 010.condor 
Universe   = Vanilla
Executable = sleeper.exe
should_transfer_files = YES
when_to_transfer_output = ON_EXIT

Requirements = (target.Arch == "INTEL") && (target.OpSys == "WINNT51")

output = 010/$(Cluster)_$(PROCESS).out
error = 010/$(Cluster)_$(PROCESS).err
log = 010/$(Cluster)_$(PROCESS).log
log_xml = true

arguments = "10"
Queue 10000

$ condor_submit 010.condor

[...snip...and wait for some time....]

$ ls -1s 010/322_1464.*
0K 010/322_1464.err
8K 010/322_1464.log
0K 010/322_1464.out

$ cat 010/322_1464.log
<c>
    <a n="MyType"><s>SubmitEvent</s></a>
    <a n="EventTypeNumber"><i>0</i></a>
    <a n="EventTime"><s>2010-11-25T08:45:51</s></a>
    <a n="Cluster"><i>322</i></a>
    <a n="Proc"><i>1464</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="SubmitHost"><s>&lt;115.145.140.71:60614&gt;</s></a>
</c>
<c>
    <a n="MyType"><s>ExecuteEvent</s></a>
    <a n="EventTypeNumber"><i>1</i></a>
    <a n="EventTime"><s>2010-11-25T12:05:21</s></a>
    <a n="Cluster"><i>322</i></a>
    <a n="Proc"><i>1464</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="ExecuteHost"><s>&lt;115.145.228.233:1048&gt;</s></a>
</c>
<c>
    <a n="MyType"><s>JobImageSizeEvent</s></a>
    <a n="EventTypeNumber"><i>6</i></a>
    <a n="EventTime"><s>2010-11-25T12:05:29</s></a>
    <a n="Cluster"><i>322</i></a>
    <a n="Proc"><i>1464</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Size"><i>756</i></a>
</c>
<c>
    <a n="MyType"><s>JobSuspendedEvent</s></a>
    <a n="EventTypeNumber"><i>10</i></a>
    <a n="EventTime"><s>2010-11-25T12:08:39</s></a>
    <a n="Cluster"><i>322</i></a>
    <a n="Proc"><i>1464</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="NumberOfPIDs"><i>1</i></a>
</c>
<c>
    <a n="MyType"><s>JobUnsuspendedEvent</s></a>
    <a n="EventTypeNumber"><i>11</i></a>
    <a n="EventTime"><s>2010-11-25T12:13:44</s></a>
    <a n="Cluster"><i>322</i></a>
    <a n="Proc"><i>1464</i></a>
    <a n="Subproc"><i>0</i></a>
</c>
<c>
    <a n="MyType"><s>JobEvictedEvent</s></a>
    <a n="EventTypeNumber"><i>4</i></a>
    <a n="EventTime"><s>2010-11-25T12:13:44</s></a>
    <a n="Cluster"><i>322</i></a>
    <a n="Proc"><i>1464</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Checkpointed"><b v="f"/></a>
    <a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
    <a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
    <a n="SentBytes"><r>0.000000000000000E+00</r></a>
    <a n="ReceivedBytes"><r>2.373600000000000E+04</r></a>
    <a n="TerminatedAndRequeued"><b v="f"/></a>
    <a n="TerminatedNormally"><b v="f"/></a>
</c>
<c>
    <a n="MyType"><s>ExecuteEvent</s></a>
    <a n="EventTypeNumber"><i>1</i></a>
    <a n="EventTime"><s>2010-11-25T12:14:16</s></a>
    <a n="Cluster"><i>322</i></a>
    <a n="Proc"><i>1464</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="ExecuteHost"><s>&lt;115.145.228.111:1047&gt;</s></a>
</c>
<c>
    <a n="MyType"><s>JobSuspendedEvent</s></a>
    <a n="EventTypeNumber"><i>10</i></a>
    <a n="EventTime"><s>2010-11-25T12:23:37</s></a>
    <a n="Cluster"><i>322</i></a>
    <a n="Proc"><i>1464</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="NumberOfPIDs"><i>1</i></a>
</c>
<c>
    <a n="MyType"><s>JobUnsuspendedEvent</s></a>
    <a n="EventTypeNumber"><i>11</i></a>
    <a n="EventTime"><s>2010-11-25T12:28:42</s></a>
    <a n="Cluster"><i>322</i></a>
    <a n="Proc"><i>1464</i></a>
    <a n="Subproc"><i>0</i></a>
</c>
<c>
    <a n="MyType"><s>JobEvictedEvent</s></a>
    <a n="EventTypeNumber"><i>4</i></a>
    <a n="EventTime"><s>2010-11-25T12:28:42</s></a>
    <a n="Cluster"><i>322</i></a>
    <a n="Proc"><i>1464</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Checkpointed"><b v="f"/></a>
    <a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
    <a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
    <a n="SentBytes"><r>0.000000000000000E+00</r></a>
    <a n="ReceivedBytes"><r>2.373600000000000E+04</r></a>
    <a n="TerminatedAndRequeued"><b v="f"/></a>
    <a n="TerminatedNormally"><b v="f"/></a>
</c>
<c>
    <a n="MyType"><s>ExecuteEvent</s></a>
    <a n="EventTypeNumber"><i>1</i></a>
    <a n="EventTime"><s>2010-11-25T12:29:01</s></a>
    <a n="Cluster"><i>322</i></a>
    <a n="Proc"><i>1464</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="ExecuteHost"><s>&lt;115.145.228.116:1046&gt;</s></a>
</c>
<c>
    <a n="MyType"><s>JobSuspendedEvent</s></a>
    <a n="EventTypeNumber"><i>10</i></a>
    <a n="EventTime"><s>2010-11-25T12:38:48</s></a>
    <a n="Cluster"><i>322</i></a>
    <a n="Proc"><i>1464</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="NumberOfPIDs"><i>1</i></a>
</c>
<c>
    <a n="MyType"><s>JobUnsuspendedEvent</s></a>
    <a n="EventTypeNumber"><i>11</i></a>
    <a n="EventTime"><s>2010-11-25T12:43:53</s></a>
    <a n="Cluster"><i>322</i></a>
    <a n="Proc"><i>1464</i></a>
    <a n="Subproc"><i>0</i></a>
</c>
<c>
    <a n="MyType"><s>JobEvictedEvent</s></a>
    <a n="EventTypeNumber"><i>4</i></a>
    <a n="EventTime"><s>2010-11-25T12:43:53</s></a>
    <a n="Cluster"><i>322</i></a>
    <a n="Proc"><i>1464</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Checkpointed"><b v="f"/></a>
    <a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
    <a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
    <a n="SentBytes"><r>0.000000000000000E+00</r></a>
    <a n="ReceivedBytes"><r>2.373600000000000E+04</r></a>
    <a n="TerminatedAndRequeued"><b v="f"/></a>
    <a n="TerminatedNormally"><b v="f"/></a>
</c>
<c>
    <a n="MyType"><s>ExecuteEvent</s></a>
    <a n="EventTypeNumber"><i>1</i></a>
    <a n="EventTime"><s>2010-11-25T12:44:03</s></a>
    <a n="Cluster"><i>322</i></a>
    <a n="Proc"><i>1464</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="ExecuteHost"><s>&lt;115.145.228.94:1050&gt;</s></a>
</c>
<c>
    <a n="MyType"><s>JobSuspendedEvent</s></a>
    <a n="EventTypeNumber"><i>10</i></a>
    <a n="EventTime"><s>2010-11-25T12:50:06</s></a>
    <a n="Cluster"><i>322</i></a>
    <a n="Proc"><i>1464</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="NumberOfPIDs"><i>1</i></a>
</c>
<c>
    <a n="MyType"><s>JobUnsuspendedEvent</s></a>
    <a n="EventTypeNumber"><i>11</i></a>
    <a n="EventTime"><s>2010-11-25T12:55:11</s></a>
    <a n="Cluster"><i>322</i></a>
    <a n="Proc"><i>1464</i></a>
    <a n="Subproc"><i>0</i></a>
</c>
<c>
    <a n="MyType"><s>JobEvictedEvent</s></a>
    <a n="EventTypeNumber"><i>4</i></a>
    <a n="EventTime"><s>2010-11-25T12:55:11</s></a>
    <a n="Cluster"><i>322</i></a>
    <a n="Proc"><i>1464</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="Checkpointed"><b v="f"/></a>
    <a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
    <a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
    <a n="SentBytes"><r>0.000000000000000E+00</r></a>
    <a n="ReceivedBytes"><r>2.373600000000000E+04</r></a>
    <a n="TerminatedAndRequeued"><b v="f"/></a>
    <a n="TerminatedNormally"><b v="f"/></a>
</c>
<c>
    <a n="MyType"><s>ExecuteEvent</s></a>
    <a n="EventTypeNumber"><i>1</i></a>
    <a n="EventTime"><s>2010-11-25T12:55:19</s></a>
    <a n="Cluster"><i>322</i></a>
    <a n="Proc"><i>1464</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="ExecuteHost"><s>&lt;115.145.228.213:1045&gt;</s></a>
</c>
<c>
    <a n="MyType"><s>JobTerminatedEvent</s></a>
    <a n="EventTypeNumber"><i>5</i></a>
    <a n="EventTime"><s>2010-11-25T12:55:20</s></a>
    <a n="Cluster"><i>322</i></a>
    <a n="Proc"><i>1464</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="TerminatedNormally"><b v="t"/></a>
    <a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
    <a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
    <a n="TotalLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
    <a n="TotalRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
    <a n="SentBytes"><r>0.000000000000000E+00</r></a>
    <a n="ReceivedBytes"><r>2.373600000000000E+04</r></a>
    <a n="TotalSentBytes"><r>0.000000000000000E+00</r></a>
    <a n="TotalReceivedBytes"><r>1.186800000000000E+05</r></a>
</c>


Notice that "SentBytes" and "TotalReceivedBytes" are both zero; what does that 
mean?

Thank you!
Rob.