[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor_submit logfile is missing "job submitted" entries



Hey all,

 

I’ve got two machines configured what I sure thought was identically, but one’s started misbehaving.

The misbehaving one now writes logfiles (from the submission, not the daemon logs) which are missing the “000 Job submitted” entries.

It didn’t do so a week or two ago – I can see some correct submission logfiles on the misbehaving box.

 

Having the “job submitted” entries go missing breaks the Condor Perl module.

That module counts up jobs as they’re submitted, and counts them back down again as they retire, and if there aren’t any submissions, the retires immediately push the count negative, never to return to zero and terminate the Monitor.

(The missing part isn’t exclusive to Perl; I can reproduce the problem when invoking condor_submit manually.)

 

They’re both Windows 2012 Server R2, running $CondorVersion: 8.4.8 Jun 30 2016 BuildID: 373513 $ $CondorPlatform: x86_64_Windows8 $

 

When I run a hello world submission:

 

#---------- Condor Variables -----------------------------------------

universe       = vanilla

priority       = 0

output         = $(Cluster).$(Process).out

error          = $(Cluster).$(Process).err

log            = $(Cluster).log

#-------------------------------------------------------------------------

 

executable             = CondorTestJob.bat

should_transfer_files   = YES

transfer_input_files    = CondorTestJob.pl

when_to_transfer_output = ON_EXIT

notification            = never

run_as_owner            = TRUE

#==============================================

#                 Machines

#==============================================

 

requirements   = ((Arch == "Intel") || (Arch == "X86_64")) && ((OpSys == "WINNT51") || (OpSys == "WINNT52") || (OpSys == "WINNT61") || (OpSys == "WINDOWS")) && ((LocalCredd =?= "mylocalcredd") || (LocalCredd =?= "mylocalcredd:9620"))

queue

 

A good logfile like this:

000 (001.000.000) 11/04 11:05:47 Job submitted from host: <myip:57120?addrs=myip-57120>

...

001 (001.000.000) 11/04 11:06:06 Job executing on host: <otherip:58825?addrs=otherip-58825>

...

006 (001.000.000) 11/04 11:06:06 Image size of job updated: 1

     0  -  MemoryUsage of job (MB)

     0  -  ResidentSetSize of job (KB)

...

005 (001.000.000) 11/04 11:06:06 Job terminated.

     (1) Normal termination (return value 0)

           Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage

           Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage

           Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage

           Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage

     1145  -  Run Bytes Sent By Job

     84  -  Run Bytes Received By Job

     1145  -  Total Bytes Sent By Job

     84  -  Total Bytes Received By Job

     Partitionable Resources :    Usage  Request Allocated

        Cpus                 :                 1         1

        Disk (KB)            :       11        2  97755693

        Memory (MB)          :        0        1      8189

...

 

While a bad one misses the all-important first message:

001 (012.000.000) 11/19 17:15:08 Job executing on host: <other2ip:59153?addrs=other2ip -59153>

...

006 (012.000.000) 11/19 17:15:08 Image size of job updated: 1

     0  -  MemoryUsage of job (MB)

     0  -  ResidentSetSize of job (KB)

...

005 (012.000.000) 11/19 17:15:08 Job terminated.

     (1) Normal termination (return value 0)

           Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage

           Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage

           Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage

           Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage

     1153  -  Run Bytes Sent By Job

     84  -  Run Bytes Received By Job

     1153  -  Total Bytes Sent By Job

     84  -  Total Bytes Received By Job

     Partitionable Resources :    Usage  Request Allocated

        Cpus                 :                 1         1

        Disk (KB)            :       11        2  91585753

        Memory (MB)          :        0        1      8189

...

 

The only other piece of weirdness I can find is that the misbehaving box doesn’t want me to open a logfile by double-clicking in Windows Explorer – but right-clicking and using Open With… (anything) is fine.

 

I’ve diffed the daemon logs against a box that’s still working correctly, and nothing jumps out.

Any ideas where I might look next?

 

Thanks!

 

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Jason Ross                                                                                                                                                                                           Intel Corporation

Graphics Architect                                                                                                                                                                                               FM5-64

VPG Architecture                                                                                                                                                                         1900 Prairie City Road

(916) 356-8964                                                                                                                                                                                 Folsom, CA  95630