[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] I'm having shadow exceptions also



Below are the tails of the logs of a couple of jobs running. The time
spent keeps going up, but the output file  hasn't grown any for a day or
two.
_____________________________________________________
007 (4275.000.000)
02/14 05:18:36 Shadow exception!
        ckpt server store failed
        111447288  -  Run Bytes Sent By Job
        12501003  -  Run Bytes Received By Job
...
001 (4275.000.000) 02/14 05:19:07 Job executing on host:
<144.92.73.157:9699>
...
006 (4275.000.000) 02/14 08:19:01 Image size of job updated: 19250
...

==> columnsD16-17_L8-9_NT9_S1_4x100fairCloseStay45_60Psign.log <==
...
007 (4276.000.000) 02/14 05:18:47 Shadow exception!
        ckpt server store failed
        46565200  -  Run Bytes Sent By Job
        12500685  -  Run Bytes Received By Job
...
001 (4276.000.000) 02/14 05:19:14 Job executing on host:
<144.92.73.157:9699>
...
006 (4276.000.000) 02/14 08:19:14 Image size of job updated: 16070
...
________________________________________________________________
 Is this [Ba sign that my
job is too big? I noticed the image size keeps
getting  updated. Or is this something that just happens sometime?
What other parameters can I look at to figure out what is going on?
condor_q -analyze gives
 __________________________________________________________________
---
4276.000:  Request is being serviced

---
4277.000:  Request is being serviced

---
4278.000:  Run analysis summary.  Of 113 machines,
     73 are rejected by your job's requirements
     21 reject your job because of their own requirements
      8 match, but are serving users with a better priority in the pool
     11 match, match, but reject the job for unknown reasons
      0 match, but will not currently preempt their existing job
      0 are available to run your job
        Last successful match: Thu Feb 16 11:56:47 2006

______________________________________________________________

Is there something I can change to make my requirements more generic?
and therefore reject fewer machines?


I have
___________________________________
########################
  # Submit description file for
columnsD16-17_L8-9_NT9_S10_4x100fairCloseStay45_60Psign program
  ########################
  Executable     =
columnsD16-17_L8-9_NT9_S10_4x100fairCloseStay45_60PsignC
 Requirements = OpSys =!= UNDEFINED
  notification   = Always
  notify_user    = seavey@xxxxxxxxxxx
  Universe       = standard

  Universe       = standard
  Output         =
columnsD16-17_L8-9_NT9_S10_4x100fairCloseStay45_60Psign.out
  input          = post-genM_D16-17_L8-9_NT9_S10fairCloseStay45_60P.params
  Log            =
columnsD16-17_L8-9_NT9_S10_4x100fairCloseStay45_60Psign.log
  error          =
columnsD16-17_L8-9_NT9_S10_4x100fairCloseStay45_60Psign.error
  Queue

_____________________________________