[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor: More jobs running than nodes available



Hi,

after doing a condor_q I get:


539 jobs; 20 idle, 190 running, 329 held

but doing a condor_status:

  Total Owner Claimed Unclaimed Matched Preempting Backfill

         INTEL/LINUX   104     0      87        17       0          0        0

               Total   104     0      87        17       0          0        0


Looking to shadow logs I see many errors like:


5/15 11:52:03 (239919.0) (15089): Job 239919.0 going into Hold state (code 13,2): Error from starter on vm4@host: STARTER failed to receive file(s) from <Master_IP:33088>; SHADOW at Master_IP failed to send file(s) to <host:42514>: error reading from /cafdata/cafIn/submit_mmp_long_260583_14146/stage/__job_in__.tgz: (errno 2) No such file or directory

Obviusly, that file does not exist

[cdfcaf@ log]$ ls -lsa /cafdata/cafIn/submit_mmp_long_260583_14146/
total 820
  16 drwxrwxrwx    2 cdfcaf   cdfcaf      16384 May 14 23:06 .
  24 drwxrwxr-x  345 cdfcaf   cdfcaf      24576 May 15 09:16 ..
   0 -rw-r--r--    1 cdfcaf   cdfcaf          0 May 14 22:54 dprintf_failure.DAGMAN
   4 -rw-r--r--    1 cdfcaf   cdfcaf          7 May 12 11:19 job.ClusterId
  16 -rw-r--r--    1 cdfcaf   cdfcaf      14476 May 12 11:19 job.dag
   4 -rw-r--r--    1 cdfcaf   cdfcaf        571 May 12 11:19 job.dagman.ClassAd
  36 -rw-r--r--    1 cdfcaf   cdfcaf      36864 May 14 22:54 job.dagman.dagman.out
   0 -rw-rw-r--    1 cdfcaf   cdfcaf          0 May 14 22:54 job.dagman.lib.out
   0 -rw-r--r--    1 cdfcaf   cdfcaf          0 May 12 11:23 job.dagman.lock
   4 -rw-r--r--    1 cdfcaf   cdfcaf        452 May 12 11:19 job.Descript
   4 -rw-r--r--    1 cdfcaf   cdfcaf         17 May 12 11:09 job.email
 452 -rw-------    1 cdfcaf   cdfcaf     458302 May 15 11:53 job.log
  20 -rw-rw-rw-    1 cdfcaf   cdfcaf      20030 May 12 12:04 job.log.01.dmpi
   4 -rw-r--r--    1 cdfcaf   cdfcaf         98 May 12 11:09 job.outurl
   4 -rwxr--r--    1 cdfcaf   cdfcaf        101 May 12 11:09 mark_removed.sh
   4 -rwxr--r--    1 cdfcaf   cdfcaf         19 May 12 11:09 return_OK.sh
 220 -rw-rw-rw-    1 cdfcaf   cdfcaf     217888 May 14 23:06 sections.ClassAd.zip
   8 -rw-rw-rw-    1 cdfcaf   cdfcaf       4424 May 14 23:06 sections.log.tgz


And jobs go to HOLD state.


So, I think I must resubmit this job, isn't it? Yesterday condor worked
fine, but disk got full, I have free some space and restarted condor,
and now nothing works...

What must be the correct procedure when disk is full? Why all jobs are
corrupted now?


TIA,
Arnau