[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] why my submitted job runs for 2 mins and gets suspended and unsuspended every 10 mins



Hi,

I've had the same problem with a Windows cluster using a Cygwin environment. I contacted the developers about it and eventually we discovered that the process tracking under the startd and starter daemons is losing track of some of the processes that you are creating.  In my case this meant that the daemons were keeping track of the Windows cmd process and the perl process but lost track all processes generated from within the perl script. This meant that, since it didn't recognise the perl-generated processes, it registered the CPU as busy with non-condor processes and suspended the job (after the configuration files standard 2 minute wait period). The perl-spawned processes were not suspended, keeping the CPU busy and subsequently evicting the job after the configuration files 10 minute wait period. Your job appears to be showing the same characteristics.

A fix that might work is to set the EXECUTE_LOGIN_IS_DEDICATED = True in your configuration file. This worked for me as far as suspending all jobs owned by the condor user. However, the processes seen by the starter and startd daemons are not necessarily identical in this case. In  my case the starter did suspend all jobs owned by the user but the startd still didn't see the perl-spawned processes and still registered the CPU as busy. This caused cyclical suspend-unsuspend behaviour. Heres what the development team last suggested to me :

Unfortunately, the behavior you
are seeing is expected given the way that our process family tracking code
works in the 6.8.x series and the 6.9.x series prior to 6.9.2.

In 6.9.2, all process family tracking logic has been moved into a separate
process, the ProcD. This gives Condor a single global view of the Condor's
tree of processes. In particular, the problem you see will not occur since
the StartD will see all the same processes that the Starter does.

So it appears you have two options to fix your problem:

1) Upgrade to 6.9.2 - our latest development release of Condor

2) Try to structure your jobs such that the  EXECUTE_LOGIN_IS_DEDICATED
  feature is not needed. Typically, this means ensuring that Condor can
  trace parent PID pointers from all the job's processes back to the
  Starter. Specifically, behavior that violates this is having one
  processes from your job create a child and then exit without waiting
  for the child to finish. Hopefully, it is possible for you to make
  this modification for your job.

Regards,

Alan

On 26/04/07, VIT Students < vit.gridproject@xxxxxxxxx> wrote:
hi,

i have setup condor for 2 machines with NFS sharing(linux fedora core 3) .The installation and configuration was perfect.But when i submit a simple example job sh_loop.cmd ,it gets excecuted for couple of mins and afterwards get suspeneded and after sometime again gets unsuspended and afterwards gets evicted.please can any one help  me  out with this .i am showing the sh_loop.log file here .i hope it might be of some help.This log file is for a single machine.i am facing the same problem for single and multiple machines




000 (002.000.000) 04/25 20:02:23 Job submitted from host: <127.0.0.1:32878>
...
001 (002.000.000) 04/25 20:02:25 Job executing on host: <127.0.0.1:32877>
...
010 (002.000.000) 04/25 20:02:30 Job was suspended.
    Number of processes actually suspended: 2
...
011 (002.000.000) 04/25 20:12:31 Job was unsuspended.
...
004 (002.000.000) 04/25 20:12:32 Job was evicted.
    (0) Job was not checkpointed.
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
...
001 (002.000.000) 04/25 20:22:29 Job executing on host: <127.0.0.1:32877>
...
010 (002.000.000) 04/25 20:22:33 Job was suspended.
    Number of processes actually suspended: 2
...
011 (002.000.000) 04/25 20:32:34 Job was unsuspended.
...
004 (002.000.000) 04/25 20:32:34 Job was evicted.
    (0) Job was not checkpointed.
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
...
001 (002.000.000) 04/25 20:42:28 Job executing on host: <127.0.0.1:32877>
...
010 (002.000.000) 04/25 20:42:33 Job was suspended.
    Number of processes actually suspended: 2
...
011 (002.000.000) 04/25 20:52:36 Job was unsuspended.
...
004 (002.000.000) 04/25 20:52:36 Job was evicted.
    (0) Job was not checkpointed.
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
    0  -  Run Bytes Sent By Job
    0  -  Run Bytes Received By Job
...


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR