[Condor-users] Parallel Universe and Job Termination

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Tue, 24 Jul 2007 21:26:25 -0400

From: "Jeffrey Wells" <jwells2@xxxxxxxxxxxx>

Subject: [Condor-users] Parallel Universe and Job Termination

Hi All;

I have completed installing condor 6.8.5 on three Fedora Core 6 machines.

All of my universes appear to be working satisfactorily, however I wanted to verify whether the parallel universe is behaving correctly. When I submit the job below, only one of the two machines return output. The other machine indicates the user terminated the job (see e-mail below). I read in an on-line posting how this behavior is correct. The posting stated that once a node returns output from a member of the cluster in a parallel universe job, the remaining nodes are killed. If this is correct? Is my log output for my machine correct? Is this the proper way of terminating a node that has not completed processing? If not, does any one see were I am making a mistake.

Thanking you all for your time.

Jeff

My parallel job:

#########################
# Parallel Job
#
#########################
universe = parallel
executable = /bin/hostname
machine_count = 2
log = parallellogfile
output = outfileMPI.$(NODE)
error = errfileMPI.$(NODE)
should_transfer_files = YES
when_to_transfer_output = ON_EXIT

queue

I receive correct output from one of the two machines that are in my cluster. The e-mail received at the completion of the job is below and indicates that the second machines job was removed by the user. Which I did not knowingly do. Possibly be a configuration error?

From condor@xxxxxxxxxxxxxxxxxxxxx Tue Jul 24 20:54:18 2007
Return-Path: <condor@xxxxxxxxxxxxxxxxxxxxx>
Received: from stengal.cs.sunyit.edu (localhost.localdomain [127.0.0.1])
        by stengal.cs.sunyit.edu (8.13.8/8.13.8) with ESMTP id l6P0sIOg025549
        for <wells@xxxxxxxxxxxxxxxxxxxxx>; Tue, 24 Jul 2007 20:54:18 -0400
Received: (from condor@localhost)
        by stengal.cs.sunyit.edu (8.13.8/8.13.8/Submit) id l6P0sIWt025548
        for wells@xxxxxxxxxxxxxxxxxxxxx; Tue, 24 Jul 2007 20:54:18 -0400
Date: Tue, 24 Jul 2007 20:54:18 -0400
From: condor@xxxxxxxxxxxxxxxxxxxxx
Message-Id: <200707250054.l6P0sIWt025548@xxxxxxxxxxxxxxxxxxxxx>
To: wells@xxxxxxxxxxxxxxxxxxxxx
Subject: [Condor] Condor Job 321.0

This is an automated email from the Condor system
on machine "stengal.cs.sunyit.edu". Do not reply.

Your Condor-MPI job 321.0 has completed.

Here are the machines that ran your MPI job.
They are listed in the order they were started
in, which is the same as MPI_Comm_rank.

    Machine Name               Result
------------------------    -----------
      jeter.cs.sunyit.edu    exited normally with status 0
    dimagio.cs.sunyit.edu    was removed by the user

Have a nice day.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: wellsj1@xxxxxxxxxxxxx
The Official Condor Homepage is http://www.cs.wisc.edu/condor

The machine that was removed, below is its startlog::

7/24 20:07:04 Remote global job ID is stengal.cs.sunyit.edu#1185322047#319.0
7/24 20:07:04 JobLeaseDuration defined in job ClassAd: 1200
7/24 20:07:04 Resetting ClaimLease timer (19) with new duration
7/24 20:07:04 About to Create_Process "condor_starter -f stengal.cs.sunyit.edu"
7/24 20:07:04 ProcAPI::buildFamily() Found daddypid on the system: 5324
7/24 20:07:04 Got RemoteUser (wells@xxxxxxxxxxxxxxxxxxxxx) from request classad
7/24 20:07:04 Got universe "PARALLEL" (11) from request classad
7/24 20:07:04 State change: claim-activation protocol successful
7/24 20:07:04 Changing activity: Idle -> Busy
7/24 20:07:05 DaemonCore: Command received via UDP from host <192.168.0.40:33369>
7/24 20:07:05 DaemonCore: received command 60008 (DC_CHILDALIVE), calling handler (HandleChildAliveCommand)
7/24 20:07:06 DaemonCore: Command received via TCP from host <192.168.0.60:38711>
7/24 20:07:06 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
7/24 20:07:06 Called deactivate_claim_forcibly()
7/24 20:07:06 In Starter::kill() with pid 5324, sig 3 (SIGQUIT)
7/24 20:07:07 condor_read(): recv() returned -1, errno = 104, assuming failure reading 5 bytes from unknown source.
7/24 20:07:07 IO: EOF reading packet header
7/24 20:07:07 attempt to connect to <192.168.0.40:40894> failed: Connection refused (connect errno = 111).
7/24 20:07:07 ERROR: SECMAN:2004:Failed to start a session to <192.168.0.40:40894> with TCP|SECMAN:2003:TCP connection to <192.168.0.40:40894> failed

7/24 20:07:07 Send_Signal: ERROR Connect to <192.168.0.40:40894> failed.
7/24 20:07:07 Error sending signal to starter, errno = 25 (Inappropriate ioctl for device)
7/24 20:07:07 In Starter::kill_kids() with pid 5324, sig 9 (SIGKILL)
7/24 20:07:07 ProcAPI::buildFamily() Found daddypid on the system: 5324
7/24 20:07:07 DaemonCore: Command received via TCP from host <192.168.0.60:45978>
7/24 20:07:07 DaemonCore: received command 403 (DEACTIVATE_CLAIM), calling handler (command_handler)
7/24 20:07:07 Called deactivate_claim()
7/24 20:07:07 In Starter::kill() with pid 5324, sig 15 (SIGTERM)
7/24 20:07:07 attempt to connect to <192.168.0.40:40894> failed: Connection refused (connect errno = 111).
7/24 20:07:07 ERROR: SECMAN:2003:TCP auth connection to <192.168.0.40:40894> failed

7/24 20:07:07 Send_Signal: ERROR Connect to <192.168.0.40:40894> failed.
7/24 20:07:07 Error sending signal to starter, errno = 25 (Inappropriate ioctl for device)
7/24 20:07:07 In Starter::kill_kids() with pid 5324, sig 9 (SIGKILL)
7/24 20:07:07 ProcAPI::buildFamily() Found daddypid on the system: 5324
7/24 20:07:07 DaemonCore: No more children processes to reap.
7/24 20:07:07 Starter pid 5324 exited with status 4
7/24 20:07:07 ProcAPI::buildFamily failed: parent 5324 not found on system.
7/24 20:07:07 ProcAPI::getProcInfo() pid 5324 does not exist.
7/24 20:07:07 ProcAPI::getProcInfo() pid 5324 does not exist.
7/24 20:07:07 ProcAPI::getProcInfo() pid 5324 does not exist.
7/24 20:07:07 ProcAPI::getProcInfo() pid 5324 does not exist.
7/24 20:07:07 ProcAPI::getProcInfo() pid 5324 does not exist.
7/24 20:07:07 Attempting to remove /home/condor/execute/dir_5324 as SuperUser (root)
7/24 20:07:07 State change: starter exited
7/24 20:07:07 Changing activity: Busy -> Idle

Mailing List Archives

Public Access

[Condor-users] Parallel Universe and Job Termination