[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] How to solve problem between condor and globus?



On Dec 19, 2005, at 4:17 AM, Fu-Ming Tsai wrote:

Hello, all,
I installed condor on 2 machine(I do not use NFS or AFS) and tried to submit some globus jobs. When I submited a Vanilla job, everything could work fine. However, when I tried to submited globus jobs, my jobs were in the job queue no mater there is a machine available. Does anyone know how to solve this?

The following is what I got from the log files.
==MatchLog==
12/19 07:42:20       Matched 4350.0 sary357@xxxxxxxxxxxxxxxxxx
<140.109.98.41:34992> preempting none <140.109.98.40:42530>


===NegotiatorLog==
12/19 07:42:20   Getting startd private ads ...
12/19 07:42:20 Got ads: 14 public and 4 private
12/19 07:42:20 Public ads include 5 submitter, 4 startd
12/19 07:42:20 Phase 2:  Performing accounting ...
12/19 07:42:20 Phase 3:  Sorting submitter ads by priority ...
12/19 07:42:20 Phase 4.1:  Negotiating with schedds ...
12/19 07:42:20   Negotiating with sary357@xxxxxxxxxxxxxxxxxx at
<140.109.98.41:34992>
12/19 07:42:20     Request 04350.00000:
12/19 07:42:20       Matched 4350.0 sary357@xxxxxxxxxxxxxxxxxx
<140.109.98.41:34992> preempting none <140.109.98.40:42530>
12/19 07:42:20 Successfully matched with vm2@xxxxxxxxxxxxxxxxxxxxxxxxx
12/19 07:42:20     Got NO_MORE_JOBS;  done negotiating


==SchedLog==
12/19 07:42:30 Shadow pid 6088 for job 4350.0 exited with status 4
12/19 07:42:30 ERROR: Shadow exited with job exception code!
12/19 07:42:32 Starting add_shadow_birthdate(4350.0)
12/19 07:42:32 Started shadow for job 4350.0 on "<140.109.98.40:42530>",
(shadow pid = 6092)
12/19 07:42:32 Shadow pid 6092 for job 4350.0 exited with status 4
12/19 07:42:32 ERROR: Shadow exited with job exception code!
12/19 07:42:34 Starting add_shadow_birthdate(4350.0)
12/19 07:42:34 Started shadow for job 4350.0 on "<140.109.98.40:42530>",
(shadow pid = 6097)
12/19 07:42:34 Shadow pid 6097 for job 4350.0 exited with status 4
12/19 07:42:34 ERROR: Shadow exited with job exception code!
12/19 07:42:34 Match for cluster 4350 has had 5 shadow exceptions,
relinquishing.
12/19 07:42:34 Sent RELEASE_CLAIM to startd on <140.109.98.40:42530>
12/19 07:42:34 Match record (<140.109.98.40:42530>, 4350, 0) deleted
12/19 07:42:34 DaemonCore: Command received via TCP from host
<140.109.98.40:42555>
12/19 07:42:34 DaemonCore: received command 443 (VACATE_SERVICE), calling
handler (vacate_service)
12/19 07:42:34 Got VACATE_SERVICE from <140.109.98.40:42555>

==ShadowLog==
12/19 07:42:30 ******************************************************
12/19 07:42:30 ** condor_shadow (CONDOR_SHADOW) STARTING UP
12/19 07:42:30 ** /opt/osg/osg_0.2.0/condor/sbin/condor_shadow
12/19 07:42:30 ** $CondorVersion: 6.7.7 Apr 27 2005 $
12/19 07:42:30 ** $CondorPlatform: I386-LINUX_RH9 $
12/19 07:42:30 ** PID = 6088
12/19 07:42:30 ******************************************************
12/19 07:42:30 Using config file: /opt/osg/osg_0.2.0/condor/etc/ condor_config
12/19 07:42:30 Using local config
files: /opt/osg/osg_0.2.0/condor/home/condor_config.local
12/19 07:42:30 DaemonCore: Command Socket at <140.109.98.41:35215>
12/19 07:42:30 Initializing a VANILLA shadow for job 4350.0
12/19 07:42:30 (4350.0) (6088): Request to run on <140.109.98.40:42530> was
ACCEPTED
12/19 07:42:30 (4350.0) (6088): ERROR "Error from starter on
vm2@xxxxxxxxxxxxxxxxxxxxxxxxx: Failed to open standard output
file '/home/sary357/.globus/job/osgc01.grid.sinica.edu.tw/ 5973.1134978120/stdo ut': No such file or directory (errno 2)" at line 597 in file pseudo_ops.C


And the following is my job description file.

Universe        = globus
globusscheduler = osgc01.grid.sinica.edu.tw/jobmanager-condor
Executable      = job4.sh
Output          = job4.out
Error           = job4.err
Log             = job4.log
Requirements    = (Name=="vm2@xxxxxxxxxxxxxxxxxxxxxxxxx")
should_transer_file =  IF_NEEDED
when_to_transfer_output = ON_EXIT
Queue

It sounds like your FileSystemDomain is misconfigured in Condor. Machines with the same FileSystemDomain are assumed to have a shared filesystem between them. Once that's fixed, jobs will only run between machines if you enable Condor's file transfer mechanism (should_transfer_files and when_to_transfer_output). Globus doesn't do this (because it assumes all clusters have a shared filesystem), so you'll have to muck with $GLOBUS_LOCATION/lib/perl/Globus/GRAM/ JobManager/condor.pm.

+----------------------------------+---------------------------------+
|            Jaime Frey            |  Public Split on Whether        |
|        jfrey@xxxxxxxxxxx         |  Bush Is a Divider              |
|  http://www.cs.wisc.edu/~jfrey/  |         -- CNN Scrolling Banner |
+----------------------------------+---------------------------------+