[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs remain idle long time!



Jaime,

The jobs submitted by globus 4.0.0(GRAM - globusrun-ws) usually remain idle.
But how you asked ShadowLog:

8/24 14:09:30 (?.?) (6449):******* Standard Shadow starting up *******
8/24 14:09:30 (?.?) (6449):** $CondorVersion: 6.7.10 Aug  3 2005 $
8/24 14:09:30 (?.?) (6449):** $CondorPlatform: I386-LINUX_RH9 $
8/24 14:09:30 (?.?) (6449):*******************************************
8/24 14:09:30 (?.?) (6449):uid=0, euid=503, gid=0, egid=503
8/24 14:09:30 (?.?) (6449):Hostname = "<150.162.60.140:32771>", Job = 43.0
8/24 14:09:30 (43.0) (6449):Requesting Primary Starter
8/24 14:09:30 (43.0) (6449):Shadow: Request to run a job was ACCEPTED
8/24 14:09:30 (43.0) (6449):Shadow: RSC_SOCK connected, fd = 17
8/24 14:09:30 (43.0) (6449):Shadow: CLIENT_LOG connected, fd = 18
8/24 14:09:30 (43.0) (6449):My_Filesystem_Domain = "labweb02.inf.ufsc.br"
8/24 14:09:30 (43.0) (6449):My_UID_Domain = "labweb02.inf.ufsc.br"
8/24 14:09:30 (43.0) (6449):    Entering pseudo_get_file_stream
8/24 14:09:30 (43.0) (6449):    file =
"/usr/local/condor/local.labweb02//spool/cluster43.ickpt.subproc0"
8/24 14:09:30 (43.0) (6449):    150.162.60.140
8/24 14:09:30 (43.0) (6449):    150.162.60.140
8/24 14:09:30 (43.0) (6449):Reaped child status - pid 6451 exited with
status 0
8/24 14:09:31 (43.0) (6449):Read: User Job - $CondorPlatform: I386-LINUX_RH9
$
8/24 14:09:31 (43.0) (6449):Read: User Job - $CondorVersion: 6.7.7 Apr 27
2005 $
8/24 14:09:31 (43.0) (6449):Read: Checkpoint file name is
"/usr/local/condor/local.labweb02//spool/cluster43.proc0.subproc0"
8/24 14:09:31 (43.0) (6449):error: Warning: READWRITE: File
'/usr/local/condor/local.labweb02/tmp' used for both reading and writing. 
This i$

8/24 14:09:31 (43.0) (6449):Read: Warning: READWRITE: File
'/usr/local/condor/local.labweb02/tmp' used for both reading and writing. 
This is$
8/24 14:09:32 (43.0) (6449):Shadow: Job 43.0 exited, termsig = 0, coredump =
0, retcode = 0
8/24 14:09:32 (43.0) (6449):Shadow: Job exited normally with status 0
8/24 14:09:32 (43.0) (6449):user_time = 1 ticks
8/24 14:09:32 (43.0) (6449):sys_time = 17 ticks
8/24 14:09:32 (43.0) (6449):Static Policy: removing job because OnExitRemove
has become true
8/24 14:21:13 (44.0) (6698):Shadow: Request to run a job was ACCEPTED
8/24 14:21:13 (44.0) (6698):Shadow: RSC_SOCK connected, fd = 17
8/24 14:21:13 (44.0) (6698):Shadow: CLIENT_LOG connected, fd = 18
8/24 14:21:13 (44.0) (6698):My_Filesystem_Domain = "labweb02.inf.ufsc.br"
8/24 14:21:13 (44.0) (6698):My_UID_Domain = "labweb02.inf.ufsc.br"
8/24 14:21:13 (44.0) (6698):    Entering pseudo_get_file_stream
8/24 14:21:13 (44.0) (6698):    file =
"/usr/local/condor/local.labweb02//spool/cluster44.ickpt.subproc0"
8/24 14:21:13 (44.0) (6698):    150.162.60.140
8/24 14:21:13 (44.0) (6698):    150.162.60.140
8/24 14:21:14 (44.0) (6698):Reaped child status - pid 6700 exited with
status 0
8/24 14:21:14 (44.0) (6698):Read: User Job - $CondorPlatform: I386-LINUX_RH9
$
8/24 14:21:14 (44.0) (6698):Read: User Job - $CondorVersion: 6.7.7 Apr 27
2005 $
8/24 14:21:14 (44.0) (6698):Read: Checkpoint file name is
"/usr/local/condor/local.labweb02//spool/cluster44.proc0.subproc0"
8/24 14:21:14 (44.0) (6698):error: Warning: READWRITE: File
'/usr/local/condor/local.labweb02/tmp' used for both reading and writing. 
This i$

8/24 14:21:14 (44.0) (6698):Read: Warning: READWRITE: File
'/usr/local/condor/local.labweb02/tmp' used for both reading and writing. 
This is$
8/24 14:21:15 (44.0) (6698):Shadow: Job 44.0 exited, termsig = 0, coredump =
0, retcode = 0
8/24 14:21:15 (44.0) (6698):Shadow: Job exited normally with status 0
8/24 14:21:15 (44.0) (6698):user_time = 2 ticks
8/24 14:21:15 (44.0) (6698):sys_time = 16 ticks
8/24 14:21:15 (44.0) (6698):Static Policy: removing job because OnExitRemove
has become true
8/24 14:44:44 ******************************************************
8/24 14:44:44 Using config file: /etc/condor/condor_config
8/24 14:44:44 Using local config files:
/usr/local/condor/local.labweb02/condor_config.local
8/24 14:44:44 DaemonCore: Command Socket at <150.162.60.140:36012>
8/24 14:44:45 Initializing a VANILLA shadow for job 46.1
8/24 14:44:45 (46.1) (7192): Request to run on <150.162.60.140:32771> was
ACCEPTED
8/24 14:44:45 (46.1) (7192): Asked to write event of number 1.
8/24 14:44:45 (46.1) (7192): Job 46.1 terminated: exited with status 0
8/24 14:44:45 (46.1) (7192): Asked to write event of number 5.
8/24 14:44:45 (46.1) (7192): **** condor_shadow (condor_SHADOW) EXITING WITH
STATUS 100


StarterLog:

8/24 14:44:42 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
8/24 14:44:45 ******************************************************
8/24 14:44:45 ** condor_starter (CONDOR_STARTER) STARTING UP
8/24 14:44:45 ** /usr/local/condor/sbin/condor_starter
8/24 14:44:45 ** $CondorVersion: 6.7.10 Aug  3 2005 $
8/24 14:44:45 ** $CondorPlatform: I386-LINUX_RH9 $
8/24 14:44:45 ** PID = 7199
8/24 14:44:45 ******************************************************
8/24 14:44:45 Using config file: /etc/condor/condor_config
8/24 14:44:45 Using local config files:
/usr/local/condor/local.labweb02/condor_config.local
8/24 14:44:45 DaemonCore: Command Socket at <150.162.60.140:36014>
8/24 14:44:45 Done setting resource limits
8/24 14:44:45 Communicating with shadow <150.162.60.140:36012>
8/24 14:44:45 Submitting machine is "labweb02.inf.ufsc.br"
8/24 14:44:45 Starting a VANILLA universe job with ID: 46.1
8/24 14:44:45 IWD: /usr/local/globus-4.0.0/
8/24 14:44:45 Output file:
/home/vinicius/.globus/950bb950-14c4-11da-8cd3-e97f6a0eea0e/stdout001
8/24 14:44:45 Error file:
/home/vinicius/.globus/950bb950-14c4-11da-8cd3-e97f6a0eea0e/stderr001
8/24 14:44:45 Renice expr "1" evaluated to 1
8/24 14:44:45 About to exec /bin/date
8/24 14:44:45 Create_Process succeeded, pid=7202
8/24 14:44:45 Process exited, pid=7202, status=0
8/24 14:44:45 Got SIGQUIT.  Performing fast shutdown.
8/24 14:44:45 ShutdownFast all jobs.


ScheddLog:

8/24 14:44:18 Out of servers - 0 jobs matched, 3 jobs idle, 2 jobs rejected
8/24 14:44:31 Sent ad to central manager for vinicius@xxxxxxxxxxxxxxxxxxxx
8/24 14:44:31 Sent ad to 1 collectors for vinicius@xxxxxxxxxxxxxxxxxxxx
8/24 14:44:38 Activity on stashed negotiator socket
8/24 14:44:38 Negotiating for owner: vinicius@xxxxxxxxxxxxxxxxxxxx
8/24 14:44:38 Checking consistency running and runnable jobs
8/24 14:44:38 Tables are consistent
8/24 14:44:38 Out of servers - 1 jobs matched, 2 jobs idle, 2 jobs rejected
8/24 14:44:42 Sent ad to central manager for vinicius@xxxxxxxxxxxxxxxxxxxx
8/24 14:44:42 Sent ad to 1 collectors for vinicius@xxxxxxxxxxxxxxxxxxxx
8/24 14:44:42 Starting add_shadow_birthdate(46.0)
8/24 14:44:42 Started shadow for job 46.0 on "<150.162.60.140:32771>",
(shadow pid = 7184)
8/24 14:44:42 Shadow pid 7184 for job 46.0 exited with status 100
8/24 14:44:44 Starting add_shadow_birthdate(46.1)
8/24 14:44:45 Started shadow for job 46.1 on "<150.162.60.140:32771>",
(shadow pid = 7192)
8/24 14:44:45 Shadow pid 7192 for job 46.1 exited with status 100
8/24 14:44:47 Sent ad to central manager for vinicius@xxxxxxxxxxxxxxxxxxxx
8/24 14:44:47 Sent ad to 1 collectors for vinicius@xxxxxxxxxxxxxxxxxxxx
8/24 14:44:47 Starting add_shadow_birthdate(47.0)
8/24 14:44:49 Started shadow for job 47.0 on "<150.162.60.140:32771>",
(shadow pid = 7208)
8/24 14:44:51 Shadow pid 7208 for job 47.0 exited with status 100
8/24 14:44:54 match (<150.162.60.140:32771>#1124891069#27) out of jobs
(cluster id 46); relinquishing
8/24 14:44:54 Sent RELEASE_CLAIM to startd on <150.162.60.140:32771>
8/24 14:44:54 Match record (<150.162.60.140:32771>, 46, -1) deleted
8/24 14:44:54 Sent ad to central manager for vinicius@xxxxxxxxxxxxxxxxxxxx
8/24 14:44:54 Sent ad to 1 collectors for vinicius@xxxxxxxxxxxxxxxxxxxx
8/24 14:44:54 DaemonCore: Command received via TCP from host
<150.162.60.140:36035>
8/24 14:44:54 DaemonCore: received command 443 (VACATE_SERVICE), calling
handler (vacate_service)
8/24 14:44:54 Got VACATE_SERVICE from <150.162.60.140:36035>
8/24 16:22:14 Tables are consistent
8/24 16:22:14 Out of servers - 0 jobs matched, 3 jobs idle, 2 jobs rejected
8/24 16:22:18 DaemonCore: Command received via TCP from host
<150.162.60.140:37507>
8/24 16:22:18 DaemonCore: received command 478 (ACT_ON_JOBS), calling
handler (actOnJobs)
8/24 16:22:18 Asked to write event of number 9.
8/24 16:22:18 Attempting to chown
'/usr/local/condor/local.labweb02//spool/cluster52.proc0.subproc0', but it
doesn't appear to exist.
8/24 16:22:18 Error: Unable to chown
'/usr/local/condor/local.labweb02//spool/cluster52.proc0.subproc0' from 503
to 503.503
8/24 16:22:18 (52.0) Failed to chown
/usr/local/condor/local.labweb02//spool/cluster52.proc0.subproc0 from 503 to
503.503.  User may run into$
8/24 16:22:25 DaemonCore: Command received via TCP from host
<150.162.60.140:37513>
8/24 16:22:25 DaemonCore: received command 478 (ACT_ON_JOBS), calling
handler (actOnJobs)
8/24 16:22:25 Asked to write event of number 9.
8/24 16:22:25 Attempting to chown
'/usr/local/condor/local.labweb02//spool/cluster53.proc0.subproc0', but it
doesn't appear to exist.
8/24 16:22:25 Error: Unable to chown
'/usr/local/condor/local.labweb02//spool/cluster53.proc0.subproc0' from 500
to 503.503
8/24 16:22:25 (53.0) Failed to chown
/usr/local/condor/local.labweb02//spool/cluster53.proc0.subproc0 from 500 to
503.503.  User may run into$
8/24 16:22:27 DaemonCore: Command received via TCP from host
<150.162.60.140:37515>
8/24 16:22:27 DaemonCore: received command 478 (ACT_ON_JOBS), calling
handler (actOnJobs)
8/24 16:22:27 Asked to write event of number 9.
8/24 16:22:27 Attempting to chown
'/usr/local/condor/local.labweb02//spool/cluster54.proc0.subproc0', but it
doesn't appear to exist.
8/24 16:22:27 Error: Unable to chown
'/usr/local/condor/local.labweb02//spool/cluster54.proc0.subproc0' from 500
to 503.503
8/24 16:22:27 (54.0) Failed to chown
/usr/local/condor/local.labweb02//spool/cluster54.proc0.subproc0 from 500 to
503.503.  User may run into$
8/24 16:22:34 Activity on stashed negotiator socket
8/24 16:22:34 Negotiating for owner: condor@xxxxxxxxxxxxxxxxxxxx
8/24 16:22:34 Checking consistency running and runnable jobs


Would it be the universe vanilla? What is the universe I must use when I
submitted by globus? what should permissions exist in the directory /
usr/local/condor/local.labweb02/spool?



Thanks,


Vinicius


--------- Mensagem Original --------
De: Jaime Frey <jfrey@xxxxxxxxxxx>
Para: Vinicius da Cunha Martins Borges <vinicius@xxxxxxxxx>, Condor-Users
Mail List <condor-users@xxxxxxxxxxx>
Assunto: Re: [Condor-users] Jobs remain idle long time!
Data: 24/08/05 21:42

>
> On Aug 24, 2005, at 5:33 AM, Vinicius da Cunha Martins Borges wrote:
>
> &gt; I am not get to run jobs in Condor. They remain idle almost every
> &gt; time. I am
> &gt; new user in Condor. I installed condor in one machine, this machine
> &gt; is my
> &gt; pool condor(manager, submit and execute). I submit some jobs and the
> &gt; StartLog show this error:
> ....
> &gt; Does anyone know why my jobs remain idle during long time? Is it
> &gt; related
> &gt; with this error in StartLog?
>
> Your StarterLog and ShadowLog should have more information. Can you
> post those?
>
> +----------------------------------+---------------------------------+
> |            Jaime Frey            |  Public Split on Whether        |
> |        jfrey@xxxxxxxxxxx         |  Bush Is a Divider              |
> |  http://www.cs.wisc.edu/~jfrey/  |         -- CNN Scrolling Banner |
> +----------------------------------+---------------------------------+
>
>
>
>
>
>
>
> <br><br>
_________________________________________________<br>
E-mail
enviado pelo Webmail da Fesurv<br>
www.fesurv.br - (64) 620.2200 - Rio Verde
- Goiás<br><br>