[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problems in Condor-C



Hi Dan,
 
I am sorry to say both of the solutions you mentioned didn't work out. Thay all ended up with the following error on the execution node.
 
In log file――SchedLog
06/12 02:16:56 (pid:10751) Starting add_shadow_birthdate(13.0)
06/12 02:16:56 (pid:10751) Started shadow for job 13.0 on slot1@xxxxxxxxxxxxxxxxxxxxx <202.38.140.91:38395> for ddg2@xxxxxxxx
.cn, (shadow pid = 29697)
06/12 02:16:57 (pid:10751) Shadow pid 29697 for job 13.0 exited with status 112
06/12 02:16:57 (pid:10751) Putting job 13.0 on hold
06/12 02:16:57 (pid:10751) Checking consistency running and runnable jobs
06/12 02:16:57 (pid:10751) Tables are consistent
06/12 02:16:57 (pid:10751) Rebuilt prioritized runnable job list in 0.000s.  (Expedited rebuild because no match was found)
06/12 02:16:57 (pid:10751) match (slot1@xxxxxxxxxxxxxxxxxxxxx <202.38.140.91:38395> for ddg2@xxxxxxxxxxx) out of jobs; relinq
uishing
06/12 02:16:57 (pid:10751) Completed RELEASE_CLAIM to startd at <202.38.140.91:38395>
06/12 02:16:57 (pid:10751) Match record (slot1@xxxxxxxxxxxxxxxxxxxxx <202.38.140.91:38395> for ddg2@xxxxxxxxxxx, 13.-1) deleted
 
In log file――StarterLog.slot1 
06/12 02:16:56 ******************************************************
06/12 02:16:56 ** condor_starter (CONDOR_STARTER) STARTING UP
06/12 02:16:56 ** /opt/condor-7.4.1/sbin/condor_starter
06/12 02:16:56 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
06/12 02:16:56 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
06/12 02:16:56 ** $CondorVersion: 7.4.1 Dec 17 2009 BuildID: 204351 $
06/12 02:16:56 ** $CondorPlatform: I386-LINUX_RHEL3 $
06/12 02:16:56 ** PID = 29698
06/12 02:16:56 ** Log last touched 6/12 02:13:16
06/12 02:16:56 ******************************************************
06/12 02:16:56 Using config source: /opt/condor-7.4.1/etc/condor_config
06/12 02:16:56 Using local config sources: 
06/12 02:16:56    /opt/condor-7.4.1/local.euchina08/condor_config.local
06/12 02:16:56 DaemonCore: Command Socket at <202.38.140.91:36244>
06/12 02:16:56 Done setting resource limits
06/12 02:16:56 Communicating with shadow <202.38.140.91:36240>
06/12 02:16:56 Submitting machine is "euchina08.buaa.edu.cn"
06/12 02:16:56 setting the orig job name in starter
06/12 02:16:56 setting the orig job iwd in starter
06/12 02:16:56 File transfer completed successfully.
06/12 02:16:57 Job 13.0 set to execute immediately
06/12 02:16:57 Starting a VANILLA universe job with ID: 13.0
06/12 02:16:57 IWD: /opt/condor-7.4.1/local.euchina08/execute/dir_29698
06/12 02:16:57 Output file: /opt/condor-7.4.1/local.euchina08/execute/dir_29698/hello.out
06/12 02:16:57 Error file: /opt/condor-7.4.1/local.euchina08/execute/dir_29698/hello.err
06/12 02:16:57 About to exec /opt/condor-7.4.1/local.euchina08/execute/dir_29698/condor_exec.exe 
06/12 02:16:57 Create_Process(/opt/condor-7.4.1/local.euchina08/execute/dir_29698/condor_exec.exe): child failed with errno 8
 (Exec format error) before exec()
06/12 02:16:57 ERROR "Create_Process(/opt/condor-7.4.1/local.euchina08/execute/dir_29698/condor_exec.exe,, ...) failed: Exec 
format error" at line 530 in file os_proc.cpp
06/12 02:16:57 ShutdownFast all jobs.
 
In log file――ShadowLog
06/12 02:16:56 ******************************************************
06/12 02:16:56 ** condor_shadow (CONDOR_SHADOW) STARTING UP
06/12 02:16:56 ** /opt/condor-7.4.1/sbin/condor_shadow
06/12 02:16:56 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
06/12 02:16:56 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
06/12 02:16:56 ** $CondorVersion: 7.4.1 Dec 17 2009 BuildID: 204351 $
06/12 02:16:56 ** $CondorPlatform: I386-LINUX_RHEL3 $
06/12 02:16:56 ** PID = 29697
06/12 02:16:56 ** Log last touched 6/12 02:13:16
06/12 02:16:56 ******************************************************
06/12 02:16:56 Using config source: /opt/condor-7.4.1/etc/condor_config
06/12 02:16:56 Using local config sources: 
06/12 02:16:56    /opt/condor-7.4.1/local.euchina08/condor_config.local
06/12 02:16:56 DaemonCore: Command Socket at <202.38.140.91:36240>
06/12 02:16:56 Initializing a VANILLA shadow for job 13.0
06/12 02:16:56 (13.0) (29697): Request to run on slot1@xxxxxxxxxxxxxxxxxxxxx <202.38.140.91:38395> was ACCEPTED
06/12 02:16:57 (13.0) (29697): Job 13.0 going into Hold state (code 6,8): Error from slot1@xxxxxxxxxxxxxxxxxxxxx: Failed to e
xecute '/opt/condor-7.4.1/local.euchina08/execute/dir_29698/condor_exec.exe': Exec format error
06/12 02:16:57 (13.0) (29697): **** condor_shadow (condor_SHADOW) pid 29697 EXITING WITH STATUS 112
 
The submit files I am using now are the following two:
[ddg2@www simple_test]$ cat simple.submit
Universe=grid
Executable=simple.sh
Output=simple.out
Error=simple.err
Log=simple.log
grid_resource = condor euchina08.buaa.edu.cn euchina08.buaa.edu.cn
remote_universe = vanilla
+remote_NeverCreateJobSandbox = False
+remote_requirements = True
+remote_ShouldTransferFiles = YES
+remote_WhenToTransferOutput = ON_EXIT
Queue
 
[ddg2@www simple_test]$ cat simple_2.submit
Universe=grid
grid_resource = condor euchina08.buaa.edu.cn euchina08.buaa.edu.cn
Executable=simple.sh
Output=simple.out
Error=simple.err
Log=simple.log
remote_universe = vanilla
+remote_requirements = True
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
Queue
 
-Hailong
 
2010-01-05

***********************************************
* Hailong Yang, PhD. Candidate
* Sino-German Joint Software Institute,
* School of Computer Science&Engineering, Beihang University
* Phone: (86-010)82315908
* Email: hailong.yang1115@xxxxxxxxx
* Address: G413, New Main Building in Beihang University,
*              No.37 XueYuan Road,HaiDian District,
*              Beijing,P.R.China,100191
***********************************************

发件人: Dan Bradley
发送时间: 2010-01-05  06:43:07
收件人: Condor-Users Mail List
抄送:
主题: Re: [Condor-users] Problems in Condor-C
Hailong,
I found that in Condor 7.4.1 there is a problem with the attribute
NeverCreateJobSandbox. This explains your issue.
In addition to the workaround I already mentioned, your original submit
file can be made to work by adding the following:
+remote_NeverCreateJobSandbox = false
--Dan
Dan Bradley wrote:
> Hi Hailong,
>
> I have reproduced the problem you reported. I havn't fully understood
> it, but I did find that I could make things work if I submit the
> original job with file transfer turned on. In other words, change your
> submit file to this:
>
> universe = grid
> grid_resource = condor euchina08.buaa.edu.cn euchina08.buaa.edu.cn
> executable = simple.sh
> output = simple.out
> error = simple.err
> log = simple.log
> remote_universe = vanilla
> +remote_requirements = True
> ShouldTransferFiles = yes
> WhenToTransferOutput = ON_EXIT
>
> queue
>
> --Dan
>
> hailong.yang1115 wrote:
>   
>> Hi Alain,
>> There are the corresponding log files from the execute node in the
>> attachment.
>> -Hailong
>> 2010-01-02
>> ------------------------------------------------------------------------
>> ***********************************************
>> * Hailong Yang, PhD. Candidate
>> * Sino-German Joint Software Institute,
>> * School of Computer Science&Engineering, Beihang University
>> * Phone: (86-010)82315908
>> * Email: hailong.yang1115@xxxxxxxxx <mailto:hailong.yang1115@xxxxxxxxx>
>> * Address: G413, New Main Building in Beihang University,
>> * No.37 XueYuan Road,HaiDian District,
>> * Beijing,P.R.China,100191
>> ***********************************************
>> ------------------------------------------------------------------------
>> *发件人:* Alain Roy
>> *发送时间:* 2010-01-01 00:14:03
>> *收件人:* Condor-Users Mail List
>> *抄送:*
>> *主题:* Re: [Condor-users] Problems in Condor-C
>> Hi Hailong,
>> Do you have the corresponding logs from the execute side? The StartLog
>> or StarterLog might have more detail on that error.
>> -alain
>> On Dec 31, 2009, at 9:53 AM, hailong.yang1115 wrote:
>>     
>>> Hi everyone,
>>>
>>> Recently we configured two condor pools to flock jobs using
>>>       
>> Condor-C. The problem is when the jobs appear in the remote condor
>> pool, they stay idle all the way. There is error in the ShadowLog file:
>>     
>>> 06/07 12:43:20 ******************************************************
>>> 06/07 12:43:20 ** condor_shadow (CONDOR_SHADOW) STARTING UP
>>> 06/07 12:43:20 ** /opt/condor-7.4.1/sbin/condor_shadow
>>> 06/07 12:43:20 ** SubsystemInfo: name=SHADOW type=SHADOW(6)
>>>       
>> class=DAEMON(1)
>>     
>>> 06/07 12:43:20 ** Configuration: subsystem:SHADOW local:<NONE>
>>>       
>> class:DAEMON
>>     
>>> 06/07 12:43:20 ** $CondorVersion: 7.4.1 Dec 17 2009 BuildID: 204351 $
>>> 06/07 12:43:20 ** $CondorPlatform: I386-LINUX_RHEL3 $
>>> 06/07 12:43:20 ** PID = 11152
>>> 06/07 12:43:20 ** Log last touched 6/7 12:43:20
>>> 06/07 12:43:20 ******************************************************
>>> 06/07 12:43:20 Using config source: /opt/condor-7.4.1/etc/condor_config
>>> 06/07 12:43:20 Using local config sources:
>>> 06/07 12:43:20 /opt/condor-7.4.1/local.euchina08/condor_config.local
>>> 06/07 12:43:20 DaemonCore: Command Socket at <202.38.140.91:38889>
>>> 06/07 12:43:20 Initializing a VANILLA shadow for job 5.0
>>> 06/07 12:43:20 (5.0) (11152): Request to run on
>>>       
>> slot1@xxxxxxxxxxxxxxxxxxxxx <202.38.140.91:38395> was ACCEPTED
>>     
>>> 06/07 12:43:20 (5.0) (11152): ERROR "Error from
>>>       
>> slot1@xxxxxxxxxxxxxxxxxxxxx: FileTransfer: DownloadFiles called on
>> server sid
>>     
>>> e" at line 655 in file pseudo_ops.cpp
>>>
>>> Here is the job description file:
>>> [ddg2@www simple_test]$ cat simple.submit
>>> universe = grid
>>> grid_resource = condor euchina08.buaa.edu.cn euchina08.buaa.edu.cn
>>> executable = simple.sh
>>> output = simple.out
>>> error = simple.err
>>> log = simple.log
>>> remote_universe = vanilla
>>> +remote_requirements = True
>>> +remote_ShouldTransferFiles = "YES"
>>> +remote_WhenToTransferOutput = "ON_EXIT"
>>> queue
>>>
>>> [ddg2@www simple_test]$ cat simple.sh
>>> #!/bin/sh
>>> echo "Start to sleep for 5 seconds"
>>> sleep 5
>>> echo "All done"
>>>
>>> Any clue?
>>>
>>> -Hailong
>>>
>>> 2009-12-31
>>> ***********************************************
>>> * Hailong Yang, PhD. Candidate
>>> * Sino-German Joint Software Institute,
>>> * School of Computer Science&Engineering, Beihang University
>>> * Phone: (86-010)82315908
>>> * Email: hailong.yang1115@xxxxxxxxx
>>> * Address: G413, New Main Building in Beihang University,
>>> * No.37 XueYuan Road,HaiDian District,
>>> * Beijing,P.R.China,100191
>>> ***********************************************
>>> _______________________________________________
>>> Condor-users mailing list
>>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
>>>       
>> with a
>>     
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/condor-users/
>>>       
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>>     
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>   
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/