[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Setting up BOINC backfill



Thanks Tim, we will test that.

Michael


On Sep 23, 2016 6:20 PM, "Tim Theisen" <tim@xxxxxxxxxxx> wrote:
Hi Michael,

You problem looks suspiciously like the problem described in

https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=5862

The stack trace is not quite the same. Would you willing to try the 8.4.9 prerelease? The release is due out next week.

...Tim


On 09/23/2016 07:15 AM, Michael Hanke wrote:
Hi,

we haven't been able to set up BOINC backfill properly. Running the
BOINC client via

# condor_starter -f -append boinc -job-keyword boinc

(as root) works as expected. However, we are not running condor as root,
but as user 'condor'. This leads to an immediate failure:

# sudo -u condor condor_starter -f -append boinc -job-keyword boinc

The relevant part in the logs (ALL_DEBUG = D_FULLDEBUG) seems to be:

==> /var/log/condor/StarterLog.boinc <==
09/23/16 13:59:22 (pid:15948) About to exec /usr/bin/boinc --attach_project http://www.worldcommunitygrid.org <snip>
09/23/16 13:59:22 (pid:15948) Env = TMP=/var/lib/condor/execute/dir_15948 _CONDOR_JOB_IWD=/boinc _CONDOR_SLOT= _CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_15948/.machine.ad TEMP=/var/lib/condor/execute/dir_15948 _CHIRP_DELAYED_UPDATE_PREFIX=Chirp* TMPDIR=/var/lib/condor/execute/dir_15948 BATCH_SYSTEM=HTCondor _CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_15948 _CONDOR_JOB_AD=/var/lib/condor/execute/dir_15948/.job.ad _CONDOR_JOB_PIDS=
09/23/16 13:59:22 (pid:15948) ENFORCE_CPU_AFFINITY not true, not setting affinity
09/23/16 13:59:22 (pid:15948) Running job as user same uid as parent: personal condor
09/23/16 13:59:22 (pid:15950) Can not remount filesystems because this system does can not have/allow unshare(2)
09/23/16 13:59:22 (pid:15948) Warning: Create_Process: failed to read child process failure code
09/23/16 13:59:22 (pid:15948) Create_Process(/usr/bin/boinc): child failed with errno 38 (Function not implemented) before exec()
09/23/16 13:59:22 (pid:15948) Create_Process(/usr/bin/boinc,--attach_project http://www.worldcommunitygrid.org <snip> ...) failed: (errno=38: 'Function not implemented')
09/23/16 13:59:22 (pid:15948) Failed to start job, exiting

Here are the relevant config settings

# condor_config_val -dump |grep BOINC
BACKFILL_SYSTEM = BOINC
BOINC_Arguments = --attach_project http://www.worldcommunitygrid.org <snip>
BOINC_Error = $(BOINC_HOME)/boinc.err
BOINC_Executable = /usr/bin/boinc
BOINC_HOME = /boinc
BOINC_InitialDir = $(BOINC_HOME)
BOINC_Output = $(BOINC_HOME)/boinc.out
BOINC_Owner = boinc
BOINC_Universe = vanilla

When running as user condor, this yields the following before the crash:

# ls -l /boinc/
total 0
-rw-r--r-- 1 condor condor 0 Sep 23 13:59 boinc.err
-rw-r--r-- 1 condor condor 0 Sep 23 13:59 boinc.out

(BOINC_Owner is ignored as mentioned in the docs).

When running as root, this yields:

# ls -l /boinc/
total 296
-rw-r--r-- 1 boinc boinc 2140 Sep 23 14:05 account_www.worldcommunitygrid.org.xml
-rw-r--r-- 1 boinc boinc 58679 Sep 23 14:05 all_projects_list.xml
-rw-r--r-- 1 boinc boinc Â179 Sep 23 14:05 boinc.err
-rw-r--r-- 1 boinc boinc 8979 Sep 23 14:06 boinc.out
-rw-r--r-- 1 boinc boinc 54980 Sep 23 14:06 client_state.xml
-rw-r--r-- 1 boinc boinc 55699 Sep 23 14:06 client_state_prev.xml
-rw-r--r-- 1 boinc boinc Â112 Sep 23 14:05 daily_xfer_history.xml
-rw-r--r-- 1 boinc boinc 1415 Sep 23 14:05 global_prefs.xml
-rw------- 1 boinc boinc  32 Sep 23 14:05 gui_rpc_auth.cfg
-rw-r--r-- 1 boinc boinc  Â0 Sep 23 14:05 lockfile
-rw-r--r-- 1 boinc boinc 26636 Sep 23 14:05 master_www.worldcommunitygrid.org.xml
drwxrwx--x 2 boinc boinc 4096 Sep 23 14:05 notices
drwxrwx--x 3 boinc boinc 4096 Sep 23 14:05 projects
-rw-r--r-- 1 boinc boinc 33531 Sep 23 14:05 sched_reply_www.worldcommunitygrid.org.xml
-rw-r--r-- 1 boinc boinc 5630 Sep 23 14:05 sched_request_www.worldcommunitygrid.org.xml
drwxrwx--x 3 boinc boinc 4096 Sep 23 14:06 slots
-rw-r--r-- 1 boinc boinc Â419 Sep 23 14:05 statistics_www.worldcommunitygrid.org.xml
-rw-r--r-- 1 boinc boinc Â114 Sep 23 14:05 time_stats_log

In order to be maximally permissive: in both cases the directory permissions
are:

# ls -l / |grep boinc
drwxrwxrwx Â5 boinc boinc 4096 Sep 23 14:08 boinc


Enabling backfill with a running condor_master (as user condor) results in a
loop that repeatedly segfaults condor_startd:

09/23/16 11:17:05 slot1: State change: START_BACKFILL is TRUE
09/23/16 11:17:05 slot1: Changing state and activity: Unclaimed/Benchmarking -> Backfill/Idle
09/23/16 11:17:05 slot2: State change: IS_OWNER is false
09/23/16 11:17:05 slot2: Changing state: Owner -> Unclaimed
09/23/16 11:17:05 State change: RunBenchmarks is TRUE
09/23/16 11:17:05 slot2: Changing activity: Idle -> Benchmarking
09/23/16 11:17:05 slot2: Changing activity: Benchmarking -> Idle
09/23/16 11:17:05 slot2: State change: START_BACKFILL is TRUE
09/23/16 11:17:05 slot2: Changing state: Unclaimed -> Backfill
09/23/16 11:17:05 slot2: State change: START_BACKFILL is TRUE
09/23/16 11:17:05 slot3: State change: IS_OWNER is false
09/23/16 11:17:05 slot3: Changing state: Owner -> Unclaimed
09/23/16 11:17:05 State change: RunBenchmarks is TRUE
09/23/16 11:17:05 slot3: Changing activity: Idle -> Benchmarking
09/23/16 11:17:05 slot3: Changing activity: Benchmarking -> Idle
09/23/16 11:17:05 slot3: State change: START_BACKFILL is TRUE
09/23/16 11:17:05 slot3: Changing state: Unclaimed -> Backfill
09/23/16 11:17:05 slot3: State change: START_BACKFILL is TRUE
09/23/16 11:17:05 slot4: State change: IS_OWNER is false
09/23/16 11:17:05 slot4: Changing state: Owner -> Unclaimed
09/23/16 11:17:05 State change: RunBenchmarks is TRUE
09/23/16 11:17:05 slot4: Changing activity: Idle -> Benchmarking
09/23/16 11:17:05 slot4: Changing activity: Benchmarking -> Idle
09/23/16 11:17:05 slot4: State change: START_BACKFILL is TRUE
09/23/16 11:17:05 slot4: Changing state: Unclaimed -> Backfill
09/23/16 11:17:05 slot4: State change: START_BACKFILL is TRUE
Stack dump for process 15529 at timestamp 1474622230 (16 frames)
/usr/lib/condor/libcondor_utils_8_4_8.so(dprintf_dump_stack+0x72)[0x7f1e70218162]
/usr/lib/condor/libcondor_utils_8_4_8.so(+0x10d9b7)[0x7f1e701819b7]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x7f1e6bc088d0]
condor_startd(_ZN7Starter13execDCStarterERK7ArgListPK3EnvPiP6Stream+0x315)[0x7f1e70a9e405]
condor_startd(_ZN7Starter16execBOINCStarterEv+0x80)[0x7f1e70a9f2a0]
condor_startd(_ZN7Starter5spawnElP6Stream+0x98)[0x7f1e70a9f768]
condor_startd(_ZN17BOINC_BackfillMgr11spawnClientEP8Resource+0x67)[0x7f1e70a826d7]
condor_startd(_ZN17BOINC_BackfillMgr5startEi+0x2c3)[0x7f1e70a82c13]
condor_startd(_ZN8ResState4evalEv+0x412)[0x7f1e70a84ed2]
condor_startd(_ZN6ResMgr4walkEM8ResourceFvvE+0xbc)[0x7f1e70a6532c]
condor_startd(_ZN6ResMgr8eval_allEv+0x39)[0x7f1e70a68729]
/usr/lib/condor/libcondor_utils_8_4_8.so(_ZN12TimerManager7TimeoutEPiPd+0x16b)[0x7f1e7031e0bb]
/usr/lib/condor/libcondor_utils_8_4_8.so(_ZN10DaemonCore6DriverEv+0x86b)[0x7f1e7030eadb]
/usr/lib/condor/libcondor_utils_8_4_8.so(_Z7dc_mainiPPc+0x11e0)[0x7f1e70321cb0]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f1e6b86fb45]
condor_startd(+0x25549)[0x7f1e70a63549]


Any hints on what we are doing wrong would be much appreciated!

Thanks,

Michael




--
Tim Theisen
Release Manager
HTCondor & Open Science Grid
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin - Madison
4261 Computer Sciences and Statistics
1210 W Dayton St
Madison, WI 53706-1685
+1 608 265 5736

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxx.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/