[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Setting up BOINC backfill



Hi,

we haven't been able to set up BOINC backfill properly. Running the
BOINC client via 

# condor_starter -f -append boinc -job-keyword boinc

(as root) works as expected. However, we are not running condor as root,
but as user 'condor'. This leads to an immediate failure:

# sudo -u condor condor_starter -f -append boinc -job-keyword boinc

The relevant part in the logs (ALL_DEBUG = D_FULLDEBUG) seems to be:

==> /var/log/condor/StarterLog.boinc <==
09/23/16 13:59:22 (pid:15948) About to exec /usr/bin/boinc --attach_project http://www.worldcommunitygrid.org <snip>
09/23/16 13:59:22 (pid:15948) Env = TMP=/var/lib/condor/execute/dir_15948 _CONDOR_JOB_IWD=/boinc _CONDOR_SLOT= _CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_15948/.machine.ad TEMP=/var/lib/condor/execute/dir_15948 _CHIRP_DELAYED_UPDATE_PREFIX=Chirp* TMPDIR=/var/lib/condor/execute/dir_15948 BATCH_SYSTEM=HTCondor _CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_15948 _CONDOR_JOB_AD=/var/lib/condor/execute/dir_15948/.job.ad _CONDOR_JOB_PIDS=
09/23/16 13:59:22 (pid:15948) ENFORCE_CPU_AFFINITY not true, not setting affinity
09/23/16 13:59:22 (pid:15948) Running job as user same uid as parent: personal condor
09/23/16 13:59:22 (pid:15950) Can not remount filesystems because this system does can not have/allow unshare(2)
09/23/16 13:59:22 (pid:15948) Warning: Create_Process: failed to read child process failure code
09/23/16 13:59:22 (pid:15948) Create_Process(/usr/bin/boinc): child failed with errno 38 (Function not implemented) before exec()
09/23/16 13:59:22 (pid:15948) Create_Process(/usr/bin/boinc,--attach_project http://www.worldcommunitygrid.org <snip> ...) failed: (errno=38: 'Function not implemented')
09/23/16 13:59:22 (pid:15948) Failed to start job, exiting

Here are the relevant config settings

# condor_config_val -dump |grep BOINC
BACKFILL_SYSTEM = BOINC
BOINC_Arguments = --attach_project http://www.worldcommunitygrid.org <snip>
BOINC_Error = $(BOINC_HOME)/boinc.err
BOINC_Executable = /usr/bin/boinc
BOINC_HOME = /boinc
BOINC_InitialDir = $(BOINC_HOME)
BOINC_Output = $(BOINC_HOME)/boinc.out
BOINC_Owner = boinc
BOINC_Universe = vanilla

When running as user condor, this yields the following before the crash:

# ls -l /boinc/
total 0
-rw-r--r-- 1 condor condor 0 Sep 23 13:59 boinc.err
-rw-r--r-- 1 condor condor 0 Sep 23 13:59 boinc.out

(BOINC_Owner is ignored as mentioned in the docs).

When running as root, this yields:

# ls -l /boinc/
total 296
-rw-r--r-- 1 boinc boinc  2140 Sep 23 14:05 account_www.worldcommunitygrid.org.xml
-rw-r--r-- 1 boinc boinc 58679 Sep 23 14:05 all_projects_list.xml
-rw-r--r-- 1 boinc boinc   179 Sep 23 14:05 boinc.err
-rw-r--r-- 1 boinc boinc  8979 Sep 23 14:06 boinc.out
-rw-r--r-- 1 boinc boinc 54980 Sep 23 14:06 client_state.xml
-rw-r--r-- 1 boinc boinc 55699 Sep 23 14:06 client_state_prev.xml
-rw-r--r-- 1 boinc boinc   112 Sep 23 14:05 daily_xfer_history.xml
-rw-r--r-- 1 boinc boinc  1415 Sep 23 14:05 global_prefs.xml
-rw------- 1 boinc boinc    32 Sep 23 14:05 gui_rpc_auth.cfg
-rw-r--r-- 1 boinc boinc     0 Sep 23 14:05 lockfile
-rw-r--r-- 1 boinc boinc 26636 Sep 23 14:05 master_www.worldcommunitygrid.org.xml
drwxrwx--x 2 boinc boinc  4096 Sep 23 14:05 notices
drwxrwx--x 3 boinc boinc  4096 Sep 23 14:05 projects
-rw-r--r-- 1 boinc boinc 33531 Sep 23 14:05 sched_reply_www.worldcommunitygrid.org.xml
-rw-r--r-- 1 boinc boinc  5630 Sep 23 14:05 sched_request_www.worldcommunitygrid.org.xml
drwxrwx--x 3 boinc boinc  4096 Sep 23 14:06 slots
-rw-r--r-- 1 boinc boinc   419 Sep 23 14:05 statistics_www.worldcommunitygrid.org.xml
-rw-r--r-- 1 boinc boinc   114 Sep 23 14:05 time_stats_log

In order to be maximally permissive: in both cases the directory permissions
are:

# ls -l / |grep boinc
drwxrwxrwx   5 boinc boinc  4096 Sep 23 14:08 boinc


Enabling backfill with a running condor_master (as user condor) results in a
loop that repeatedly segfaults condor_startd:

09/23/16 11:17:05 slot1: State change: START_BACKFILL is TRUE
09/23/16 11:17:05 slot1: Changing state and activity: Unclaimed/Benchmarking -> Backfill/Idle
09/23/16 11:17:05 slot2: State change: IS_OWNER is false
09/23/16 11:17:05 slot2: Changing state: Owner -> Unclaimed
09/23/16 11:17:05 State change: RunBenchmarks is TRUE
09/23/16 11:17:05 slot2: Changing activity: Idle -> Benchmarking
09/23/16 11:17:05 slot2: Changing activity: Benchmarking -> Idle
09/23/16 11:17:05 slot2: State change: START_BACKFILL is TRUE
09/23/16 11:17:05 slot2: Changing state: Unclaimed -> Backfill
09/23/16 11:17:05 slot2: State change: START_BACKFILL is TRUE
09/23/16 11:17:05 slot3: State change: IS_OWNER is false
09/23/16 11:17:05 slot3: Changing state: Owner -> Unclaimed
09/23/16 11:17:05 State change: RunBenchmarks is TRUE
09/23/16 11:17:05 slot3: Changing activity: Idle -> Benchmarking
09/23/16 11:17:05 slot3: Changing activity: Benchmarking -> Idle
09/23/16 11:17:05 slot3: State change: START_BACKFILL is TRUE
09/23/16 11:17:05 slot3: Changing state: Unclaimed -> Backfill
09/23/16 11:17:05 slot3: State change: START_BACKFILL is TRUE
09/23/16 11:17:05 slot4: State change: IS_OWNER is false
09/23/16 11:17:05 slot4: Changing state: Owner -> Unclaimed
09/23/16 11:17:05 State change: RunBenchmarks is TRUE
09/23/16 11:17:05 slot4: Changing activity: Idle -> Benchmarking
09/23/16 11:17:05 slot4: Changing activity: Benchmarking -> Idle
09/23/16 11:17:05 slot4: State change: START_BACKFILL is TRUE
09/23/16 11:17:05 slot4: Changing state: Unclaimed -> Backfill
09/23/16 11:17:05 slot4: State change: START_BACKFILL is TRUE
Stack dump for process 15529 at timestamp 1474622230 (16 frames)
/usr/lib/condor/libcondor_utils_8_4_8.so(dprintf_dump_stack+0x72)[0x7f1e70218162]
/usr/lib/condor/libcondor_utils_8_4_8.so(+0x10d9b7)[0x7f1e701819b7]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x7f1e6bc088d0]
condor_startd(_ZN7Starter13execDCStarterERK7ArgListPK3EnvPiP6Stream+0x315)[0x7f1e70a9e405]
condor_startd(_ZN7Starter16execBOINCStarterEv+0x80)[0x7f1e70a9f2a0]
condor_startd(_ZN7Starter5spawnElP6Stream+0x98)[0x7f1e70a9f768]
condor_startd(_ZN17BOINC_BackfillMgr11spawnClientEP8Resource+0x67)[0x7f1e70a826d7]
condor_startd(_ZN17BOINC_BackfillMgr5startEi+0x2c3)[0x7f1e70a82c13]
condor_startd(_ZN8ResState4evalEv+0x412)[0x7f1e70a84ed2]
condor_startd(_ZN6ResMgr4walkEM8ResourceFvvE+0xbc)[0x7f1e70a6532c]
condor_startd(_ZN6ResMgr8eval_allEv+0x39)[0x7f1e70a68729]
/usr/lib/condor/libcondor_utils_8_4_8.so(_ZN12TimerManager7TimeoutEPiPd+0x16b)[0x7f1e7031e0bb]
/usr/lib/condor/libcondor_utils_8_4_8.so(_ZN10DaemonCore6DriverEv+0x86b)[0x7f1e7030eadb]
/usr/lib/condor/libcondor_utils_8_4_8.so(_Z7dc_mainiPPc+0x11e0)[0x7f1e70321cb0]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f1e6b86fb45]
condor_startd(+0x25549)[0x7f1e70a63549]


Any hints on what we are doing wrong would be much appreciated!

Thanks,

Michael



-- 
Michael Hanke
GPG: 4096R/C073D2287FFB9E9B
http://psychoinformatics.de