[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Fwd: File permission errors running java across a small grid



Hi All

I've set up a small grid (a master and two slaves), and checked that everything appears to register properly.

I've run the basic java example here:Âhttp://research.cs.wisc.edu/htcondor/tutorials/intl-grid-school-3/submit_java.html. I submit using my normal login (not the condor or root user) on the master machine, and it works fine.

However, when I add more queries, the other jobs start to fail with permission errors. Here is the updated java.sub:

~/condor$ cat java.sub
Universe  = java
Executable = simple.class

Log    Â= simple$(Process).log
Output   = simple$(Process).out
Error   Â= simple$(Process).error

Arguments Â= simple 4 10
Queue
Arguments Â= simple 5 11
Queue
Arguments Â= simple 10 20
Queue
Arguments Â= simple 14 20
Queue

~/condor$ cat simple0.log
000 (007.000.000) 05/09 16:31:12 Job submitted from host: <10.8.0.10:53892>
...
001 (007.000.000) 05/09 16:31:26 Job executing on host: <10.8.0.14:45561>
...
006 (007.000.000) 05/09 16:31:31 Image size of job updated: 0
    0 Â- ÂMemoryUsage of job (MB)
    0 Â- ÂResidentSetSize of job (KB)
...
005 (007.000.000) 05/09 16:31:36 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00 Â- ÂRun Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00 Â- ÂRun Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00 Â- ÂTotal Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00 Â- ÂTotal Local Usage
    56 Â- ÂRun Bytes Sent By Job
    1082 Â- ÂRun Bytes Received By Job
    56 Â- ÂTotal Bytes Sent By Job
    1082 Â- ÂTotal Bytes Received By Job
    Partitionable Resources :  ÂUsage ÂRequest Allocated
     ÂCpus         :         1     1
     ÂDisk (KB)      Â:    Â9    Â2  4305764
     ÂMemory (MB)     Â:    Â0    Â0   Â2003
...

That looks fine, however the other jobs show:

~/condor$ cat simple1.log
000 (007.001.000) 05/09 16:31:12 Job submitted from host: <10.8.0.10:53892>
...
007 (007.001.000) 05/09 16:31:18 Shadow exception!
    Error from gridmaster: Failed to open '/home/me/condor/simple1.out' as standard output: Permission denied (errno 13)
    0 Â- ÂRun Bytes Sent By Job
    0 Â- ÂRun Bytes Received By Job
...
012 (007.001.000) 05/09 16:31:18 Job was held.
    Error from gridmaster: Failed to open '/home/me/condor/simple1.out' as standard output: Permission denied (errno 13)
    Code 7 Subcode 13
...

What am I missing with the permissions here?Â

FWIW I've set ALLOW_* = * on the master node, however permissions on slaves is set to default.

thanks
Steve