Grid Engine 6.2 on Mac OS X

Installing Grid Engine 6.x on modern versions of Mac OS X (client and server)

As of early 2010, Grid Engine is effectively not installable out of the box on modern version of Mac OS X. We have seen this recently with server and client versions of OS X 10.5.* as well as the new Snow Leopard (10.6.*) releases.

This used to not be a big deal and the workarounds were trivial. Something seems to have changed, however, in recent updates to OS X that render the Grid Engine binaries unable to function when :

  • When started during the installation process
  • SystemStarter() scripts that SGE tries to install on Apple systems
  • Manually started by user root via the command line

This is obviously a show-stopper now for new SGE users. We have no idea why this is the case but have witnessed the behavior on many 10.5 and 10.6 systems (many of them running OS X Server).

The problem with Grid Engine on modern versions of OS X can be simply stated:

SGE binaries will not launch or will be unreliable when started by ANY METHOD other than the system level OS X “launchd” service framework.

… and since neither the SGE installation scripts nor “traditional” SystemStarter() scripts that SGE install on Mac OS X systems use launchd framework this basically means that SGE 6.2 is unusable out of the box without manual intervention and custom starter scripts.

How to manually get SGE 6.2x working on Mac OS X systems

This process will be very simple to someone who already understands Grid Engine administration and is comfortable with SGE admin commands – unfortunately it may be confusing to novice or beginners because we have to interrupt the automatic SGE installation process and complete things by hand using SGE admin commands.

The method:

  1. Run the “install_qmaster” script normally. The fact that it will fail is not a big deal – before it fails it will construct the SGE_CELL directory, configure spooling and otherwise do all of the behind the scenes steps necessary to support a functional SGE qmaster process.
  2. When the install_qmaster script fails, exit out and manually kill any zombie sge_qmaster daemons that may be hanging around on the system
  3. Download and run the sge-launchd-scriptmaker tarball from this BioTeam Blog post.
  4. Run the sge-launchd-scriptmaker utility. This simple perl script will query your SGE environment and construct several .plist files suitable for copying into the /Library/LaunchdDaemons/ OS X folder
  5. Using OS X “launchctl” commands, restart sge_qmaster
  6. At this point the SGE qmaster is running and functional and we can manually complete a few additional configuration steps…
  7. Create and populate a hostgroup named “@allhosts”
  8. Create and configure the default cluster queue object (“all.q”)
  9. At this point the SGE install_execd script should work but you can also skip that step and just launchd sge_execd via the launchctl framework (a step that will be necessary even if the exexhost installer script functions fine)

The screencast below shows a recording of me stepping through this process on a small Apple Mac Mini running Mac OS X 10.5.8 Server (sorry, I thought it was a Snow Leopard box when I began the work …).

If you don’t want to watch the embedded video below, you can navigate directly to the screencast site and download the full movie file. The video is hosted here: http://www.screencast.com/t/NjMyNGJiNWM

Feedback welcome.

Popularity: 18% [?]

  • Share/Bookmark

There Are 22 Responses So Far. »

  1. Hey Chris,

    Awesome guide. It was extremely helpful and made quick work of setting up the qmaster correctly and efficiently.

    Have you tried to start up an execution host on a separate machine? I can’t for the life of me get this to work. The dead end that I keep running into looks like this:

    remotehost:SGE_ROOT root# bin/darwin-x86/sge_execd
    daemonize error: timeout while waiting for daemonize state

    Everything else is configured correctly as far as I can tell. The secondary host is configured as an administrative host, a submit host, an execution host, it has been added to the @allhosts group.

    qmaster:~ sgeadmin$ qstat -f
    queuename qtype resv/used/tot. load_avg arch states
    ————————————————————–
    all.q@qmaster.cs BIP 0/0/16 0.09 darwin-x86
    ————————————————————–
    all.q@remotehost.cs BIP 0/0/16 -NA- -NA- au

    Thanks again for all the help.
    - Trey Wessler

  2. Hi Trey, the error about being “unable to daemonize” is pretty much what I see when trying to run SGE binaries as the root user outside the control of the Launchd framework.

    To get your execution host going I’d first issue some commands on the qmaster to make sure that the compute known is known as a host (“qconf -ah “), “(qconf -as ), etc.

    You can even use commands on the master to make your nodes parts of queues without having the exechosts running yet although this is not required. I’ve done many clusters in the past where we pre-staged all of the configuration settings for the compute nodes so that all we had to do was install the startup scripts on the nodes and it all worked.

    I think you need to skip the install script, get the sge_execd daemon running via launchd and just configure queue related stuff manually. You may have to make a spool directory for the compute nodes manually as that is one step that the install script usually does. Even that may not be necessary – it could all just get sorted out automatically.

    Once sge_execd is running under launchd it will accept commands and will be able to communicate. Getting things functional from there should be quite easy.

  3. Hey guys,

    Just a quick update and heads up.

    I finally got it working! The problem is that I was using AFP as my network sharing protocol. I don’t know the exact reason, but I decided to change my automount options to NFS on a hunch. Nothing else had to be modified. So, just as a tip:

    SGE does not work with AFP.

    My guess is that the sge_execd checks for nfs mounted directories. That, or, the service is waiting for NFS to start before it will start. I haven’t done much testing, but I came here to share before I forgot.

    Thanks for all of the help. You guys are great! Keep up the good work.

    -Trey

  4. Very nice screencast. I remember a problem where SGE would not cooperate with users who were installed via OpenDirectory. Does this problem still exist?

    Do the users have to be added as local users on each machine?

  5. Hi Eric,

    From what I recall, OpenDirectory was not at fault, it was actually the incredibly long group membership that users created in OpenDirectory. Once I truncated group membership down in OD I had no issues with OD users and grid engine.

    Try running the command “id ” or “groups ” and if you see a massive group membership list that may be the culprit. Reduce the size and number of groups that the user belongs to and you might be ok.

    -Chris

  6. Not clear what to download. I downloaded ge62u5_darwin-x86.tar. Untarring this gives: ge-6.2u5-bin-darwin-x86.tar.gz and ge-6.2u5-common.tar.gz. ge-6.2u5-common.tar.gz has intall_qmaster, but does not have utilbin (unlike your example machine). When I run ./install_qmaster, I get:

    ./util/install_modules/inst_common.sh: line 69: ./utilbin/darwin-unsupported/uidgid: No such file or directory
    ./util/install_modules/inst_common.sh: line 70: ./utilbin/darwin-unsupported/uidgid: No such file or directory
    Can’t find binaries for architecture: darwin-unsupported!
    Please check your binaries. Installation failed!
    Exiting installation.

    Maybe I downloaded the wrong files? Where does one find the right ones?

    I’m trying to do this on a Fall 2009 MacPro running 10.6.2

    Thanks for any suggestions!

    -Tom

  7. Tom – this is very interesting. The root cause seems to be that SGE can’t figure out your machine architecture. On the system you describe the output of the utilbin/arch command should be “darwin-x86″ which is why you have the darwin-x86 architecture specific binary tarball that came with the SGE distribution.

    You should be able to work around this by making a symbolic link from darwin-x86 -> darwin-unsupported. That will mask the problem and all of the tools and scripts looking for that “darwin-unsupported” path will then (hopefully) find binaries that work on your system.

    We really should tease apart the arch script and find the case statement that deals with darwin. Almost certainly it’s running some sort of ‘uname’ call and getting back some parsable response that it does not know about yet as a valid OS X system. This should be a quick fix and patch with the SGE team.

    -Chris

  8. i am almost complete, i just got stuck on the

    qconf -ahgrp @allhosts command
    rereseolve -> cannot resolve host name

    my /etc/hosts file is

    127.0.0.1 localhost
    255.255.255.255 broadcasthost
    ::1 localhost
    fe80::1%lo0 localhost

    but if i type /bin/hostname i get:
    ijorge.local

    im a little confused, i put common dir in “/common” and changed owner:
    chown ijorge /common
    but i was on a root session, so i had all mixed and i guess i messes it up.

    i have some questions:

    a) on my mac, the Admin User is “ijorge” but the root user is “root”. which user should i use to install? i see you used root because of the command prompt but the name was different from “root”, so i got confused.

    b) does the “common” directory has to be placed on “/common” anyways or depends on which user i used to install, im a not clear here?

    thanks in advance!
    -Cristobal

  9. (1) You should install as the “root” user but the directory holding the SGE files can be “ijorge” or whomever you want to be the SGE Administrator.

    (2) SGE can be installed anywhere, my use of “/common” was just a personal preference. The main thing is that SGE should be installed on a shared NFS files system that is visible to all the computers in your cluster for the easiest method of operation. It is possible to install without a common filesystem underneath but it can be harder to setup and troubleshoot. If you are just installing SGE on a single machine than the location does not matter at all.

  10. Chris

    i reinstalled with your advices,

    im getting this error when doing “qconf -sconf”
    reresolve hostname failed: can’t resolve host name

    my host name from /bin/hostname is:
    ijorge.local

    and my file /etc/hosts is the one i mentioned on the last post

  11. SGE is sensitive to DNS and it looks like it is not set up for your system. One workaround would be to edit /etc/hosts on your system and make an entry for “ijorge.local” that uses the IP that the SGE qmaster is listening too. On most Apple OS X systems I will go out of my way to make sure that /etc/hosts is correct and fully populated in addition to having forward and reverse DNS set up.

    -Chris

  12. i tried adding the line

    127.0.0.1 ijorge.local

    but no luck,

    when you said that my DNS was probably not set up, do you mean that i have to go to System Preferences->Network->Ethernet1(my case)–> and set up DNS ip ??

    at the moment the values are
    DNS server: 192.168.1.254 (the same as router ip)
    DNS search domain: nothing

    and my ip is 192.168.1.31
    is this set up ok?

  13. You can’t use 127.0.0.1 for grid engine or any other program that talks on a network. The 127.0.0.1 address is a local loopback “special” IP address.

    Try putting:

    192.168.1.31 ijorge.local into your /etc/hosts file instead

  14. it still doesnt work :(

    im very thankful for your help already,
    i will try installing again everything and post back!
    -Cristobal

  15. Hello again Chris,

    i just cant make this work, i reinstalled all again.
    added to /etc/hosts

    192.168.1.31 ijorge.local

    and still qconf -sconf cannot resolve.
    however, if i run from the utilbin gethostbyname it does resolve it.

    sh-3.2# /common/utilbin/darwin-x86/gethostbyname ijorge.local
    Hostname: ijorge.local
    Aliases:
    Host Address(es): 192.168.1.31

    and reverse too
    sh-3.2# /common/utilbin/darwin-x86/gethostbyaddr 192.168.1.31
    Hostname: ijorge.local
    Aliases:
    Host Address(es): 192.168.1.31

    this must be a really small error hidden somewhere, i just cannot find it. has this ever happened to you??

  16. Chris

    it worked now, you wont believe how small was the problem.
    i had to add the line to /etc/hosts this way.

    192.168.1.31 ijorge.local ijorge

    and it worked everything till the end of your tutorial!!
    i have to say really thanks for your help, your tutorial (the best i’ve found on internet), and your will to help cluster-noobs like me.

    well now i have to do this, to the lab. because i was only testing at home with 1 iMac.

    on the lab we have 4 Mac Pros with Leopard 10.5.6, they are 64 bit archs if im not mistaken, with Intel Xeon, they were bought just on November 2009.

    my first attempt was using xgrid, it was easy to install but a headache to make it work with openMPI since the compatibility is somehow broken.

    my question is the following,

    i) where can i find the 64 bit version of the SGE installer for Mac XEON ??

    ii) i there wasnt any other chance than using this 32-bit version, will the C programs be able to use all 8GB of memory from each Mac pro??

    iii) and my last question since i was testing in my same machine i only added “myself” to the grid… is the process of adding more machines as simple as adding them in “@allhosts”. i mean do i have to do something on each machine apart from that?

    thanks in advance, your tutorial is exelent i hope the video never goes down.
    -Cristobal

  17. chris,

    i tried the ./install_qmaster script on a mac-pro with Leopard 10.5.6 and the daemon did start,
    however, after reboot it does not restart, so i anyways i had to include your fixes.

    im facing a weird problem when configuring an execution host, its on the same network

    but this is the error of the script after the “port step”.

    Checking hostname resolving
    —————————

    Cannot contact qmaster. The command failed:

    ./bin/darwin-x86/qconf -sh

    The error message was:

    ERROR: unable to send message to qmaster using port 6444 on host “mac-pro-3″: can’t resolve host name

    You can fix the problem now or abort the installation procedure.
    The problem can be:

    – the qmaster is not running
    – the qmaster host is down
    – an active firewall blocks your request

    Contact qmaster again (y/n) (‘n’ will abort) [y] >>

    —-

    im sure this is a problem of hostnames, because the exec-host tries to resolve “mac-pro-3″, but if i type hostname on the master node, i get “mac-pro-3.local”. i dont know how to “really” change the hostname, and get rid of that “local” sufix after each hostname.
    modifiyng the /etc/hosts file did not work even the aliases didnt work. “scutil” command did change the hostname to “mac-pro-3″, however when i tried to ping from another PC, it cannot resolve that new name and it still responds pings as “mac-pro-3.local”.

    have you faced this problem before?

    thans in advance

  18. Did the bioteam ever figure out the issue with the Arch script on install as noted by Tom above? I’m getting the same response, “darwin-unsupported”, during install after the upgrade to Mac OS 10.6 and the install fails and exits.

    Specs: Darwin Kernel Version 10.3.0: Fri Feb 26 11:57:13 PST 2010; root:xnu-1504.3.12~1/RELEASE_X86_64 x86_64

    By the way, I noticed that when I run
    $ arch
    from within the util directory, “i386″ is returned. When I run
    $ ./arch
    from within the util directory, “darwin-unsupported” is returned.

  19. nevermind–so far it’s working by changing unsupported in the arch script to “x86″

  20. I found a solution to the “darwin-unsupported” issue.
    Under 10.6 (desktop and server) and previous running on an intel chip the /usr/bin/arch returns “i386″. The SGE arch script uses “uname -m”
    Running “uname -m” on 10.6 desktop returns “i386″ while “uname -m” on 10.6 server returns “x86_64″.
    I added this to the SGE arch script inside the “Darwin” section.

    x86_64)
    darwin_machine=x86

    This solved the darwin-unsupported problem.

  21. We have had sge62u3 working fine in a 10.5 cluster using cron to startup the sge_execd process on clients. Trying a 10.6 client the sge_execd process hangs on the qping process. Next I have tried the launchctl files you generated but get a startup error -

    bash-3.2# launchctl start /Library//LaunchDaemons/net.sunsource.gridengine.sgeexecd.plist
    launchctl start error: No such process
    bash-3.2#

    The filenames in the plist are all defined -
    bash-3.2# ls -l /usr/local/sge/bin/darwin-x86/sge_execd
    -rwxr-xr-x 1 root wheel 1650040 Jun 4 2009 /usr/local/sge/bin/darwin-x86/sge_execd
    bash-3.2#

    Is there a problem with the syntax in the plist file ?
    cat net.sunsource.gridengine.sgeexecd.plist

    Label
    net.sunsource.gridengine.sgeexecd
    Program
    /usr/local/sge/bin/darwin-x86/sge_execd
    RunAtLoad

    EnvironmentVariables

    SGE_ROOT
    /usr/local/sge
    SGE_CELL

    SGE_ND
    1
    DYLD_LIBRARY_PATH
    /usr/local/sge/lib/darwin-x86

    StandardErrorPath
    /dev/null
    StandardOutPath
    /dev/null
    KeepAlive

    bash-3.2

  22. Barry — did you try “launchctl load” instead of “start” — that may be a requirement for initially telling launchctl about the new .plist files. I’ll have to check your syntax against one of ours and the blog comment may have messed with your formatting.

Post a Response