Grid Engine 6.2 on Mac OS X

Installing Grid Engine 6.x on modern versions of Mac OS X (client and server)

As of early 2010, Grid Engine is effectively not installable out of the box on modern version of Mac OS X. We have seen this recently with server and client versions of OS X 10.5.* as well as the new Snow Leopard (10.6.*) releases.

This used to not be a big deal and the workarounds were trivial. Something seems to have changed, however, in recent updates to OS X that render the Grid Engine binaries unable to function when :

  • When started during the installation process
  • SystemStarter() scripts that SGE tries to install on Apple systems
  • Manually started by user root via the command line

This is obviously a show-stopper now for new SGE users. We have no idea why this is the case but have witnessed the behavior on many 10.5 and 10.6 systems (many of them running OS X Server).

The problem with Grid Engine on modern versions of OS X can be simply stated:

SGE binaries will not launch or will be unreliable when started by ANY METHOD other than the system level OS X “launchd” service framework.

… and since neither the SGE installation scripts nor “traditional” SystemStarter() scripts that SGE install on Mac OS X systems use launchd framework this basically means that SGE 6.2 is unusable out of the box without manual intervention and custom starter scripts.

How to manually get SGE 6.2x working on Mac OS X systems

This process will be very simple to someone who already understands Grid Engine administration and is comfortable with SGE admin commands – unfortunately it may be confusing to novice or beginners because we have to interrupt the automatic SGE installation process and complete things by hand using SGE admin commands.

The method:

  1. Run the “install_qmaster” script normally. The fact that it will fail is not a big deal – before it fails it will construct the SGE_CELL directory, configure spooling and otherwise do all of the behind the scenes steps necessary to support a functional SGE qmaster process.
  2. When the install_qmaster script fails, exit out and manually kill any zombie sge_qmaster daemons that may be hanging around on the system
  3. Download and run the sge-launchd-scriptmaker tarball from this BioTeam Blog post.
  4. Run the sge-launchd-scriptmaker utility. This simple perl script will query your SGE environment and construct several .plist files suitable for copying into the /Library/LaunchdDaemons/ OS X folder
  5. Using OS X “launchctl” commands, restart sge_qmaster
  6. At this point the SGE qmaster is running and functional and we can manually complete a few additional configuration steps…
  7. Create and populate a hostgroup named “@allhosts”
  8. Create and configure the default cluster queue object (“all.q”)
  9. At this point the SGE install_execd script should work but you can also skip that step and just launchd sge_execd via the launchctl framework (a step that will be necessary even if the exexhost installer script functions fine)

The screencast below shows a recording of me stepping through this process on a small Apple Mac Mini running Mac OS X 10.5.8 Server (sorry, I thought it was a Snow Leopard box when I began the work …).

If you don’t want to watch the embedded video below, you can navigate directly to the screencast site and download the full movie file. The video is hosted here: http://www.screencast.com/t/NjMyNGJiNWM

Feedback welcome.

Popularity: 3% [?]

There Are 7 Responses So Far. »

  1. Hey Chris,

    Awesome guide. It was extremely helpful and made quick work of setting up the qmaster correctly and efficiently.

    Have you tried to start up an execution host on a separate machine? I can’t for the life of me get this to work. The dead end that I keep running into looks like this:

    remotehost:SGE_ROOT root# bin/darwin-x86/sge_execd
    daemonize error: timeout while waiting for daemonize state

    Everything else is configured correctly as far as I can tell. The secondary host is configured as an administrative host, a submit host, an execution host, it has been added to the @allhosts group.

    qmaster:~ sgeadmin$ qstat -f
    queuename qtype resv/used/tot. load_avg arch states
    ————————————————————–
    all.q@qmaster.cs BIP 0/0/16 0.09 darwin-x86
    ————————————————————–
    all.q@remotehost.cs BIP 0/0/16 -NA- -NA- au

    Thanks again for all the help.
    - Trey Wessler

  2. Hi Trey, the error about being “unable to daemonize” is pretty much what I see when trying to run SGE binaries as the root user outside the control of the Launchd framework.

    To get your execution host going I’d first issue some commands on the qmaster to make sure that the compute known is known as a host (“qconf -ah “), “(qconf -as ), etc.

    You can even use commands on the master to make your nodes parts of queues without having the exechosts running yet although this is not required. I’ve done many clusters in the past where we pre-staged all of the configuration settings for the compute nodes so that all we had to do was install the startup scripts on the nodes and it all worked.

    I think you need to skip the install script, get the sge_execd daemon running via launchd and just configure queue related stuff manually. You may have to make a spool directory for the compute nodes manually as that is one step that the install script usually does. Even that may not be necessary – it could all just get sorted out automatically.

    Once sge_execd is running under launchd it will accept commands and will be able to communicate. Getting things functional from there should be quite easy.

  3. Hey guys,

    Just a quick update and heads up.

    I finally got it working! The problem is that I was using AFP as my network sharing protocol. I don’t know the exact reason, but I decided to change my automount options to NFS on a hunch. Nothing else had to be modified. So, just as a tip:

    SGE does not work with AFP.

    My guess is that the sge_execd checks for nfs mounted directories. That, or, the service is waiting for NFS to start before it will start. I haven’t done much testing, but I came here to share before I forgot.

    Thanks for all of the help. You guys are great! Keep up the good work.

    -Trey

  4. Very nice screencast. I remember a problem where SGE would not cooperate with users who were installed via OpenDirectory. Does this problem still exist?

    Do the users have to be added as local users on each machine?

  5. Hi Eric,

    From what I recall, OpenDirectory was not at fault, it was actually the incredibly long group membership that users created in OpenDirectory. Once I truncated group membership down in OD I had no issues with OD users and grid engine.

    Try running the command “id ” or “groups ” and if you see a massive group membership list that may be the culprit. Reduce the size and number of groups that the user belongs to and you might be ok.

    -Chris

  6. Not clear what to download. I downloaded ge62u5_darwin-x86.tar. Untarring this gives: ge-6.2u5-bin-darwin-x86.tar.gz and ge-6.2u5-common.tar.gz. ge-6.2u5-common.tar.gz has intall_qmaster, but does not have utilbin (unlike your example machine). When I run ./install_qmaster, I get:

    ./util/install_modules/inst_common.sh: line 69: ./utilbin/darwin-unsupported/uidgid: No such file or directory
    ./util/install_modules/inst_common.sh: line 70: ./utilbin/darwin-unsupported/uidgid: No such file or directory
    Can’t find binaries for architecture: darwin-unsupported!
    Please check your binaries. Installation failed!
    Exiting installation.

    Maybe I downloaded the wrong files? Where does one find the right ones?

    I’m trying to do this on a Fall 2009 MacPro running 10.6.2

    Thanks for any suggestions!

    -Tom

  7. Tom – this is very interesting. The root cause seems to be that SGE can’t figure out your machine architecture. On the system you describe the output of the utilbin/arch command should be “darwin-x86″ which is why you have the darwin-x86 architecture specific binary tarball that came with the SGE distribution.

    You should be able to work around this by making a symbolic link from darwin-x86 -> darwin-unsupported. That will mask the problem and all of the tools and scripts looking for that “darwin-unsupported” path will then (hopefully) find binaries that work on your system.

    We really should tease apart the arch script and find the case statement that deals with darwin. Almost certainly it’s running some sort of ‘uname’ call and getting back some parsable response that it does not know about yet as a valid OS X system. This should be a quick fix and patch with the SGE team.

    -Chris

Post a Response