Open OnDemand

From Leo's Notes
Last edited on 7 February 2022, at 02:39.

Open OnDemand is a open source project by the Ohio Supercomputer Center that provides a web portal for HPC users. It is designed as a platform allowing system administrators to add additional modules or 'apps'. Users can use this platform to launch interactive jobs or VNC/SSH sessions, view their job statuses, interact with their files. It supports a variety of authentication mechanisms including federated authentication (OpenID, CAS, Shiboleth) or with an the system's underlying PAM (ldap, password file, etc.).

Interactive Apps[edit | edit source]

App Development[edit | edit source]

On Open OnDemand 1.8, you can only enable a specific user access to development mode by creating a directory /var/www/ood/apps/dev/$username, then symlinking /var/www/ood/apps/dev/$username/gateway to /home/$username/openondemand/dev. Restart the PUN web server. A 'Develop' section should appear in the navbar. This will only work for $username.

See: https://osc.github.io/ood-documentation/latest/app-development/enabling-development-mode.html#enable-in-ondemand-v1-6

It may help to look at other apps that are available at:

Deploying an app[edit | edit source]

Once you have done developing an app, deploy it by moving it to /var/www/ood/apps/sys.

Jupyter Notebooks[edit | edit source]

Running a Jupyter Notebook interactive app requires tweaking the example OSC Jupyter Interactive App (https://github.com/OSC/bc_example_jupyter) to work in your environment. Before you get started, you should be able to submit jobs via Open OnDemand. If this isn't working, fix your configuration before proceeding with interactive apps.

The easiest way to get this working is to enable App Development on an account and then testing from this account.

  1. Enable App Development on your account
  2. Clone the OSC Jupyter app to ~/ondemand/dev/bc_example_jupyter
  3. Tweak the form.yml (or form.yml.erb). Add or remove fields as required by you.
  4. Tweak submit.yml.erb to work in your environment.
  5. Try launching a Jupyter notebook and fix any issues as you go.

Remote Desktop[edit | edit source]

Setup[edit | edit source]

Clone one of the existing bc_desktop apps and edit form.yml. Set the cluster value to a cluster you have configured in OnDemand (hint: what you have in the clusters.d directory). You may want to add/remove the CPU, memory, partition fields as desired. The default bc_desktop app includes startup scripts for various desktop managers but I hardcoded this to always use XFCE for my use-case. On the compute nodes that are intended to run the VNC sessions, you install the desktop environment (XFCE in my case), TurboVNC, and websockify. On a CentOS 8 machine, this was accomplished manually with:

yum install -y nc wget python3-devel python3-pip;

# Websockify
pip3 install websockify;

# TurboVNC
wget https://turbovnc.org/pmwiki/uploads/Downloads/TurboVNC.repo;
mv TurboVNC.repo /etc/yum.repos.d/;
yum install -y turbovnc ;

# Desktop DM
yum -y install epel-release;
yum -y group install "Xfce" "base-x"

If everything works correctly, requesting for a desktop will submit a job via the scheduler which launches Xvnc and proxies it via websockify to OnDemand which then renders it via noVNC.

VirtualGL[edit | edit source]

The setup that was described above will render everything via software rendering (mesa), which you can confirm with glxinfo. For nodes that have dedicated GPUs, it might be desirable to have the desktop take advantage of hardware acceleration. This can be accomplished using VirtualGL which will draw the user's application on a local X11 server using hardware acceleration and then copies the rendered output to the user's virtual VNC display.

I currently have nodes set up so that users may request for 0 or 1 GPUs. If no GPUs are requested, the desktop is rendered using a GPU that is shared with other users on the node. If 1 GPU is requested, the desktop is rendered on a dedicated GPU for that user.

To accomplish this, I had to:

  1. On Slurm, setup cgroups and a gres for GPUs
  2. Make all but the first GPU a gres. This will permit all users to see the first GPU and any other GPUs requested by the job
  3. Start X11 on the compute nodes on each GPU. Each Xorg server should target a specific GPU (done so via specific Screen/Device sections in xorg.conf)
  4. Change the script that starts the display manager (like XFCE) so that it loads the xauth cookies for the X11 server corresponding to the assigned GPU and run the display manager with vglrun with the appropriate VGL_DISPLAY environment set.


Troubleshooting[edit | edit source]

Here are some issues I ran into. In hindsight, some of them are trivial but hopefully it helps someone.

No cluster set[edit | edit source]

Out of the box, trying to make a desktop results in:

The cluster was never set. Either set it in form.yml.erb with `cluster` or `form.cluster` or set `cluster` in submit.yml.erb.
  • The Desktop session data for this session can be accessed under the staged root directory.

You need to tweak the form to work in your environment. You need to edit either define the cluster value in /var/www/ood/apps/sys/bc_desktop/form.yml or launch your own interactive app.

bc_vnc_resolution is not showing[edit | edit source]

If your bc_vnc_resolution form field isn't showing, it's most likely that you have ENABLE_NATIVE_VNC set which lets the VNC client to resize the desktop.

VNC: Failed to connect to server[edit | edit source]

Launching the VNC session opens noVNC. However, I got "Failed to connect to server". Error from apache logs show:

==> httpd/error.log <==                                                                                                                                                       
[Fri Jan 22 03:12:49.324498 2021] [proxy:error] [pid 5386:tid 140231151974144] (111)Connection refused: AH00957: WS: attempt to connect to 172.28.0.5:32316 (*) failed        
[Fri Jan 22 03:12:49.324540 2021] [proxy_wstunnel:error] [pid 5386:tid 140231151974144] [client 136.159.79.128:54416] AH02452: failed to make connection to backend: c2

which is weird because c2 is actually listening on port 5901 not 32316. The job output however clearly shows the websockify binary not being found. Fixing this fixed the VNC connection issue.

Solution: Ensure websockify is found. If it's installed in a nonstandard location, tweak your clusters.d yaml file so that the batch_connect/vnc has a script_wrapper that exports the websockify path as WEBSOCKIFY_CMD.

batch_connect:
    basic:
      script_wrapper: |
        %s
      set_host: "host=$(hostname -A | awk '{print $1}')"
    vnc:
      # websockify by pip3 is in /usr/local/bin
      script_wrapper: |
        module purge
        export PATH="/opt/TurboVNC/bin:$PATH"
        export WEBSOCKIFY_CMD="/usr/local/bin/websockify"
        %s
      set_host: "host=$(hostname -A | awk '{print $1}')"
No Home Environment Error[edit | edit source]
Setting VNC password...
Error: no HOME environment variable
Starting VNC server...
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.

Slurm job output showed this. It turns out, the nodes I've set up didn't have the user account so it didn't know what the user's home directory is.

Solution: Ensure your system has the user account.

Job Composer[edit | edit source]

To install it, see:

The actual Job Composer app is at https://github.com/OSC/ondemand/tree/master/apps/myjobs

CSRF Fail, resulting in 'The change you wanted was rejected'[edit | edit source]

For some reason, when trying to create a new job or new template, I get "The change you wanted was rejected.". PUN logs show FATAL "ActionController::InvalidAuthenticityToken (ActionController::InvalidAuthenticityToken):".

This was only 'fixed' by disabling CSRF by injecting Rails.application.config.action_controller.forgery_protection_origin_check = false into config/initializers/new_framework_defaults_5_2.rb. This is most likely caused by either Traefik (likely) or nginx doing the reverse proxy somehow causing the CSRF check to fail due to a difference in base_url. In fact, this issue prevails across all the other interactive apps and may require the same fix as well.

See also, this issue: https://github.com/rails/rails/issues/22965.

Shell / Terminal Access[edit | edit source]

The Shell App basically just runs a web SSH client from the Open OnDemand server. You limit which hosts it can connectt to by tweaking the env file:

DEFAULT_SSHHOST="arc.ucalgary.ca"
OOD_SSHHOST_ALLOWLIST="arc.ucalgary.ca"

Failed to establish a websocket connection.[edit | edit source]

After setting up OnDemand, I had a hard time getting the SSH app to connect. I kept on getting:

Failed to establish a websocket connection. Be sure you are using a browser that supports websocket connections.

Debugging the socket revealed that it was getting a 401 error from PUN. At first, I thought the reverse proxy (traefik 1.7) wasn't forwarding the authentication headers, but this was a red herring. Something with the PUN application was throwing this 401 error. I verified that the node is able to SSH and that the default hostname is correct in /etc/ood/config/apps/shell/env. I even tried connecting to a specific host to no avail. I then set OOD_SSH_WRAPPER=/test.sh with test.sh just dumping the environment to a tmp file which showed me that it wasn't even reaching the point of calling the SSH wrapper. OOD 1.8 also requires setting the OOD_SSHHOST_ALLOWLIST, but that didn't help.

Solution: This only worked after setting OOD_SHELL_ORIGIN_CHECK='off'

Authentication[edit | edit source]

Open OnDemand supports a number of authentication methods including plain old system authentication (based on PAM), LDAP/AD (with Dex), and federated authentication services (such as CAS, Shibboleth, OpenID).

For more information, review the docs at https://osc.github.io/ood-documentation-test/init-2.1/authentication.html

Plain old PAM[edit | edit source]

To have Apache handle authentication using PAM, define the following lines in ood_porta.yml:

auth:
  - 'AuthType Basic'
  - 'AuthName "Open OnDemand"'
  - 'AuthBasicProvider PAM'
  - 'AuthPAMService ood'
  - 'Require valid-user'

When users attempt to access the OnDemand instance, they will authenticate using their web browser using basic authentication. When using this method, the user cannot log out unless they clear their browser of the basic authentication credentials.

OpenID[edit | edit source]

To get Open OnDemand to authenticate against the Azure AD service that the University of Calgary is using:

  1. Obtain the OIDC Identity Provider URL, client ID, and client secret.
  2. Install the mod_auth_openidc package.
  3. Setup ood_portal.yml for oidc authentication:
servername: "ood.example.com"

# Use OIDC logout
logout_redirect: "/oidc/?logout=https%3A%2F%2Food.example.com"

# Authentication via MS results in 'user@ucalgary.ca' as the username
user_map_cmd: "/opt/ood/ood_auth_map/bin/ood_auth_map.regex --regex='^(.+)@ucalgary.ca' "

oidc_uri: "/oidc/"
oidc_provider_metadata_url: "https://login.microsoftonline.com/** Provided by MS **/v2.0/.well-known/openid-configuration"
oidc_client_id: "** Provided by MS **"
oidc_client_secret: "** Provided by MS **"
oidc_remote_user_claim: "preferred_username"
oidc_scope: "openid profile email"
oidc_session_inactivity_timeout: 28800
oidc_session_max_duration: 28800
oidc_state_max_number_of_cookies: "10 true"
oidc_settings:
  OIDCPassIDTokenAs: "serialized"
  OIDCPassRefreshToken: "On"
  OIDCPassClaimsAs: "environment"
  OIDCStripCookies: "mod_auth_openidc_session mod_auth_openidc_session_chunks mod_auth_openidc_session_0 mod_auth_openidc_session_1"

If your Open OnDemand system is behind a reverse proxy doing SSL termination, you will still want the OIDCRedirectURI to be using the https:// URL. However, the ood-portal-generator only uses the https protocol if the SSL certificates are defined in ood_portal.yml (which we don't want to do with an external SSL terminator). As a work around, my docker entrypoint.sh script will replace http with https before starting apache.

See Also:

Troubleshooting[edit | edit source]

Logs are stored at:

  • /var/log/ondemand-nginx/
  • /var/log/httpd/

Apps are stored at:

  • /var/www/ood/apps/sys
  • /var/www/ood/apps/dev/$username/gateway (symlinked to user's home directory ~/ondemand/dev)

400 - Bad Request[edit | edit source]

Apache and the nginx instance running the PUN supports a 8k maximum client header request. Exceeding this limit will result in a 400 - Bad Request error being returned by Apache to the client. A likely cause for exceeding this limit is with excessive amounts of cookies.

If there are a large number of OpenID state cookies, then this error is likely a symptom of another issue. For more information on this issue, see: https://github.com/zmartzone/mod_auth_openidc/wiki/Cookies

A workaround would be to increase the client header request size from the default 8k to something higher. To do this:

  1. Add to /etc/httpd/conf.d/ood-portal.conf within the VirtualHost tags this: LimitRequestFieldSize  65536.
  2. Create an additional config file at /var/lib/ondemand-nginx/config/apps/sys/large-header.conf with large_client_header_buffers 8 64k;.

Implementations[edit | edit source]


See Also[edit | edit source]