Bio-X² cluster | documentation

 

New status pages and My Jobs now available

07/18/08 21:17:55 | kilian | #

I'm very happy to announce the release of some new features for the Bio-X² cluster. In order to enhance the cluster usability, and give users a better overview and finer understanding of its functionning, the following new pages are now available:

Finally, we're also introducing "My Jobs", a new feature available in the users' profile pages.

"My Jobs" allow each user to access a rich web interface to the scheduler, and get real-time information about their jobs (currently active, or archived), analyze and profile their resource usage, see their historical usage, get an overview of compute nodes' loads, status of the queues and much more. Group leads can also take a glance at their group's jobs with a single click by following the "My Group's Jobs" link on their profile page.

Please don't hesitate to direct questions or report problems to biox2-administrators@lists.stanford.edu

Scheduled maintenance

07/07/08 13:58:22 | kilian | #

Mon, Jul 21, 9am-6pp

Sun released an update to the Lustre filesystem. This updated version incorporates performance improvements and bug fixes the Bio-X² cluster filesystem could benefit from. It also supports the latest Infiniband software stack, which also includes performance improvements and bug fixes, especially for MPI users.

We're planning to proceed to the upgrade on Monday, July 21st. The outage will begin at 9am and should last a few hours if everything works as expected. An announce will be sent to this mailing list when access is restored.

Regarding jobs, the compute nodes will have to be shut down. As a consequence, all the still running jobs will be killed.

Given this two-weeks notice, it should be possible for all of the currently running jobs to terminate before the maintenance window. However, it will remain users' responsibility to ensure that their jobs are done by this time, and to refrain from submitting new jobs shortly before the shutdown. To help ensure this, queues will be closed to new job submissions starting Friday, July 18, 6pm.

As an extra security measure, please also make sure that you've backed up your latest files and results before the upgrade.

Schedule

Updated profile pages and new URL

05/06/08 15:35:31 | kilian | #

The Bio-X² cluster website, hosting the documentation and the profile pages, has been relocated on a new web server. The new URL to access this site is now http://biox2.stanford.edu. This would hopefully be shorter to type and easier to remember. Redirections from the old addresses have been put in place to ensure continuity of access.

In the same time, a new version of the Bio-X² users' profile pages has been released, including new and updated functionalities. New features include:

For the PIs:

All the profile page's functionalities are described in details in WebProfile. The new version of the Profile Page is already accessible at http://biox2.stanford.edu/my

Below are some screenshots:

General maintenance

02/11/08 16:41:37 | kilian | #

Starting Tue, Feb 19, 9am

In an effort to increase both filesystem reliability and performance, as well as network bandwidth to the Bio-X² cluster, an upcoming maintenance window has been scheduled for Tuesday, February 19, starting 9am.

What will be done

This maintenance will focus on two items:

How long it will take

The aforementioned operations are expected to take two days. Users will be notified by an email announce once access to the cluster is restored.

Impact on jobs

During that maintenance, all network traffic to the cluster will be interrupted. Once access is restored, users shoudn't notice any change, steps to connect to the cluster will remain the same, and the only noticeable difference should be faster transfer rates.

Regarding jobs, the filesystem will have to be brought offline, and the compute nodes turned off. As a consequence, all the still running jobs will be killed.

Given this one-week notice, it should be possible for all of the currently running jobs to terminate before the maintenance window. However, it will remain users' responsibility to ensure that their jobs are done by this time, and to refrain from submitting new jobs shortly before the shutdown. To help ensure this, queues will be closed to new job submissions starting Saturday, February 16, 6pm.

Schedule

Filesystem maintenance

12/18/07 11:01:12 | kilian | #

Mon, Jan 7th, 9am-6pm

ClusterFS released a bug fix version of the Lustre filesystem. This version is expected to fix most of the problems currently encountered with the filesystem.

The upgrade will take place on Monday, January 7th, starting at 9am. The maintenance is expected to last a few hours. An announce will be sent to this mailing list when access is restored.

The process will require to bring down all the compute nodes, and thus to interrupt the running jobs. So please schedule your jobs accordingly, and refrain from submitting long jobs before the maintenance window, since they will be interrupted.

The upgrade is expected to be non-destructive, but we won't provide any backup in case of problem. So, to be on the safe side, please make sure that you've backed up your data before the upgrade.

Power Outage

12/08/07 09:42:33 | kilian | #

A power outage in Clark affected the Bio-X² cluster. Things are being worked on to restore access as quickly as possible.

Update #1
12/08 1:10pm

The cluster was in pretty bad shape after the power outage. All the storage disk shelves were down, and the ethernet switches unresponsive. Most of the compute nodes had crashed, and some were still stuck in their booting phase.

Update #2
12/08 5:19pm

Most of the problems on the cluster have been caused by overheating. In top of the power failure, and probably related to it, there has been a chilled water flow interruption last night, during two hours, and the server room temperature raised to more than 50° Celsius (~120°F). The disk shelves and switches actually shut down to protect themselves, and the plastic pieces around the raised floor tiles popped out their sockets, probably because of the dilatation of the metallic floor tiles.

Update #3
12/12 5:14pm

All the switches have returned to normal operations, and all the compute nodes are now available for job submission.

Charging scheme for Bio-X², starting Dec. 1st

11/28/07 10:07:06 | kilian | #

The charging scheme for Bio-X² has been established, and will be as follows:

Users will be charged for CPU time, not run time or waiting time. There will be two groups categories:

Both categories will be charged $2.00 per core-day in high-priority queues.

The charging will be retroactive to 1 July 2007 and all priority use will be ignored. The current charging scheme will start on 1 Dec. 2007 and be in effect to 30 June 2008.

More information is available in PricingScheme

A simpler way to check Lustre quotas

10/16/07 17:54:22 | kilian | #

A new tool is available for users to check their quotas usage on the Lustre filesystems (/home and /scratch). It's called lfsquota and is available on the frontend.

Its output looks like the following:

Filesystem   Quota     Usage
/home        50GB      24GB      ||||||||| 48%
/scratch     1024GB    341GB     |||||| 33%

It can be added to the shell profile file to run at logon (~/.bash_profile for bash, or ~/.login for csh-like shells).

See StorageConfiguration#Howtocheckquotas for more details

Jobs milestone

10/11/07 10:25:14 | kilian | #

Over 1,000,000 jobs have been submitted to the scheduler since the cluster first saw light.

To celebrate this milestone, we've decided to offer a Peet's Coffee gift card to the lucky user who sumbitted the one millionth job on the Bio-X² cluster.

We currently have 130 registered users from over 30 labs. The total CPU time consumed has been 101334 days (something like 280 years of CPU time), with the longest job having used 6.5 CPU years. The average wait time in queues has been 1h 02 min, and the average turnaround time (elapsed time from job submission to job completion) has been 3.6 hours. The total throughput of the cluster has been 223.12 jobs per hour during the last 4 months.

Internal Ethernet switches maintenance

10/10/07 16:05:54 | kilian | #

An unscheduled maintenance is ongoing on the internal Ethernet switches. A new firmware is being deployed which corrects previous issues. During the upgrade, the inter-nodes network connectivity is degraded, and the compute nodes may appear to be unavailable from the scheduler. Jobs will however continue to run and the filesystem is unaffected.

Update: the maintenance is over, jobs have been resumed.

Using CVS or SVN from Bio-X²

10/02/07 14:07:09 | kilian | #

Due to numerous requests, users are now allowed to use CVS and SVN clients on the Bio-X² frontend to checkout their code from remote servers. More details available in the CvsSvn page.

Group quota increase on /scratch

08/27/07 16:46:43 | kilian | #

Needs for a larger storage space to host large data sets on the Bio-X² cluster are becoming more frequent, so we've decided to increase the group quotas on /scratch from 500GB to 1TB.

A very rough outline is available in the StorageConfiguration page.

Filesystem maintenance

08/10/07 16:20:37 | kilian | #

Tue, Aug 14th, 9am-6pm

ClusterFS will release a new version of their Lustre filesystem on Monday, August 13th. This updated version will incorporate fixes for most of the problems which caused the latest filesystem problems on the Bio-X² cluster.

That's why we're planning to bring the filesystem down next Tuesday (August 14th) so that we can proceed to the upgrade. The outage will begin at 9am and should last a few hours if everything works as expected. An announce will be sent to this mailing list when access is restored.

The process will require to reboot all the compute nodes, and thus to interrupt the running jobs. So please schedule your jobs accordingly, and refrain from submitting long jobs after the week-end, since we will be interrupted.

As an extra security measure, please also make sure that you've backed up your latest results before the upgrade.

We will also use this downtime to upgrade the Infiniband stack, which should bring nice performance improvements and bug fixes for MPI users. The storage access speed should also benefit from this upgrade.

We're sorry for the inconvenience, and hope that this step will resolve the issues we've encountered.

The Custom SSH port goes away

07/26/07 17:18:59 | kilian | #

As a measure to simplify the connection and authentication process, the custom SSH port previously used to connect to the Bio-X² cluster has been removed. You don't need to add the -p <your_custom_port> option to your ssh command to connect to the frontend. It will still work for backward compatibility, but it's not required anymore. You can now use the default SSH port (22).

Please refer to the documentation, and especially FirewallAccess for updated instructions.

Standard Queue Farewell

07/16/07 17:37:21 | kilian | #

The previous default queue, named "standard", has been removed today, after its last jobs successfully completed.

New queues active

07/02/07 14:16:00 | kilian | #

The new queues are now active.

New queues

06/29/07 17:03:49 | kilian | #

Starting July 2nd (Monday), the default queue will be switched from "standard" to "SP", and the "standard" queue will be closed. Jobs running in the "standard" queue at that time will finish, and the queue will be deactivated (destroyed) when empty.

All jobs submitted without any queue specification (bsub -q) will be submitted to the SP queue. Any job submitted without runtime specification will only be allowed to run for 2 hours, and will be killed thereafter.

See JobQueues for more details about the queues configuration.

Top500 Ranking

06/27/07 10:52:38 | kilian | #

The Bio-X² cluster has entered the Top500 supercomputer sites list. The June 2007 list has been released during the opening sessions of ISC'07 in Dresden, Germany.

The first Top500 ranking of Bio-X² is #54, which makes it one of the top US academic supercomputers, with a Rpeak of 15.57 Teraflops, and a theoretical Rmax of 20.58 Teraflops.

Excerpt from the latest list general highlights:

Earthquake bracing outage is over

06/26/07 18:15:27 | kilian | #

The earthquake bracing outage is over, everything went as planned, and the racks are now firmly bolted to the floor.

Access to the cluster has been restored, and jobs can be resumed.

Earthquake bracing outage

06/26/07 15:12:54 | kilian | #

An upcoming cluster outage as been scheduled on Tuesday, June 26, 6:00 am to 6:00 pm.

This outage is to fasten the racks firmly to the concrete sub-floor, bracing the system in the event of an earthquake. This will involve drilling through the raised floor tiles and concrete floor, generating dust and metal filings. We will need to power off the systems to prevent this detritus from being sucked into our systems.

Please schedule your jobs accordingly. Any jobs running at 6:00 am on Tuesday morning will be killed without notice.