ESS Storage
Fall 2002. Status as of 1/9/03: NAS gateway in Computer Building in production as ubackup.win.psu.edu; nightly jobs mirror udrive and profile data. 3/11/03: Second NAS gateway installed in Computer Building to connect to disks there.
Summary
CLC has bought some of the disk storage in the large IBM Enterprise Storage System (ESS) SAN (aka "Shark") purchased by Administrative Information Systems (AIS) for possible end-user services and for backup storage.
Background and Purpose
In the Fall of 2001 we began looking at storage options for the "UDrive" and roaming profiles. Our existing IBM SAN was several years old, out of space and speed, and no longer supported by the vendor. Tape backups, both to our LTO library and to TSM, are very time consuming. We began discussions with AIS about sharing storage in a large SAN being planned for administrative data. As plans progressed it became clear we needed an upgrade before the new AIS storage could be in place, so we purchased and IBM FAStT700-based SAN with 2TB of disk space, and put that into production in May, 2002. We remained interested in participating in the AIS SAN for possible backup of data to another building and for the user self-service online backup of the UDrive.
The IBM ESS (Enterprise Storage System) was selected by AIS and we met with the AIS team and IBM storage experts. One limitation of the ESS is a limit in the number of drives that can go into a LUN (RAID set). We need one or two more years of a single large volume supporting Windows 9x/ME and we also require clustering for high-availability. Currently MSCS does not support spanned dynamic volumes (multiple LUNs looking like one disk), so the ESS could not be used for the UDrive. However, it could work as a online or offline backup (with a single front-end server). We agreed to buy into the Sharks with 1TB of disk space in each of the two buildings to investigate the speed, ease of use, and features available.
Progress Summary
- Summer 02 -- Sharks installed and working for mainframe use; we're too busy to look at it until . . .
- 11/19/02 -- meet with AIS staff and IBM engineer to discuss connectivity plans
- 11/19/02 -- fiber cable types and switch port identified to connect to ESS SAN in Computer Building
- 11/22/02 -- fiber pulled to CLC server area
- 11/22/02 -- get NAS gateway manuals to figure out what they are
- 11/25/02 -- pickup one of the NAS boxes at Shields and put it in our server area
- 11/25/02 -- install NAS in rack, update Windows OS, join to win.psu.edu domain as "remora1", connect fiber
- 11/25/02 -- fabric zone defined; 2 LUNs (421GB and 491GB) formatted
- 11/25/02 -- UDrive data duplicated -- 7/5 hours
- 11/26/02 -- UDrive mirror (update or incremental backup)) job takes about 2 hours
- 11/26/02 -- Start job to mirror profiles (both UP and WB servers; the latter only weekly)
- 12/02/02 -- Space used on Shields Shark now 462GB; remove /SECFIX option from mirror profile job.
- 12/03/02 -- Installed TSM client (5.1.2) on Remora1; a 4.x version was there; setup job to do UDrive backup
NAS Gateways
Part of the investment includes two IBM TotalStorage Network Attached Storage (NAS) 300G-G26 servers to connect to the Sharks. These are essentially Intel 1.1GHz dual processor servers running Windows 2000 Advanced Server and preloaded with some IBM customization and configuration software. They each include a gigabit ethernet adapter to connect to an IP network and a two-port fibre-channel adapter to connect to the ESS SAN. The first one was installed in the Computer Building 11/25/02 and connected to the SAN fabric, and two LUNs in the Shields Building ESS were assigned to it. A spanned logical volume was created using the two LUNs.
Performance Tests
File Copies
File copies from one server to another are typically done on the receiving side; that is, data is pulled (so the production server is not "bogged" down). We use Microsoft's "Robust Copy" utility, which can duplicate a directory structure, adding, replacing and erasing files in the target storage space to match the source. We use options to compare and duplicate folder and file permissions as well. The typical job launches 26 processes, one for each letter of the alphabet, to copy all 26 of the first level of folders simultaneously. This task stresses processor time, disk I/O and network I/O. Just inspecting the large number of files, without copying, takes a lot of time.
Both the UDrive data and the roaming profiles consist of large numbers of mostly small files. Counts on 12/02/02:
| Data Store | GB Used | Bytes Used | GB Files | Bytes in Files | Bytes/File | Folders | Files | Users |
| UDrive | 314 | 338,006,626,304 | 307 | 330,124,376,629 | 103,443 | 664,472 | 3,191,351 | 36,758 |
| Profiles, UP | 145 | 155,960,246,272 | 120 | 128,880,282,495 | 15,776 | 2,328,514 | 8,169,044 | 28,822 (147,672) |
| Profiles, WB | 5 | 5,669,007,360 | 4 | 4,184,121,213 | 8,880 | 88,052 | 465,943 | 702 (1,466) |
"GB Used" and "Bytes Used" is the size on disk; "GB Files" and "Bytes in Files" is the total data in files; due to blocking, files take more space than the data they contain. Profile folders are created for all PSU Access Accounts but contain data only when the person logs onto a machine in the win.psu.edu domain; the smaller number for "Users" is the count of non-empty profile folders.
UDrive Mirroring
The first test was to mirror all files from the UDrive to Remora1 over IP with a file copy utility. Both servers are on the same gigabit Ethernet switch. This went faster than anticipated; the 317GB went in 7 hours and 22 minutes, or about 734MB/minute. Both processors on Remora1 were near 100% busy the whole time. The UDrive server showed elevated processor utilization, but our remote file I/O benchmarks did not show any significant impact on end-user response time.
At 3:25 am the next day, the mirror job ran again, finishing in 2 hours. Since the full copy was only a few hours prior, there was not a lot of changes to make. Nevertheless, this appears to be much faster than the current mirror job running on a SCSI-based 2-node cluster which takes around 6 hours. These servers are both processor- and disk I/O- limited (but they do have gigabit Ethernet connections on the same switch as the UDrive cluster).
Timings (hours:minutes)
- 11/27/02 2:13 (few changes; full copy was 12 hours prior)
- 11/28/02 3:18 (overlapped with Profile mirror)
- 11/29/02 2:05 (very few changes, 28th was a holiday)
- 11/30/02 2:03
- 12/01/02 1:59
- 12/02/02 2:02
- 12/03/02 2:58 (students back, more file changes)
- 12/04/02 2:11
- 12/05/02 2:13
- 12/07/02 2:15
- 12/08/02 2:38
- 12/09/02 2:17
- 12/10/02 2:21
Profile Mirroring
The user profiles are on a 409GB LUN on the same FAStT700 SAN as the UDrive, but usually served on the second cluster node. On 11/26/02, 139GB were in use. There are also a small number of profiles (4GB / 466,000 files / 88,0000 folders) on a file server at the Wilkes Barre campus for lab users at that location which are backed up here weekly.
Timings (hours:minutes)
- 11/27/02 14:57 (initial copy)
- 11/28/02 07:14 (overlapped with UDrive mirror)
- 11/29/02 05:12
- 11/30/02 06:04
- 12/01/02 06:04
- 12/02/02 05:35
- 12/03/02 02:50 (removing /SECFIX option must help a lot)
- 12/04/02 03:04
- 12/05/02 03:41
- 12/07/02 03:03
- 12/08/02 03:13
- 12/09/02 03:28
- 12/10/02 03:02
TSM Backup
A nightly job does an incremental backup of the UDrive. This was running on a cluster hosting the "UBackup" share, and is often slow; we moved this to Remora1 to see how much faster processors would help. The data is still backed up from the UDrive server; the ESS is not involved here (yet).
The job has two sequential steps, issuing a dsmc command for two different sets of first-level folders. The times for this job are highly variable, ranging from 7-12 hours recently.
- 12/03/02 4:54 + 2:54 = 7:48 (amount of data backed up similar to a job that took 12 hours on the old server)
- 12/04/02 5:35 + 2:21 = 7:56
- 12/05/02 5:35 + 2:51 = 8:21
- 12/06/02 3:53 + 4:31 = 8:24
- 12/07/02 5:53 + 6:47 = 12:55
- 12/08/02 3:31 + 3:20 = 6:51
- 12/09/02 5:02 + 4:24 = 9:26
- 01/12/03 3:39 + 4:32 = 8:11 (They're gone)
- 01/13/03 5:41 + 8:10 = 13:51 (They're back)
This site maintained by the Classroom and Lab Computing group of Information Technology Services.
Suggestions and comments about this web site: CLC Webmasters; Other contacts here.
This page was last modified: 3/18/2003 10:29:02 AM.