Computational Resources

About Us

Computational Resources at the IGSP

Integrated environment, from laptop to high-performance compute cluster

IGSP's information systems are designed to minimize time spent on staging data for analysis and maximize efficiency of analysis. In practical terms this means that data access is continuous throughout the Institute, and datasets visible on a networked laptop are also available to thousands of CPU cores on the Duke Shared Cluster Resource (DSCR), a high-performance computational resource shared by researchers across the university. The IGSP computational team also offers specialized and exotic computational resources through virtualization, using both University and "cloud"-based resources.

IGSP's DNA Microarray Core, Proteomics Core, and Sequencing Core facilities use the centralized storage, so that IGSP scientists can easily acquire large datasets and set them up for analysis. The Microarray and Proteomics Cores use the "Express" data repository for data distribution and analysis for major projects. "Express" has been developed by IGSP programmers to ease data production activities in the cores and provide a true repository for data of abiding scientific interest. The system has been used to automate data storage and analysis, and it is being expanded to increase its flexibility and capability.

The Express Data Repository and the IGSP infrastructure have been recognized as models for supporting efficient and secure data management for large genomic data.

Generous and secure storage

NetApp FAS 3170

In terms of raw storage capacity, Duke's IGSP has the fourth largest data storage system in the Duke University and Health System enterprise. Data is backed up to disc, and mirrored to separate locations for disaster recovery. Storage and computational resources of IGSP are located in enterprise-level data centers with access controls, and primary storage for the Institute's labs and staff is housed in a data center that is manned 24/7 and has redundant and emergency power supplies.

Currently, individuals are granted 25 gigabytes of backed up storage space, and labs have access to 150 gigabytes of backed up storage space. Labs and projects that require more storage can purchase additional storage to be added to the existing storage controllers and systems. The storage is designed to ease data sharing among the Institute's researchers.

IGSP has separate installations of NetApp FAS 3000 series filers in three Duke locations. Disc shelves attached to the filers use NetApp's fibre channel architecture for high-performance storage and SATA disc for more capacious, moderate-performance storage. Upgrades to the system planned in 2011/2012 will use newer technology SAS discs for high speed and increased data capacity.

Datasets that are no longer being analyzed can be placed in much cheaper storage that is not directly accessible to computation servers but that can be restaged for analysis in short order. This storage consists of a Dell R710 server and shelves of MD1200 disc arrays. This is RAID 6 storage, and the setup features hot swap discs in the event of disc failure. Although the device is not backed up to a separate location, the system architecture is very failure resistant and has 24/7 local and vendor support. This storage is also quite inexpensive, costing in FY 2010/2011 $450/terabyte.

Processing power fit for wide range of projects

Computational power is tailored to fit researchers' requirements, and the infrastructure handles large and small projects. The infrastructure is designed to be flexible, with open access to IGSP researchers on computational servers outfitted with a broad range of bioinformatics and application development tools. Software not currently on IGSP machines can be installed on request.

Dell M1000 Compute Infrastructure

Five 8 CPU-core Intel machines, each fitted with 32 gigabytes of RAM, are available to researchers with regular and basic demand for computation. Additional computational resources are available by arrangement for more computationally intensive projects, such as high-throughput gene expression microarray or sequence analysis. Access to these dedicated devices are restricted to specific research groups. Special provisions have been made on both the storage and the computational infrastructure for protected sensitive electronic information, such as datasets that fall under HIPAA and HITECH regulation.

The IGSP computation and core infrastructure uses Dell 1950 series 1U machines, Dell 1955 and M1000 blade/enclosure systems and a Dell R900 device for proteomics analysis.

High performance computation is executed on the Duke Shared Cluster Resource (DSCR), a computational cluster of over 4,000 CPU-cores. This cluster is directly connected to IGSP's storage infrastructure via dedicated 10 GigE fibre, allowing for easy staging of large datasets. The cluster has all commonly used software, and systems administrators will install additional software on request. IGSP is a major contributor to the DSCR and has added computational servers funded by the NIH (grant number 1S10RR025590-01) and the North Carolina Biotechnology Center (grant number 2009-IDG-1002).

The IGSP computational infrastructure is Linux-based, since Linux is a widely adopted and very reliable platform for computational biologists. Use of open source software is encouraged, though projects also use proprietary software when it fits their research needs.

The IGSP computational team also is trained in using so-called "cloud" technologies and can set up customized computational infrastructure for special purposes, including computational servers with Tesla "Fermi" GPU processors or machines fitted with up to 64 gigabytes of RAM. With staff of the DSCR, the IGSP IT team is conducting a pilot project supported by the Kimmel Foundation to make high performance computing services available to Duke Medicine researchers in a secure "local cloud" that is suitable for analysis of sensitive and protected data.

Immediately available bioinformatics software and development infrastructure

Commonly used software for sequence analysis, gene expression analysis, and proteomics is available to all researchers. The IT infrastructure is particularly well suited for application development by IGSP researchers, and a significant number of computational and software-development projects are underway, ranging from software for specialized analysis to enterprise-wide data management and data analysis systems.

Currently, IGSP IT staff are involved in externally funded projects to expand the tools used to establish data provenance and ensure the reproducibility of highly complex and computationally challenging genomic analysis.

Talented core IT staff

IGSP's seven-member IT staff includes individuals trained and certified in Oracle and MySQL databases, systems administration, information security, and bioinformatics. A third of the staff hold advanced degrees. Programming and database staff have extensive training and experience in biology labs. Staff have education and work experience at a broad range of organizations, including the European Bioinformatics Institute (EBI), Duke, Northwestern, Western Kentucky University, and the Rochester Institute of Technology.

IGSP and basic sciences faculty members have found that the programming staff in particular have unique talents that can be used in various research projects, and it is common practice for PIs to include members of the IGSP IT team in their grant proposals for special analysis and development projects. This is typically done by including staff effort in the budget. Arrangements of this kind can be made by contacting Mark DeLong well before a grant is submitted.

Systems administration staff serve on a round-the-clock emergency on-call rotation for the Institute's main computational and storage infastructure. Desktop and routine service is provided during normal business hours.