GPFS overview • GPFS: General Parallel File System • GPFS is a high performance , shared-disk file system that can provide data access from nodes in a cluster environment. Node Node Node A B C • Developed by IBM Research for IBM SP supercomputers. GPFS
– GPFS has been available on AIX ,Linux and
Windows – Disks (data and metadata) shared across all SAN nodes – Concurrent, parallel access of data and metadata – Files accessed using standard UNIX interfaces and commands
Why choose GPFS? • GPFS is highly available and fault tolerant – Data protection mechanisms include journaling, re plication (like mirroring), support for storage array copies (synch and asynch) – Heartbeat mechanism to recover from multiple disk, node, connectivity failures – Recovery software mechanisms implemented in all layers • GPFS is highly scalable: (2000+ nodes) – Symmetric, scalable software architecture – Distributed metadata management – Allows for incremental scaling of system (nodes, disk space) with ease • GPFS is a high performance file system – Large block size (tunable) support with wide striping (across nodes and disks) – Parallel access to files from multiple nodes – Efficient deep prefetching: read ahead, write behind – Recognize access patterns (adaptable mechanism) – Highly multithreaded daemon
GPFS terminology: SAN, NSD, VSD • SAN – Storage Area Network – The disk is visible as a local device from any node in a cluster, typically over a switched Fiber Channel network. • VSD – Virtual Shared Disk – This is remote access across a network to the disk, for example, the disk is local to one or m ore nodes, and remote to others. I/O access is over the network, or an IBM high-performance switch. VSD requires the rsct.vsd fileset. • NSD – Network Shared Disk – This is the ability to make a raw LUN available to remote clients over TCP/IP.
• Any given disk can belong to only one file system
• One file system can have many disks • When a file system is created on more than one disk, the file system is striped across disks using the block size specified when the file system is created. Possible block sizes: 16K, 64K, 256K, 512K, and 1024K (256K is default) • Many operations on GPFS can be done dynamically, like – Adding/deleting disks – Restriping (Rebalancing) – Increasing # inodes – Adding/removing nodes
GPFS terminology: Replication • Replication is the duplication of data and/or metadata (usually both) on GPFS disks for failover support • Requires 2x the storage • This is GPFS synchronous “mirroring”, GPFS can’ t mirror at the logical volume level • Can be used with extended distance RAC clusters
GPFS terminology: Cluster data configuration file • A primary cluster data server must be defined to ac t as the primary holder of the GPFS cluster configuration information file /var/mmfs/gen/mmsdrfs • A secondary GPFS cluster data server is highly recommended • If you don’t have a secondary cluster data server , when the node housing the primary data server fails, no changes can be made to the cluster configuration. • The cluster data server is specified when the cluster is formed #mmcrcluster –p node1_priv –s node2_priv…
• The configuration manager is needed to check against failure of components
(Hardware and software: network, adapters, disks, nodes …) – Drives recovery from node failure within the cluster – Selects the file system mana ger avoiding data corruption • Disk Leasing: Request from the node to CfgMgr to renew its lease – leaseDuration: Time a disk-lease is granted from CfgMgr to any node (Default is 35 sec.) – leaseRecoveryWait: Additional wait time to allow transactions to complete (Default is 35 sec.) • Pings: sent from CfgMgr to the node when node fails to renew lease – PingPeriod: seconds between pings (Default is 2 sec.) – totalPingTimeout: total ping time before giving up (Default is 120 sec.)
GPFS terminology: File system manager • Processing changes to file system: Adding disks, changing disk availability, repairing the file system, and mount or unmount file system • Management of disk space allocation: Controls which regions of disks are allocated to each nod e, allowing effective parallel allocation of space • Token management: Coordinates access to files on shared disks by granting tokens that convey the right to read or write the data or metadata of a file • Quota management: Allocating disk blocks to nodes that are writing to the file system and comparing the allocated space to the quota limits at regular intervals • If node containing the File System Manager fails, the Configuration Manager assigns it to another node
GPFS and Oracle • Consistent with the Or acle SAME methodology – Stripe And Mirror Everything • Supports Direct I/O – Bypass AIX buffer cache and GPFS cache for additional performance – Lets Oracle manage and optimize buffer caching – Performance close to that of raw LVs • Supports all files: Oracle Home, CRS Home, data files, log files, control files, back ups, and so forth.
CSS Heartbeat and GPFS • CSS has two heartbeat mechanisms : – Network heartbeat across interconnect to establish/confirm cluster membership • CSS misscount parameter • If network ping time > CSS misscount, node evicted • CSS misscount = 30seconds for UNIX+Oracle Clusterware – Disk heartbeat to voting device • – Internal I/O timeout interval (IOT) where an I/O to voting disk must complete • – If I/O to voting disk > IOT, node evicted