You are on page 1of 32

Linux Virtual File System

Peter J. Braam

P.J.Braam/CMU -- 1

Aims
Present the data structures in Linux VFS Provide information about flow of control Describe methods and invariants needed to implement a new file system Illustrate with some examples

P.J.Braam/CMU -- 2

History
BSD implemented VFS for NFS: aim dispatch to different filesystems VMS had elaborate filesystem NT/Win95 have VFS type interfaces Newer systems integrate VM with buffer cache. File access

P.J.Braam/CMU -- 3

Linux Filesystems
Media based
ext2 - Linux native ufs - BSD fat - DOS FS vfat - win 95 hpfs - OS/2 minix - well. Isofs - CDROM sysv - Sysv Unix hfs - Macintosh affs - Amiga Fast FS NTFS - NTs FS adfs - Acorn-strongarm

Network
nfs Coda AFS - Andrew FS smbfs - LanManager ncpfs - Novell

Special ones

procfs -/proc umsdos - Unix in DOS userfs - redirector to user

P.J.Braam/CMU -- 4

Linux Filesystems (ctd)


Forthcoming: Linux serves (unrelated devfs - device file system to the VFS!)
DFS - DCE distributed FS NFS - user & kernel Coda AppleShare netatalk/CAP SMB - samba NCP - Novell

Varia:
cfs - crypt filesystem cfs - cache filesystem ftpfs - ftp filesystem mailfs - mail filesystem pgfs - Postgres versioning file system

P.J.Braam/CMU -- 5

Usefulness Linux is Obsolete


Andrew Tanenbaum

P.J.Braam/CMU -- 6

Linux VFS
Multiple interfaces build up VFS:
files dentries inodes superblock quota

File access

VFS can do all caching & provides utility fctns to FS FS provides methods to VFS; many are optional
P.J.Braam/CMU -- 7

User level file access


Typical user level types and code:
pathnames: /myfile file descriptors: fd = open(/myfile) attributes in struct stat: stat(/myfile, &mybuf), chmod, chown... offsets: write, read, lseek directory handles: DIR *dh = opendir(/mydir) directory entries: struct dirent *ent = readdir(dh)

P.J.Braam/CMU -- 8

VFS
Manages kernel level file abstractions in one format for all file systems Receives system call requests from user level (e.g. write, open, stat, link) Interacts with a specific file system based on mount point traversal Receives requests from other parts of the kernel, mostly from memory management
P.J.Braam/CMU -- 9

File system level


Individual File Systems
responsible for managing file & directory data responsible for managing meta-data: timestamps, owners, protection etc translates data between particular FS data: e.g. disk data, NFS data, Coda/AFS data VFS data: attributes etc in standard format e.g. nfs_getattr(.) returns attributes in VFS format, acquires attributes in NFS format to do so.
P.J.Braam/CMU -- 10

Anatomy of stat system call


sys_stat(path, buf) { dentry = namei(path); if ( dentry == NULL ) return -ENOENT; Establish VFS data inode = dentry->d_inode; rc =inode->i_op->i_permission(inode); if ( rc ) return -EPERM;

Call into inode layer of filesystem

Call into inode layer rc = inode->i_op->i_getattr(inode, buf); of filesystem dput(dentry); return rc; }
P.J.Braam/CMU -- 11

Anatomy of fstatfs system call


sys_fstatfs(fd, buf) { /* for things like df */ file = fget(fd); Translate fd to VFS if ( file == NULL ) return -EBADF; data structure superb = file->f_dentry->d_inode->i_super; rc = superb->sb_op->sb_statfs(sb, buf); return rc; } Call into superblock layer of filesystem

P.J.Braam/CMU -- 12

Data structures
VFS data structures for:
VFS handle to the file: inode (BSD: vnode) User instantiated file handle: file (BSD: file) The whole filesystem: superblock (BSD: vfs) A name to inode translation: dentry

P.J.Braam/CMU -- 13

Shorthand method notation


super block methods: sss_methodname inode methods: iii_methodname dentry methods: ddd_methodname file methods: fff_methodname

instead of : inode i_op lookup we write iii_lookup


P.J.Braam/CMU -- 14

namei
VFS
struct dentry *namei(parent, name) { if (dentry = d_lookup(parent,name)) ddd_hash(parent, name) ddd_revalidate(dentry)

FS

else
iii_lookup(parent, name) struct inode *iget(ino, dev) { /* try cache else .. */ sss_read_inode() }
P.J.Braam/CMU -- 15

Superblocks
Handle metadata only (attributes etc) Responsible for retrieving and storing metadata from the FS media or peers Struct superblocks hold things like:
device, blocksize, dirty flags, list of dirty inodes super operations wait queue pointer to the root inode of this FS
P.J.Braam/CMU -- 16

Super Operations (sss_)


Ops on Inodes:
read_inode put_inode write_inode delete_inode clear_inode notify_change

Superblock manips:
read_super (mount) put_super (unmount) write_super (unmount) statfs (attributes)

P.J.Braam/CMU -- 17

Inodes
Inodes are VFS abstraction for the file Inode has operations (iii_methods) VFS maintains an inode cache, NOT the individual FSs (compare NT, BSD etc) Inodes contain an FS specific area where:
ext2 stores disk block numbers etc AFS would store the FID

Extraordinary inode ops are good for dealing with stale NFS file handles etc.
P.J.Braam/CMU -- 18

Whats inside an inode - 1


list_head i_hash list_head i_list list_head i_dentry int i_count long i_ino int i_dev
{m,a,c}time {u,g}id mode size n_link

caching

Identifies file

Usual stuff

P.J.Braam/CMU -- 19

Whats inside an inode -2


superblock i_sb inode_ops i_op
wait objects, semaphore lock vm_area_struct pipe/socket info page information union { ext2fs_inode_info i_ext2 nfs_inode_info i_nfs coda_inode_info i_coda ..} u

Which FS

For mmap, networking waiting

FS Specific info: blocknos fids etc


P.J.Braam/CMU -- 20

Inode state
Inode can be on one or two lists:
(hash & in_use) or (hash & dirty ) or unused inode has a use count i_count

Transitions
unused hash: iget calls sss_read_inode
dirty in_use: sss_write_inode hash unused: call on sss_clear_inode, but if i_nlink = 0: iput calls sss_delete_inode when i_count falls to 0
P.J.Braam/CMU -- 21

Inode Cache
Players: 1. iget: if i_count>0 ++ 2. iput: if i_count>1 - 3. free_inodes 4. syncing inodes

Inode_hashtable
sss_clear_inode (freeing inos) or sss_delete_inode (iput) Fs storage sss_read_inode (iget) Fs storage Unused inodes

Dirty inodes
media fs only (mark_inode_dirty) Used inodes

sss_write_inode (sync one) Fs storage


P.J.Braam/CMU -- 22

Sales
Red Hat Software sold 240,000 copies of Red Hat Linux in 1997 and expects to reach 400,000 in 1998.

Estimates of installed servers (InfoWorld): - Linux: 7 million - OS/2: 5 million - Macintosh: 1 million
P.J.Braam/CMU -- 23

Inode operations (iii_)


lookup: return inode
creation/removal
create link unlink symlink mkdir rmdir mknod rename calls iget

symbolic links
readlink follow link

pages
readpage, writepage, updatepage - read or write page. Generic for mediafs. bmap - return disk block number of logical block

special operations

revalidate - see dentry sect truncate permission


P.J.Braam/CMU -- 24

Dentry world
Dentry is a name to inode translation structure Cached agressively by VFS Eliminates lookups by FS & private caches
timing on Coda FS: ls -lR 1000 files after priming cache linux 2.0.32: 7.2secs linux 2.1.92: 0.6secs

disk fs: less benefit, NFS even more

Negative entries! Namei is dramatically simplified

P.J.Braam/CMU -- 25

Inside dentrys
name pointer to inode pointer to parent dentry list head of children chains for lots of lists use count

P.J.Braam/CMU -- 26

Dentry associated lists


Legend: inode dentry
dentry tree relationship dentry inode relationship

inode I_dentry list head

inode i_dentry list head

= d_inode pointer
d_alias chains place: d_instantiate remove: dentry_iput

= d_parent pointer
d_child chains place: d_alloc remove: d_prune, d_invalidate, d_put
P.J.Braam/CMU -- 27

Dcache
dentry_hashtable (d_hash chains) dhash(parent, name) list head namei tries cache: d_lookup ddd_compare Success: ddd_revalidate d_invalidate if fails proceed if success Failure: iii_lookup find inode iget sss_read_inode finish: d_add can give negative entry in dcache
P.J.Braam/CMU -- 28

prune d_invalidate d_drop

namei iii_lookup d_add

unused dentries (d_lru chains)

Dentry methods
ddd_revalidate: can force new lookup ddd_hash: compute hash value of name ddd_compare: are names equal? ddd_delete, ddd_put, ddd_iput: FS cleanup opportunity

P.J.Braam/CMU -- 29

Dentry particulars:
ddd_hash and ddd_compare have to deal with extraordinary cases for msdos/vfat:
case insensitive long and short filename pleasantries

ddd_revalidate -- can force new lookup if inode not in use:


used for NFS/SMBfs aging used for Coda/AFS callbacks
P.J.Braam/CMU -- 30

Style

Dijkstra probably hates me Linus Torvalds

P.J.Braam/CMU -- 31

Memory mapping
vm_area structure has
vm_operations inode, addresses etc.

vm_operations
map, unmap swapin, swapout nopage -- read when page isnt in VM

mmap
calls on iii_readpage keeps a use count on the inode until unmap
P.J.Braam/CMU -- 32

You might also like