You are on page 1of 160

volume 45 NUMBER 4 APRIL 2013

e di t o r i a l
339 Predicting the influence of common variants

FOCUS ON CANCER RISK


Cover art by Daniela Leitner http://www.danielaleitner.de/

F O RE W O R D
343 iCOGS collection provides a collaborative model

2013 Nature America, Inc. All rights reserved.

C O M M E N TAR Y
345
Testis
ain Br

Turning of COGS moves forward findings for hormonally mediated cancers Lori C Sakoda, Eric Jorgenson & John S Witte Public health implications from COGS and potential for risk stratification and screening Hilary Burton, Susmita Chowdhury, Tom Dent, Alison Hall, Nora Pashayan & Paul Pharoah Research highlights

Breas t

Ov ar ia
n

349
e stat Pro

Pan cre as

352

Melanom a

CLL
r de ad Bl

Co lon

ART I C L E S
353 Large-scale genotyping identifies 41 new loci associated with breast cancer risk K Michailidou, P Hall, A Gonzalez-Neira, M Ghoussaini, J Dennis, R L Milne, M K Schmidt, J Chang-Claude, S E Bojesen, M K Bolla, Q Wang, E Dicks, A Lee, C Turnbull, N Rahman, The Breast and Ovarian Cancer Susceptibility Collaboration, O Fletcher, J Peto, L Gibson, I dos Santos Silva, H Nevanlinna, T A Muranen, K Aittomki, C Blomqvist, K Czene, A Irwanto, J Liu, Q Waisfisz, H Meijers-Heijboer, M Adank, Hereditary Breast and Ovarian Cancer Research Group Netherlands (HEBON), R B van der Luijt, R Hein, N Dahmen, L Beckman, A Meindl, R K Schmutzler, B Mller-Myhsok, P Lichtner, J L Hopper, M C Southey, E Makalic, D F Schmidt, A G Uitterlinden, A Hofman, D J Hunter, S J Chanock, D Vincent, F Bacot, D C Tessier, S Canisius, L F A Wessels, C A Haiman, M Shah, R Luben, J Brown, C Luccarini, N Schoof, K Humphreys, J Li, B G Nordestgaard, S F Nielsen, H Flyger, F J Couch, X Wang, C Vachon, K N Stevens, D Lambrechts, M Moisse, R Paridaens, M-R Christiaens, A Rudolph, S Nickels, D Flesch-Janys, N Johnson, Z Aitken, K Aaltonen, T Heikkinen, A Broeks, L J Vant Veer, C E van der Schoot, P Gunel, T Truong, P Laurent-Puig, F Menegaux, F Marme, A Schneeweiss, C Sohn, B Burwinkel, M P Zamora, J I Arias Perez, G Pita, M R Alonso, A Cox, I W Brock, S S Cross, M W R Reed, E J Sawyer, I Tomlinson, M J Kerin, N Miller, B E Henderson, F Schumacher, L Le Marchand, I L Andrulis, J A Knight, G Glendon, A Marie Mulligan, kConFab Investigators, Australian Ovarian
Nature Genetics (ISSN 1061-4036) is published monthly by Nature Publishing Group, a trading name of Nature America Inc. located at 75 Varick Street, Fl 9, New York, NY 10013-1917. Periodicals postage paid at New York, NY and additional mailing post offices. Editorial Office: 75 Varick Street, Fl 9, New York, NY 10013-1917. Tel: (212) 726 9314, Fax: (212) 545 8341. Annual subscription rates: USA/Canada: US$225 (personal), US$4,677 (institution). Canada add 5% GST #140911595rt001; Euro-zone: 287 (personal), 3,713 (institution); Rest of world (excluding China, Japan, Korea): 185 (personal), 2,400 (institution); Japan: Contact NPG Nature Asia-Pacific, Chiyoda Building, 2-37 Ichigayatamachi, Shinjuku-ku, Tokyo 162-0843. Tel: 81 (03) 3267 8751, Fax: 81 (03) 3267 8746. POSTMASTER: Send address changes to Nature Genetics, Subscriptions Department, 75 Varick Street, 9th Floor, New York, NY 10013-1917. Authorization to photocopy material for internal or personal use, or internal or personal use of specific clients, is granted by Nature Publishing Group to libraries and others registered with the Copyright Clearance Center (CCC) Transactional Reporting Service, provided the relevant copyright fee is paid direct to CCC, 222 Rosewood Drive, Danvers, MA 01923, USA. Identification code for Nature Genetics: 1061-4036/04. Back issues: US$45, Canada add 7% for GST. CPC PUBAGREEMENT #40032744. Printed on acid-free paper by The Sheridan Press, Hanover, PA, USA. Copyright 2013 Nature Publishing Group. Printed in USA.

Invited Commentary on COGS papers (p 345)

npg

ng Lu

Endometrium

Kid ne y

volume 45 NUMBER 4 APRIL 2013

15

log10 (P)

10

0
9 10 11 12 13 14 16 18 2 3 4

Chromosome

2013 Nature America, Inc. All rights reserved.

Forty-one novel breast cancer susceptibility loci (p 353)

Cancer Study Group, A Lindblom, S Margolin, M J Hooning, A Hollestelle, A M W van den Ouweland, A Jager, Q M Bui, J Stone, G S Dite, C Apicella, H Tsimiklis, G G Giles, G Severi, L Baglietto, P A Fasching, L Haeberle, A B Ekici, M W Beckmann, H Brenner, H Mller, V Arndt, C Stegmaier, A Swerdlow, A Ashworth, N Orr, M Jones, J Figueroa, J Lissowska, L Brinton, M S Goldberg, F Labrche, M Dumont, R Winqvist, K Pylks, A Jukkola-Vuorinen, M Grip, H Brauch, U Hamann, T Brning, The GENICA (Gene Environment Interaction and Breast Cancer in Germany) Network, P Radice, P Peterlongo, S Manoukian, B Bonanni, P Devilee, R A E M Tollenaar, C Seynaeve, C J van Asperen, A Jakubowska, J Lubinski, K Jaworska, K Durda, A Mannermaa, V Kataja, V-M Kosma, J M Hartikainen, N V Bogdanova, N N Antonenkova, T Drk, V N Kristensen, H Anton-Culver, S Slager, A E Toland, S Edge, F Fostira, D Kang, K-Y Yoo, D-Y Noh, K Matsuo, H Ito, H Iwata, A Sueta, A H Wu, C-C Tseng, D Van Den Berg, D O Stram, X-O Shu, W Lu, Y-T Gao, H Cai, S H Teo, CH Yip, S Y Phuah, B K Cornes, M Hartman, H Miao, W Yen Lim, J-H Sng, K Muir, A Lophatananon, S Stewart-Brown, P Siriwanarangsan, C-Y Shen, C-N Hsiung, P-E Wu, S-L Ding, S Sangrajrang, V Gaborieau, P Brennan, J McKay, W J Blot, L B Signorello, Q Cai, W Zheng, S Deming-Halverson, M Shrubsole, J Long, J Simard, M Garcia-Closas, P D P Pharoah, G Chenevix-Trench, A M Dunning, J Benitez & D F Easton 362 GWAS meta-analysis and replication identifies three new susceptibility loci for ovarian cancer P D P Pharoah, Y-Y Tsai, S J Ramus, C M Phelan, E L Goode, K Lawrenson, M Buckley, B L Fridley, J P Tyrer, H Shen, R Weber, R Karevan, M C Larson, H Song, D C Tessier, F Bacot, D Vincent, J M Cunningham, J Dennis, E Dicks, Australian Cancer Study, Australian Ovarian Cancer Study Group, K K Aben, H Anton-Culver, N Antonenkova, S M Armasu, L Baglietto, E V Bandera, M W Beckmann, M J Birrer, G Bloom, N Bogdanova, J D Brenton, L A Brinton, A Brooks-Wilson, R Brown, R Butzow, I Campbell, M E Carney, R S Carvalho, J Chang-Claude, Y A Chen, Z Chen, W-H Chow, M S Cicek, G Coetzee, L S Cook, D W Cramer, C Cybulski, A Dansonka-Mieszkowska, E Despierre, J A Doherty, T Drk, A du Bois, M Drst, D Eccles, R Edwards, A B Ekici, P A Fasching, D Fenstermacher, J Flanagan, Y-T Gao, M Garcia-Closas, A Gentry-Maharaj, G Giles, A Gjyshi, M Gore, J Gronwald, Q Guo, M K Halle, P Harter, A Hein, F Heitz, P Hillemanns, M Hoatlin, E Hgdall, C K Hgdall, S Hosono, A Jakubowska, A Jensen, K R Kalli, B Y Karlan, L E Kelemen, L A Kiemeney, S K Kjaer, G E Konecny, C Krakstad, J Kupryjanczyk, D Lambrechts, S Lambrechts, N D Le, N Lee, J Lee, A Leminen, B K Lim, J Lissowska, J Lubin ski, L Lundvall, G Lurie, L F A G Massuger, K Matsuo, V McGuire, J R McLaughlin, U Menon, F Modugno, K B Moysich, T Nakanishi, S A Narod, R B Ness, H Nevanlinna, S Nickels, H Noushmehr, K Odunsi, S Olson, I Orlow, J Paul, T Pejovic, L M Pelttari, J Permuth-Wey, M C Pike, E M Poole, X Qu, H A Risch, L Rodriguez-Rodriguez, M A Rossing, A Rudolph, I Runnebaum, I K Rzepecka, H B Salvesen, I Schwaab, G Severi, H Shen, V Shridhar, X-O Shu, W Sieh, M C Southey, P Spellman, K Tajima, S-H Teo, K L Terry, P J Thompson, A Timorek, S S Tworoger, A M van Altena, D Van Den Berg, I Vergote, R A Vierkant, A F Vitonis, S Wang-Gohrke, N Wentzensen, A S Whittemore, E Wik, B Winterhoff, Y L Woo, A H Wu, H P Yang, W Zheng, A Ziogas, F Zulkifli, M T Goodman, P Hall, D F Easton, C L Pearce, A Berchuck, G Chenevix-Trench, E Iversen, A N A Monteiro, S A Gayther, J M Schildkraut & T A Sellers Multiple independent variants at the TERT locus are associated with telomere length and risks of breast and ovarian cancer S E Bojesen, K A Pooley, S E Johnatty, J Beesley, K Michailidou, J P Tyrer, S L Edwards, H A Pickett, H C Shen, C E Smart, K M Hillman, P L Mai, K Lawrenson, M D Stutz, Y Lu, R Karevan, N Woods, R L Johnston, J D French,

15

log10 (P value)

10

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 20 22 23

Chromosome

npg

Twenty-three new prostate cancer susceptibility loci (p 385)

20 22

371

nature genetics

iii

volume 45 NUMBER 4 APRIL 2013

Plotted SNPs 10 r
2

0.8 0.6 0.4 0.2

rs16953002

100

80

60

40

20

0
CHD9 RBL2 AKTIP RPGRIP1L FTO IRX3 CRNDE IRX5

53.5

54 54.5 Position on Chr. 16 (Mb)

55

npg

2013 Nature America, Inc. All rights reserved.

Independent influence of FTO on melanoma and BMI (p 428)

X Chen, M Weischer, S F Nielsen, M J Maranian, M Ghoussaini, S Ahmed, C Baynes, M K Bolla, Q Wang, J Dennis, L McGuffog, D Barrowdale, A Lee, S Healey, M Lush, D C Tessier, D Vincent, F Bacot, Australian Cancer Study, Australian Ovarian Cancer Study, Kathleen Cuningham Foundation Consortium for research into Familial Breast cancer (kConFab), Gene Environment Interaction and Breast Cancer (GENICA), Swedish Breast Cancer Study (SWE-BRCA), The Hereditary Breast and Ovarian Cancer Research Group Netherlands (HEBON), Epidemiological study of BRCA1 & BRCA2 Mutation Carriers (EMBRACE), Genetic Modifiers of Cancer Risk in BRCA1/2 Mutation Carriers (GEMO), I Vergote, S Lambrechts, E Despierre, H A Risch, A Gonzlez-Neira, M A Rossing, G Pita, J A Doherty, N lvarez, M C Larson, B L Fridley, N Schoof, J Chang-Claude, M S Cicek, J Peto, K R Kalli, A Broeks, S M Armasu, M K Schmidt, L M Braaf, B Winterhoff, H Nevanlinna, G E Konecny, D Lambrechts, L Rogmann, P Gunel, A Teoman, R L Milne, J J Garcia, A Cox, V Shridhar, B Burwinkel, F Marme, R Hein, E J Sawyer, C A Haiman, S Wang-Gohrke, I L Andrulis, K B Moysich, J L Hopper, K Odunsi, A Lindblom, G G Giles, H Brenner, J Simard, G Lurie, P A Fasching, M E Carney, P Radice, L R Wilkens, A Swerdlow, M T Goodman, H Brauch, M Garca-Closas, P Hillemanns, R Winqvist, M Drst, P Devilee, I Runnebaum, A Jakubowska, J Lubinski, A Mannermaa, R Butzow, N V Bogdanova, T Drk, L M Pelttari, W Zheng, A Leminen, H Anton-Culver, C H Bunker, V Kristensen, R B Ness, K Muir, R Edwards, A Meindl, F Heitz, K Matsuo, Andreas du Bois, A H Wu, P Harter, S-H Teo, I Schwaab, X-O Shu, W Blot, S Hosono, D Kang, T Nakanishi, M Hartman, Y Yatabe, U Hamann, B Y Karlan, S Sangrajrang, S K Kjaer, V Gaborieau, A Jensen, D Eccles, E Hgdall, C-Y Shen, J Brown, Y L Woo, M Shah, M A N Azmi, R Luben, S Z Omar, K Czene, R A Vierkant, B G Nordestgaard, H Flyger, C Vachon, J E Olson, X Wang, D A Levine, A Rudolph, R P Weber, D Flesch-Janys, E Iversen, S Nickels, J M Schildkraut, I D S Silva, D W Cramer, L Gibson, K L Terry, O Fletcher, A F Vitonis, C E van der Schoot, E M Poole, F B L Hogervorst, S S Tworoger, J Liu, E V Bandera, J Li, S H Olson, K Humphreys, I Orlow, C Blomqvist, L Rodriguez-Rodriguez, K Aittomki, H B Salvesen, T A Muranen, E Wik, B Brouwers, C Krakstad, E Wauters, M K Halle, H Wildiers, L A Kiemeney, C Mulot, K K Aben, P Laurent-Puig, A M van Altena, T Truong, L F A G Massuger, J Benitez, T Pejovic, J I A Perez, M Hoatlin, M P Zamora, L S Cook, S P Balasubramanian, L E Kelemen, A Schneeweiss, N D Le, C Sohn, A Brooks-Wilson, I Tomlinson, M J Kerin, N Miller, C Cybulski, B E Henderson, J Menkiszak, F Schumacher, N Wentzensen, L L Marchand, H P Yang, A M Mulligan, G Glendon, S A Engelholm, J A Knight, C K Hgdall, C Apicella, M Gore, H Tsimiklis, H Song, M C Southey, A Jager, A M W van den Ouweland, R Brown, J W M Martens, J M Flanagan, M Kriege, J Paul, S Margolin, N Siddiqui, G Severi, A S Whittemore, L Baglietto, V McGuire, C Stegmaier, W Sieh, H Mller, V Arndt, F Labrche, Y-T Gao, M S Goldberg, G Yang, M Dumont, J R McLaughlin, A Hartmann, A B Ekici, M W Beckmann, C M Phelan, M P Lux, J Permuth-Wey, B Peissel, T A Sellers, F Ficarazzi, M Barile, A Ziogas, A Ashworth, A Gentry-Maharaj, M Jones, S J Ramus, N Orr, U Menon, C L Pearce, T Brning, M C Pike, Y-D Ko, J Lissowska, J Figueroa, J Kupryjanczyk, S J Chanock, A Dansonka-Mieszkowska, A Jukkola-Vuorinen, I K Rzepecka, K Pylks, M Bidzinski, S Kauppila, A Hollestelle, C Seynaeve, R A E M Tollenaar, K Durda, K Jaworska, J M Hartikainen, V-M Kosma, V Kataja, N N Antonenkova, J Long, M Shrubsole, S Deming-Halverson, A Lophatananon, P Siriwanarangsan, S Stewart-Brown, N Ditsch, P Lichtner, R K Schmutzler, H Ito, H Iwata, K Tajima, C-C Tseng, D O Stram, D van den Berg, C H Yip, M K Ikram, Y-C Teh, H Cai, W Lu, L B Signorello, Q Cai, D-Y Noh, K-Y Yoo, H Miao, PT-C Iau, Y Y Teo, J McKay, C Shapiro, F Ademuyiwa, G Fountzilas, C-N Hsiung, J-C Yu, M-F Hou, C S Healey, C Luccarini, S Peock, D Stoppa-Lyonnet, P Peterlongo, T R Rebbeck, M Piedmonte, C F Singer, E Friedman, M Thomassen, K Offit, T V O Hansen, S L Neuhausen,

Recombination rate (cM Mb )

log10 (P value)

nature genetics

volume 45 NUMBER 4 APRIL 2013

Low r 2

High r

Low r 2

Chromosome with missing sequence

Unlocalized sequence

2013 Nature America, Inc. All rights reserved.

Admixed populations to complete the genome (p 406)

C I Szabo, I Blanco, J Garber, S A Narod, J N Weitzel, M Montagna, E Olah, A K Godwin, D Yannoukakos, D E Goldgar, T Caldes, E N Imyanitov, L Tihomirova, B K Arun, I Campbell, A R Mensenkamp, C J van Asperen, K E P van Roozendaal, H Meijers-Heijboer, J M Colle, J C Oosterwijk, M J Hooning, M A Rookus, R B van der Luijt, T A M van Os, D G Evans, D Frost, E Fineberg, J Barwell, L Walker, M J Kennedy, R Platte, R Davidson, S D Ellis, T Cole, B Bressac-de Paillerets, B Buecher, F Damiola, L Faivre, M Frenay, O M Sinilnikova, O Caron, S Giraud, S Mazoyer, V Bonadona, V Caux-Moncoutier, A Toloczko-Grabarek, J Gronwald, T Byrski, A B Spurdle, B Bonanni, D Zaffaroni, G Giannini, L Bernard, R Dolcetti, S Manoukian, N Arnold, C Engel, H Deissler, K Rhiem, D Niederacher, H Plendl, C Sutter, B Wappenschmidt, ke Borg, B Melin, J Rantala, M Soller, K L Nathanson, S M Domchek, G C Rodriguez, R Salani, D G Kaulich, M-K Tea, S S Paluch, Y Laitman, A-B Skytte, T A Kruse, U B Jensen, M Robson, A-M Gerdes, B Ejlertsen, L Foretova, S A Savage, J Lester, P Soucy, K B Kuchenbaecker, C Olswold, J M Cunningham, S Slager, V S Pankratz, E Dicks, S R Lakhani, F J Couch, P Hall, A N A Monteiro, S A Gayther, P D P Pharoah, R R Reddel, E L Goode, M H Greene, D F Easton, A Berchuck, A C Antoniou, G Chenevix-Trench & A M Dunning

letters
385
til a fis gin h ou s
65 MYA CZ

250 MYA

Ancestral osteichthyan Ancestral gnathostome Outgroups Ancestral vertebrate

550 MYA

Sea lamprey genome (p 415)

Identification of 23 new prostate cancer susceptibility loci using the iCOGS custom genotyping array R A Eeles, A A Al Olama, S Benlloch, E J Saunders, D A Leongamornlert, M Tymrakiewicz, M Ghoussaini, C Luccarini, J Dennis, S Jugurnauth-Little, T Dadaev, D E Neal, F C Hamdy, J L Donovan, K Muir, G G Giles, G Severi, F Wiklund, H Gronberg, C A Haiman, F Schumacher, B E Henderson, L L Marchand, S Lindstrom, P Kraft, D J Hunter, S Gapstur, S J Chanock, S I Berndt, D Albanes, G Andriole, J Schleutker, M Weischer, F Canzian, E Riboli, T J Key, R C Travis, D Campa, S A Ingles, E M John, R B Hayes, P D P Pharoah, N Pashayan, K-T Khaw, J L Stanford, E A Ostrander, L B Signorello, S N Thibodeau, D Schaid, C Maier, W Vogel, A S Kibel, C Cybulski, J Lubinski, L Cannon-Albright, H Brenner, J Y Park, R Kaneva, J Batra, A Spurdle, J A Clements, M R Teixeira, E Dicks, A Lee, A M Dunning, C Baynes, D Conroy, M J Maranian, S Ahmed, K Govindasami, M Guy, R A Wilkinson, E J Sawyer, A Morgan, D P Dearnaley, A Horwich, R A Huddart, V S Khoo, C C Parker, N J Van As, C J Woodhouse, A Thompson, T Dudderidge, C Ogden, C S Cooper, A Lophatananon, A Cox, M C Southey, J L Hopper, D R English, M Aly, J Adolfsson, J Xu, S L Zheng, M Yeager, R Kaaks, W R Diver, M M Gaudet, M C Stern, R Corral, A D Joshi, A Shahabi, T Wahlfors, Teuvo L J Tammela, A Auvinen, J Virtamo, P Klarskov, B G Nordestgaard, M A Rder, S F Nielsen, S E Bojesen, A Siddiq, L M FitzGerald, S Kolb, E M Kwon, D M Karyadi, W J Blot, W Zheng, Q Cai, S K McDonnell, A E Rinckleb, B Drake, G Colditz, D Wokolorczyk, R A Stephenson, C Teerlink, H Muller, D Rothenbacher, T A Sellers, H-Y Lin, C Slavov, V Mitev, F Lose, S Srinivasan, S Maia, P Paulo, E Lange, K Cooney, A C Antoniou, D Vincent, F Bacot, D C Tessier, The COGSCancer Research UK GWASELLIPSE (part of GAME-ON) Initiative, The Australian Prostate Cancer Bioresource, The UK Genetic Prostate Cancer Study Collaborators/British Association of Urological Surgeons Section of Oncology, The UK ProtecT Prostate testing for cancer and Treatment) Study Collaborators, The PRACTICAL (Prostate Cancer Association Group to Investigate Cancer-Associated Alteration in the Genome) Consortium, Zsofia KoteJarai & Douglas F Easton

npg

nature genetics

Precambrian Paleozoic

Mesozoic

-f fis inn ed Am h ph ib ia R ns ep til es M am m als

pr

ey

La

ay

ar

vii

volume 45 NUMBER 4 APRIL 2013

392
Recombination rate (cM/Mb)

14 12 10 8 6 4 2 0

rs4245739

Observed (log P)

>0.8 0.50.8 0.20.5 <0.2

100 80 60 40 20

PLEKHA6 PPP1R15B PIK3C2B MDM4

0
LRRN2

NFASC

204,300 204,500 204,700 Chromosome 1 position (kb)

npg

2013 Nature America, Inc. All rights reserved.

ER-negative breast cancer risk loci (p 392)

Genome-wide association studies identify four ER-negativespecific breast cancer risk loci M Garcia-Closas, F J Couch, S Lindstrom, K Michailidou, M K Schmidt, M N Brook, N Orr, S K Rhie, E Riboli, H S Feigelson, L L Marchand, J E Buring, D Eccles, P Miron, P A Fasching, H Brauch, J Chang-Claude, J Carpenter, A K Godwin, H Nevanlinna, G G Giles, A Cox, J L Hopper, M K Bolla, Q Wang, J Dennis, E Dicks, W J Howat, N Schoof, S E Bojesen, D Lambrechts, A Broeks, I L Andrulis, P Gunel, B Burwinkel, E J Sawyer, A Hollestelle, O Fletcher, R Winqvist, H Brenner, A Mannermaa, U Hamann, A Meindl, A Lindblom, W Zheng, P Devillee, M S Goldberg, J Lubinski, V Kristensen, A Swerdlow, H Anton-Culver, T Drk, K Muir, K Matsuo, A H Wu, P Radice, S H Teo, X-O Shu, W Blot, D Kang, M Hartman, S Sangrajrang, C-Y Shen, M C Southey, D J Park, F Hammet, J Stone, L J Vant Veer, E J Rutgers, A Lophatananon, S Stewart-Brown, P Siriwanarangsan, J Peto, M G Schrauder, A B Ekici, M W Beckmann, I dos Santos Silva, N Johnson, H Warren, I Tomlinson, M J Kerin, N Miller, F Marme, A Schneeweiss, C Sohn, T Truong, P Laurent-Puig, P Kerbrat, B G Nordestgaard, S F Nielsen, H Flyger, R L Milne, J I A Perez, P Menndez, H Mller, V Arndt, C Stegmaier, P Lichtner, M Lochmann, C Justenhoven, Y-D Ko, The Gene Environmental Interaction and breast Cancer (GENICA) Network, T Muranen, K Aittomki, C Blomqvist, D Greco, T Heikkinen, H Ito, H Iwata, Y Yatabe, N N Antonenkova, S Margolin, V Kataja, V-M Kosma, J M Hartikainen, R Balleine, kConFab Investigators, C-C Tseng, D Van Den Berg, D O Stram, P Neven, A-S Dieudonn, K Leunen, A Rudolph, S Nickels, D Flesch-Janys, P Peterlongo, B Peissel, L Bernard, J E Olson, X Wang, K Stevens, G Severi, L Baglietto, C McLean, G A Coetzee, Y Feng, B E Henderson, F Schumacher, N V Bogdanova, F Labrche, M Dumont, C H Yip, N Aishah Mohd Taib, C-Y Cheng, M Shrubsole, J Long, K Pylks, A Jukkola-Vuorinen, S Kauppila, J A Knight, G Glendon, A M Mulligan, R A E M Tollenaar, C Seynaeve, M Kriege, M J Hooning, A M W van den Ouweland, C H M van Deurzen, W Lu, Y-T Gao, H Cai, S P Balasubramanian, S S Cross, M W R Reed, L Signorello, Q Cai, M Shah, H Miao, C W Chan, K S Chia, A Jakubowska, K Jaworska, K Durda, C-N Hsiung, P-E Wu, J-C Yu, A Ashworth, M Jones, D C Tessier, A Gonzlez-Neira, G Pita, M R Alonso, D Vincent, F Bacot, C B Ambrosone, E V Bandera, E M John, G K Chen, J J Hu, J L Rodriguez-Gil, L Bernstein, M F Press, R G Ziegler, R M Millikan, S L Deming-Halverson, S Nyante, S A Ingles, Q Waisfisz, H Tsimiklis, E Makalic, D Schmidt, M Bui, L Gibson, B Mller-Myhsok, R K Schmutzler, R Hein, N Dahmen, L Beckmann, K Aaltonen, K Czene, A Irwanto, J Liu, C Turnbull, Familial Breast Cancer Study (FBCS), N Rahman, H Meijers-Heijboer, A G Uitterlinden, F Rivadeneira, Australian Breast Cancer Tissue Bank (ABCTB) Investigators, C Olswold, S Slager, R Pilarski, F Ademuyiwa, I Konstantopoulou, N G Martin, G W Montgomery, D J Slamon, C Rauh, M P Lux, S M Jud, T Bruning, J E Weaver, P Sharma, H Pathak, W Tapper, S Gerty, L Durcan, D Trichopoulos, R Tumino, P H Peeters, R Kaaks, D Campa, F Canzian, E Weiderpass, M Johansson, K-T Khaw, R Travis, F Clavel-Chapelon, L N Kolonel, C Chen, A Beck, S E Hankinson, C D Berg, R N Hoover, J Lissowska, J D Figueroa, D I Chasman, M M Gaudet, W R Diver, W C Willett, D J Hunter, J Simard, J Benitez, A M Dunning, M E Sherman, G Chenevix-Trench, S J Chanock, P Hall, P D P Pharoah, C Vachon, D F Easton, C A Haiman & P Kraft

nature genetics

ix

volume 45 NUMBER 4 APRIL 2013

0.5

A N A LY S I S
400 Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies N Chatterjee, B Wheeler, J Sampson, P Hartge, S J Chanock & J-H Park

0.4

0.3
PCC
2

0.2

0.1

ART I C L E S
1 104 1 105 N 1 106

406 Using population admixture to help complete maps of the human genome G Genovese, R E Handsaker, H Li, N Altemose, A M Lindgren, K Chambert, B Pasaniuc, A L Price, D Reich, C C Morton, M R Pollak, J G Wilson & S A McCarroll 415 Sequencing of the sea lamprey (Petromyzon marinus) genome provides insights into vertebrate evolution OPEN J J Smith, S Kuraku, C Holt, T Sauka-Spengler, N Jiang, M S Campbell, M D Yandell, T Manousaki, A Meyer, O E Bloom, J R Morgan, J D Buxbaum, R Sachidanandam, C Sims, A S Garruss, M Cook, R Krumlauf, L M Wiedemann, S A Sower, W A Decatur, J A Hall, C T Amemiya, N R Saha, K M Buckley, J P Rast, S Das, M Hirano, N McCurley, P Guo, N Rohner, C J Tabin, P Piccinelli, G Elgar, M Ruffier, B L Aken, S M J Searle, M Muffato, M Pignatelli, J Herrero, M Jones, C T Brown, Y-W Chung-Davidson, K G Nanlohy, S V Libants, C-Y Yeh, D W McCauley, J A Langeland, Z Pancer, B Fritzsch, P J de Jong, B Zhu, L L Fulton, B Theising, P Flicek, M E Bronner, W C Warren, S W Clifton, R K Wilson & W Li

2013 Nature America, Inc. All rights reserved.

Polygenic models for genetic risk prediction (p 400)

30

TERC

log10(P value)

20

TERT NAF1 OBFC1 ACYP2

10

ZNF208 RTEL1

L ETTER S
422 Identification of seven loci affecting mean telomere length and their association with disease V Codd, C P Nelson, E Albrecht, M Mangino, J Deelen, J L Buxton, J J Hottenga, K Fischer, T Esko, I Surakka, L Broer, D R Nyholt, I M Leach, P Salo, S Hgg, M K Matthews, J Palmen, G D Norata, P F OReilly, D Saleheen, N Amin, A J Balmforth, M Beekman, R A de Boer, S Bhringer, P S Braund, P R Burton, A J M de Craen, M Denniff, Y Dong, K Douroudis, E Dubinina, J G Eriksson, K Garlaschelli, D Guo, A-L Hartikainen, A K Henders, J J Houwing-Duistermaat, L Kananen, L C Karssen, J Kettunen, N Klopp, V Lagou, E M van Leeuwen, P A Madden, R Mgi, P K E Magnusson, S Mnnist, M I McCarthy, S E Medland, E Mihailov, G W Montgomery, B A Oostra, A Palotie, A Peters, H Pollard, A Pouta, I Prokopenko, S Ripatti, V Salomaa, H E D Suchiman, A M Valdes, N Verweij, A Viuela, X Wang, H-E Wichmann, E Widen, G Willemsen, M J Wright, K Xia, X Xiao, D J van Veldhuisen, A L Catapano, M D Tobin, A S Hall, A I F Blakemore, W H van Gilst, H Zhu, CARDIoGRAM consortium, J Erdmann, M P Reilly, S Kathiresan, H Schunkert, P J Talmud, N L Pedersen, M Perola, W Ouwehand, J Kaprio, N G Martin, C M van Duijn, I Hovatta, C Gieger, A Metspalu, D I Boomsma, M-R Jarvelin, P E Slagboom, J R Thompson, T D Spector, P van der Harst & N J Samani A variant in FTO shows association with melanoma risk not due to BMI M M Iles, M H Law, S N Stacey, J Han, S Fang, R Pfeiffer, M Harland, S MacGregor, J C Taylor, K K Aben, L A Akslen, M-F Avril, E Azizi, B Bakker, K R Benediktsdottir, W Bergman, G B Scarr, K M Brown, D Calista, V Chaudru, M C Fargnoli, A E Cust, F Demenais, A C de Waal, T Debniak, D E Elder, E Friedman, P Galan, P Ghiorzo, E M Gillanders, A M Goldstein, N A Gruis, J Hansson, P Helsing, M Hocevar, V Hiom, J L Hopper, C Ingvar, M Janssen, M A Jenkins, P A Kanetsky, L A Kiemeney, J Lang, G M Lathrop, S Leachman, J E Lee, J Lubin ski, R M Mackie, G J Mann, N G Martin, J I Mayordomo, A Molven, S Mulder, E Nagore, S Novakovic, I Okamoto, J H Olafsson, H Olsson, H Pehamberger, K Peris, M P Grasa, D Planelles, S Puig,
xi

10

11

12

13

14

Chromosome

npg

Loci controlling telomere length (p 422)

15

16 17 18 19 20 21 22

428

nature genetics

volume 45 NUMBER 4 APRIL 2013

C3 APOE

400 300 200 100 15 log10 P

COL15A1-TGFBR1

IER3-DDR1 VEGFA FRK-COL10A1

TNFRSF10A

ADAMTS9 COL8A1

10

B3GALTL

RAD51B LIPC

SLC16A8

J A Puig-Butille, Q-MEGA and A M F S Investigators, J Randerson-Moor, C Requena, L Rivoltini, M Rodolfo, M Santinami, B Sigurgeirsson, H Snowden, F Song, P Sulem, K Thorisdottir, R Tuominen, P Van Belle, N van der Stoep, M M van Rossum, Q Wei, J Wendt, D Zelenika, M Zhang, M T Landi, G Thorleifsson, D T Bishop, C I Amos, N K Hayward, K Stefansson, J A Newton Bishop & J H Barrett for the GenoMEL Consortium 433 440 Seven new loci associated with age-related macular degeneration The AMD Gene Consortium Somatic mutations in ATP1A1 and ATP2B3 lead to aldosterone-producing adenomas and secondary hypertension F Beuschlein, S Boulkroun, A Osswald, T Wieland, H N Nielsen, U D Lichtenauer, D Penton, V R Schack, L Amar, E Fischer, A Walther, P Tauber, T Schwarzmayr, S Diener, E Graf, B Allolio, B Samson-Couterie, A Benecke, M Quinkler, F Fallo, P-F Plouin, F Mantero, T Meitinger, P Mulatero, X Jeunemaitre, R Warth, B Vilsen, M-C Zennaro, TM Strom & M Reincke De novo mutations in the autophagy gene WDR45 cause static encephalopathy of childhood with neurodegeneration in adulthood H Saitsu, T Nishimura, K Muramatsu, H Kodera, S Kumada, K Sugai, E Kasai-Yoshida, N Sawaura, H Nishida, A Hoshino, F Ryujin, S Yoshioka, K Nishiyama, Y Kondo, Y Tsurusaki, M Nakashima, N Miyake, H Arakawa, M Kato, N Mizushima & N Matsumoto Sequencing ancient calcified dental plaque shows changes in oral microbiota with dietary shifts of the Neolithic and Industrial revolutions C J Adler, K Dobney, L S Weyrich, J Kaidonis, A W Walker, W Haak, J A Bradshaw, G Townsend, A Sotysiak, K W Alt, J Parkhill & A Cooper The draft genome of the fast-growing non-timber forest species moso bamboo (Phyllostachys heterocycla) OPEN Z Peng, Y Lu, L Li, Q Zhao, Q Feng, Z Gao, H Lu, T Hu, N Yao, K Liu, Y Li, D Fan, Y Guo, W Li, Y Lu, Q Weng, C C Zhou, L Zhang, T Huang, Y Zhao, C Zhu, X Liu, X Yang, T Wang, K Miao, C Zhuang, X Cao, W Tang, G Liu, Y Liu, J Chen, Z Liu, L Yuan, Z Liu, X Huang, T Lu, B Fei, Z Ning, B Han & Z Jiang OsLG1 regulates a closed panicle trait in domesticated rice T Ishii, K Numaguchi, K Miura, K Yoshida, P T Thanh, T M Htun, M Yamasaki, N Komeda, T Matsumoto, R Terauchi, R Ishikawa & M Ashikari

ARMS2-HTRA1

C2-CFB

CFH

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

CFI

Chromosome

2013 Nature America, Inc. All rights reserved.

Age-related macular degeneration loci (p 433)

CETP

TIMP3

445

ATP1A1 Leu104Arg Phe100_Leu104del Val332Gly

450
ATP2B3 Val426_Val427del
Leu425_Val426del

456

npg

Somatic mutations as a cause of secondary hypertension (p 440)

462

N at u r e g e n e t ics cl a ssifi e d
See back pages.

nature genetics

xiii

Editorial

Predicting the influence of common variants


2013 Nature America, Inc. All rights reserved.

An ever-larger proportion of the liability to common and complex disease can be obtained by progressively larger studies. However, for most diseases, the sample sizes required to gain usable predictions will be out of reach of sequencing technologies for the foreseeable future. Array-based genotyping genome-wide association studies (GWAS) still offer a reliable harvest of biological hypotheses for many diseases, together with the secondary benefit of slowly improving prediction.
WAS have an amazing track record in rapidly discovering the genetic contribution to over 700 common and complex diseases and phenotypes. Indeed, the technique may well have mapped out a large proportion of the regulatory variation associated with human traits. Still, by the benchmark of genetic epidemiologists, it has been slow to deliver. The set of well-replicated SNPs together do not account for the phenotypic variance that can be attributed to additive genetic variance (narrow-sense heritability). Because of this, the loci are of limited usefulness in risk prediction. We are simply not yet playing with a full deck. Consortia of genetic epidemiologists now work on very large population samples. For example, in this issue, our Focus on cancer risk loci (p 343) reports the results of genotyping ~200,000 SNPs in a total of ~200,000 individuals. The implications of these studies are explored in two Commentaries (pp 345, 349) and in detail online in our editorial threads (http://nature.com/ng/focuses/icogs ) linking the coordinated COGS publications. These roughly double the harvest of loci associated with these cancers and finally make clinical prediction a testable reality. For three common cancers, breast (p 353), ovarian (p 362) and prostate (p 385), genetic variants now explain about a third of the familial relative risk (link to Primer 1 online). With each of these studies now able to identify the individuals at the greatest genetic risk, SNP genotypes can be used in stratification approaches and tested in population screening (p 349). Variants contributing to disease can be found by considering not only the replicated variants of significant effect but also all genotyped variants using polygenic analytical methods that take into account the much larger set of contributory SNPs (Nat. Genet. 42, 565569, 2010). In this issue, Nilanjan Chatterjee and colleagues (p 400) show that the predictive accuracy attained by larger studies is limited not only by the samples available to train the polygenic model but also by the distribution of effect sizes of the genetic variants themselves. For some diseases, prediction is readily achievable, but in no case do they anticipate that common SNPs will fully account for all of the genetic variance, even when the thousands of variants with individually undetectable

effect sizes are included. For many diseases, the polygenic model suggests that GWAS deliver no more prediction accuracy beyond studies of 100,000200,000 individuals, but for a few conditions, such as coronary artery disease (p 422), it may be useful to continue studies to five times that size. Risk prediction has many different aims. For most diseases, it should be possible to identify the individuals with the highest genetic risk. However, if the aim is to identify individuals with just twice the mean population risk, we cannot currently do that with SNPs. For most diseases, only a small proportion of individuals with twice the mean risk can currently be identified genetically. Risk locus discovery can be an iterative process, with subtypes of disease being initially lumped together or discovered in the process. Loci that have larger effect sizes in disease subtypes than in the broader condition can be more valuable in predictive classification. Although we currently deal with diseases and loci independently, common diseases themselves may not be independently distributed in the population. For example, James Sorace and colleagues (Popul. Health Manag. 14, 161166, 2011) examined the conditions for which over 32 million US citizens billed the Medicare system in 2008, classifying the population by the combinations of two or more conditions for which they were treated in that one year. What they found was that the population divided approximately into thirds. One third of the population with no recorded treatment accounted for 6% of the expenditure. The next third accounted for 15% of the expenditure on the 100 most common disease combinations, and the final third, with larger numbers of disease combinations, accounted for 79% of the expenditure. The healthcare cross section was also very diverse. Sixty percent of the beneficiaries (and 90% of the expenditure) were accounted for by over 2 million disease combinations comprising of one of the 20 most prevalent conditions with one or more other condition. To us, this study suggests that, if prediction is to be used in the real world, it will be interesting to examine the genetic risk profiles for common diseases mapped so far by GWAS in a cross section of health care users.

npg

nature genetics | volume 45 | NUMBER 4 | APRIL 2013

339

FOREWORD

iCOGS collection provides a collaborative model


We are pleased to present this iCOGS Focus comprising a collection of papers by the COGS (Collaborative Oncological Gene-environment Study) Consortium. This represents a significant advance in our understanding of genetic susceptibility to three hormone-related cancersbreast, ovarian and prostate.
2013 Nature America, Inc. All rights reserved.

o put this collection of 13 coordinated research papers from the COGS (Collaborative Oncological Gene-environment Study) Consortium into context, we have commissioned two accompanying Commentaries. On page 345, John Witte and colleagues survey all of the COGS studies in this collection. Hilary Burton and colleagues provide a public health perspective on these studies (page 349). Burton reports on the efforts of the Foundation for Genomics and Population Health (PHG Foundation) to consider the potential for genetic risk prediction, based on currently known genetic susceptibility loci for breast, ovarian and prostate cancers and considerations for population-based risk screening programs. Finally, we have highlighted a selection of the eight coordinated research publications from COGS published in Nature Communications, The American Journal of Human Genetics, Human Molecular Genetics and PLoS Genetics (page 352). Together, these papers roughly double the number of susceptibility loci associated with breast, ovarian and prostate cancers. In this issue, Douglas Easton and colleagues report 41 loci newly associated with breast cancer (page 353), Rosalind Eeles and colleagues report 23 loci newly associated with prostate cancer (page 385), and Paul Pharoah and colleagues report 3 loci newly associated with ovarian cancer (page 362). Together with the additional publications in this collection, the authors report a combined total of 74 new susceptibility loci for these cancers, as well as fine-mapping and follow-up functional experiments. Each of these three studies began with a large-scale genome-wide association study (GWAS) and meta-analysis. The five cancer-specific consortia that comprise COGS selected SNPs showing promising association in each GWAS to include on a custom genotyping array, the iCOGS array, which they developed in coordination with Illumina. The COGS authors also nominated additional variants within regions of particular interest to include on the iCOGS array. They then conducted the replication phase for each of the studies with the shared iCOGS array. This study design provided efficient genotyping for large case-control samples and replication with the high-density iCOGS array designed with content selected from GWAS findings across the three cancers. The benefit of this design is evident in the large yield of new susceptibility loci for each of the cancers studied.

The COGS project serves as an excellent model for collaboration among consortia of consortia. The groups pooled their resources in order to design the single shared custom array in collaboration with Illumina. They also coordinated their ongoing efforts to characterize genetic susceptibility to a range of common cancers. In a similar spirit of cooperation, we were pleased to work with the authors to carry out the coordinated review and publication of this collection of manuscripts from COGS. We are grateful to our sponsor Illumina, whose support has provided for freely available access to the papers in this Focus and an accompanying website for the next 6 months. The iCOGS website accompanying this Focus offers additional content and analysis in the form of five Primers, hypertext essays that provide a guided tour through the entire collection of COGS publications. This new publishing format interlaces editorial analysis with threads (the latter is a format that we first used for the ENCODE website, comprising a series of direct quotations from relevant sections of the original research publications). In the Primers, we discuss the relevance of these studies primary findings to genetic susceptibility to these three hormone-related cancers and the heritability explained (doi:10.1038/ngicogs.1), provide an analysis of the shared susceptibility regions (doi:10.1038/ngicogs.2), give a guide to subsequent functional annotation and mechanistic interpretation (doi:10.1038/ngicogs.3), examine genetic risk estimates and considerations for the development of population-based screening programs (doi:10.1038/ngicogs.5), describe the history of the COGS consortium, and the development of the iCOGS array and discuss what the authors have planned for future studies (doi:10.1038/ ngicogs.4). We hope that you will find this printed Focus, as well as the accompanying website, a useful guide to this milestone in genetic epidemiology. We look forward to receiving your feedback on how you use these materials and hope that this will enable your own collaborative research.
Orli G Bahcall

npg

nature genetics | volume 45 | NUMBER 4 | APRIL 2013

343

c o m m e n ta r y

Turning of COGS moves forward findings for hormonally mediated cancers


Lori C Sakoda1,2, Eric Jorgenson1 & John S Witte3,4
The large-scale Collaborative Oncological Gene-environment Study (COGS) presents new findings that further characterize the genetic bases of breast, ovarian and prostate cancers. We summarize and provide insights into this collection of papers from COGS and discuss the implications of the results and future directions for such efforts.
Important discoveries from the COGS project are presented in over a dozen coordinated papers113, with five appearing in this issue of Nature Genetics15. This mega-consortium of >200,000 individuals was conducted to further characterize the genetic and environmental bases of breast, ovarian and prostate cancers. A custom genotyping array with ~211,000 SNPs (termed iCOGS), designed specifically to followup previous results of genome-wide association studies (GWAS) and candidate gene association studies, was employed as part of this initiative. The resulting papers report over 70 new susceptibility loci for these 3 hormone-related cancers. This accomplishment highlights the value of following up initial discovery efforts with concerted, large-scale collaborative projects such as COGS. Such findings are proving increasingly valuable for clarifying the underlying mechanisms of carcinogenesis and developing clinically relevant cancer prediction models. Custom iCOGS array design Individuals studied in this mega-consortium were collected from four large, established con1Division of Research, Kaiser Permanente

2013 Nature America, Inc. All rights reserved.

Northern California, Oakland, California, USA. 2Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA. 3Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, California, USA. 4Institute for Human Genetics, University of California, San Francisco, San Francisco, California, USA. Correspondence should be addressed to J.S.W. (jwitte@ucsf.edu).

sortia: BCAC (the Breast Cancer Association Consortium), OCAC (the Ovarian Cancer Association Consortium), PRACTICAL (the Prostate Cancer Association Group to Investigate Cancer-Associated Alterations in the Genome) andCIMBA (the Consortium of Investigators of Modifiers of BRCA1/2) (Fig. 1). Investigators of each consortium selected SNPs for inclusion on the iCOGS array, in particular, markers (i) associated with cancer susceptibility or survival in previous GWAS, including those specific to particular subtypes (for example, aggressive prostate cancer) and subgroups (for example, BRCA1 and BRCA2 mutation carriers); (ii) for fine mapping genomic regions of interest to each cancer and across cancers (for example, 8q24 region, TERT, CDKN2ACDKN2B and ESR1); (iii) associated with cancer-related quantitative traits (for example, age at menarche and mammographic density); (iv) in selected candidate genes or pathways; and (v) associated with other cancers (for example, lung, endometrial, melanoma or testicular). These SNPs were classified into one of three categories: GWAS replication, fine-mapping and candidate SNPs. Space on the iCOGS array was shared among the consortia, with approximate initial allocations of 25% each to BCAC, OCAC and PRACTICAL, 17.5% to CIMBA and 7.5% to markers of mutual interest. The iCOGS SNPs were selected to enhance SNP genotyping success, with the majority having Illumina design scores of 0.8. SNPs were chosen preferentially in the following order: (i) SNPs previously genotyped by Illumina (with design scores of 1.1); (ii) SNPs with linkage disequilibrium (LD) of r2 = 1 with the index (previously associated) SNP and the best

BCAC

OCAC

COGS

PRACTICAL

CIMBA

Figure 1 Participating consortia in the COGS mega-consortium.

npg

design score; and (iii) SNPs with r2 > 0.8 with the index SNP and the best design score. SNPs in strong LD with other selected SNPs (r2 > 0.9) were excluded, although, for GWAS-identified SNPs with association P value < 1 105, two surrogate SNPs were also included. The final set of SNPs was compiled by first including the selected fine-mapping SNPs, followed by the addition of selected GWAS replication and candidate SNPs. The penultimate list included 220,123 SNPs, and, of these, 211,155 were successfully included on the iCOGS array. Overview of the findings The findings from ten of the coordinated COGS papers are presented in Table 1. Of the five published in this issue of Nature Genetics, four involved GWAS meta-analysis and validation using iCOGS data14, and one entailed fine mapping and functional analysis of the TERT region (5p15.33) in relation to mean telomere length and the risk of breast and ovarian can345

nature genetics | volume 45 | number 4 | APRIL 2013

COMMENTARY
Table 1 COGS overviewstudy design, characteristics and published results
New genetic loci identified Sample size GWAS meta-analysis and validation Phenotypecancer subtypes and subgroups Breast cancer1 ER negative2 BRCA1 mutation carriers6 BRCA2 mutation carriers7 East Asians10 Ovarian cancer3 Serous3 BRCA1 mutation carriers6 Prostate cancer4 Meta-analysisa: 10,052 cases, 12,575 controls Follow-upa: 45,290 cases, 41,880 controls Meta-analysis: 4,193 cases, 35,194 controls Follow-up: 6,514 cases, 41,455 controls Stages 12: 11,705 carriers (5,920 affected) Stage 3: 2,646 carriers (1,394 affected) 8,211 carriers (3,881 affected) Meta-analysis: 23,637 cases, 25,580 controls Meta-analysis: 7,931 cases, 9,216 controls Follow-up: 18,174 cases, 26,134 controls Follow-up: 10,316 cases, 26,134 controls Stages 12: 11,705 carriers (1,839 affected) Stage 3: 2,646 carriers (442 affected) Meta-analysisa: 11,085 cases, 11,463 controls Follow-upa: 19,662 cases, 19,715 controls 29,807 13,276 31,812 18,086 70 22,252 22,252 31,812 72,157 41 4 1 1 16 2 1 2 23 1.051.26 1.101.14 1.14 1.17 1.061.16 1.101.19 1.12 1.201.27 1.061.15 0.0080.48 0.240.41 0.31 0.35 0.040.48 0.070.31 0.37 0.190.48 0.080.50 SNPs tested by iCOGS Count Effect sizeb Minor allele frequency

2013 Nature America, Inc. All rights reserved.

Independent association signals Sample size Fine-mapping studies of the 5p15 locus Phenotypecancer subtypes and subgroups Telomere length5 Breast cancer5 15,567 women 46,451 cases, 42,599 controls 7,435 cases, 41,575 controls 27,074 cases, 41,749 controls 11,705 carriers 8,371 cases, 23,491 controls 986 cases, 23,491 controls 22,301 cases, 22,320 controls 89,050 women 1,228 4,405 1 1 4 3 1.15 1.51 1.061.20 1.071.38 0.26 0.33 NR 0.060.26 480 2 2 2 1 2 1.0101.019c 1.06 1.101.15 1.05 1.091.16 0.290.33 0.300.43 0.200.30 0.28 0.260.27 SNPs typed or imputed Count Effect sizeb Minor allele frequency

ER negative5 ER positive5 BRCA1 mutation carriers5 Ovarian cancer Serous5 Serous low malignant potential5 Prostate cancer8 Fine-mapping analysis of the 11q13 locus Breast cancer, ER positive9
aLimited

to Europeans. bConverted effect size estimates to >1.0. cFold change in telomere length per minor allele. NR, not reported.

npg
cers5. These studies detected new loci associated with overall risk of breast cancer (n = 41)1, ovarian cancer (n = 2)3 and prostate cancer (n = 23)4. A few additional loci were identified specifically for risk of estrogen receptor (ER)negative breast cancer (n = 4)2 and for serous ovarian cancer (n = 1)3. In the TERT region, several distinct SNP associations for breast cancer, ovarian cancer and telomere length were found, underscoring the complex interplay of common variants across this genomic region in carcinogenesis and telomere maintenance5. The companion papers published in PLoS Genetics, The American Journal of Human Genetics, Human Molecular Genetics and Nature Communications expand this body of work, focusing on identifying susceptibility loci for breast, ovarian and prostate cancers in specific subpopulations or in targeted regions of interest. New risk-modifying loci were identified for breast (n = 1) and ovarian
346

(n = 2) cancers in BRCA1 mutation carriers and for breast cancer (n = 1) in BRCA2 mutation carriers6,7. Almost half of the 70 known loci associated with overall breast cancer risk in women of European ancestry were also associated with risk in east Asian women8. In addition, certain environmental factors, specifically alcohol consumption and parity, seem to modify the association between some common variants and breast cancer risk12. A fine-mapping analysis of the TERT region identified multiple independent SNP associations with risk of prostate cancer, including one associated with TERT expression in normal prostate tissue8. Other specific regions examined were 11q13, HNF1B (17q12) and microRNA-binding sites across the genome. Fine mapping of the 11q13 region in relation to breast cancer risk identified three independent association signals, with the top SNPs mapping to enhancer and silencer ele-

ments that regulate CCND1 expression9. An epigenetic analysis characterizing HNF1B variation in relation to ovarian cancer risk showed that different variants influence susceptibility to the serous and clear-cell subtypes11. Investigating the association between variants in putative microRNA-binding sites and ovarian cancer risk pointed to a new susceptibility locus at 17q21.13 (ref. 13), which was also identified as one of the two new riskmodifying loci for ovarian cancer in BRCA1 mutation carriers6. The COGS efforts represent a major contribution to the understanding of inherited susceptibility to breast, ovarian and prostate cancers. With the opportunity to examine much larger populations, it is increasingly evident that many common, low-penetrance variants influence interindividual risk for these cancers, with some specific to particular disease subtypes and subgroups. There is

volume 45 | number 4 | APRIL 2013 | nature genetics

COMMENTARY
also greater evidence that, even within a single genomic region, different variants can influence risk for distinct histological subtypes of a specific cancer and that the same genomic region can influence risk for cancer at multiple organ sites. The new insights gained can now be leveraged to pursue more focused investigations, particularly of the potential molecular mechanisms underlying these observations. Pleiotropy and biological mechanisms A number of findings from COGS (alone or in conjunction with results from previous studies) support the existence of carcinogenic pleiotropy (Fig. 2). Such overlap between genetic susceptibility loci for breast, ovarian and prostate cancers may not be that unexpected, as these sex-specific cancers are thought to share a hormonal etiology. The fine-mapping studies of 5p15.33 identified multiple association signals for breast and ovarian cancers, both for overall disease and by disease subtype, and for prostate cancer5,8. Of the variants detected, including highly correlated surrogates, several showed overlap across these three cancers: rs2242652 (or rs10069690) was associated with breast cancer (ER negative and BRCA1 mutation carriers), serous ovarian cancer and prostate cancer, and rs2736107 (or rs2736108 or rs2736109) and rs2853669 were associated with breast and prostate cancers. Slightly complicating matters, however, the minor allele of rs2242562 was associated with higher breast and ovarian cancer risks but with lower prostate cancer risk, whereas the minor alleles of rs2736107 and rs2853669 were associated with lower breast cancer risk but with higher prostate cancer risk. Variants at 5p15.33 have also been associated with other cancers in previous GWAS1420. Additional insight about the functional relevance of variation in this region could be acquired by examining and comparing the extent to which telomere maintenance and other TERT-mediated functions influence susceptibility across these cancers. Similarly, the COGS GWAS analyses identified several pleiotropic regions shared by breast and prostate cancers, including 1q32 (MDM4), 4q24 (TET2) and 14q24 (RAD51B, also known as RAD51L1)1,2,4,6. These analyses also provided support for the existence of regions of shared susceptibility between ovarian and breast cancers at 8q24, 10p12 (MLLT10) and 19q13 (MERIT40) and between ovarian and prostate cancers at 17q12 (HNF1B)14. On the basis of the strongest candidate genes in these identified regions, potential mechanisms of carcinogenic pleiotropy include inhibition of cell cycle arrest and apoptosis, impaired DNA repair and myelopoiesis regulation. Notably, all are generally known to influence carcinogenesis.
Testis
Breas t

n ai Br

Ov ar ia

Pan cre as

e stat Pro

Melanom a

CLL

r de ad Bl

Co lon

2013 Nature America, Inc. All rights reserved.

Figure 2 Pleiotropy among different cancers detected by COGS and previous association studies. Riskassociated loci for each cancer are indicated by chromosomal location, and sharing is indicated by colored lines connecting different cancers. For example, loci at 8q24 are associated with breast, ovarian, prostate, colon and bladder cancers and with chronic lymphocyticleukemia (CLL) (light-blue lines).

Across the COGS studies, various functional analyses were undertaken to identify potential causal variants and to clarify the underlying biological mechanisms. These analyses were largely conducted in silico using publicly available data (for example, The Cancer Genome Atlas (TCGA), the Encyclopedia of DNA Elements (ENCODE) and the Catalogue of Somatic Mutations in Cancer (COSMIC)). Most commonly, expression and/or methylation quantitative trait locus (QTL) analyses of the top loci were conducted to correlate genotypes with the expression levels and/ or methylation patterns of nearby genes in disease-relevant tissues13,59,11,13. Although some analyses provided more informative results than others, further insight into the key causal genes and mechanisms contributing to individual cancers may be gained in time with the rapid expansion of genomic data in the public domain. However, if one aims to fulfill the recommendations for post-GWAS functional characterization of cancer-associated loci previously published in Nature Genetics, additional steps should be pursued21. Implications of the results The newly identified susceptibility loci explain an increasing proportion of the familial risk of these cancers. Taking these new loci into account, the proportion of familial risk explained by common genetic loci is now esti-

mated at 28% for breast cancer1, 4% for ovarian cancer3 and 30% for prostate cancer4. However, because there remains substantial unexplained heritability of these cancers, a number of riskconferring loci have yet to be discovered. Continuing to increase both sample size and the rarity of the SNPs measured will help, although the marginal contribution of each variant may decrease (that is, larger sample sizes will detect associations with small effects). Findings from the GIANT Consortium suggest that data from approximately 500,000 individuals are required to explain 15% of the variability in height, a highly heritable trait22. The potential for further discovery may also be constrained by sample size, particularly for the less common disease subtypes and subpopulations, although the COGS papers suggest that new associations with modest effects can be detected by refining phenotypic categories. Given the iCOGS array design, the majority of the newly identified risk-associated loci, as expected, were common, low-penetrance SNPs. Therefore, the extent to which rare variants are associated with the three cancers remains unclear. For example, as noted by Bojesen et al.5, the iCOGS array did not type all SNPs in the 5p15.33 (TERT) region, and, despite additional genotyping and imputation, some gaps in coverage likely existed owing to low LD in this region. Further interrogation is thus essential to establish whether the variants
347

npg

nature genetics | volume 45 | number 4 | APRIL 2013

ng Lu
Endometrium

Kid ne y

COMMENTARY
identified in this and other analyses are truly causal. Efforts at designing another array with more variants based on sequencing data, such as those from the 1000 Genomes Project, or direct sequencing will be valuable in addressing this issue. The results principally pertain to persons of European ancestry, the ancestry group predominantly studied. Some of the papers summarize the results of identified variants and cancer risk in other ancestry groups, but only Zheng et al.10 conducted analyses focused on identifying cancer-associated variants in a non-European population. Because the iCOGS array was designed largely on the basis of GWAS data for European populations, its usefulness to studies of non-European populations remains uncertain. In assessing whether risk loci for breast cancer identified in Europeans extend to east Asians, however, Zheng et al.10 detected similar associations for almost half of the variants examined and confirmed associations for five variants previously identified in east Asians. Many of the papers also suggest that multiple genomic regions are pleiotropic for cancer, as described above. This intriguing observation, which carries important biological and therapeutic implications, argues for using the iCOGS array and other consortiadeveloped arrays in conducting genetic susceptibility studies of other cancers. Finally, identifying new susceptibility loci may help with risk stratification, leading to improved strategies for cancer prevention and early detection, and provide new knowledge about the molecular basis of cancer that ultimately results in better treatment and improved patient outcomes. However, as the implications of such genomic information for clinical practice remain unclear, well-designed studies are critical to establish its translational relevance. Future directions The success of COGS in detecting new genetic associations for all three cancers demonstrates the tremendous value of continued largescale scientific collaboration and the applied usefulness of cross-cutting arrays designed for mega-analysis. These studies also provide model strategies for designing future arrays. For example, a new custom genotyping array, termed the OncoChip, is presently being designed by COGS and the US National Institutes of Health (NIH)-funded GAME-ON project to further characterize the genetic bases of breast, ovarian, prostate, colorectal and lung cancers. The discoveries from COGS so far emphasize the need to fully examine genetic susceptibility related to tumor heterogeneity and pleiotropy and to better characterize the functional impact of identified cancer susceptibility loci. A complementary extension of this work includes conducting focused analyses of interactions between the identified loci and environmental and/or lifestyle factors. Another crucial aspect, going forward, is to undertake studies in populations of more diverse ancestry. Although such populations have been understudied so far, their different LD patterns offer an opportunity to help fine map loci and narrow down the potential causal variants. Incorporating more rare variants into the arrays should also prove valuable for detecting new associations and improving prediction models for cancer. Furthermore, the scope of these efforts would be ideally expanded to investigate the role of genetic susceptibility across the entire disease spectrum, from precursor conditions, such as ductal carcinoma in situ of the breast and endometriosis, to patient outcomes, such as chemotherapy toxicity and response, second cancers and survival. This additional knowledge would not only facilitate strategies to more accurately identify and optimally manage individuals at greatest risk but would also promote the discovery of new targets for the development effective chemopreventive and therapeutic agents for specific cancers. In summary, COGS is a model of how collaborative, large-scale association studies can advance understanding of the genetic bases underlying cancer and other common, complex diseases.
ACKNOWLEDGMENTS This work is supported by US NIH grants R01 CA088164, U01 CA127298 and U19 CA148537. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.
1. Michailidou, K.M. et al. Nat. Genet. published online; doi:10.1038/ng.2563 (27 March 2013). 2. Garcia-Closas, M. et al. Nat. Genet. published online; doi:10.1038/ng.2561 (27 March 2013). 3. Pharoah, P.D.P. et al. Nat. Genet. published online; doi:10.1038/ng.2564 (27 March 2013). 4. Eeles, R.A. et al. Nat. Genet. published online; doi:10.1038/ng.2560 (27 March 2013). 5. Bojesen, S.E. et al. Nat. Genet. published online; doi:10.1038/ng.2566 (27 March 2013). 6. Couch, F.J. et al. PLoS Genet. 9, e1003212 (2013). 7. Gaudet, M.M. et al. PLoS Genet. 9, e1003173 (2013). 8. Kote-Jarai, Z. et al. Hum. Mol. Genet. published online; doi:10.1093/hmg/ddt086 (27 March 2013). 9. French, J.D. et al. Am. J. Hum. Genet. published online; doi:10.1016/j.ajhg.2013.01.002 (27 March 2013). 10. Zheng, W. et al. Hum. Mol. Genet. published online; doi:10.1093/hmg/ddt089 (27 March 2013). 11. Shen, H. et al. Nat. Comm. published online; doi:10.1038/ncomms2629 (27 March 2013). 12. Nickels, S. et al. PLoS Genet. 9, e1003284 (2013). 13. Permuth-Wey, J. et al. Nat. Comm. published online; doi:10.1038/ncomms2613 (27 March 2013). 14. Barrett, J.H. et al. Nat. Genet. 43, 11081113 (2011). 15. McKay, J.D. et al. Nat. Genet. 40, 14041406 (2008). 16. Petersen, G.M. et al. Nat. Genet. 42, 224228 (2010). 17. Rafnar, T. et al. Nat. Genet. 41, 221227 (2009). 18. Rajaraman, P. et al. Hum. Genet. 131, 18771888 (2012). 19. Rothman, N. et al. Nat. Genet. 42, 978984 (2010). 20. Turnbull, C. et al. Nat. Genet. 42, 604607 (2010). 21. Freedman, M.L. et al. Nat. Genet. 43, 513518 (2011). 22. Lango Allen, H. et al. Nature 467, 832838 (2010).

npg

2013 Nature America, Inc. All rights reserved.

348

volume 45 | number 4 | APRIL 2013 | nature genetics

c o m m e n ta r y

Public health implications from COGS and potential for risk stratification and screening
Hilary Burton1, Susmita Chowdhury1, Tom Dent1, Alison Hall1, Nora Pashayan2 & Paul Pharoah3,4 The PHG Foundation led a multidisciplinary program, which used results from COGS research identifying genetic variants associated with breast, ovarian and prostate cancers to model risk-stratified prevention for breast and prostate cancers. Implementing such strategies would require attention to the use and storage of genetic information, the development of risk assessment tools, new protocols for consent and programs of professional education and public engagement. The research articles by the multicenter Collaborative Oncological Gene-environment Study (COGS) published in this special collection in Nature Genetics1 are the output of a massive international scientific collaboration aimed at dissecting the genetic factors underlying susceptibility to three common hormone-related cancersbreast, ovarian and prostate cancers. The strength of the studies for all 3 cancers was the large-scale collaboration that was achieved, including over 130 institutions for the primary breast cancer association study2, over 100 for ovarian cancer3 and 70 for prostate cancer4. This large collaboration and combined resources provided the ability to pool large quantities of collected genetic association data sets and prospectively coordinate research. How can these studies be interpreted for their potential public health impact? Alongside the scientific work, the COGS program included an implementation work package (WP7) led by the Foundation for Genomics and Population Health (PHG Foundation) in Cambridge. This multidisciplinary public healthorientated group focused on how emerging findings of associations with these
1PHG Foundation (Foundation for Genomics 2University College London (UCL) Department

2013 Nature America, Inc. All rights reserved.

and Population Health), Cambridge, UK.

of Applied Health Research, University College London, London, UK. 3Department of Public Health and Primary Care, Institute of Public Health, University of Cambridge, Cambridge, UK. 4Department of Oncology, University of Cambridge, Cambridge, UK. Correspondence should be addressed to H.B. e-mail: hilary.burton@phgfoundation.org

three hormone-related cancers could enhance disease prevention by enabling the stratification of risk and the fine-tuning of current screening programs according to risk. Selfevidently, this relies on having effective preventive interventions for these cancers. WP7 focused on secondary prevention of breast and prostate cancers by provision of a screening test that identifies early cancers and thereby reduces mortality and morbidity. The evidence for the benefits of mammography in breast cancer detection and of prostatespecific antigen (PSA) testing in prostate cancer detection is not without controversy, and there is currently no screening test for ovarian cancer. From a public health perspective, it would also be highly desirable to devise proven and acceptable primary prevention strategies aimed at reducing the risk of disease; in breast cancer, for example, all women should receive general preventive advice related to alcohol intake, exercise and obesity. Under a riskstratified program, behavioral interventions might be more intense for those at higher risk, and other measures such as chemoprophylaxis might also be offered. As well as modeling what might be achievable, on the basis of the findings from a series of international workshops, the work package provided an overview of the organizational, ethical, legal and social issues that would be important in implementation through public health programs. Theories of disease prevention In his classic paper, the epidemiologist Geoffrey Rose highlighted two approaches to disease prevention: the individual and the population approaches5. The individual

approach focuses on identifying individuals at high risk and providing some individual protection, which might involve controlling the level of exposure to a causal agent or an intervention, such as prophylactic treatment or surveillance for early disease. The population approach focuses on identifying the underlying causes of disease (for example, high dietary intake of fat or salt) and providing a generalized intervention that shifts the whole distribution of risk at the population level. In both approaches, there is acknowledgment of the potential for harm or at least inconvenience for individuals, as well as the possibility of benefit. In the high-risk approach, the benefit-to-harm ratio for individuals is more favorable, albeit at the cost of identifying these individuals in the first place and the potential for long-lasting medicalization or stigmatization. The potential for risk stratification using personal and medical information that may include genetic testing results requires a refinement of these original concepts of disease prevention and suggests a third way that synthesizes elements of the two approaches. In a conceptual paper using the COGS program and the prevention of breast and prostate cancers as examples, the WP7 group argued that stratified prevention could be conceptualized as an enhancement of Roses high-risk approach. Essentially, it uses a prior assessment of risk, applied to the whole population, followed by the assignment of individuals to a risk stratum and the tailoring of the interventions offered to each group. In so doing, it aims to optimize the benefit-harm ratio and the cost-effectiveness of the public health program6.
349

npg

nature genetics | volume 45 | number 4 | APRIL 2013

COMMENTARY
Risk stratification The question then arises of whether current understanding of the genetic susceptibility for hormone-related cancersbreast, ovarian and prostatecan provide sufficiently good discrimination between risk groups so that the clinical usefulness gained by the stratification of prevention justifies the complexity that will be added to prevention programs. As reported in the accompanying collection of papers from the COGS scientific program, there were significant gains in identifying common variants associated with each of these cancers: 49 new loci were identified for breast cancer2,79, 8 were identified for ovarian cancer3,8,10,11, and 26 were identified for prostate cancer4,12. In their discussion, Michailidou et al.2 claimed that using the current set of loci and assuming that all loci combine multiplicatively could lead to a potential for risk stratification, with risks of breast cancer being approximately 2.3-fold and 3-fold higher, for individuals in the top 5% and 1% of the population relative to the population average. For prostate cancer, Eeles et al.4 estimated that there would be a 4.7-fold greater risk for prostate cancer for the top 1% of men in the highest risk stratum relative to the population average. The potential usefulness of population risk stratification for the prevention of breast cancer and prostate cancer was estimated by WP7 using detailed comparisons of risk stratification with current mammographic screening programs in the UK and a hypothetical screening strategy for prostate cancer based on age alone13. For each cancer, the 10-year absolute risk of being diagnosed with disease was estimated, taking into account age and polygenic risk profile, using all known susceptibility variants, including new variants identified in the COGS program (a total of 67 susceptibility variants for breast cancer and 72 variants for prostate cancer). The number of individuals eligible for screening and the number of cases potentially detectable by screening were estimated in a population undergoing screening on the basis of age alone in comparison to a population undergoing personalized screening. For breast cancer, using the current UK National Health Service (NHS) breast cancer screening program as a comparator, it was found that, compared with existing age-based screening (ages 4773 years), stratified screening of women in a wider age range (ages 3579 years) at the same 10-year absolute risk (2.5%) would be expected to result in 24% fewer women being eligible for screening while potentially detecting 3% fewer cases through screening. Similarly, with prostate cancer, in a hypothetical screening strategy comparing
350

risk-stratified screening for men aged 4579 years with screening of men from age 55 (10year absolute risk of prostate cancer of 2%), 19% fewer men would be eligible for screening at a cost of 4% fewer cases potentially detected by screening. The advantages of such stratified programs would be increased opportunities to detect cancers in individuals of younger ages, in whom the cancers tend to behave more aggressively14,15, and reduced risk of false positives, with reduction of harm due to unnecessary biopsies and invasive treatments. However, to estimate the true benefits of risk-stratified screening, it will be necessary to understand whether and how tumor subtypes, screening test sensitivity, the natural history of cancers and the probability of over-diagnosis (the diagnosis of indolent cancers that may never have manifested clinically) vary by polygenic profile. This COGS WP7 discussion on the potential effectiveness of risk stratification was based entirely on mathematical modeling. The eventual implementation of stratified prevention programs will require the development of risk prediction tools and empirical as well as modeled evidence of the effectiveness of riskstratified interventions. Thus, it will be necessary to evaluate risk models using empirical data sets. Demonstration of the benefits of stratified screening, particularly in the reduction of cancer-specific mortality, would ideally be achieved through a randomized trial, but this may not always be feasible. Alternatively, it may be reasonable to implement stratified screening on the basis of the best available evidence, including validated model findings, and then have pragmatic service evaluation. Implementation of risk-stratified screening In addition to the evaluation of potential usefulness and cost-effectiveness, the implementation of risk-stratified screening will require attention to a wide range of organizational, ethical, legal and social issues. The COGS WP7 group investigated these through multidisciplinary stakeholder workshops and further detailed policy research on the key issues identified16,17. Workshop participants included oncologists, breast cancer screening program managers, clinical geneticists, ethicists, health service policy makers, public health specialists and public representatives, as well as scientists and clinicians closely involved as researchers in the wider COGS program. The use of genetic information There is increasing interest in the ethical and legal issues generated by DNA sampling and subsequent data use in both clinical and research settings. If DNA is to be sampled as

2013 Nature America, Inc. All rights reserved.

part of a risk stratification process, then, like any other clinical intervention, consent will be required. DNA sampling and analysis will not in itself change the nature of the consent that is sought, unless it is for research or other purposes outside clinical care18. Nevertheless, DNA sampling and analysis may cause concern to patients and the public. Policies for the implementation of risk-stratified screening will need to set out clearly the uses that will be made of the data generated. Participants in testing will need to be informed of whether the data generated can or will be used for other purposes, such as research; the possibility of generating incidental findings and how these will be managed; whether information will be relevant for family members and, if so, whether, how and by whom it will be shared; whether the data will be stored and, if so, with what safeguards; and who might have access to stored data, including the individual, family members, employers, insurance companies, criminal justice agencies and researchers. Risk assessment The components of risk assessment are likely to include genetic susceptibility variants, other biomarkers, a data set of personal, clinical and family history information, reproductive information and environmental or lifestyle factors. The precise components of this information and the methods for collection must be studied to determine the most cost-effective approaches. For example, non-genetic information, such as alcohol intake, obesity and family history, might be obtained by paper, electronic or in-person questionnaires. The final set of information will in sum form a tool or instrument for risk estimation, whose effectiveness should be validated, for example, using population cohorts. Policymakers must also decide whether information collection will be a one-time occurrence or take into account changing circumstances (such as family history), being updated over time. For genetic information, the set of SNPs to be analyzed should be chosen on the basis of evidence from the latest scientific findings, and, again, this set of markers should be subject to external scrutiny and will need to be updated over time. It will also be necessary to debate whether or not to include rare, highly penetrant mutations such as those occurring in BRCA1 and BRCA2, which confer information that is more highly predictive than most susceptibility information and is therefore of greater relevance for patients and family members19. Personalized screening program Preliminary risk stratification will add new complexities to the prevention program.

npg

volume 45 | number 4 | APRIL 2013 | nature genetics

COMMENTARY
First, appropriate systems for inviting and recalling people for risk assessment and screening need to be in place. Second, there should be a standard protocol for taking consent, performing genetic sampling and using a standardized risk assessment tool to integrate genetic data from an individual with environmental, lifestyle and hormonal data. Third, the level of risk of cancer will dictate the care pathway followed, with different pathways being followed for each risk stratum. Before implementation of a stratified screening program using genetic information, some health professionals will require new competencies to explain the new system, undertake assessment, communicate results and, in some systems, such as the UK screening programs, uphold decisions on NHS screening eligibility. Other ethical, legal and social issues An over-riding concern expressed in the multidisciplinary workshops was whether and under what circumstances the public would think that it was fair for screening eligibility to be based on a risk score that includes information from genetic profiling. Public opinion might depend critically on whether arguments were expressed in terms of improving the benefitharm balance for individuals or greater overall cost-effectiveness for the population. In either case, it would be important that these arguments and the underlying evidence were transparent and clearly communicated. Wider ethical, legal and social issues vary according to the precise model adopted but may include regulation of the risk prediction model and the wider social impact of knowledge of disease susceptibility. With respect to distributive justice, it will be important to ensure that different societal groups have equal access to the program and that it is delivered in a fair and objective fashion so that, as far as possible, different societal groups have an equal chance of benefit. For example, reasonable steps should be taken to ensure that poorer or less educated individuals are not deterred from participation owing to the added complexity of risk assessment. Similarly, if stratification is known to be less accurate for some ancestry groups, these biases should be acknowledged,
2013 Nature America, Inc. All rights reserved.

and, wherever possible, deficiencies should be addressed. Using the example of east Asian women, the COGS studies have drawn attention to the fact that most of the research on genetic susceptibility in breast cancer relates to populations of European ancestry, and they noted that there is insufficient evidence on which to build accurate risk models for other population groups20. Conclusions Societies are increasingly aware of the possible harms of major population screening programs (see the recent UK breast cancer screening report21) and the importance of costeffectiveness. As a result of the COGS work, international policymakers, clinicians and researchers at the WP7 workshops were confident that risk stratification using models that include genetic susceptibility would be a component of future public health programs. There is a wide array of factors that will eventually influence the effectiveness of risk stratification for a public health program. These include many technical factors still to be elucidated related to the stratification itself, the organizational complexity of the program and wider issues such as public acceptance among different sectors of the population, the drive for the increasing personalization of prevention and the potential development of new preventive and treatment modalities. It is therefore clear that evaluation of the benefits, harms and costs of implementing risk stratification at a population level will require empirical evidence from clinical trials or pilot studies. For responsible and rapid implementation, policymakers will need to ensure that public health services, healthcare services and experts in ethical, legal and regulatory matters become involved in the assessment of research implications and the development of implementation plans. An international effort similar in scale to that deployed in the research component of COGS should be initiated. In areas such as this, where most public health and other multidisciplinary professionals struggle to understand the complexities of the science, it will be necessary to develop a cadre of experts able to lead the field.

Complementing this approach, it will be vital that scientists and clinicians actively involved in the research themselves engage with those involved in translation and implementation. Through WP7, the COGS program has provided evidence of the usefulness of such joint work and a strong indication that more will be required.
URL. COGS, http://www.cogseu.org/. ACKNOWLEDGMENTS H.B., S.C., T.D. and A.H. were supported by the PHG Foundation. The PHG Foundation is the working name of the Foundation for Genomics and Population Health, a charitable company registered in England and Wales: Charity 118664, Company 5823194. N.P. is supported by a Cancer Research UK Clinician Scientist Fellowship. This work is part of the COGS program, funded by the seventh Framework Programme of the European Commission under grant agreement 223175 (HEALTH-F2-2009-223175). COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.
1. Anonymous. Nat. Genet. published online; doi:10.1038/ng.2592 (27 March 2013). 2. Michailidou K. et al. Nat. Genet. published online; doi:10.1038/ng.2563 (27 March 2013). 3. Pharoah, P.D.P. et al. Nat. Genet. published online; doi:10.1038/ng.2564 (27 March 2013). 4. Eeles, R.A. et al. Nat. Genet. published online; doi:10.1038/ng.2560 (27 March 2013). 5. Rose, G. Int. J. Epidemiol. 14, 3238 (1985). 6. Burton, H., Sagoo, G.S., Pharoah, P. & Zimmern, R.L. Ital. J. Public Health 9, 4 (2012). 7. Garcia-Closas, M. et al. Nat. Genet. published online; doi:10.1038/ng.2561 (27 March 2013). 8. Bojesen, S.E. et al. Nat. Genet. published online; doi:10.1038/ng.2566 (27 March 2013). 9. French, J.D. et al. Am. J. Hum. Genet. published online; doi:10.1016/j.ajhg.2013.01.002 (27 March 2013). 10. Shen, H. Nat. Comm. published online; 10.1038/ ncomms2629 (27 March 2013). 11. Permuth-Wey, J. et al. Nat. Comm. published online; doi:10.1038/ncomms2613 (27 March 2013). 12. Kote-Jarai, Z. et al. Hum. Mol. Genet. published online; doi:10.1093/hmg/ddt086 (27 March 2013). 13. Pashayan, N. et al. Br. J. Cancer 104, 16561663 (2011). 14. Fredholm, H. et al. PLoS ONE 4, e7695 (2009). 15. Lin, D.W., Porter, M. & Montgomery, B. Cancer 115, 28632871 (2009). 16. Chowdhury, S. et al. Genet. Med. published online; doi:10.1038/gim.2012.167 (14 February 2013). 17. Dent, T. et al. Public Health Genomics published online; doi:10.1159/000345941 (26 January 2013). 18. UK Parliament. Human Tissue Act 2004 Chapter 30 (The Stationery Office, London, 2004). 19. Antoniou, A. et al. Am. J. Hum. Genet. 72, 11171130 (2003). 20. Zheng, W. et al. Nat. Genet. 45, 191196 (2013). 21. Independent UK Panel on Breast Cancer Screening. Lancet 380, 17781786 (2012).

npg

nature genetics | volume 45 | number 4 | APRIL 2013

351

research highlights

BRCA1 mutation carriers


BRCA1 mutation carriers are at increased risk for both breast and ovarian cancers. The Consortium of Investigators of Modifiers of BRCA1/2 (CIMBA) have previously reported a two-stage genomewide association study (GWAS) for modifiers of breast or ovarian cancer risk in BRCA1 mutation carriers (Nat. Genet. 42, 885892, 2010), identifying a 19p13 locus associated with breast cancer risk for individuals with BRCA1 mutation. Fergus Couch and colleagues now report a large-scale replication study including 11,705 BRCA1 mutation carriers from 45 centers in 25 countries (PLoS Genet. 9, e1003212, 2013). These samples were all genotyped using the custom iCOGS array, which included 32,557 SNPs selected from the original GWAS. The authors identified a new susceptibility locus at 1q32, containing the MDM4 gene, associated with breast cancer risk for BRCA1 mutation carriers and also identified two new loci associated with ovarian cancer risk for BRCA1 mutation carriers at 17q21.31 and 4q32.2, with the latter representing the first locus to modify cancer risk specifically in BRCA1 mutation carriers. The authors also replicated previous associations and report a total of ten and seven loci associated, respectively, with breast and ovarian cancer risk in BRCA1 mutation carriers. On the basis of combinedSNP profiles, they report large differences in the predicted risk of developing breast or ovarian cancer for the 5% of BRCA1 mutation carriers at highest and lowest risk.

67 known susceptibility loci in 23,637 breast cancer cases and 25,579 controls of east Asian ancestry from the Asia Breast Cancer Consortium (ABCC) as well as Asian samples from COGS. They found that variants at 31 of these loci showed nominal association with breast cancer risk and that 21 of these met a Bonferroni-corrected significance level. This study offers the most comprehensive analysis of breast cancer risk variants in east Asians so far, but further studies and fine mapping of these regions in both European and Asian populations are needed to characterize population-specific differences in breast cancer susceptibility.

Fine mapping at the 11q13 locus


Alison Dunning and colleagues report fine mapping of the 11q13 breast cancer susceptibility locus (Am. J. Hum. Genet. doi:10.1016/j. ajhg.2013.01.002, 27 March 2013). The original genome-wide association study (GWAS)-identified SNP tags a linkage disequilibrium block that spans 683 kb, and the authors selected 731 SNPs from this region to include on the iCOGS array. They genotyped 89,050 individuals of European ancestry and 12,893 individuals of Asian ancestry, all from studies included in BCAC. They identified 204 SNPs associated with overall breast cancer risk, finding that these were all associated with estrogen receptor (ER)-positive but not ER-negative breast cancer. Using stepwise logistic regression, they identified three independently associated SNPs. They selected five promising candidate SNPs for functional studies but did not detect any significant association of these SNPs with the expression of local genes in normal breast tissue or tumor samples. Using chromatin immunoprecipitation with sequencing (ChIP-seq) data from MCF7 cells, they found that these SNPs fell within two putative regulatory elements. Chromatin interaction analysis by paired-end tag sequencing (ChIA-PET) and chromosome conformation capture (3C) analyses showed long-range interactions between these regulatory elements and the CCND1 promoter and/or terminator. Further functional studies identified a candidate causal variant in the putative CCND1 enhancer, which affected the binding of the ELK4 transcription factor. A second candidate causal variant, located within a silencer element that physically interacts with the CCND1 enhancer, affected binding of the GATA3 transcription factor.

2013 Nature America, Inc. All rights reserved.

BRCA2 mutation carriers


A previous study from the Consortium of Investigators of Modifiers of BRCA1/2 (CIMBA) reported a genome-wide association study (GWAS) for breast cancer risk in BRCA2 mutation carriers (PLoS Genet. 6, e1001183, 2010), identifying only known loci previously associated with breast cancer risk in the general population. Kenneth Offit and colleagues now report an extended replication, including genotyping using the iCOGS array, of 3,881 BRCA2 mutation carriers with breast cancer and 4,330 without breast cancer from CIMBA (PLoS Genet. 9, e1003173, 2013). They selected 19,029 SNPs from the initial GWAS for inclusion on the iCOGS array. The authors replicated previous breast cancer susceptibility loci and also identified a new susceptibility locus at 6p24 associated with breast cancer risk for BRCA2 mutation carriers but not with breast cancer for BRCA1 mutation carriers or the general population. This represents the first BRCA2-specific breast cancer association. Using a data set from The Cancer Genome Atlas (TGCA), they found that rs9348512 was associated with increased expression of the nearby gene GCNT2 in breast tumors. The authors used a combinedSNP risk profile of 14 SNPs known to modify risk in BRCA2 mutation carriers, predicting 2147% risk of developing breast cancer by the age of 80 years for the 5% of the BRCA2 mutation carriers at lowest risk compared to 83100% risk for the 5% at highest risk.

npg

Fine mapping at the TERT locus


Zsofia Kote-Jarai and colleagues report fine mapping of associations to prostate cancer susceptibility at the TERT locus using high-resolution genotyping and imputation (Hum. Mol. Genet. doi:10.1093/hmg/ddt086, 27 March 2013). The authors genotyped 134 SNPs across the TERT locus using the custom iCOGS array or Sequenom MassArray iPlex in 22,301 cases and 22,320 matched controls from 23 studies included in the PRACTICAL Consortium. They initially genotyped 114 SNPs across 135 kb of the SLC6A18-TERT-CLPTM1L region and then narrowed their focus to a 20-kb interval that included variants with stronger association. They further tested association for an imputed set of 1,094 SNPs. They identified 44 SNPs associated with prostate cancer risk at P < 1 105. With stepwise logistic regression, they were able to identify four SNPs showing independent association, suggesting four separate regions influencing susceptibility to prostate cancer. They examined gene expression of TERT and CLPTM1L in 195 normal (histologically benign) prostate tissue samples isolated from men with elevated prostate-specific antigen (PSA) levels. They found protective alleles of four SNPs in one region associated with higher expression of TERT.

Breast cancer associations in east Asians


To extend findings of breast cancer association studies that have primarily been conducted in European-ancestry populations, Wei Zheng and colleagues systematically examined the association of known breast cancer susceptibility loci in east Asian women (Hum. Mol. Genet. doi:10.1093/hmg/ddt089, 27 March 2013). They genotyped 70 SNPs at
Research Highlights written by Orli Bahcall.

352

volume 45 | NUMBER 4 | APRIL 2013 | nature genetics

Articles

Large-scale genotyping identifies 41 new loci associated with breast cancer risk
Breast cancer is the most common cancer among women. Common variants at 27 loci have been identified as associated with susceptibility to breast cancer, and these account for ~9% of the familial risk of the disease. We report here a meta-analysis of 9 genome-wide association studies, including 10,052 breast cancer cases and 12,575 controls of European ancestry, from which we selected 29,807 SNPs for further genotyping. These SNPs were genotyped in 45,290 cases and 41,880 controls of European ancestry from 41 studies in the Breast Cancer Association Consortium (BCAC). The SNPs were genotyped as part of a collaborative genotyping experiment involving four consortia (Collaborative Oncological Gene-environment Study, COGS) and used a custom Illumina iSelect genotyping array, iCOGS, comprising more than 200,000 SNPs. We identified SNPs at 41 new breast cancer susceptibility loci at genome-wide significance (P < 5 108). Further analyses suggest that more than 1,000 additional loci are involved in breast cancer susceptibility. Breast cancer is the most commonly occurring malignancy among women, with an estimated 1 million new cases and over 400,000 deaths annually worldwide1. Familial aggregation and twin studies have shown the substantial contribution of inherited susceptibility to breast cancer2,3. Many genetic loci are known to contribute to this familial risk, including genes with high-penetrance mutations (notably BRCA1 and BRCA2), moderate-risk alleles in genes such as ATM, CHEK2 and PALB2, and common lower penetrance alleles, of which 27 have been identified so far, principally through genome-wide association studies (GWAS)416. In total, these loci explain approximately 30% of the familial risk of breast cancer15. Global analysis of GWAS data suggests that a substantial fraction of the residual aggregation can be explained by other common variants not yet identified, but the relative contributions of common and rare variants are still uncertain. RESULTS To identify additional susceptibility loci for breast cancer, we first conducted a meta-analysis of 9 breast cancer GWAS in populations of European ancestry, including 10,052 cases and 12,575 controls (Supplementary Table 1). From this analysis, we selected 35,084 SNPs on the basis of evidence of association with breast cancer, derived from a 1-degree-of-freedom trend test, a test weighted for family history, a 2-degrees-of-freedom test and subset analyses based on cases of breast cancer diagnosed before 40 years of age and before 50 years of age (Online Methods). In particular, we were able to select all SNPs or surrogate SNPs with 1-degree-of-freedom Ptrend < 0.008. To evaluate these SNPs, we then designed a custom Illumina iSelect genotyping array (iCOGS) in collaboration with three other consortia studying, in addition to breast cancer risk, susceptibility to ovarian cancer, prostate cancer and breast and ovarian cancers in BRCA1 and BRCA2 mutation carriers (COGS)1720. The array included, in addition to SNPs selected from GWAS, SNPs selected for fine mapping of known susceptibility loci, functional candidate SNPs and SNPs related to other traits (Online Methods and Supplementary Note). The iCOGS array comprised 211,155 SNPs. These arrays were used to genotype 114,255 DNA samples from 52 studies participating in BCAC (Supplementary Table 2). After quality control exclusions (Online Methods and Supplementary Table 3), data were obtained for 199,961 SNPs in 52,675 cases and 49,436 controls. The analyses presented here are based on data from subjects of European ancestry (45,290 cases and 41,880 controls from 41 studies) and focus on 29,807 SNPs that were selected on the basis of the GWAS analysis that were successfully genotyped and were not located in regions previously known to be associated with breast cancer. The association between each SNP and breast cancer risk was tested using a 1-degree-of-freedom trend test adjusted for study and seven principal components (Online Methods). There was some evidence for inflation in the test statistics, detected using data from 22,897 uncorrelated SNPs on iCOGS not selected on the basis of breast cancer risk ( = 1.20, 1000 = 1.005; Supplementary Fig. 1a). There was, however, clear evidence of an excess of statistically significant associations among the SNPs selected from the GWAS analysis (Table 1 and Supplementary Fig. 1b). Although some excess was also observed among the SNPs not selected from the breast cancer GWAS, the excess of statistically significant associations was much more marked among the GWAS SNPs at all levels of statistical significance. In addition, of 21,128 SNPs not selected for breast cancer association that were also present in the combined GWAS data set, 10,864 (51%) had effects in the same direction in the GWAS and iCOGS data, and, for these SNPs, inflation was 1.26 (1000 = 1.007) compared with 1.14 (1000 = 1.0035) for SNPs with effects in opposite directions in the two stages. A similar direction of effect was seen for these SNPs in the combined GWAS ( = 0.87 for SNPs with effects in the same direction versus = 0.79 for SNPs with effects in the opposite

npg

2013 Nature America, Inc. All rights reserved.

A full list of authors and affiliations appears at the end of the paper. Received 10 May 2012; accepted 30 January 2013; published online 27 March 2013; doi:10.1038/ng.2563

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

353

Articles
disease (based on data from 7,465 ERnegative cases and 27,074 ER-positive cases; Supplementary Table 7a). The most notable Significance SNPs Observed/expected SNPs Observed/expected Relative excess differences were for SNP rs6828523 at 4q34.1 <1 107 142 47,639.8 7 554.0 86.0 (ER-positive OR = 0.87 (95% confidence 1 1071 106 62 2080.0 13 102.9 20.2 interval (CI) = 0.840.90); ER-negative OR = 1 1061 105 108 362.3 25 19.8 18.3 1.01 (95% CI = 0.961.07); P for difference = 1 1051 104 157 52.7 136 10.8 4.9 1 1041 103 360 12.1 348 2.8 4.3 1.2 107) and for rs7072776 at 10p12.31, aAll SNPs excluding those proposed by BCAC and those in four common regions selected for fine mapping (Online Methods). where the estimated effects were in opposite directions (ER-positive OR = 1.09 (95% CI = 1.061.12); ER-negative OR = 0.94 (95% direction, with inflation being <1 because SNPs showing evidence of CI = 0.900.98); P for difference = 3.1 1010). No such difference association were excluded). Taken together, these results suggest that was observed for the neighboring SNP rs11814448, which was assomuch of the inflation in the test statistics for SNPs not selected for ciated with both ER-positive and ER-negative disease in the same breast cancer association is also due to the effect of true associations. direction. For one locus, SNP rs17817449 on chromosome 16, the Moreover, some of the excess of statistically significant associations association was stronger for ER-negative than for ER-positive disseen in the SNPs not selected for breast cancer association was due ease (P for difference = 0.039). All SNPs showed comparable ORs to SNPs close to breast cancerassociated SNPs. For example, of the for invasive and in situ disease (based on data from 2,335 ductal car45 SNPs with significant association at P < 0.00001, 21 were within cinoma in situ, DCIS, and 42,118 invasive cases), with the excep1 Mb of 1 of the newly identified breast cancer loci identified at our tions of rs12493607 and rs3903072, for which associations seemed set genome-wide significance threshold. Taken together, these results to be restricted to invasive disease (Supplementary Table 7b). Two strongly suggest that most of the excess of significant association for loci (rs2588809 at 14q24.1 (P = 0.001) and rs941764 at 14q32.12 the GWAS-selected SNPs reflect true associations. (P = 0.007)) showed higher per-allele ORs for cases diagnosed at a Of the 27 previously established breast cancerassociated loci, young age (Supplementary Table 7c). Consistent with the predictions all but 4 showed clear evidence of association with overall breast of a polygenic model of susceptibility25, for 26 of the loci, the esticancer risk in the iCOGS stage (P = 2.2 105 P = 5.9 10125; mated OR was higher when restricted to cases with a positive family Supplementary Table 4). Three loci showed weaker evidence for history for disease (significant at P < 0.05 for 5 loci), whereas for only association: rs1045485, encoding an Asp302His variant in CASP8, 6 loci was the OR lower when restricted to cases with a positive family whose association was previously identified in a candidate gene study history (Supplementary Table 7d). (P = 0.054 in the iCOGS stage; P = 0.0013 in combined data from Four of the newly associated loci (rs16857609 at 2q35, rs10759243 at the GWAS and iCOGS stages)21; rs2380205 at 10p15, identified in a 9q31, rs11199914 at 10q26 and rs2588809 at 14q24) lie close to regions GWAS but suggested to be a possible false positive association in a previously associated with breast cancer risk. In each locus, however, previous BCAC analysis22,23 (iCOGS P = 0.075; combined P = 0.0021); the lead SNP was not correlated with the most strongly associated and rs8170 at 19p13.1, for which the association has been shown to be known association, and the association of the new SNP remained specific to estrogen receptor (ER)-negative breast cancer24 (P = 0.0027 similarly statistically significant after adjustment for the previously in iCOGS; combined P = 0.0012). One locus, rs2284378 at 20q11, associated SNP (Supplementary Table 5). In the case of rs2588809, recently shown to be associated with ER-negative breast cancer, was which lies in RAD51B (also known as RAD51L1), the association was not selected for the iCOGS array16. markedly stronger for ER-positive disease (P = 0.011; Supplementary Table 7a), whereas the previously associated SNPs (rs999737 and Identification of new susceptibility loci rs10483813), which lie ~370 kb telomeric, are associated with similar When the results from the GWAS and the iCOGS array were com- ORs for both ER-positive and ER-negative disease26. bined, 263 SNPs in 37 new regions had associations that reached P < 5 108 (Fig. 1, Table 2 and Supplementary Figs. 2 and 3). In four regions (5q11.2, 8q21.11, 10p12.31 and 18q11.2), this set of SNPs included SNPs within 1 Mb of each other that were uncorre15 lated, such that a second SNP was associated with disease after adjustment for the most significantly associated SNP (Supplementary Fig. 4 and Supplementary Table 5). There was little or no evidence for 10 heterogeneity in the per-allele odds ratios (ORs) among studies for 2 any SNP (per-SNP I and P values are given in Supplementary Fig. 2 and Supplementary Table 6). Genotype-specific OR estimates were 5 consistent with a log-additive (allele dose) model for most SNPs, with the exception of three SNPs (rs616488, rs204247 and rs720475) for which the heterozygotes had a similar OR as homozygotes for the high-risk allele and two SNPs (rs11242675 and rs6472903) that were 0 more consistent with a recessive model (Supplementary Table 6). Consistent with the pattern seen for previously established loci, there Chromosome was strong evidence for specificity of the association to tumor subFigure 1 One-degree-of-freedom trend-test statistics for 29,807 iCOGS type. For 13 of the loci, the per-allele OR was higher for ER-positive SNPs selected from the combined GWAS, excluding those occurring in disease than for ER-negative disease (case-only P < 0.05), in most known susceptibility regions. The red horizontal line represents P = 5 108. instances with little or no evidence of an association with ER-negative The blue horizontal line represents P = 1 105.
Table 1 Summary of SNPs by level of statistical significance in the iCOGS stage
Combined GWAS (n = 29,807) Non-BCACa (n = 126,360)
log10 (P)

npg

2013 Nature America, Inc. All rights reserved.

9 10 11 12 13 14

18

354

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

16

20 22

Articles
Table 2 Results for 41 SNPs for which association P < 5 108 in combined GWAS and iCOGS analysis
Lead SNP rs616488 rs11552449 Chr.a 1 1 Positionb 10488802 114249912 Allelesc A/G C/T MAFd 0.33 0.17 GWAS OR (95% CI)e GWAS Pe iCOGS OR (95% CI)f iCOGS Pe Combined GWAS and iCOGS P e 2.0 1010 1.8 108 Genes PEX14 PTPN22-BCL2L15AP4B1-DCLRE1BHIPK1 None METAP1D-DLX1-DLX2 CDCA7 DIRC3 ITPR1-EGOT TGFBR2 TET2 ADAM29 RAB3C PDE4D EBF1 FOXQ1 RANBP9 ARHGEF5-NOBOX None None HNF4G MIR1208 None MLLT10-DNAJC1 DNAJC1 TCF7L2 None DKFZp761E198-OVOL1SNX32-CFL1-MUS81 None None NTN4 BRCA2-N4BP2L1-N4BP2L2 PAX9-SLC25A21 RAD51L1 CCDC88C MIR1972-2-FTO CDYL2 None CHST9 SSBP4-ISYNA1-ELL C19orf61-KCNN4-LYPD5ZNF283 EMID1-RHBDD3-EWSR1 MKL1

0.94 (0.900.98) 1.08 (1.021.14)

0.0017 0.0042

0.94 (0.920.96) 1.07 (1.041.09)

3.0 108 1.1 106

rs4849887 rs2016394 rs1550623 rs16857609 rs6762644 rs12493607 rs9790517 rs6828523 rs10472076 rs1353747 rs1432679 rs11242675

2 2 2 2 3 3 4 4 5 5 5 6 6 7 8 8 8 8 9 10 10 10 10 11 11 12 12 13 14 14 14 16 16 18 18 19 19 22 22

120961592 172681217 173921140 218004753 4717276 30657943 106304227 176083001 58219818 58373238 158176661 1263878 13830502 143705862 29565535 76392856 76580492 129263823 109345936 22072948 22355849 114763917 123083891 65339642 128966381 14305198 94551890 31870626 36202520 67730181 90910822 52370868 79208306 22591422 22824665 18432141 48978353 27951477 39206180

C/T G/A A/G C/T A/G G/C C/T C/A T/C T/G T/C T/C A/G G/A C/A T/G A/G C/T C/A G/A A/C A/G C/T G/T C/T G/C A/G A/T G/A C/T A/G T/G A/G G/C T/G A/G G/A T/C T/C

0.098 0.48 0.16 0.26 0.40 0.35 0.23 0.13 0.38 0.095 0.43 0.39 0.43 0.25 0.32 0.18 0.07 0.16 0.39 0.29 0.020 0.46 0.32 0.47 0.41 0.26 0.30 0.008 0.21 0.16 0.34 0.40 0.22 0.38 0.4 0.35 0.46 0.036 0.11

0.90 0.95 0.91 1.09 1.06 1.04 1.09 0.89 1.06 0.90 1.06 0.97 1.06 0.93 1.07 0.88 1.17 1.13 1.07 1.11 1.35 1.06 0.94 0.92

(0.840.96) (0.920.99) (0.860.96) (1.051.14) (1.021.11) (1.001.09) (1.041.14) (0.830.94) (1.021.11) (0.840.96) (1.021.10) (0.931.01) (1.021.10) (0.890.98) (1.031.12) (0.840.93) (1.091.26) (1.071.19) (1.021.12) (1.071.16) (1.171.56) (1.021.10) (0.890.98) (0.890.96)

0.0017 0.014 0.00027 4.5 105 0.0016 0.049 0.00027 0.00011 0.005 0.0020 0.0023 0.12 0.0057 0.0024 0.00086 2.0 106 1.2 105 2.2 106 0.0084 1.3 106 3.7 105 0.0059 0.0030 5.1 105 0.00068 4.2 105 1.7 106 0.0016 2.0 105 0.017 0.043 0.010 9.2 108 3.0 105 0.0008 0.0027 0.0022

0.91 0.95 0.94 1.08 1.07 1.06 1.05 0.90 1.05 0.92 1.07 0.94 1.05 0.94 1.07 0.91 1.13 1.07 1.06 1.07 1.26 1.06 0.95 0.95

(0.880.94) (0.930.97) (0.920.97) (1.061.10) (1.041.09) (1.031.08) (1.031.08) (0.870.92) (1.031.07) (0.890.95) (1.051.09) (0.920.96) (1.031.07) (0.920.96) (1.051.09) (0.890.93) (1.091.17) (1.041.10) (1.031.08) (1.051.09) (1.181.35) (1.041.08) (0.930.97) (0.930.96)

5.6 109 2.7 107 1.2 105 4.4 1012 3.5 1010 1.4 107 1.6 105 6.6 1013 1.6 106 2.7 106 2.1 1012 1.2 108 4.2 107 7.8 109 2.6 1011 8.4 1013 6.0 1011 5.0 107 4.0 107 1.6 109 3.6 1012 1.5 108 1.5 106 2.0 108 3.2 107 2.9 105 1.4 1017 5.7 106 4.4 1010 2.3 109 2.3 109 1.3 1012 5.8 1011 3.1 107 6.9 106 3.9 1013 2.5 108 5.9 106 2.0 1013

3.7 1011 1.2 108 3.0 108 1.1 1015 2.2 1012 2.3 108 4.2 108 3.5 1016 2.9 108 2.5 108 2.0 1014 7.1 109 8.3 109 7.0 1011 9.2 1014 1.7 1017 5.7 1015 3.4 1011 1.2 108 4.3 1014 9.3 1016 3.1 108 1.9 108 8.6 1012 1.1 109 3.7 108 1.8 1022 4.9 108 1.7 1013 1.4 1010 3.7 1010 6.4 1014 2.1 1016 1.6 1010 3.2 108 4.6 1015 2.1 1010 3.1 109 8.8 1019

2013 Nature America, Inc. All rights reserved.

rs204247 rs720475 rs9693444 rs6472903 rs2943559 rs11780156 rs10759243 rs7072776 rs11814448 rs7904519 rs11199914 rs3903072 rs11820646 rs12422552 rs17356907 rs11571833 rs2236007 rs2588809 rs941764 rs17817449 rs13329835 rs527616 rs1436904 rs4808801 rs3760982 rs132390 rs6001930

0.93 (0.900.97) 1.11 (1.051.16) 0.89 (0.850.93) 1.39 (1.131.71) 0.88 (0.830.93) 1.07 (1.011.13) 1.05 (1.001.09) 0.95 (0.910.99) 1.14 (1.091.19) 0.91 (0.870.95) 0.93 (0.90.97) 0.94 (0.900.98) 1.06 (1.021.10) 1.36 (1.191.54) 1.17 (1.111.25)

0.95 (0.930.97) 1.05 (1.031.07) 0.91 (0.890.93) 1.26 (1.141.39) 0.93 (0.910.95) 1.08(1.051.11) 1.06 (1.041.09) 0.93 (0.910.95) 1.08 (1.051.10) 0.95 (0.930.97) 0.96 (0.940.98) 0.93 (0.910.95) 1.06 (1.041.08)

npg

3.0 106 1.12 (1.071.18) 2.9 107 1.12 (1.091.16)

Results for the SNPs showing the strongest association in each region are given.
aChromosome. bBuild eOne-degree-of-freedom

36 position. cMajor/minor allele, based on the forward strand and minor allele frequency in Europeans. dMean minor allele frequency over all European controls in iCOGS. Ptrend. fPer-allele OR for the minor allele relative to the major allele.

Two associated loci lie within or close to known breast cancer susceptibility genes. rs11571833 is a polymorphic variant in BRCA2 that introduces a premature stop codon (p.Lys3326*), previously reported to have no association with breast cancer risk27. The results from the current study, however, indicate that this variant is associated with a modestly higher risk of breast cancer. Further work will be required to determine whether this association is due to a higher risk variant or variants in linkage disequilibrium (LD). SNP rs132390 at 22q12 lies within an intron of EMID1 but is ~500 kb upstream of CHEK2, raising the possibility that this association is mediated through
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

the latter. CHEK2 c.1100delC, the major deleterious CHEK2 variant in European populations28, occurs more frequently in association with the risk allele at rs132390 (r2 = 0.06); however, the association between r132390 and breast cancer risk persisted after adjustment for CHEK2 c.1100delC, although attenuated (unadjusted OR in iCOGS = 1.12, P = 5.9 106; adjusted OR = 1.09, P = 0.04). In addition to rs11571833, one further SNP is a coding variant: rs11552449 encodes a missense substitution p.His61Tyr in DCLRE1B (also known as SNM1B), an evolutionarily conserved gene involved in DNA stability and the repair of interstrand cross-links29.
355

Articles
The remaining loci are either intronic (20) or intergenic (19). Two loci lie within genes previously proposed as candidate breast cancer susceptibility genes. SNP rs12493607 lies in intron 2 of TGFBR2. An analysis of genes in the transforming growth factor (TGF)- signaling pathway in European populations found weak evidence of an association between rs4522809 and breast cancer risk (P = 0.02)30. This SNP is weakly correlated with rs12493607 (r2 = 0.25) and also showed some evidence of association in our study, although weaker than that seen for rs12493607 (iCOGS P = 0.00096; combined analysis of GWAS and iCOGS P = 0.0029). A similar analysis of candidate SNPs in Asian populations identified SNP rs1078985 as a potential breast cancer susceptibility variant31. This variant, however, was uncorrelated with rs12493607 in Europeans and showed no evidence of association in our study (P = 0.33 in the iCOGS stage). SNP rs7904519 lies in intron 4 of TCF7L2. A previous candidate gene study found weak evidence for an association between a correlated SNP, rs12255372, associated with type 2 diabetes (r2 = 0.37 with rs7904519), and familial breast cancer (P = 0.04)32. The identification of the genes and variants underlying these associations will require more detailed fine mapping and functional analysis. Nevertheless, it is possible to discern some patterns. We identified 53 genes within 50 kb of the lead SNPs in the newly associated regions, totaling 96 genes when including the previously known loci. Analysis using Ingenuity Systems Pathway Analysis (IPA) identified an excess of genes reported to be involved in tumorigenesis (34 genes; P = 0.0005), breast cancer (15 genes; P = 2 105) and tumor incidence in model systems (10 genes; P = 2 107). The most consistently over-represented functions were cell death (P = 0.0028), differentiation (P = 2 105) and expression (P = 2 108). Three loci are located in the vicinity of susceptibility regions for other cancer types. SNP rs11780156 lies ~400 kb downstream of MYC. Previous GWAS have identified multiple loci upstream of MYC that are associated with different cancer types, including a locus for breast cancer. Functional studies have indicated that these associations might be mediated through transcriptional regulation of MYC. The newly associated locus is ~300 kb centromeric to a previously reported susceptibility locus for ovarian cancer, rs10088218, but is uncorrelated with it (r2 = 0.02, based on data from European subjects in BCAC), raising the possibility that these loci might also be regulating MYC33. SNP rs9790517 at 4q24 lies ~20 kb away from SNP rs7679673, previously reported to be associated with prostate cancer34, and is correlated with it (r2 = 0.53). SNP rs9790517 lies in intron 11 of TET2, which encodes a methylcytosine dioxygenase involved in myelopoiesis. Mutations in TET2 are frequent in hematological malignancies but have also been reported in 2 of 47 breast tumors in the Catalogue of Somatic Mutations in Cancer (COSMIC) database. In addition, Pharoah et al.18 have found an association between rs1243180 and ovarian cancer. This SNP is ~120 kb telomeric to rs7072776 and is partially correlated with it (r2 = 0.51); both SNPs and the neighboring breast cancerassociated locus rs11814448 lie within the region 400 kb upstream of DNAJC1. To further investigate the likely genes underlying the susceptibility variants, we examined associations between the lead SNPs and the RNA expression of neighboring genes in 473 primary breast tumors and 61 normal breast tissue samples in The Cancer Genome Atlas (TCGA) database. We found strong evidence for an association between rs616402 (a surrogate for rs616488; r2 = 0.66) and expression of PEX14 in both tumor (P = 4.7 1012) and normal tissue (P = 0.00018; Supplementary Table 8), between rs3760983 (a surrogate for rs3760982; r2 = 1) and expression of both ZNF404 (P = 1.2 106 in tumors) and ZNF283 (P = 0.0089) and between rs3903072
356

and expression of CTSW (P = 4.9 105). SNP rs3760982 was also found to be associated with the expression of ZNF45 (P = 0.0077), ZNF283 (P = 0.05) and ZNF222 (P = 0.01) in lymphoblastoid cell lines from HapMap samples using the Genevar database35 (Supplementary Table 8c). After adjustment for the SNP in the region most strongly associated with expression, SNP rs616488 and PEX14 (P = 0.0071) as well as rs1217396 (a proxy for rs11552449) and PTPN22 (P = 0.0055) and DCLRE1B (P = 0.0067) reached nominal significance at P < 0.01 (Supplementary Table 8a). Although none of these passed Bonferroni correction for multiple testing, the three associations found exceeded the number expected by chance with 46 associations tested. This supports some transcriptional effect from the risk-associated SNPs. PEX14 is involved in peroxisome organization and protein and transmembrane transport; mutations in PEX14 have been associated with Zellweger syndrome36. The functions of ZNF45, ZNF222 and ZNF283 are unknown but may involve transcriptional regulation. In addition to the genes described above, plausible candidate genes exist in several of the newly associated regions. MUS81 at 11q13 has a key role in the maintenance of genomic stability and in DNA repair pathways37,38, and the cofilin gene (CFL1) is required for tumor cell motility and invasion, particularly in mammary tumors39,40. Several other genes have been associated with tumor aggressiveness; these include PTH1R at 3p21, FOXQ1 at 6p25, ARHGEF5 at 7q35 and MKL1 at 22q13. PTH1R is the receptor for PTHLH, encoded by a previously identified breast cancer susceptibility locus15. PTHLH is required for normal mammary gland function and has been shown to be involved in the metastasis of breast cancer cells to bone41,42. FOXQ1 encodes a transcription factor with a key role in cell proliferation and migration and in breast cancer metastasis43. Alterations in its expression level induce mesenchymal-epithelial transition44. Dysfunctional ARHGEF5 acts as an oncogene specific for human breast tissue, with a crucial role in tumorigenesis and metastasis in breast cancer45. MKL1 is also involved in tumor cell invasion and metastasis, particularly in human breast carcinoma46. Two of the newly associated SNPs lie within the TCF7L2 and FTO genes, previously associated with type 2 diabetes and/or obesity through GWAS4749. TCF7L2 acts as a proto-oncogene and is involved in the Wnt pathway and in tumor formation50. PAX9 at 14q13.3 encodes a transcription factor that regulates cell proliferation, migration and resistance to apoptosis51,52. SSBP4 is involved in DNA recombination and repair and has been suggested to have tumor suppressor activity53,54. The expression of KREMEN1 at 22q12.1 is lower or absent in human tumors compared to normal tissue55,56.
0.4

npg

2013 Nature America, Inc. All rights reserved.

0.3

Density

0.2

0.1

0 5 0 z score 5 10

Figure 2 Distribution of normalized effect sizes (z scores) in the iCOGS stage, with the direction of effect determined by the direction in the combined GWAS. The blue curve represents the standard normal distribution. The green curve represents the best-fit normal distribution (mean = 0.19, s.d. = 1.22).

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

Articles
This gene encodes a negative regulator of the Wnt/-catenin pathway, which has a key role in cell fate determination, stem cell regulation and cell differentiation and proliferation. It has been suggested that lack of KREMEN1 would activate the Wnt/-catenin pathway, thereby enhancing susceptibility to tumorigenesis55,56. Finally, NTN4 at 12q22 encodes a secreted growth factor that regulates tumor growth. High levels of NTN4 have been found in ER-positive but not ER-negative breast tumors57. NTN4 expression in tumors has also been suggested as a potential prognostic marker for breast cancer57. Overall contribution to breast cancer susceptibility On the assumption that the risks conferred by common susceptibility loci combine multiplicatively (no interaction on a log-additive scale) and on the basis of the per-allele OR estimates from the iCOGS stage, we determined that the 41 newly associated loci explain approximately 5% of the familial risk of breast cancer. However, the overall excess of significant associations for SNPs selected from the breast cancer GWAS for genotyping in the iCOGS stage suggests that a much larger number of loci contribute to susceptibility, although they did not have associations reaching genome-wide levels of significance in the current study. To assess this hypothesis more formally, we identified a set of 10,668 SNPs selected from the GWAS that were uncorrelated (r2 < 0.1 between any pair). Of these, the estimated OR was in the same direction as in the combined GWAS for 5,918 SNPs and in the opposite direction for 4,750 SNPs. Assuming that SNPs with effects in opposite directions are not associated with risk, an estimated 1,168 loci selected from the GWAS are associated with risk. However, this is an underestimate because weakly associated SNPs might have effects in opposite directions in the two stages. As an alternative approach, we fitted the distribution of z scores for the iCOGS stage, aligned to the direction of the effect in the GWAS, as a mixture of two normal distributions representing those SNPs that were or were not associated with disease (Fig. 2 and Online Methods)58. On the basis of the posterior probabilities from this analysis, an estimated 92% of loci (n = 9,815) were associated with breast cancer risk (95% CI = 85100%), and these contributed approximately 18% of the familial risk of breast cancer. It should be noted, however, that the large majority of the loci had very small individual effects on risk: for example, the estimated OR was >1.05 for only 10 loci, and 920 loci had an estimated OR of >1.02. When taking into account effects from the previously known loci, these analyses suggest that ~28% of familial risk is explained by common variants selected for iCOGS, of which ~14% can be explained by the 67 established loci (with a further ~20% due to higher penetrance loci). DISCUSSION To our knowledge, this is the largest genetic association study in cancer so far. The power of this approach is demonstrated by the fact that we have found evidence, at genome-wide levels of significance, for more than 40 new susceptibility loci, more than doubling the number of susceptibility loci for breast cancer. The effect sizes of the newly identified loci are generally modest (the highest OR was 1.26). However, the very high levels of statistical significance, the lack of heterogeneity among studies, the generally higher effect sizes for familial cases and the fact that most of the excess of significant associations was concentrated among SNPs selected on the basis of an association in the combined breast cancer GWAS all indicate that these are robust associations. Although the majority of the data are from populations of Northern and Western European ancestry, there was little or no evidence of heterogeneity in the OR estimates between studies, indicating that the associations apply broadly to populations
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

npg

of European ancestry. With more than 60 established breast cancer susceptibility loci, it is becoming possible to discern some more general patterns among the loci. Although most of the underlying genes and variants remain to be identified, there is a clear excess of genes either known to be involved in tumorigenesis in model systems or involved in processes relevant to cancer, such as cell death and differentiation. However, for other loci, such as PEX14, there is no obvious link to cancer susceptibility. Nine of the new loci lie in chromosomal regions with no known genes, suggesting that these may provide further examples of long-range regulation similar to that seen in the 8q24 region59. We have identified three additional examples of loci in the vicinity of susceptibility loci for other cancers (TET2, 8q24 and DNAJC1). These associations might reflect the tissue-specific regulation of key genes, and understanding the functional mechanisms underlying these associations may be particularly informative. On the basis of the current set of loci and assuming that all loci combine multiplicatively, the currently known loci now define a genetic profile for which 5% of the female population has a risk that is ~2.3-fold higher than the population average and for which 1% of the population has a risk that is ~3-fold higher. However, the large excess of significant associations among the SNPs selected from the GWAS suggests that many more susceptibility loci exist that have not met our threshold for genome-wide-significant association in this study and that these explain a similar fraction of the heritability as the currently known loci. The observation, made by comparing effect sizes in the iCOGS stage with those in the GWAS, that a very large number of loci, perhaps several thousand, contribute to polygenic susceptibility to breast cancer is consistent with results from GWAS in other complex disorders such as schizophrenia, using a different analytical approach60. Incorporating these loci into risk models should substantially improve disease prediction, even if not all loci can be identified individually. Moreover, fine-scale mapping of the identified regions may uncover more of the missing heritability, either through identifying a more strongly associated variant (as found for the CCND1 locus; see French et al.61) or by identifying additional signals (exemplified for the TERT region in Bojesen et al.62). Genetic profiling using these common susceptibility loci in combination with rarer high-risk loci and other risk factors may provide a rational basis for targeted breast cancer prevention. URLs. TCGA, http://cancergenome.nih.gov/; IPA, http://www. ingenuity.com/products/ipa; COSMIC, http://www.sanger. ac.uk/genetics/CGP/cosmic/; BCAC, http://ccge.medschl.cam. ac.uk/consortia/bcac/index.html; CIMBA, http://ccge.medschl. cam.ac.uk/consortia/cimba/index.html; OCAC, http://ccge. medschl.cam.ac.uk/consortia/ocac/index.html; PRACTICAL, http://ccge.medschl.cam.ac.uk/consortia/practical/index.html; COGS, http://www.cogseu.org/; iCOGS, http://ccge.medschl.cam. ac.uk/research/consortia/icogs/; Illumina GenCall, http://www. illumina.com/Documents/products/technotes/technote_gencall_ data_analysis_software.pdf; SNAP, http://www.broadinstitute.org/ mpg/snap/ldplot.php. Methods Methods and any associated references are available in the online version of the paper.
Note: Supplementary information is available in the online version of the paper. Acknowledgments The authors wish to thank all the individuals who took part in these studies and all the researchers, clinicians, technicians and administrative staff who have

2013 Nature America, Inc. All rights reserved.

357

Articles
enabled this work to be carried out. BCAC is funded by Cancer Research UK (C1287/A10118 and C1287/A12014) and by the European Communitys Seventh Framework Programme under grant agreement 223175 (HEALTH-F2-2009223175) (COGS). Meetings of BCAC have been funded by the European Union European Cooperation in Science and Technology (COST) programme (BM0606). Genotyping of the iCOGS array was funded by the European Union (HEALTH-F22009-223175), Cancer Research UK (C1287/A10710), the Canadian Institutes of Health Research (CIHR) for the CIHR Team in Familial Risks of Breast Cancer program and the Ministry of Economic Development, Innovation and Export Trade of Quebec (grant PSR-SIIRI-701). Combining the GWAS data was supported in part by the US National Institutes of Health (NIH) Cancer Post-Cancer GWAS initiative grant 1 U19 CA 148065-01 (DRIVE, part of the GAME-ON initiative). A full description of funding and acknowledgments is provided in the Supplementary Note. AUTHOR CONTRIBUTIONS K. Michailidou and D.F.E. performed the statistical analysis and drafted the manuscript. D.F.E. conceived and coordinated the synthesis of the iCOGS array and led BCAC. P.H. coordinated COGS. J. Benitez led the iCOGS genotyping working group. A.G.-N., G.P., M.R.A., J. Benitez, D.V., F.B., D.C.T., J. Simard, A.M.D. and C.L. coordinated genotyping of the iCOGS array. M.G.-C., P.D.P.P. and M.K.S. led the BCAC pathology and survival working group. J.C.-C. led the BCAC risk factor working group. A.M.D. and G.C.-T. led the iCOGS quality control working group. J.D., E.D., M. Ghoussaini and A. Lee provided bioinformatics support. M.K.B. and Q. Wang provided data management support for BCAC. S.C. and L.F.A.W. provided analysis of the TCGA expression data. C.T., N.R. and D.F.E. led the UK2 GWAS. O.F., J.P. and I.d.S.S. led the BBCS GWAS. H.N., T.A.M., K. Aittomki and C.B. led the HEBCS GWAS. P.H., K.C., A.I. and J. Liu led the SASBAC GWAS. Q. Waisfisz, H.M.-H., M.A. and R.B.v.d.L. led the DFBBCS GWAS. J.C.-C., R.H., N.D. and L. Beckman led the MARIE GWAS. A. Meindl, R.K.S., B.M.-M. and P.L. led the GC-HBOC GWAS. J.L.H., M.C.S., E.M., D.F.S. and H.T. led the ABCFS GWAS. A.G.U. and A. Hofman led the genotyping in the Rotterdam study. D.J.H. and S.J.C. led the CGEMS GWAS. F.J.C. and S. Slager coordinated TNBCC. C.A.H., B.E.H., F.S. and L.L.M. coordinated MEC. P.D.P.P., D.F.E. and M. Shah coordinated SEARCH. R.L. coordinated EPIC-Norfolk. J. Brown coordinated SIBS. P.H., K.C., N.S., K.H. and J. Li coordinated SASBAC and pKARMA. S.E.B., B.G.N., S.F.N. and H.F. coordinated CGPS. F.J.C., X.W., C.V. and K.N.S. coordinated MCBCS. D.L., M.M., R.P. and M.-R.C. coordinated LMBC. J.C.-C., A.R., S.N. and D.F.-J. coordinated MARIE. N.J., L.G. and Z.A. coordinated BBCS. K. Aaltonen and T.H. coordinated HEBCS. M.K.S., A.B., L.J.V.t.V. and C.E.v.d.S. coordinated ABCS. P.G., T.T., P.L.-P. and F. Menegaux coordinated CECILE. F. Marme, A. Schneeweiss, C. Sohn and B. Burwinkel coordinated BSUCH. R.L.M., A.G.-N., M.P.Z., J.I.A.P. and J. Benitez coordinated CNIO-BCS. A.C., I.W.B., S.S.C. and M.W.R.R. coordinated SBCS. E.J.S., I.T., M.J.K. and N.M. coordinated BIGGS. I.L.A., J.A.K., G.G. and A.M.M. coordinated OFBCR. A. Lindblom and S. Margolin coordinated KARBAC. M.J.H., A. Hollestelle, A.M.W.v.d.O. and A. Jager coordinated RBCS. J.L.H., M.C.S., Q.M.B., J. Stone, G.S.D. and C.A. coordinated ABCFS. J.L.H., M.C.S., G.G.G., G.S. and L. Baglietto coordinated MCCS. P.A.F., L.H., A.B.E. and M.W.B. coordinated BBCC. H. Brenner, H. Mller, V.A. and C. Stegmaier coordinated ESTHER. A. Swerdlow, A.A., N.O., M.J. and M.G.-C. coordinated UKBGS. M.G.-C., J.F., J. Lissowska and L. Brinton coordinated PBCS. M.S.G., F.L., M.D. and J. Simard coordinated MTLGEBCS. R.W., K.P., A.J.-V. and M. Grip coordinated OBCS. H. Brauch, U.H. and T.B. coordinated GENICA. P.R., P.P., S. Manoukian and B. Bonanni coordinated MBCSG. P.D., R.A.E.M.T., C. Seynaeve and C.J.v.A. coordinated ORIGO. A. Jakubowska, J. Lubinski, K.J. and K.D. coordinated SZBCS. A. Mannermaa, V.K., V.-M.K. and J.M.H. coordinated KBCP. N.V.B., N.N.A. and T.D. coordinated HMBCS. V.N.K. coordinated NBCS. H.A.-C. coordinated UCIBCS. A.E.T. coordinated OSU. S.E. coordinated RPCI. F.F. coordinated DEMOKRITOS. D.K., K.-Y.Y. and D.-Y.N. coordinated SEBCS. K. Matsuo, H. Ito, H. Iwata and A. Sueta coordinated HERPACC. A.H.W., C.-C.T., D.V.D.B. and D.O.S. coordinated LAABC. W.Z., X.-O.S., W.L., Y.-T.G. and H.C. coordinated SGBCS. S.H.T., C.H.Y., S.Y.P. and B.K.C. coordinated MYBRCA. M.H., H. Miao, W.Y.L. and J.-H.S. coordinated SGBCC. K. Muir, A. Lophatananon, S.S.-B. and P.S. coordinated ACP. C.-Y.S., C.-N.H., P.-E.W. and S.-L.D. coordinated TWBCS. S. Sangrajrang, V.G., P.B. and J.M. coordinated TBCS. W.J.B., L.B.S., Q.C. and W.Z. coordinated SCCS. W.Z., S.D.-H., M. Shrubsole and J. Long coordinated NBHS. G.C.-T. coordinated the genotyping component of kConFab. All authors provided critical review of the manuscript. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.
Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html.
1. Kamangar, F., Dores, G.M. & Anderson, W.F. Patterns of cancer incidence, mortality, and prevalence across five continents: defining priorities to reduce cancer disparities in different geographic regions of the world. J. Clin. Oncol. 24, 21372150 (2006). 2. Lichtenstein, P. et al. Environmental and heritable factors in the causation of canceranalyses of cohorts of twins from Sweden, Denmark, and Finland. N. Engl. J. Med. 343, 7885 (2000). 3. Peto, J. & Mack, T.M. High constant incidence in twins and other relatives of women with breast cancer. Nat. Genet. 26, 411414 (2000). 4. Easton, D.F. et al. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature 447, 10871093 (2007). 5. Hunter, D.J. et al. A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat. Genet. 39, 870874 (2007). 6. Stacey, S.N. et al. Common variants on chromosomes 2q35 and 16q12 confer susceptibility to estrogen receptorpositive breast cancer. Nat. Genet. 39, 865869 (2007). 7. Stacey, S.N. et al. Common variants on chromosome 5p12 confer susceptibility to estrogen receptorpositive breast cancer. Nat. Genet. 40, 703706 (2008). 8. Ahmed, S. et al. Newly discovered breast cancer susceptibility loci on 3p24 and 17q23.2. Nat. Genet. 41, 585590 (2009). 9. Zheng, W. et al. Genome-wide association study identifies a new breast cancer susceptibility locus at 6q25.1. Nat. Genet. 41, 324328 (2009). 10. Thomas, G. et al. A multistage genome-wide association study in breast cancer identifies two new risk alleles at 1p11.2 and 14q24.1 (RAD51L1). Nat. Genet. 41, 579584 (2009). 11. Turnbull, C. et al. Genome-wide association study identifies five new breast cancer susceptibility loci. Nat. Genet. 42, 504507 (2010). 12. Antoniou, A.C. et al. A locus on 19p13 modifies risk of breast cancer in BRCA1 mutation carriers and is associated with hormone receptornegative breast cancer in the general population. Nat. Genet. 42, 885892 (2010). 13. Fletcher, O. et al. Novel breast cancer susceptibility locus at 9q31.2: results of a genome-wide association study. J. Natl. Cancer Inst. 103, 425435 (2011). 14. Haiman, C.A. et al. A common variant at the TERT-CLPTM1L locus is associated with estrogen receptornegative breast cancer. Nat. Genet. 43, 12101214 (2011). 15. Ghoussaini, M. et al. Genome-wide association analysis identifies three new breast cancer susceptibility loci. Nat. Genet. 44, 312318 (2012). 16. Siddiq, A. et al. A meta-analysis of genome-wide association studies of breast cancer identifies two novel susceptibility loci at 6q14 and 20q11. Hum. Mol. Genet. 21, 53735384 (2012). 17. Eeles, R.A. et al. Identification of 23 new prostate cancer susceptibility loci using the iCOGS custom genotyping array. Nat. Genet. published online; doi:10.1038/ ng.2560 (27 March 2013). 18. Pharoah, P.D.P. et al. GWAS meta-analysis and replication identifies three new susceptibility loci for ovarian cancer. Nat. Genet. published online; doi:10.1038/ ng.2564 (27 March 2013). 19. Couch, F.J. et al. Genome-wide association study in BRCA1 mutation carriers identifies novel loci associated with breast and ovarian cancer risk. PLoS Genet. 9, e1003212 (2013). 20. Gaudet, M.M. et al. Identification of a BRCA2-specific modifier locus at 6p24 related to breast cancer risk. PLoS Genet. 9, e1003173 (2013). 21. Cox, A. et al. A common coding variant in CASP8 is associated with breast cancer risk. Nat. Genet. 39, 352358 (2007). 22. Turnbull, C. et al. Genome-wide association study identifies five new breast cancer susceptibility loci. Nat. Genet. 42, 504507 (2010). 23. Lambrechts, D. et al. 11q13 is a susceptibility locus for hormone receptor positive breast cancer. Hum. Mutat. 33, 11231132 (2012). 24. Stevens, K.N. et al. 19p13.1 is a triple-negative-specific breast cancer susceptibility locus. Cancer Res. 72, 17951803 (2012). 25. Antoniou, A.C. & Easton, D.F. Polygenic inheritance of breast cancer: implications for design of association studies. Genet. Epidemiol. 25, 190202 (2003). 26. Figueroa, J.D. et al. Associations of common variants at 1p11.2 and 14q24.1 (RAD51L1) with breast cancer risk and heterogeneity by tumor subtype: findings from the Breast Cancer Association Consortium. Hum. Mol. Genet. 20, 46934706 (2011). 27. Mazoyer, S. et al. A polymorphic stop codon in BRCA2. Nat. Genet. 14, 253254 (1996). 28. Schutte, M. et al. Variants in CHEK2 other than 1100delC do not make a major contribution to breast cancer susceptibility. Am. J. Hum. Genet. 72, 10231028 (2003). 29. Hemphill, A.W. et al. Mammalian SNM1 is required for genome stability. Mol. Genet. Metab. 94, 3845 (2008). 30. Scollen, S. et al. TGF- signaling pathway and breast cancer susceptibility. Cancer Epidemiol. Biomarkers Prev. 20, 11121119 (2011). 31. Ma, X. et al. Pathway analyses identify TGFBR2 as potential breast cancer susceptibility gene: results from a consortium study among Asians. Cancer Epidemiol. Biomarkers Prev. 21, 11761184 (2012). 32. Burwinkel, B. et al. Transcription factor 7like 2 (TCF7L2) variant is associated with familial breast cancer risk: a case-control study. BMC Cancer 6, 268 (2006). 33. Goode, E.L. et al. A genome-wide association study identifies susceptibility loci for ovarian cancer at 2q31 and 8q24. Nat. Genet. 42, 874879 (2010).

npg

2013 Nature America, Inc. All rights reserved.

358

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

Articles
34. Eeles, R.A. et al. Identification of seven new prostate cancer susceptibility loci through a genome-wide association study. Nat. Genet. 41, 11161121 (2009). 35. Yang, T.P. et al. Genevar: a database and Java application for the analysis and visualization of SNP-gene associations in eQTL studies. Bioinformatics 26, 24742476 (2010). 36. Shimozawa, N. et al. Identification of a new complementation group of the peroxisome biogenesis disorders and PEX14 as the mutated gene. Hum. Mutat. 23, 552558 (2004). 37. Murfuni, I. et al. The WRN and MUS81 proteins limit cell death and genome instability following oncogene activation. Oncogene 32, 610620 (2013). 38. Pamidi, A. et al. Functional interplay of p53 and Mus81 in DNA damage responses and cancer. Cancer Res. 67, 85278535 (2007). 39. Leong, S., McKay, M.J., Christopherson, R.I. & Baxter, R.C. Biomarkers of breast cancer apoptosis induced by chemotherapy and TRAIL. J. Proteome Res. 11, 12401250 (2012). 40. Wang, W. et al. The activity status of cofilin is directly related to invasion, intravasation, and metastasis of mammary tumors. J. Cell Biol. 173, 395404 (2006). 41. Dunbar, M.E., Wysolmerski, J.J. & Broadus, A.E. Parathyroid hormonerelated protein: from hypercalcemia of malignancy to developmental regulatory molecule. Am. J. Med. Sci. 312, 287294 (1996). 42. Dunbar, M.E. et al. Stromal cells are critical targets in the regulation of mammary ductal morphogenesis by parathyroid hormonerelated protein. Dev. Biol. 203, 7589 (1998). 43. Qiao, Y. et al. FOXQ1 regulates epithelial-mesenchymal transition in human cancers. Cancer Res. 71, 30763086 (2011). 44. Kaneda, H. et al. FOXQ1 is overexpressed in colorectal cancer and enhances tumorigenicity and tumor growth. Cancer Res. 70, 20532063 (2010). 45. Debily, M.A. et al. Expression and molecular characterization of alternative transcripts of the ARHGEF5/TIM oncogene specific for human breast cancer. Hum. Mol. Genet. 13, 323334 (2004). 46. Muehlich, S. et al. The transcriptional coactivators megakaryoblastic leukemia 1/2 mediate the effects of loss of the tumor suppressor deleted in liver cancer 1. Oncogene 31, 39133923 (2012). 47. Frayling, T.M. et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 316, 889894 (2007). 48. Grant, S.F. et al. Variant of transcription factor 7like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nat. Genet. 38, 320323 (2006). 49. Sladek, R. et al. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature 445, 881885 (2007). 50. Jingushi, K. et al. DIF-1 inhibits the Wnt/-catenin signaling pathway by inhibiting TCF7L2 expression in colon cancer cell lines. Biochem. Pharmacol. 83, 4756 (2012). 51. Dantuma, N.P., Heinen, C. & Hoogstraten, D. The ubiquitin receptor Rad23: at the crossroads of nucleotide excision repair and proteasomal degradation. DNA Repair (Amst.) 8, 449460 (2009). 52. Lee, J.C. et al. Pax9 mediated cell survival in oral squamous carcinoma cell enhanced by c-myb. Cell Biochem. Funct. 26, 892899 (2008). 53. Castro, P., Liang, H., Liang, J.C. & Nagarajan, L. A novel, evolutionarily conserved gene family with putative sequence-specific single-stranded DNA-binding activity. Genomics 80, 7885 (2002). 54. Sanchez-Cespedes, M. et al. Chromosomal alterations in lung adenocarcinoma from smokers and nonsmokers. Cancer Res. 61, 13091313 (2001). 55. Nakamura, T. et al. Molecular cloning and characterization of Kremen, a novel kringle-containing transmembrane protein. Biochim. Biophys. Acta 1518, 6372 (2001). 56. Nakamura, T., Nakamura, T. & Matsumoto, K. The functions and possible significance of Kremen as the gatekeeper of Wnt signalling in development and pathology. J. Cell Mol. Med. 12, 391408 (2008). 57. Esseghir, S. et al. Identification of NTN4, TRA1, and STC2 as prognostic markers in breast cancer in a screen for signal sequence encoding proteins. Clin. Cancer Res. 13, 31643173 (2007). 58. Morris, A.P. et al. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat. Genet. 44, 981990 (2012). 59. Ahmadiyeh, N. et al. 8q24 prostate, breast, and colon cancer risk loci show tissuespecific long-range interaction with MYC. Proc. Natl. Acad. Sci. USA 107, 97429746 (2010). 60. Purcell, S.M. et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748752 (2009). 61. French, J.D. et al. Functional variants at the 11q13 risk locus regulate cyclin D1 expression through long-range enhancers. Am. J. Hum. Genet. published online; 10.1016/j.ajhg.2013.01.002 (27 March 2013). 62. Bojesen, S.E. et al. Multiple independent variants at the TERT locus are associated with telomere length and risks of breast and ovarian cancer. Nat. Genet. published online; doi:10.1038/ng.2566 (27 March 2013).

2013 Nature America, Inc. All rights reserved.

Kyriaki Michailidou1,138, Per Hall2,138, Anna Gonzalez-Neira3, Maya Ghoussaini4, Joe Dennis1, Roger L Milne5, Marjanka K Schmidt6,7, Jenny Chang-Claude8, Stig E Bojesen9,10, Manjeet K Bolla1, Qin Wang1, Ed Dicks4, Andrew Lee1, Clare Turnbull11, Nazneen Rahman11, The Breast and Ovarian Cancer Susceptibility Collaboration12, Olivia Fletcher13, Julian Peto14, Lorna Gibson14, Isabel dos Santos Silva14, Heli Nevanlinna15, Taru A Muranen15, Kristiina Aittomki16, Carl Blomqvist17, Kamila Czene2, Astrid Irwanto18, Jianjun Liu18, Quinten Waisfisz19, Hanne Meijers-Heijboer19, Muriel Adank19, Hereditary Breast and Ovarian Cancer Research Group Netherlands (HEBON)12, Rob B van der Luijt20, Rebecca Hein8,21, Norbert Dahmen22, Lars Beckman23, Alfons Meindl24, Rita K Schmutzler25,26, Bertram Mller-Myhsok27, Peter Lichtner28, John L Hopper29, Melissa C Southey30, Enes Makalic29, Daniel F Schmidt29, Andre G Uitterlinden31, Albert Hofman32, David J Hunter33, Stephen J Chanock34, Daniel Vincent35, Franois Bacot35, Daniel C Tessier35, Sander Canisius36, Lodewyk F A Wessels36, Christopher A Haiman37, Mitul Shah4, Robert Luben1, Judith Brown1, Craig Luccarini4, Nils Schoof2, Keith Humphreys2, Jingmei Li18, Brge G Nordestgaard9,10, Sune F Nielsen9,10, Henrik Flyger38, Fergus J Couch39, Xianshu Wang39, Celine Vachon40, Kristen N Stevens40, Diether Lambrechts41,42, Matthieu Moisse41,42, Robert Paridaens43, Marie-Rose Christiaens43, Anja Rudolph8, Stefan Nickels8, Dieter Flesch-Janys8,44,45, Nichola Johnson13, Zoe Aitken14, Kirsimari Aaltonen1517, Tuomas Heikkinen15, Annegien Broeks6, Laura J Vant Veer6, C Ellen van der Schoot46, Pascal Gunel47,48, Thrse Truong47,48, Pierre Laurent-Puig49, Florence Menegaux47,48, Frederik Marme50,51, Andreas Schneeweiss50,51, Christof Sohn50, Barbara Burwinkel50,52, M Pilar Zamora53, Jose Ignacio Arias Perez54, Guillermo Pita3, M Rosario Alonso3, Angela Cox55, Ian W Brock55, Simon S Cross56, Malcolm W R Reed55, Elinor J Sawyer57, Ian Tomlinson58,59, Michael J Kerin60, Nicola Miller60, Brian E Henderson37, Fredrick Schumacher37, Loic Le Marchand61, Irene L Andrulis62,63, Julia A Knight64,65, Gord Glendon62, Anna Marie Mulligan66,67, kConFab Investigators12, Australian Ovarian Cancer Study Group12, Annika Lindblom68, Sara Margolin69, Maartje J Hooning70, Antoinette Hollestelle70, Ans M W van den Ouweland71, Agnes Jager70, Quang M Bui29, Jennifer Stone29, Gillian S Dite29, Carmel Apicella29, Helen Tsimiklis30, Graham G Giles29,72, Gianluca Severi29,72, Laura Baglietto29,72, Peter A Fasching73,74, Lothar Haeberle73,
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013 359

npg

Articles
Arif B Ekici75, Matthias W Beckmann73, Hermann Brenner76, Heiko Mller76, Volker Arndt76, Christa Stegmaier77, Anthony Swerdlow11, Alan Ashworth13,78, Nick Orr13,78, Michael Jones11, Jonine Figueroa34, Jolanta Lissowska79, Louise Brinton34, Mark S Goldberg80,81, France Labrche82, Martine Dumont83, Robert Winqvist84, Katri Pylks84, Arja Jukkola-Vuorinen85, Mervi Grip86, Hiltrud Brauch87,88, Ute Hamann89, Thomas Brning90, The GENICA (Gene Environment Interaction and Breast Cancer in Germany) Network12, Paolo Radice91,92, Paolo Peterlongo91,92, Siranoush Manoukian93, Bernardo Bonanni94, Peter Devilee95,96, Rob A E M Tollenaar97, Caroline Seynaeve98, Christi J van Asperen99, Anna Jakubowska100, Jan Lubinski100, Katarzyna Jaworska100,101, Katarzyna Durda100, Arto Mannermaa102104, Vesa Kataja104106, Veli-Matti Kosma102104, Jaana M Hartikainen102104, Natalia V Bogdanova107,108, Natalia N Antonenkova109, Thilo Drk107, Vessela N Kristensen110,111, Hoda Anton-Culver112, Susan Slager40, Amanda E Toland113, Stephen Edge114, Florentia Fostira115, Daehee Kang116, Keun-Young Yoo116, Dong-Young Noh116, Keitaro Matsuo117, Hidemi Ito117, Hiroji Iwata118, Aiko Sueta117, Anna H Wu37, Chiu-Chen Tseng37, David Van Den Berg37, Daniel O Stram37, Xiao-Ou Shu119, Wei Lu120, Yu-Tang Gao121, Hui Cai119, Soo Hwang Teo122,123, Cheng Har Yip123, Sze Yee Phuah122, Belinda K Cornes124, Mikael Hartman125,126, Hui Miao125, Wei Yen Lim125, Jen-Hwei Sng126, Kenneth Muir127, Artitaya Lophatananon127, Sarah Stewart-Brown127, Pornthep Siriwanarangsan128, Chen-Yang Shen129,130, Chia-Ni Hsiung129, Pei-Ei Wu131, Shian-Ling Ding132, Suleeporn Sangrajrang133, Valerie Gaborieau134, Paul Brennan134, James McKay134, William J Blot119,135, Lisa B Signorello119,135, Qiuyin Cai119, Wei Zheng119, Sandra Deming-Halverson119, Martha Shrubsole119, Jirong Long119, Jacques Simard83, Montse Garcia-Closas11,13,78, Paul D P Pharoah1,4, Georgia Chenevix-Trench136, Alison M Dunning4, Javier Benitez3,137 & Douglas F Easton1,4
1Centre

2013 Nature America, Inc. All rights reserved.

for Cancer Genetic Epidemiology, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK. 2Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden. 3Human Genotyping UnitCentro Nacional de Genotipado (CEGEN), Human Cancer Genetics Programme, Spanish National Cancer Research Centre (CNIO), Madrid, Spain. 4Centre for Cancer Genetic Epidemiology, Department of Oncology, University of Cambridge, Cambridge, UK. 5Genetic & Molecular Epidemiology Group, Human Cancer Genetics Programme, CNIO, Madrid, Spain. 6Division of Molecular Pathology, Netherlands Cancer Institute, Antoni van Leeuwenhoek Hospital, Amsterdam, The Netherlands. 7Division of Psychosocial Research and Epidemiology, Netherlands Cancer Institute, Antoni van Leeuwenhoek Hospital, Amsterdam, The Netherlands. 8Division of Cancer Epidemiology, Deutsches Krebsforschungszentrum, Heidelberg, Germany. 9Copenhagen General Population Study, Herlev Hospital, Copenhagen University Hospital, Copenhagen, Denmark. 10Department of Clinical Biochemistry, Herlev Hospital, Copenhagen University Hospital, Copenhagen, Denmark. 11Division of Genetics and Epidemiology, Institute of Cancer Research, Sutton, UK. 12A list of members is provided in the Supplementary Note. 13Breakthrough Breast Cancer Research Centre, The Institute of Cancer Research, London, UK. 14Department of Non-communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK. 15Department of Obstetrics and Gynecology, University of Helsinki and Helsinki University Central Hospital, Helsinki, Finland. 16Department of Clinical Genetics, University of Helsinki and Helsinki University Central Hospital, Helsinki, Finland. 17Department of Oncology, University of Helsinki and Helsinki University Central Hospital, Helsinki, Finland. 18Human Genetics Division, Genome Institute of Singapore, Singapore. 19Section of Oncogenetics, Department of Clinical Genetics, VU University Medical Center, Amsterdam, The Netherlands. 20Department of Medical Genetics, University Medical Center Utrecht, Utrecht, The Netherlands. 21PMV (Primr Medizinische Versorgung) Research Group at the Department of Child and Adolescent Psychiatry and Psychotherapy, University of Cologne, Cologne, Germany. 22Department of Psychiatry, University of Mainz, Mainz, Germany. 23Institute for Quality and Efficiency in Health Care (IQWiG), Cologne, Germany. 24Division for Gynaecological Tumor Genetics, Clinic of Gynaecology and Obstetrics, Technische Universitt Mnchen, Munich, Germany. 25Centre of Familial Breast and Ovarian Cancer, University of Cologne, Cologne, Germany. 26Centre for Molecular Medicine (CMMC), University of Cologne, Cologne, Germany. 27Max Planck Institute of Psychiatry, Munich, Germany. 28Institute of Human Genetics, Helmholtz Zentrum MnchenGerman Research Center for Environmental Health, Neuherberg, Germany. 29Centre for Molecular, Environmental, Genetic, and Analytic Epidemiology, Melbourne School of Population Health, The University of Melbourne, Melbourne, Victoria, Australia. 30Genetic Epidemiology Laboratory, Department of Pathology, The University of Melbourne, Melbourne, Victoria, Australia. 31Department of Internal Medicine, Erasmus Medical Center, Rotterdam, The Netherlands. 32Department of Epidemiology, Erasmus Medical Center, Rotterdam, The Netherlands. 33Program in Molecular and Genetic Epidemiology, Harvard School of Public Health, Boston, Massachusetts, USA. 34Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland, USA. 35McGill University and Gnome Qubec Innovation Centre, Montreal, Quebec, Canada. 36Division of Molecular Carcinogenesis, Netherlands Cancer Institute, Antoni van Leeuwenhoek Hospital, Amsterdam, The Netherlands. 37Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, California, USA. 38Department of Breast Surgery, Herlev Hospital, Copenhagen University Hospital, Copenhagen, Denmark. 39Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota, USA. 40Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, USA. 41Vesalius Research Center (VRC), VIB, Leuven, Belgium. 42Laboratory for Translational Genetics, Department of Oncology, University of Leuven, Leuven, Belgium. 43Department of Oncology, University Hospital Gasthuisberg, University of Leuven, Leuven, Belgium. 44Department of Cancer Epidemiology/Clinical Cancer Registry, University Clinic Hamburg-Eppendorf, Hamburg, Germany. 45Institute for Medical Biometrics and Epidemiology, University Clinic Hamburg-Eppendorf, Hamburg, Germany. 46Sanquin Research, Amsterdam, The Netherlands. 47INSERM (National Institute of Health and Medical Research), CESP (Center for Research in Epidemiology and Population Health), U1018, Environmental Epidemiology of Cancer, Villejuif, France. 48Unit Mixte de Recherche Scientifique (UMRS) 1018, University ParisSud, Villejuif, France. 49UMRS 775, INSERM, Universit Paris Sorbonne Cit, Paris, France. 50Department of Obstetrics and Gynecology, University of Heidelberg, Heidelberg, Germany. 51National Center for Tumor Diseases, University of Heidelberg, Heidelberg, Germany. 52Molecular Epidemiology Group, German Cancer Research Center (DKFZ), Heidelberg, Germany. 53Servicio de Oncologa Mdica, Hospital Universitario La Paz, Madrid, Spain. 54Servicio de Ciruga General y Especialidades, Hospital Monte Naranco, Oviedo, Spain. 55Cancer Research UK/Yorkshire Cancer Research Sheffield Cancer Research Centre, Department of Oncology, University of Sheffield, Sheffield, UK. 56Academic Unit of Pathology, Department of Neuroscience, University of Sheffield, Sheffield, UK. 57Division of Cancer Studies, National Institute for Health Research (NIHR) Comprehensive Biomedical Research Centre, Guys & St. Thomas National Health Service (NHS) Foundation Trust in partnership with Kings College London, London, UK. 58Welcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK. 59Oxford Biomedical Research Centre, University of Oxford, Oxford, UK. 60Clinical Science Institute, University Hospital Galway, Galway, Ireland. 61University of Hawaii Cancer Center, Honolulu, Hawaii, USA. 62Ontario Cancer Genetics Network, Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada. 63Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada. 64Prosserman Centre for Health Research, Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada. 65Division of Epidemiology, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada. 66Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada. 67Laboratory Medicine Program, University Health Network, Toronto, Ontario, Canada. 68Department of Molecular Medicine and Surgery, Karolinska Institutet, Stockholm, Sweden. 69Department of Oncology-Pathology, Karolinska Institutet, Stockholm, Sweden.

npg

360

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

Articles
of Medical Oncology, Erasmus University Medical Center, Rotterdam, The Netherlands. 71Department of Clinical Genetics, Erasmus University Medical Center, Rotterdam, The Netherlands. 72Cancer Epidemiology Centre, The Cancer Council Victoria, Melbourne, Victoria, Australia. 73University Breast Center Franconia, Department of Gynecology and Obstetrics, University Hospital Erlangen, Friedrich-Alexander University ErlangenNuremberg, Erlangen, Germany. 74Division of Hematology and Oncology, Department of Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California, USA. 75Institute of Human Genetics, Friedrich Alexander University Erlangen-Nuremberg, Erlangen, Germany. 76Division of Clinical Epidemiology and Aging Research, DKFZ, Heidelberg, Germany. 77Saarland Cancer Registry, Saarbrcken, Germany. 78Division of Breast Cancer Research, The Institute of Cancer Research, London, UK. 79Department of Cancer Epidemiology and Prevention, M. Sklodowska-Curie Memorial Cancer Center and Institute of Oncology, Warsaw, Poland. 80Department of Medicine, McGill University, Montreal, Quebec, Canada. 81Division of Clinical Epidemiology, McGill University Health Centre, Royal Victoria Hospital, Montreal, Quebec, Canada. 82Dpartement de Mdecine Sociale et Prventive, Dpartement de Sant Environnementale et Sant au Travail, Universit de Montral, Montreal, Quebec, Canada. 83Cancer Genomics Laboratory, Centre Hospitalier Universitaire de Qubec and Laval University, Quebec City, Quebec, Canada. 84Laboratory of Cancer Genetics and Tumor Biology, Department of Clinical Genetics and Biocenter Oulu, University of Oulu, Oulu University Hospital, Oulu, Finland. 85Department of Oncology, Oulu University Hospital, University of Oulu, Oulu, Finland. 86Department of Surgery, Oulu University Hospital, University of Oulu, Oulu, Finland. 87Dr. Margarete Fischer-Bosch Institute of Clinical Pharmacology, Stuttgart, Germany. 88University of Tbingen, Tbingen, Germany. 89Molecular Genetics of Breast Cancer, DKFZ, Heidelberg, Germany. 90Institute for Prevention and Occupational Medicine of the German Social Accident Insurance, Institute of the RuhrUniversitt Bochum (IPA), Bochum, Germany. 91Unit of Molecular Bases of Genetic Risk and Genetic Testing, Department of Preventive and Predictive Medicine, Fondazione IRCCS Istituto Nazionale Tumori (INT), Milan, Italy. 92Istituto FIRC di Oncologia Molecolare (IFOM), Fondazione Istituto FIRC di Oncologia Molecolare, Milan, Italy. 93Unit of Medical Genetics, Department of Preventive and Predictive Medicine, Fondazione IRCCS INT, Milan, Italy. 94Division of Cancer Prevention and Genetics, Istituto Europeo di Oncologia, Milan, Italy. 95Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands. 96Department of Pathology, Leiden University Medical Center, Leiden, The Netherlands. 97Department of Surgical Oncology, Leiden University Medical Center, Leiden, The Netherlands. 98Family Cancer Clinic, Department of Medical Oncology, Erasmus Medical CenterDaniel den Hoed Cancer Center, Rotterdam, The Netherlands. 99Department of Clinical Genetics, Leiden University Medical Center, Leiden, The Netherlands. 100Department of Genetics and Pathology, Pomeranian Medical University, Szczecin, Poland. 101Postgraduate School of Molecular Medicine, Warsaw Medical University, Warsaw, Poland. 102School of Medicine, Institute of Clinical Medicine, Pathology and Forensic Medicine, University of Eastern Finland, Kuopio, Finland. 103Biocenter Kuopio, Cancer Center of Eastern Finland, University of Eastern Finland, Kuopio, Finland. 104Imaging Center, Department of Clinical Pathology, Kuopio University Hospital, Kuopio, Finland. 105School of Medicine, Institute of Clinical Medicine and Oncology, University of Eastern Finland, Kuopio, Finland. 106Cancer Center, Kuopio University Hospital, Kuopio, Finland. 107Department of Obstetrics and Gynaecology, Hannover Medical School, Hannover, Germany. 108Department of Radiation Oncology, Hannover Medical School, Hannover, Germany. 109NN Alexandrov Research Institute of Oncology and Medical Radiology, Minsk, Belarus. 110Institute for Clinical Epidemiology and Molecular Biology (EpiGen), Faculty of Medicine, University of Oslo, Oslo, Norway. 111Group of Cancer Genome Variation, Department of Genetics, Institute for Cancer Research, Rikshospitalet-Radiumhospitalet, Oslo, Norway. 112Department of Epidemiology, University of CaliforniaIrvine, Irvine, California, USA. 113Department of Molecular Virology, Immunology and Medical Genetics, Comprehensive Cancer Center, The Ohio State University, Columbus, Ohio, USA. 114Roswell Park Cancer Institute, Buffalo, New York, USA. 115Molecular Diagnostics Laboratory, Institute of Radioisotopes and Radiodiagnostic Products (IRRP), National Centre for Scientific Research Demokritos, Aghia Paraskevi Attikis, Athens, Greece. 116Seoul National University College of Medicine, Seoul, Korea. 117Division of Epidemiology and Prevention, Aichi Cancer Center Research Institute, Nagoya, Japan. 118Department of Breast Oncology, Aichi Cancer Center Hospital, Nagoya, Japan. 119Division of Epidemiology, Department of Medicine, Vanderbilt Epidemiology Center, Vanderbilt-Ingram Cancer Center, Vanderbilt University School of Medicine, Nashville, Tennessee, USA. 120Shanghai Center for Disease Control and Prevention, Shanghai, China. 121Department of Epidemiology, Shanghai Cancer Institute, Shanghai, China. 122Cancer Research Initiatives Foundation, Sime Darby Medical Centre, Subang Jaya, Malaysia. 123Breast Cancer Research Unit, University Malaya Cancer Research Institute, University Malaya Medical Centre, Kuala Lumpur, Malaysia. 124Singapore Eye Research Institute, National University of Singapore, Singapore. 125Saw Swee Hock School of Public Health, National University of Singapore, Singapore. 126Department of Surgery, Yong Loo Lin School of Medicine, National University of Singapore, Singapore. 127Warwick Medical School, University of Warwick, Coventry, UK. 128Ministry of Public Health, Bangkok, Thailand. 129Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan. 130Colleague of Public Health, China Medical University, Taichong, Taiwan. 131Taiwan Biobank, Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan. 132Department of Nursing, Kang-Ning Junior College of Medical Care and Management, Taipei, Taiwan. 133National Cancer Institute, Bangkok, Thailand. 134International Agency for Research on Cancer, Lyon, France. 135International Epidemiology Institute, Rockville, Maryland, USA. 136Department of Genetics, Queensland Institute of Medical Research, Brisbane, Queensland, Australia. 137Centro de Investigacin en Red de Enfermedades Raras (CIBERER), Madrid, Spain. 138These authors contributed equally to this work. Correspondence should be addressed to D.F.E. (dfe20@medschl.cam.ac.uk) or P.H. (per.hall@ki.se).
70Department

npg
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

2013 Nature America, Inc. All rights reserved.

361

ONLINE METHODS

2013 Nature America, Inc. All rights reserved.

GWAS analysis. Primary genotype data were obtained for nine breast cancer GWAS in populations of European ancestry (Supplementary Table 1). Standard quality control was performed on all scans as follows. We excluded all individuals with low call rate (<95%) and extremely high or low heterozygosity (P < 1 105), as well as all individuals evaluated to be of non-European ancestry (>15% non-European component, as determined by multidimensional scaling using the HapMap version 2 CEU, JPT/CHB and YRI populations as a reference). We excluded SNPs with MAF < 1%; call rate < 95%; or call rate < 99% and MAF < 5% and all SNPs with genotype frequencies that departed from Hardy-Weinberg equilibrium at P < 1 106 in controls or P < 1 1012 in cases. For highly significant SNPs, genotype intensity cluster plots were examined manually to judge reliability, either centrally or by contacting the original investigators. Data were imputed for all scans for ~2.6 million SNPs with the HapMap version 2 CEU panel (Utah residents of Northern and Western European ancestry) as a reference, using the program MaCH v1.0. Imputation was conducted separately for each scan. Estimated per-allele ORs and standard errors were generated from the imputed genotypes using ProbABEL 63. For two studies (UK2 and HEBCS), estimates were adjusted by the first three principal components, as this was found to materially reduce the inflation of test statistics. Residual inflation was then adjusted for by multiplying the variance by a genomic control adjustment factor, based on the ratio of the median 2 test statistic to its expected value64. BBCS and UK2 used the same control data (WTCCC2) but different genotyping platforms. Data were imputed separately for these studies. For the combined analysis, the control set was divided randomly between the two studies, in proportion to the size of the case series, to provide disjoint strata. Overall significance tests for each SNP were performed using a fixed-effects meta-analysis; data were only included for a given study if the imputation accuracy r2 was >0.3. SNP selection. Details of SNP selection for the iCOGS array are given in the Supplementary Note. For the purpose of the BCAC analyses, we included SNPs on the basis of the analysis of the nine GWAS described above. We ranked SNPs on the basis of the results from five analyses: an overall 1-degree-of-freedom trend test; a 1-degree-of-freedom trend test giving a weight of 2 to those studies selecting cases for a positive family history (UK2, BBCS, DFBBCS and GC-HBOC); a 2-degrees-of-freedom genotype test; and 1-degree-of-freedom tests based on cases diagnosed before the ages of 40 years or 50 years compared with all controls. We also defined lists based on 1-degree-of-freedom trend tests restricted to data from each of the nine component studies. SNPs were also selected from analyses of cases with ER-negative disease, but these are not reported here. iCOGS genotyping. Samples for the iCOGS stage were drawn from 52 studies participating in BCAC, including 41 from populations of predominantly European ancestry, 9 of Asian ancestry and 2 of African-American ancestry. The majority of studies were population-based or hospital-based case-control studies, but some studies selected samples by age or oversampled for cases with a family history of breast cancer (Supplementary Table 2). Studies were required to provide ~2% of samples in duplicate. Genotyping was conducted using a custom Illumina Infinium array (iCOGS) in seven centers, of which four were used for BCAC. Genotypes were called using Illuminas proprietary GenCall algorithm. Initial calling used a cluster file generated from 270 samples from HapMap 2. To generate the final calls, we first selected a subset of 3,018 individuals, including samples from each of the genotyping centers, each of the participating consortia and each major ancestry group. Only plates with a consistently high call rate in the initial calling were used. We also included 380 samples of European, Asian or African ancestry genotyped as part of the HapMap Project and 1000 Genomes Project and 160 samples that were known positive controls for rare variants on the array. This subset was used to generate a cluster file that was then applied to call the genotypes for the remaining samples. We also investigated two other calling algorithms: Illumnus65 and GenoSNP66. All three algorithms were >99% concordant in their calling for 91% of the SNPs on the array. However, manual inspection of a sample of the SNPs with

discrepancies indicated that the calls from GenCall were almost invariably superior (generally, because Illumnus or GenoSNP attempted to call SNPs that clustered poorly). Therefore, only the genotypes called by GenCall have been used in the analyses reported here. Quality control. We excluded individuals for any of the following reasons: genotypically not female XX (XY, XXY or XO); overall call rate < 95%; low or high heterozygosity (P < 1 106, determined separately for individuals of European, East Asian and African-American ancestry); genotypes discordant with those determined in previous BCAC genotyping such that the individual appeared to be different; genotypes for the duplicate sample that seemed to be from a different individual; and cryptic duplicates where the phenotypic data indicated that the individuals were different. We searched for cryptic duplicates, both within each study and between studies from the same country. For known and cryptic concordant duplicates, the sample with the lower call rate was excluded. We attempted to identify first-degree relative pairs using identity-by-state estimates based on ~37,000 uncorrelated SNPs. For apparent first-degree relative pairs, we removed the control from a case-control pair; otherwise, we excluded the individual with the lower call rate. For the main analyses presented here, we also excluded 1,880 individuals who were included in any of the GWAS to allow the GWAS and iCOGS stages to be combined. Ancestry outliers were identified by multidimensional scaling, combining the iCOGS data with genotypes from the HapMap 2 populations, on the basis of a subset of 37,000 uncorrelated markers that passed quality control (including ~1,000 that were selected as ancestry-informative markers). Most studies were predominantly of a single ancestry (European or East Asian), and individuals with >15% minority ancestry, as determined on the basis of the first two principal components, were excluded. Two studies from Singapore (SGBCC) and Malaysia (MYBRCA) contained a substantial fraction of individuals of mixed European and Asian ancestry (likely of South Asian ancestry). For these studies, no exclusions for ancestry outliers were made, but principal-components analysis adequately corrected for inflation in these studies. Similarly, for the two African-American studies (NBHS and SCCS), no exclusions for ancestry outliers were made. Principal-components analyses were carried out separately for the European, Asian and African-American subgroups, on the basis of a subset of 37,000 uncorrelated SNPs. For the analyses of European subjects, we included the first six principal components as covariates, together with a seventh component derived specifically for one study (LMBC) for which there was substantial inflation not accounted for by the components derived from the analysis of all studies (this component was set to zero for all other studies). The addition of further principal components did not reduce inflation further. We included two principal components each for the studies in Asian and African-American populations. We excluded SNPs with call rates of <95%. We also excluded SNPs that deviated from Hardy-Weinberg equilibrium in controls at P < 1 107, on the basis of a stratified 1-degrre-of-freedom test in which the deviations were summed across strata67. We also excluded SNPs for which the genotypes were discrepant in more than 2% of duplicate samples across all COGS consortia. The final analyses were based on data from 199,961 SNPs. Genotype intensity cluster plots were examined manually for SNPs in each new region in which a genome-wide significant association was obtained, and SNPs were eliminated if the clustering was judged to be poor. Statistical analysis. For each SNP, we estimated a per-allele log(OR) and standard error by logistic regression, including study and principal components as covariates. Genotype-specific ORs were also computed. Overall significance levels were obtained by combining the estimates from the combined GWAS and iCOGS using a fixed-effects meta-analysis to derive a 1-degree-of-freedom test. Inflation of the test statistics () was estimated by dividing the 45th percentile of the test statistic by 0.357 (the 45th percentile for a 2 distribution on 1 degree of freedom). For this purpose, we used a subset of 22,897 SNPs that were uncorrelated (r2 < 0.1), which were not selected by BCAC and were not within 1 of the 4 common fine-mapping regions. This subset was used to minimize the selection of SNPs associated with disease, on the assumption that such SNPs are likely to be representative of common

npg

Nature Genetics

doi:10.1038/ng.2563

SNPs in terms of population structure. The inflation statistic was converted to an equivalent inflation statistic for a study with 1,000 cases and 1,000 controls (1,000) by adjusting by effective study size, namely
l1,000 = 1 + 500(l 1) 1
k

k n

1 mk

pparent associations between germline genotype and tumor expression may a be confounded or obscured by somatic copy number alterations. To assess the potential effects of the new SNPs on nearby gene expression in lymphocytes, we identified all genes that lie within a 500-kb window surrounding each of the SNPs and used Genevar (Gene Expression Variation), a public database with gene expression data quantified in lymphocytes from individuals in the HapMap 2 populations35,68. Estimation of the number of associated loci. To estimate the total number of newly associated loci selected for the iCOGS array, we first used the set of 29,807 SNPs selected from the GWAS and not selected for fine mapping, to exclude previously known loci. We then defined a set of 10,668 SNPs that were uncorrelated (r2 < 0.1 between any pair) and determined the number of loci for which the estimated effect size in the iCOGS stage was in the same direction as in the combined GWAS and the number of loci for which the effect was in the opposite direction. Similar results were obtained using cutoffs of r2 < 0.05 and r2 < 0.2. On the assumption that none of the loci with effects in opposite directions in the two stages were associated with disease, the number of loci associated with disease can be estimated as the difference between the number of loci with effects in the same direction and the number with effects in opposite directions. This, however, is an underestimate because loci with weak effects may have estimated effects in opposite directions in the two stages. To allow for this possibility, we fitted the distribution of z scores as a mixture of a standard normal distribution (representing SNPs with no effect) and a normal distribution with unknown mean and variance, using an expectation-maximization algorithm58. The total contribution to heritability was then computed from the posterior estimates. To allow for the potential effect of residual population stratification, we conducted an additional analysis in which the null distribution was assumed to have variance of 1.2, based on the estimated inflation from the non-BCAC SNPs, but the estimates were essentially identical.
63. Aulchenko, Y.S., Struchalin, M.V. & van Duijn, C.M. ProbABEL package for genome-wide association analysis of imputed data. BMC Bioinformatics 11, 134 (2010). 64. Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 9971004 (1999). 65. Teo, Y.Y. et al. A genotype calling algorithm for the Illumina BeadArray platform. Bioinformatics 23, 27412746 (2007). 66. Giannoulatou, E., Yau, C., Colella, S., Ragoussis, J. & Holmes, C.C. GenoSNP: a variational Bayes within-sample SNP genotyping algorithm that does not require a reference population. Bioinformatics 24, 22092214 (2008). 67. Haldane, J.B.S. An exact test for randomness of mating. J. Genet. 52, 631635 (1954). 68. Stranger, B.E. et al. Patterns of cis regulatory variation in diverse human populations. PLoS Genet. 8, e1002639 (2012).

where nk and mk are the number of cases and controls, respectively, for study k. Heterogeneity in the per-allele OR by ER status, age at diagnosis, family history and tumor invasiveness (DCIS versus invasive) were evaluated using a case-only analysis. Expression analysis. Gene expression, copy number and genotype data were retrieved from the TCGA breast cancer study. Gene expression profiles were measured by TCGA using a custom Agilent 244K expression array. We downloaded the raw expression data and performed preprocessing using the limma R package. Copy number and germline genotype were both measured using the Affymetrix Genome-Wide Human SNP 6.0 array. We used the segmented copy number and called genotype data as provided by TCGA. Intersecting the different genomic data types, we collected 458 primary tumor samples with germline genotypes from blood and both gene expression and somatic copy number data from the tumor. In addition, for 61 samples, we had germline genotype and gene expression data from normal breast tissue from individuals in the TCGA breast cancer study. Expression quantitative trait locus (eQTL) analysis was performed on both sets separately. For cis-eQTL analysis, we considered all genes 50 kb upstream or downstream of the lead SNP. Fourteen of the riskassociated SNPs are represented directly on the Affymetrix SNP array. For an additional 23, we were able to select proxies on the basis of maximum LD with minimum r2 of 0.5. In case of equal LD, we used proximity on the genome to break the tie. LD estimates were extracted from the HapMap data for the CEU population. eQTL analysis was performed by regressing the gene expression of selected candidate genes on the genotype followed by a significance test of the t statistic for the genotype covariate. For both the normal and tumor analyses, the linear regression was adjusted for potential batch effects by including indicator variables for the plate identifier component of the TCGA sample barcode. In addition, the first principal component of the complete gene expression matrix was added as a covariate to adjust for other global, typically non-genetic contributions to the gene expression signal. To prevent spurious associations due to confounding by nearby eQTLs, we corrected the model for the most strongly associated eQTL SNP in the region. For the tumor analysis only, we also added the copy number of the candidate gene as a covariate because

npg
doi:10.1038/ng.2563

2013 Nature America, Inc. All rights reserved.

Nature Genetics

Articles

GWAS meta-analysis and replication identifies three new susceptibility loci for ovarian cancer
Genome-wide association studies (GWAS) have identified four susceptibility loci for epithelial ovarian cancer (EOC), with another two suggestive loci reaching near genome-wide significance. We pooled data from a GWAS conducted in North America with another GWAS from the UK. We selected the top 24,551 SNPs for inclusion on the iCOGS custom genotyping array. We performed follow-up genotyping in 18,174 individuals with EOC (cases) and 26,134 controls from 43 studies from the Ovarian Cancer Association Consortium. We validated the two loci at 3q25 and 17q21 that were previously found to have associations close to genome-wide significance and identified three loci newly associated with risk: two loci associated with all EOC subtypes at 8q21 (rs11782652, P = 5.5 109) and 10p12 (rs1243180, P = 1.8 108) and another locus specific to the serous subtype at 17q12 (rs757210, P = 8.1 1010). An integrated molecular analysis of genes and regulatory regions at these loci provided evidence for functional mechanisms underlying susceptibility and implicated CHMP4C in the pathogenesis of ovarian cancer. Evidence from twin and family studies suggests an inherited genetic component to EOC risk1,2. Rare, high-penetrance alleles of genes such as BRCA1 and BRCA2 account for about 40% of excess familial risk3, and GWAS have recently identified common risk alleles at 9p22, 8q24, 2q31 and 19p13 (refs. 46), with two additional loci at 3q25 and 17q21 that approached genome-wide significance6. However these alleles only explain 4% of excess familial risk, and more risk loci probably exist. We therefore pooled the data from two GWAS to inform the selection of SNPs for a large-scale replication. The North American study comprised 4 independent case-control studies that included 1,952 cases and 2,052 controls. The second study was a 2-phase multicenter GWAS that included 1,817 cases and 2,354 controls in the first phase and 4,162 cases and 4,810 controls in the second phase. We carried out a fixed-effects meta-analysis from the two GWAS for ~2.5 million genotyped or imputed SNPs. We selected 24,551 SNPs associated with the risk of either all-histology (11,647 SNPs) or serous (12,904 SNPs) ovarian cancer on the basis of ranked P values. We designed assays for 23,239 SNPs and included them on a custom Illumina Infinium iSelect array (iCOGS) comprising 211,155 SNPs designed by the Collaborative Oncological Gene-environment Study (COGS) to evaluate genetic variants for association with risk of breast, ovarian and prostate cancers. We then genotyped these SNPs in cases and controls from 43 individual studies from the Ovarian Cancer Association Consortium (OCAC) that were grouped into 34 case-control strata (Table 1 and Supplementary Tables 1 and 2). These included most of the samples genotyped in the initial GWAS. RESULTS Association analyses After applying quality control filters (Online Methods), we tested 22,252 SNPs for association with risk of all subtypes of invasive EOC and serous invasive EOC in 18,174 cases (including 10,316 with the serous subtype)
A full list of authors and affiliations appears at the end of the paper. Received 11 May 2012; accepted 30 January 2013; published online 27 March 2013; doi:10.1038/ng.2564

2013 Nature America, Inc. All rights reserved.

and 26,134 controls. Primary analyses were based on data from the subjects of European ancestry (16,283 cases and 23,491 controls). We confirmed associations of the four SNPs at 2q31, 8q24, 9p22 and 19p13 that were previously reported at genome-wide significance (Supplementary Table 3). We also confirmed SNPs at the two other loci previously reported as being near genome-wide significance (at 3q25 and 17q21)6. The previously reported associated SNP at 3q25 (rs2665390)6 failed design, but a correlated SNP, rs7651446 (r2 = 0.61), was highly significantly associated with invasive EOC (effect allele frequency = 0.050, per-allele odds ratio (OR) = 1.44, 95% confidence interval (CI) = 1.351.53, P = 1.5 1028), as was rs9303542 at 17q21 (effect allele frequency = 0.27, OR = 1.12, 95% CI = 1.081.16, P = 6.0 1011). We generated Manhattan plots for all subtypes of invasive EOC and serous invasive EOC after excluding 176 SNPs from the 6 known loci (Fig. 1). We identified three new loci associated at genome-wide significance (P < 5 108), two of which were significant for all subtypes of invasive EOC (8q21 and 10p12) and another that was significant for invasive serous EOC only (17q12) (Table 2). Genotype clusters for the top hits at 8q21 (rs11782652) and 17q12 (rs757210) were distinct, but clusters for the top hit at 10p12 (rs7084454) overlapped (Supplementary Fig. 1). Clusters for a second, highly correlated SNP (r2 = 0.86) at this locus (rs1243180) were distinct, and so the results for this SNP are presented instead. The most significant association for all subtypes of invasive EOC was rs11782652 at 8q21 (OR = 1.19, 95% CI = 1.121.26, P = 5.5 109). We selected this SNP for replication because it was associated with all subtypes of invasive EOC in the combined GWAS data (OR = 1.20, 95% CI = 1.071.36, P = 0.0025). This SNP is not correlated with any other SNP in HapMap (Supplementary Fig. 2) and was the only SNP in the region selected for genotyping in COGS. Effects varied by histological subtype (P = 0.0002), with the strongest effect seen in the serous subtype (Table 2). There was little evidence for heterogeneity in the association by ancestry (P = 0.55) or between the 31 European studies included

npg

362

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

Articles
Table 1 Summary of samples and SNPs genotyped
Phase Pooled GWAS Study population North American UK phase 1 UK phase 2 OCAC Studies (n) 5 6 10 43 Cases (n) 1,952 1,817 4,162 18,549 Controls (n) 2,042 2,354 4,810 26,134 Illumina genotyping platform 610K/317K 610K/550K Custom iSelect Custom iSelect SNPs Genotyped 620,901 620,901 23,590 23,239 Passed QC 599,179 507,094 21,955 22,252 Imputed 1,909,565 2,001,650

COGSa
aIncludes

2,482 samples in the North American GWAS, 1,641 samples in phase 1 of the UK GWAS and 8,463 samples in phase 2 of the UK GWAS. QC, quality control.

2013 Nature America, Inc. All rights reserved.

(P = 0.13; Supplementary Fig. 3). We selected rs1243180 at 10p12 for replication because it was associated with all subtypes of invasive EOC in the combined GWAS data (OR = 1.11, 95% CI = 1.041.19, P = 0.0027) and was also associated with risk of all subtypes of invasive EOC in the replication data (OR = 1.10, 95% CI = 1.061.13, P = 1.8 108). There is strong linkage disequilibrium in this region, and we selected 32 other SNPs in the region for replication (Supplementary Fig. 4). There was some heterogeneity of effects by tumor subtype (P = 0.0007) but not by study (P = 0.65; Supplementary Fig. 5) or population (P = 0.12). At 17q12, we selected rs757210 for replication because it was associated with serous EOC in the combined GWAS data (OR = 1.13, 95% CI = 1.041.23, P = 0.0026) and was most strongly associated with the serous subtype in the replication data (OR = 1.12, 95% CI = 1.081.17, P = 9.6 1010). We selected eight other SNPs in the region for replication in COGS (Supplementary Fig. 6). The association with all invasive subtypes of EOC was much weaker in this analysis (OR = 1.05, 95% CI = 1.021.09, P = 9 104); there was substantial heterogeneity by tumor subtype (P 0.0001), with the risk allele for serous EOC being associated with a reduced risk of both clear-cell and mucinous EOC (Table 2). Data from a set of fine-mapping SNPs genotyped in this region suggest that this apparent paradox is caused by the presence of two independent loci for serous and clear-cell cancer and that the top hit at each of these loci is correlated with rs757210 (ref. 7). There was also heterogeneity by ancestry for the serous subtype (P = 0.034), with the risk allele being associated with lower disease risk in subjects of mixed ancestry and the presence of some between-study heterogeneity (P = 0.038; Supplementary Fig. 7). Functional and molecular analyses The most significant risk-associated SNPs for the three new EOC susceptibility loci are located in noncoding DNA sequences, but these may be only markers for the true causal variants, which could be functional coding variants or variants in noncoding DNA elements or noncoding RNAs and might influence the expression of nearby target genes (cis-regulatory effects). They may also act on genes through more distal regulation (trans-regulatory effects)812. To identify the possible functional SNP and target gene for each locus, we evaluated the putative functional role in EOC for all genes in a 1-Mb region centered on the most significant risk-associated SNP. We used a combination of locus-specific and genome-wide assays to characterize the transcribed genes (Online Methods and Supplementary Fig. 8) and regulatory elements (Supplementary Fig. 9) within susceptibility regions to evaluate putative functional mechanisms and identify candidate EOC susceptibility gene(s) at each locus. At the 8q21 locus, the strongest associated SNP, rs11782652, is located in the first intron of CHMP4C. We imputed genotypes in the region to 1000 Genomes Project data and tested all variants with minor allele frequency (MAF) > 0.02 for association. We compared the log likelihoods of the regression models and considered eight SNPs with a log likelihood within 6.91 of the most strongly associated SNP (equivalent to odds of 1,000:1) as the possible candidates for the causal variant. Six of these SNPs lie in introns of CHMP4C, but insilico
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

analysis provided functional evidence for only one (rs74544416), which contains a putative SOX9 binding site. One SNP is an indel (4 nt) at the exon-intron border (rs137960856, alleles /GTGA), but it is unlikely to have a functional impact because the next four nucleotides are also GTGA. Thus, even in the deleted allele, the corrected exonic sequences are retained, and this SNP is not expected to affect splicing. The eighth SNP, rs35094336, is predicted to result in a coding change from alanine to threonine that may be functionally relevant (PolyPhen-2 score of 0.997). This residue is located in a C-terminal amphipathic helix that is conserved in all CHMP4 proteins and is important for binding to ALIX, a protein that is involved in the endosomal sorting complex required for transport13. Further studies will be necessary to determine whether this change is of functional relevance and has an impact on ovarian cancer biology. Encyclopedia of DNA Elements (ENCODE) data from tissues not associated with ovarian cancer, formaldehyde-assisted isolation of regulatory elements sequencing (FAIRE-seq) data and mapping of enhancer elements generated in normal serous ovarian cancer precursor

log10 P

npg

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 19 21 X 18 20 22 MT

b 10
8

Chromosome

log10 P

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 19 21 X 18 20 22 MT

Chromosome

Figure 1 The associations of SNP genotypes with risk of ovarian cancer. (a,b) Manhattan plots showing association between the genotypes of 22,076 SNPs and risk of all subtypes of invasive EOC ( a) and serous subtype invasive epithelial ovarian cancer (b).

363

Articles
Table 2 ORs and tests of association by histological subtype and population for the most strongly associated SNPs at 8q21, 10p12 and 17q12
SNP rs11782652 Locus 8q21 Ref/alt allele A/G RAF 0.07 0.07 0.07 0.07 0.07 0.00 0.07 0.06 0.31 0.31 0.31 0.31 0.31 0.03 0.06 0.25 0.37 0.37 0.37 0.37 0.37 0.29 0.53 0.37 Subtype All invasive Serous Endometrioid Clear cell Mucinous Serous Serous Serous All invasive Serous Endometrioid Clear cell Mucinous Serous Serous Serous All invasive Serous Endometrioid Clear cell Mucinous Serous Serous Serous Ancestry European European European European European Asian African Mixed European European European European European Asian African Mixed European European European European European Asian African Mixed OR 1.19 1.24 1.04 1.12 0.95 1.73 1.15 1.10 1.11 1.08 1.09 0.97 1.41 1.11 0.87 1.05 1.12 0.98 0.80 0.89 1.10 1.08 0.86

rs1243180

10p12

T/A

rs757210

17q12

G/A

RAF, risk allele frequency; ref/alt allele, reference and alternative allele.

cells suggested that there are two regulatory regions that may be influenced by risk-associated SNPs: one at the CHMP4C promoter and the other in intron 1 of CHMP4C (Fig. 2). We found no evidence of a correlation between rs11782652 genotype and gene expression in normal ovarian or fallopian tube epithelial cells for any of the nine genes in the region (FABP5, PMP2, FABP4, FABP12, IMPA1, SLC10A5, ZFAND1, CHMP4C and SNX16), but there was a highly statistically significant association between rs11782652 and CHMP4C expression in primary EOC tissues (P = 3.9 1014) and transformed lymphocytes (P = 0.012). We also found evidence of association for rs11782652 with methylation status (methylation quantitative trait locus (mQTL)) for three genes in tumor tissue (Supplementary Table 4): ZFAND1 (P = 0.003), CHMP4C (P = 0.001) and SNX16 (P = 0.001). However, gene methylation was only correlated with gene expression for CHMP4C. Three genes in the region, FABP5, CHMP4C and SNX16, were significantly overexpressed in both EOC cell lines compared to in normal tissues (P = 0.002, P = 4.8 109 and P = 5.9 104, respectively; Supplementary Table 5) and, where data were available, in primary EOC tissues. The Catalogue of Somatic Mutations in Cancer (COSMIC) database showed that four genes in the region, IMPA1, ZFAND1, CHMP4C and SNX16, have functionally relevant mutations in cancer, with the last three genes found to be mutated in ovarian carcinoma (Supplementary Fig. 10). These four genes formed a highly connected coexpression network across different experimental conditions (Supplementary Fig. 11). Taken together, these data suggest that several genes at the 8q21 locus may have a role in the somatic development of EOC; however, the cumulative evidence indicates that CHMP4C (encoding chromatinmodifying protein 4C) is the probable candidate susceptibility gene. This is supported by previously published data on the function of CHMP4C. CHMP4C is involved in the final steps of cell division, coordinating midbody resolution with the abscission checkpoint 14,
364

and its transcription is regulated by TP53 to enhance exosome production15. A more 95% CI P recent study has shown that CHMP4C is frequently overexpressed in ovarian tumor 1.121.26 5.5 109 tissues, with the suggestion that it may be 1.161.33 7.0 1010 0.921.19 0.50 a diagnostic tumor marker and therapeutic 0.951.33 0.18 target for patients with the disease16. 0.781.15 0.58 At the 10p12 locus, six known genes (NEBL, C10orf113, C10orf114, SKIDA1 (also 0.923.28 0.91 known as C10orf140), MLLT10 and DNAJC1) 0.811.64 0.43 span the 1-Mb region around rs1243180, 1.061.13 1.8 108 which lies in an intron of MLLT10 (Fig. 3). 1.071.15 1.4 107 On the basis of data imputed from the 1000 1.001.15 0.038 Genomes Project, 57 SNPs are candidates as 0.991.19 0.091 functionally relevant variants. This includes 0.871.07 0.50 variants in the 3 UTR of C10orf114 and the 0.822.43 0.21 5 UTR of SKIDA1 and a synonymous variant 0.502.48 0.80 in MLLT10. Forty-six SNPs lie in introns of 0.721.07 0.18 MLLT10, and the remaining eight are inter1.021.09 0.00090 genic. In silico analyses found little or no evi10 1.081.17 8.1 10 dence that any of these SNPs, including the 0.911.04 0.47 most highly risk-associated SNP, rs1243180, 0.720.88 3.9 106 are functional. However, after FAIRE-seq 0.810.99 0.027 analysis of normal serous ovarian cancer 0.871.37 0.43 precursor cells, we found one of these SNPs, 0.751.54 0.69 rs10828252 (r2 = 0.87 with rs1243180), to 0.731.02 0.093 coincide with a region of open chromatin, which probably corresponds to the promoter of MLLT10 (Fig. 3). Although rs10828252 is not positioned directly at the apex of the signal and is instead within the upstream portion, it is well established that open chromatin at the transcriptional start sites of genes results from the coordinated influence of numerous transcription factors binding within the vicinity; therefore, it is highly plausible that rs10828247 is modulating one of these transcription factor binding sites. The resulting shape and position of the FAIRE-seq signal may therefore represent the resulting effects of putative transcription factor binding at rs10828247 working in concert with binding at other sites in close proximity. This finding suggests a possible mechanism for susceptibility to EOC at this locus through subtle variations in the promoter regulation of MLLT10. However, expression quantitative trait locus (eQTL) analysis found no significant association between the genotype of either rs1243180 or rs10828252 and MLLT10 expression in normal tissues. We did observe eQTL associations for two other coding genes in the region, NEBL (P = 0.04) and C10orf114 (P = 0.03). C10orf114 expression was also associated with rs1243180 genotype in primary EOC tissues (P = 0.02), as was SKIDA1 (P = 0.02). Methylation at both SKIDA1 and MLLT10 was associated with rs1243180 genotype in primary EOC tissues (P = 0.03 and 0.05, respectively), and both genes showed a significant negative correlation between methylation and expression (P = 0.0016 and 0.002, respectively). SKIDA1 also showed a significant difference in methylation in tumors compared to normal tissue (P = 1.9 105). Four genes (NEBL, C10orf114, SKIDA1 and MLLT10) were significantly overexpressed in EOC cell lines compared to in normal tissues (P 0.01), two of which (C10orf114 and MLLT10) also showed overexpression in primary EOC tissues (Fig. 3). Correlations between gene expression and DNA copy number variation at this locus in primary EOC tissues suggest that overexpression of C10orf114 and MLLT10 is driven by copy number variation. NEBL is the only gene with reported mutations in ovarian cancer (Supplementary Fig. 10). Together, these data suggest that NEBL, C10orf114, SKIDA1 and MLLT10 might all
VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

npg

2013 Nature America, Inc. All rights reserved.

Articles
have a role in ovarian cancer development, and any of these four genes could be the target susceptibility gene at this locus. However, there is no other evidence to implicate C10orf114 or SKIDA1 in EOC or to suggest a functional mechanism that may underlie disease susceptibility. More is known about the functions of NEBL and MLLT10, although neither gene has previously been implicated in ovarian cancer. MLLT10 (mixed-lineage leukemia (trithorax homolog, Drosophila) translocated to 10) encodes a transcription factor and has been identified as a partner gene that is involved in several chromosomal rearrangements that result in leukemia17. More than 60 MLL fusion partner genes have
8q21

a
82,400,000

rs11782652

82,800,000

FABP5

PMP2 FABP4 FABP12

IMPA1 ZFAND1 CHMP4C SNX16 SLC10A5

b
Relative expression

100 10 0 T

FABP5

FABP4

FABP12

IMPA1

SLC10A5

ZFAND1

CHMP4C

SNX16

***

***

***

***

***

***

N FABP5 P = 0.037

T CHMP4C P = 0.012

T CHMP4C 14 P = 3.9 10 4

N ZFAND1 P = 0.009

N CHMP4C P = 0.002

c
2013 Nature America, Inc. All rights reserved.
Relative expression
9 8 7 6

ZFAND1 P = 0.680 12 10 8 6
T n = 568 N n=8

CHMP4C 15 P = 3.2 10

d
1.5 1.0 0.5
0

e
Methylation B value
1.0 0.5 0

10 5 0
AA AG/GG n = 84 n = 10 AA AG/GG n = 84 n = 10

2 0 2
AA AG/GG n = 322 n = 72

T n = 53

N n = 10

AG/GG AA n = 187 n = 40

AA AG/GG n = 187 n = 40

f
Relative luciferase activity
20

82,805,000

82,815,000

82,825,000
CHMP4C

82,835,000 Scale

** ***

rs11782652

rs74544416

rs78724141

rs137960856 rs74758321

rs35094336

rs787400051

rs76837345

10

***

*** **

***
Tiling clones

npg

-B L3 pG -B

P-

EN

ENCODE data

pG

H3K4me1 H3K27ac H3K3me3

Figure 2 Summary of the functional analyses at the 8q21 locus. The box of the box-and-whisker plots shows the median and interquartile range, and the whiskers represent the 9th and 91st percentiles. ( a) Genomic map of a 1-Mb region at 8q21 centered on the most statistically significant SNP, rs11782652. The locations and sizes of all nine known protein-coding genes (gray) in the region are shown relative to the location of rs11782652 (red dotted line). (b) Expression analysis of all genes at the 8q21 locus performed in EOC cell lines ( n = 50) and normal ovarian surface epithelial cells (OSECs) plus fallopian tube secretory epithelial cells (FTSECs) (total n = 73), showing the relative levels of expression for each gene in tumor (T) compared to normal (N) cell lines. *P < 0.05, ***P < 0.001. (c) The ZFAND1 results from the cell line studies did not replicate in The Cancer Genome Atlas (TCGA) or the MD Anderson primary tumor expression data sets. However, increased expression of CHMP4C in primary, high-grade serous ovarian tumors compared to normal tissues was confirmed in the expression data for primary tissues (MD Anderson data set). ( d) eQTL analysis of gene expression relative to the germline genotypes for individuals carrying the minor or heterozygous allele (AG/GG) or common alleles (AA) for rs11782652. FABP5 and CHMP4C show positive eQTL associations in lymphoblastoid cell lines, and a highly significant eQTL association with rs11782652 genotype and CHMP4C expression is also seen in primary tumors (TCGA data set). ( e) mQTL analysis showing methylation status in 277 high-grade serous ovarian cancers relative to genotypes for rs11782652. (f) Functional enhancer mapping of a 40-kb region tested for the presence of enhancer regions by transfection of 2-kb tiling clones into immortalized OSECs 31. Activity of the luciferase reporter is shown as a fold change in luciferase activity relative to the pGL3-BP control. *P < 0.05, **P < 0.01, ***P < 0.001. A new enhancer region in OSECs is indicated by the red dashed box. See Supplementary Figures 8 and 9 for additional molecular analyses of all genes at this locus. H3K4me1, methylation of histone H3 at Lys4; H3K27ac, acetylation of histone H3 at Lys27; H3K3me3, trimethylation of histone H3 at Lys3.

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

L3

365

Articles a
NEBL

10q12 21,500,000 C10orf113 MIR1915 SKIDA1 C10orf114 C10orf114 SKIDA1 rs1243180 22,000,000 MLLT10 DNAJC1

b
Relative expression 2 1 0

NEBL

C10orf113

MLLT10 10 5

DNAJC1

***

**

***

***

***

T NEBL P = 0.043

T C10orf114 P = 0.027

N C10orf114 P = 0.002 1.0

N NEBL

N MLLT10

c
5 4 3 2 1 0 Relative expression

d
6 4 2 0 2

C10orf114 P = 0.027

e
Methylation B value 0.5

f
Gene expression

1.2 0.8 0.4 0 TT n = 37 AT/AA n=5 SKIDA1 GG n = 37 AG/AA n = 25

0 T n = 277 N n=7

GG AG/AA n = 156 n = 238 MLLT10

2013 Nature America, Inc. All rights reserved.

Increasing DNA copy number

C10orf114

OSEC FTSEC

rs10828247

rs1243180

MLLT10

1 kb

Figure 3 Summary of the functional analysis of the 10p12 locus. The box of the box-and-whisker plots shows the median and interquartile range, and the whiskers represent the 9th and 91st percentiles. ( a) Genomic map of a 1-Mb region at 10p12 centered on the most statistically significant SNP, rs1243180. The locations and approximate sizes of all six known protein-coding genes (gray) and one microRNA ( MIR1915, blue) in the region are shown relative to the location of rs1243180 (red dotted line). ( b) Expression analysis of all genes at this locus performed in EOC cell lines (T) and normal (N) OSEC and FTSEC lines showing the relative expression levels for each gene. ** P < 0.01, ***P < 0.001. (c) eQTL analysis showing significant associations between genotype at rs1243180 and expression of NEBL and C10orf114 in early passage primary OSEC and FTSEC cultures. (d) Positive eQTL association between C10orf114 expression and genotype at rs7098100 (r2 = 0.86 with rs1243180) in primary high-grade serous ovarian tumors (TCGA data set). (e) Methylation analysis of 277 high-grade serous ovarian tumors (T) compared to seven normal ovarian tissue samples (N); CpG sites near C10orf114 in the tumors show significantly less methylation than in normal samples. ( f) Expression versus copy number in tumors from the TCGA data set. MLLT10 and NEBL show a trend for higher levels of gene expression in tumors with increased DNA copy number at 10p12. 1, heterozygous loss; 2, diploid; 3, copy number gain; 4, amplification. ( g) FAIRE-seq performed in normal OSECs and FTSECs relative to all SNPs correlated with r2 0.8 to rs1243180. The SNP rs10828247 coincides with a region of open chromatin at the 5 end of MLLT10. See Supplementary Figures 8 and 9 for additional information on all genes at this locus.

npg

been described at the molecular level, including the recently reported fusion NEBL-MLL18. NEBL (nebulette) encodes a nebulin-like protein that is abundantly expressed in cardiac muscle and has been implicated in the genetics of sudden cardiac death syndrome and cardiac remodeling19. This evidence does not directly support a role for these genes in ovarian cancer, but the first common gene fusion in serous ovarian cancers (ESRRA-TEX40 (also known as C11orf20)) was recently reported and provides an underlying hypothesis for the involvement of genes at this locus in EOC development20. At chromosome 17q12, the most significantly associated SNP, rs757210, lies in an intron of HNF1B and is associated with the serous subtype of ovarian cancer. On the basis of data imputed from the 1000 Genomes Project, nine SNPs are candidates as the causal variant. SNPs in this region have been associated with diabetes21, endometrial cancer22 and prostate cancer23. There are 13 genes in the 500-kb regions on either side of this SNP (ACACA, C17orf78, TADA2A (also
366

known as TADA2L), DUSP14, SYNRG (also known as AP1GBP1), DDX52, HNF1B, TBC1D3F, TBC1D3, MRPL45, GPR179, SOCS7 and ARHGAP23). Seven of these genes (DUSP14, HNF1B, TBC1D3, TBC1D3F, MRPL45, SOCS7 and ARHGAP23) were overexpressed in EOC cell lines and primary tumors compared to normal tissues (Fig.4 and Supplementary Fig. 8), indicating that they may have a role in EOC. HNF1B is a strong candidate susceptibility gene at this locus; it has been extensively studied in EOC and is used as a biomarker for subtype stratification of EOC tumors 24, particularly to distinguish the clear-cell subtype from other EOC subtypes. Consistent with this, overexpression of HNF1B in EOC cell lines was driven largely by higher expression in clear-cell EOC cell lines (Fig. 4)25. However, HNF1B shows lower expression in primary serous EOC tissues compared to normal tissues, which may suggest a different role for this gene in clear-cell compared to serous tumors26. The phenotypic consequences of HNF1B knockdown in clear-cell EOC cell lines also
VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

Articles a
35,800,000
ACACA C17orf78 TADA2A DUSP14 SYNRG

17q12

rs757210
DDX52 HNF1B

36,200,000
TBC1D3 TBC1D3F GPR179 SOCS7 MRPL45 ARHGAP23

b
Relative expression

100

ACACA

C17orf78

TADA2A

DUSP14

SYNRG

DDX52

HNF1B

TBC1D3F

TBC1D3

MRPL45

GPR179

SOCS7

ARHGAP23

***
10

***

**

***

***

**

***

**

***

***

***

***

0 T N T N T N T N T N T N T N T N T N T N T N T N T N

c
Relative expression

40,000 30,000 20,000 10,000 0

d
Methylation B value
1.0

P = 1.5 10

SYNRG

HNF1B P = 2.1 106

e
Methylation B value
1.0

SYNRG P = 0.4

HNF1B P = 0.009

0.5

0.5

2013 Nature America, Inc. All rights reserved.

SC n = 11

CC n = 15

EC n=2

MC n=2

UD/UK n = 20

T n = 277

N n=7

T n = 277

N n=7

GG n = 154

AA/AG n = 73

GG n = 154

AA/AG n = 73

Figure 4 Summary of the functional analysis of the 17q12 locus. The box of the box-and-whisker plots shows the median and interquartile range, and the whiskers represent the 9th and 91st percentiles. ( a) Genomic map of a 1-Mb region at 17q12 centered on the most statistically significant SNP, rs757210. The locations and approximate sizes of all 13 known protein-coding genes (gray) in the region are shown relative to the location of rs757210 (red dotted line). (b) Expression analysis for all genes at this locus performed in ovarian tumor (T) cell lines and normal (N) OSEC and FTSEC primary cultures showing the relative levels of expression for each gene. ** P < 0.01, ***P < 0.001. (c) Overexpression of HNF1B detected in EOC cells compared to OSECs and FTSECs was driven largely by the high expression of this gene in clear-cell EOC cell lines (SC, serous; CC, clear cell; EC, endometrioid; MC, mucinous; UD/UK, undetermined or unknown). HNF1B is an established clear-cell EOC biomarker24. (d) Methylation analysis of 277 high-grade serous ovarian tumors (T) compared to seven normal ovarian tissue samples (N) showing significant hypermethylation of CpG sites upstream of SYNRG and HNF1B in tumors compared to normal tissues. (e) mQTL analysis showing methylation status of SYNRG and HNF1B in primary high-grade serous ovarian cancers by germline genotype for individuals carrying one or two copies of the minor allele (AA/AG) and individuals carrying two copies of the major allele (GG) for rs757210. Only methylation at HNF1B showed a significant association with genotype at this locus. See Supplementary Figures 8 and 9 for additional information on all genes at this locus.

suggest that it may behave as an oncogene in the development of this subtype27. We found no correlation between HNF1B expression and DNA copy number variation at this locus in primary EOC tissues, but there was a highly statistically significant inverse correlation between HNF1B expression and methylation (P = 2.1 106), which implies that the mechanism for overexpression of this gene is epigenetic. RNA sequencing (RNA-seq) analysis of normal ovarian cancer precursor tissues indicates that HNF1B is expressed at extremely low levels (Fig. 4), which restricts the extent to which the function of this gene in normal ovarian cancer precursor tissues can be studied. We found no evidence for eQTL association between rs757210 and the expression of any gene in normal tissues throughout the region, but we observed a strong mQTL association between rs757210 and HNF1B methylation (P = 0.009) (Fig. 4 and Supplementary Table4) in primary serous EOC tissues. The minor (risk) allele of rs757210 was associated with lower methylation and is therefore predicted to be associated with increased HNF1B expression. In the absence of additional functional data, it is difficult to interpret these findings, but, given the possible role of HNF1B as an oncogene in the development of clear-cell ovarian cancer, it may be that increased HNF1B expression at an early stage in the development of ovarian cancer precursor tissues, driven by the risk variant(s) under lying susceptibility, has increased oncogenic activity in the proportion of individuals with serous ovarian cancer who carry this allele. Overall, the functional data we generated do not point strongly to any one gene at 17q12 as the functionally relevant susceptibility
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

gene. However, when combined with a large body of previous work implicating HNF1B in ovarian cancer development, the data suggest that this gene is the strongest candidate, and the mQTL and methylation-expression associations suggest a mechanism for genetic variants influencing HNF1B expression and disease susceptibility through epigenetic regulation. DISCUSSION In this study, we demonstrated the strength of large-scale collaboration in genetic association studies. We have identified three common alleles newly associated with susceptibility to EOC and confirmed two suggestive loci that had been previously reported at near genome-wide significance. Molecular analyses of genes at these loci, combining publicly available data sets and systematic, large-scale experiments, point to a small number of candidate gene targets that may have a role in EOC initiation and development. However, the effects of the new susceptibility loci were modest, and together they explain less than 1% of the excess familial risk of EOC, with about 4% being explained by all known loci with common susceptibility alleles. The lack of heterogeneity between studies of varying designs carried out in different populations and the high levels of statistical significance indicate that these are robust associations. Fewer common susceptibility loci have now been found for EOC than for several other common cancers, including breast, colorectal and prostate cancers28. It seems unlikely that the underlying genetic architecture for EOC susceptibility is sub367

npg

Articles
stantially different from those of other cancers. This suggests that a key factor limiting our ability to detect susceptibility loci is sample sizethe power of this study to detect risk alleles across a range of effect sizes was modest (Supplementary Fig. 12). However, EOC is less common than these other cancers and has a higher mortality rate, and recruiting extremely high numbers of cases will be difficult. Disease heterogeneity will also reduce power if a substantial proportion of EOC susceptibility alleles are subtype specific. All EOC susceptibility loci so far identified are strongly associated with serous EOC, which is also the most common subtype. Both the discovery and replication phases of this study were weighted toward identifying risk alleles associated with serous EOC. It seems probable that additional common risk loci for clear-cell, endometrioid and mucinous EOC subtypes also exist and await identification. Several EOC susceptibility alleles have now been identified that increase the risk of multiple cancers. For example, an increased risk of estrogen receptornegative breast cancer is associated with the EOC susceptibility allele at 19p13 (refs. 5,29), and the EOC susceptibility allele at the 17q12 locus reported in this manuscript is also associated with risk of endometrial22 and prostate cancers23. Several of the loci containing EOC susceptibility alleles have been found to harbor different susceptibility alleles for other cancers. For example, Michailidou and colleagues30 found an association between rs7072776 at 10p12 and breast cancer. This SNP is ~120 kb centromeric to and partially correlated with rs1243180 (r2 = 0.51). Michailidou and colleagues30 also report an association of rs11780156 at 8q24 with breast cancer. The new locus lies ~300 kb telomeric of the known locus for ovarian cancer (rs10088218)6 but is uncorrelated with it (r2 = 0.02). Both loci lie ~400 kb 3 of MYC. Previous GWAS have identified multiple loci 5 of MYC that are associated with different cancer types, including a locus for breast cancer. These associations may reflect the tissue-specific regulation of key genes, and understanding the functional mechanisms underlying genetic associations at the same locus for different phenotypes may provide insights into more general mechanisms of disease etiology and cancer development. Assuming a log-additive model of interaction between loci, the currently known loci (Table 2 and Supplementary Table 3) define a genetic risk profile with a combined variance for the log relative risk distribution of 0.057. Such a distribution has limited discriminatory ability: the estimated relative risks at the 5th and 95th percentiles are 0.63 and 1.48, respectively. However, on the basis of what is known about the architecture of genetic susceptibility for other cancers, it is probable that many more common susceptibility alleles exist. The discovery of genetic association with ovarian cancer may be enhanced by taking advantage of functional annotation data and the analysis of gene-gene and gene-environment interactions using a genomewide approach. Continued international efforts are needed to establish new case-control studies, expand existing case-control studies and improve the pathological characterization of the cases in these studies to unravel the inherited genetic basis of the disease. In combination with rarer risk alleles and other risk factors, genetic profiling may provide sufficient discrimination to justify targeted ovarian cancer prevention. URLs. 1000 Genomes Project, http://www.1000genomes.org/page. php/; COSMIC, http://www.sanger.ac.uk/genetics/CGP/cosmic/; GeneMANIA, http://genemania.org/; MACH, http://www.sph.umich. edu/csg/abecasis/MACH/; NCBI Unigene, http://www.ncbi.nlm.nih. gov/unigene; TCGA Project, http://cancergenome.nih.gov/; Cancer Genome Atlas Project data bioportal, http://www.cbioportal.org/; Wellcome Trust Case Control Consortium, http://www.wtccc.org.uk/.
368

The statistical software programs used are available at http://ccge. medschl.cam.ac.uk/software/. Methods Methods and any associated references are available in the online version of the paper.
Note: Supplementary information is available in the online version of the paper. Acknowledgments We thank all the individuals who took part in this study and all the researchers, clinicians and technical and administrative staff who made possible the many studies contributing to this work (a full list is provided in the Supplementary Note). The COGS project is funded through a European Commissions Seventh Framework Programme grant (agreement number 223175 - HEALTH-F22009-223175). The Ovarian Cancer Association Consortium is supported by a grant from the Ovarian Cancer Research Fund thanks to donations by the family and friends of Kathryn Sladek Smith (PPD/RPCI.07). The scientific development and funding for this project were supported in part by the Genetic Associations and Mechanisms in Oncology (GAME-ON) and a National Cancer Institute Cancer Post-GWAS Initiative (U19-CA148112). Details of the funding of individual investigators and studies are provided in the Supplementary Note. This study made use of data generated by the Wellcome Trust Case Control consortium; funding for the project was provided by the Wellcome Trust under award 076113. A full list of the investigators who contributed to the generation of the data is available from the website (see URLs). The results published here are based in part on data generated by The Cancer Genome Atlas Pilot Project established by the National Cancer Institute and National Human Genome Research Institute; information about The Cancer Genome Atlas (TCGA) and the investigators and institutions who constitute the TCGA research network can be found on the website (see URLs). AUTHOR CONTRIBUTIONS Writing group: P.D.P.P., Y.-Y.T., C.M.P., S.J.R., J.M.S., T.A.S., B.L.F., E.L.G., A.N.A.M. and S.A.G. All authors read and approved the final manuscript. Provision of samples and data from contributing studies: K.L., M.P., J.P.T., H. Shen, R.W., R.K., M.C.L., H. Song, D.C.T., F.B., D.V., J.M.C., J.D., E. Dicks, K.K.A., H.A.-C., N.A., S.M.A., L.B., E.V.B., M.W.B., M.J.B., G.B., N.B., J.D.B., L.A.B., A.B.-W., R. Brown, R. Butzow, I.C., M.E.C., R.S.C., J.C.-C., Y.A.C., Z.C., A.D.-M., E. Despierre, J.A.D., T.D., A.d.B., M.D., D.E., R.E., A.B.E., P.A.F., D.F., J.F., Y.-T.G., M.G.-C., A.G.-M., G.G., A.G., M.G., J.G., Q.G., M.K.H., P. Harter, A.H., F.H., P. Hillemanns, M.H., E.H., C.K.H., S.H., A. Jakubowska, A. Jensen, K.R.K., B.Y.K., L.E.K., L.A.K., S.K.K., G.E.K., C.K., J.K., D.L., S.L., N.D.L., N.L., J. Lee, A.L., B.K.L., J. Lissowska, J. Lubin ski, L.L., G.L., L.F.A.G.M., K.M., V.M., J.R.M., U.M., F.M., K.B.M., T.N., S.A.N., R.B.N., H. Nevanlinna, S.N., H. Noushmehr, K.O., S.O., I.O., J.P., T.P., L.M.P., J.P.-W., M.C.P., E.M.P., X.Q., H.A.R., L.R.-R., M.A.R., A.R., I.R., I.K.R., H.B.S., I.S., G.S., H. Shen, V.S., X.-O.S., W.S., M.C.S., P.S., K.T., S.-H.T., K.L.T., P.J.T., A.T., S.S.T., A.M.v.A., D.v.d.B., I.V., R.A.V., A.F.V., S.W.-G., N.W., A.S.W., E.W., B.W., Y.L.W., A.H.W., H.P.Y., W.Z., A.Z., F.Z., M.T.G., P. Hall, D.F.E., C.L.P., A.B., G.C.-T., E.I. and J.M.S. Bioinformatics and data management: J.D., E. Dicks, Z.C. and R.W. Data analysis: J.P.T., Q.G., Y.-Y.T. and B.L.F. Preparation of samples for genotyping: S.J.R. and C.M.P. Genotyping: J.M.C., D.C.T., F.B. and D.V. Functional analyses: S.A.G., M.B., A.N.A.M., B.L.F., K.L., H. Shen, E.L.G., S.J.R., Y.A.C. and M.L.C. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.
Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html.
1. Lichtenstein, P. et al. Environmental and heritable factors in the causation of canceranalyses of cohorts of twins from Sweden, Denmark and Finland. N. Engl. J. Med. 343, 7885 (2000). 2. Stratton, J.F., Pharoah, P., Smith, S.K., Easton, D. & Ponder, B.A. A systematic review and meta-analysis of family history and risk of ovarian cancer. Br. J. Obstet. Gynaecol. 105, 493499 (1998). 3. Antoniou, A.C. & Easton, D.F. Risk prediction models for familial breast cancer. Future Oncol. 2, 257274 (2006). 4. Song, H. et al. A genome-wide association study identifies a new ovarian cancer susceptibility l ocus on 9p22.2. Nat. Genet. 41, 9961000 (2009). 5. Bolton, K.L. et al. Common variants at 19p13 are associated with susceptibility to ovarian cancer. Nat. Genet. 42, 880884 (2010). 6. Goode, E.L. et al. A genome-wide association study identifies susceptibility loci for ovarian cancer at 2q31 and 8q24. Nat. Genet. 42, 874879 (2010).

npg

2013 Nature America, Inc. All rights reserved.

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

Articles
7. Shen, H. et al. Epigenetic analysis leads to identification of HNF1B as a subtypespecific susceptibility gene for ovarian cancer. Nat. Comm. published online; doi:10.1038/ncomms2629 (27 March 2013). 8. Ernst, J. et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473, 4349 (2011). 9. Jia, L. et al. Functional enhancers at the gene-poor 8q24 cancer-linked locus. PLoS Genet. 5, e1000597 (2009). 10. Kim, M.J. et al. Functional characterization of liver enhancers that regulate drugassociated transporters. Clin. Pharmacol. Ther. 89, 571578 (2011). 11. Wasserman, N.F., Aneas, I. & Nobrega, M.A. An 8q24 gene desert variant associated with prostate cancer risk confers differential in vivo activity to a MYC enhancer. Genome Res. 20, 11911197 (2010). 12. Wright, J.B., Brown, S.J. & Cole, M.D. Upregulation of c-MYC in cis through a large chromatin loop linked to a cancer risk-associated single-nucleotide polymorphism in colorectal cancer cells. Mol. Cell Biol. 30, 14111420 (2010). 13. McCullough, J., Fisher, R.D., Whitby, F.G., Sundquist, W.I. & Hill, C.P. ALIX-CHMP4 interactions in the human ESCRT pathway. Proc. Natl. Acad. Sci. USA 105, 76877691 (2008). 14. Carlton, J.G., Caballe, A., Agromayor, M., Kloc, M. & Martin-Serrano, J. ESCRT-III governs the Aurora B-mediated abscission checkpoint through CHMP4C. Science 336, 220225 (2012). 15. Yu, X., Riley, T. & Levine, A.J. The regulation of the endosomal compartment by p53 the tumor suppressor gene. FEBS J. 276, 22012212 (2009). 16. Nikolova, D.N. et al. Genome-wide gene expression profiles of ovarian carcinoma: identification of molecular targets for the treatment of ovarian carcinoma. Mol. Med. Report 2, 365384 (2009). 17. Caudell, D. & Aplan, P.D. The role of CALM-AF10 gene fusion in acute leukemia. Leukemia 22, 678685 (2008). 18. Cser, V.M. et al. Nebulette is the second member of the nebulin family fused to the MLL gene in infant leukemia. Cancer Genet. Cytogenet. 198, 151154 (2010). 19. Ram, R. & Blaxall, B.C. Nebulette mutations in cardiac remodeling: big effects from a small mechanosensor. J. Am. Coll. Cardiol. 56, 15031505 (2010). 20. Salzman, J. et al. ESRRA-C11orf20 is a recurrent gene fusion in serous ovarian carcinoma. PLoS Biol. 9, e1001156 (2011). 21. Voight, B.F. et al. Twelve type 2 diabetes susceptibility loci identified through largescale association analysis. Nat. Genet. 42, 579589 (2010). 22. Spurdle, A.B. et al. Genome-wide association study identifies a common variant associated with risk of endometrial cancer. Nat. Genet. 43, 451454 (2011). 23. Elliott, K.S. et al. Evaluation of association of HNF1B variants with diverse cancers: collaborative analysis of data from 19 genome-wide association studies. PLoS ONE 5, e10858 (2010). 24. Kato, N., Sasou, S. & Motoyama, T. Expression of hepatocyte nuclear factor-1 (HNF-1) in clear cell tumors and endometriosis of the ovary. Mod. Pathol. 19, 8389 (2006). 25. Tsuchiya, A. et al. Expression profiling in ovarian clear cell carcinoma: identification of hepatocyte nuclear factor-1 as a molecular marker and a possible molecular target for therapy of ovarian clear cell carcinoma. Am. J. Pathol. 163, 25032512 (2003). 26. Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma. Nature 474, 609615 (2011). 27. Kato, N. & Motoyama, T. Hepatocyte nuclear factor-1 (HNF-1) in human urogenital organs: its expression and role in embryogenesis and tumorigenesis. Histol. Histopathol. 24, 14791486 (2009). 28. Hindorff, L.A., Junkins, H.A., Hall, P.A., Mehta, J.P. & Manolio, T.A. A catalogue of published genome-wide association studies. Genome.gov: National Institutes of Health National Human Genome Research Institute. <http://www.genome.gov/ gwastudies/> (2013). 29. Antoniou, A.C. et al. A locus on 19p13 modifies risk of breast cancer in BRCA1 mutation carriers and is associated with hormone receptor-negative breast cancer in the general population. Nat. Genet. 42, 885892 (2010). 30. Michailidou, K. et al. Large-scale genotyping identifies 41 new breast cancer susceptibility loci. Nat. Genet. published online; doi:10.1038/ng.2563 (27 March 2013). 31. Lawrenson, K. et al. Senescent fibroblasts promote neoplastic transformation of partially transformed ovarian epithelial cells in a three-dimensional model of early stage ovarian cancer. Neoplasia 12, 317325 (2010).

2013 Nature America, Inc. All rights reserved.

Paul D P Pharoah1,2,107, Ya-Yu Tsai3,107, Susan J Ramus4,107, Catherine M Phelan3,107, Ellen L Goode5, Kate Lawrenson4, Melissa Buckley3, Brooke L Fridley5, Jonathan P Tyrer1, Howard Shen4, Rachel Weber6, Rod Karevan4, Melissa C Larson7, Honglin Song1, Daniel C Tessier8,9, Franois Bacot8,9, Daniel Vincent8,9, Julie M Cunningham10, Joe Dennis2, Ed Dicks1, Australian Cancer Study11 , Australian Ovarian Cancer Study Group11, Katja K Aben12,13, Hoda Anton-Culver14, Natalia Antonenkova15, Sebastian M Armasu7, Laura Baglietto16,17, Elisa V Bandera18, Matthias W Beckmann19, Michael J Birrer20,21, Greg Bloom3, Natalia Bogdanova22, James D Brenton23, Louise A Brinton24, Angela Brooks-Wilson25, Robert Brown26, Ralf Butzow27,28, Ian Campbell29,30, Michael E Carney31, Renato S Carvalho3, Jenny Chang-Claude32, Y Anne Chen3, Zhihua Chen3, Wong-Ho Chow33, Mine S Cicek5, Gerhard Coetzee34, Linda S Cook35, Daniel W Cramer36,37, Cezary Cybulski38, Agnieszka Dansonka-Mieszkowska39, Evelyn Despierre40, Jennifer A Doherty41, Thilo Drk22, Andreas du Bois42,43, Matthias Drst44, Diana Eccles45, Robert Edwards46,47, Arif B Ekici48, Peter A Fasching19,49, David Fenstermacher3, James Flanagan26, Yu-Tang Gao50, Montserrat Garcia-Closas51,52, Aleksandra Gentry-Maharaj53, Graham Giles16,17,54, Anxhela Gjyshi3, Martin Gore55, Jacek Gronwald38, Qi Guo1, Mari K Halle56,57, Philipp Harter42,43, Alexander Hein19, Florian Heitz42,43, Peter Hillemanns58, Maureen Hoatlin59, Estrid Hgdall60,61, Claus K Hgdall62, Satoyo Hosono63, Anna Jakubowska38, Allan Jensen60, Kimberly R Kalli64, Beth Y Karlan65, Linda E Kelemen66,67, Lambertus A Kiemeney12,13,68, Susanne Krger Kjaer60,62, Gottfried E Konecny49, Camilla Krakstad56,57, Jolanta Kupryjanczyk39, Diether Lambrechts69, Sandrina Lambrechts40, Nhu D Le70, Nathan Lee4, Janet Lee4, Arto Leminen28, Boon Kiong Lim71, Jolanta Lissowska72, Jan Lubiski38, Lene Lundvall62, Galina Lurie31, Leon F A G Massuger73, Keitaro Matsuo63, Valerie McGuire74, John R McLaughlin75,76, Usha Menon53, Francesmary Modugno46,77, Kirsten B Moysich78, Toru Nakanishi79, Steven A Narod80, Roberta B Ness81, Heli Nevanlinna28, Stefan Nickels32, Houtan Noushmehr34,82, Kunle Odunsi78, Sara Olson83, Irene Orlow83, James Paul84, Tanja Pejovic85,86, Liisa M Pelttari28, Jenny Permuth-Wey3, Malcolm C Pike4,83, Elizabeth M Poole37,87, Xiaotao Qu3, Harvey A Risch88, Lorna Rodriguez-Rodriguez18, Mary Anne Rossing89,90, Anja Rudolph32, Ingo Runnebaum44, Iwona K Rzepecka39, Helga B Salvesen56,57, Ira Schwaab91, Gianluca Severi16,17, Hui Shen82, Vijayalakshmi Shridhar10, Xiao-Ou Shu92,93, Weiva Sieh74, Melissa C Southey94, Paul Spellman95, Kazuo Tajima63, Soo-Hwang Teo96,97, Kathryn L Terry36,37, Pamela J Thompson31, Agnieszka Timorek98,99, Shelley S Tworoger37,87, Anne M van Altena73, David van den Berg4, Ignace Vergote40, Robert A Vierkant5, Allison F Vitonis36, Shan Wang-Gohrke100, Nicolas Wentzensen24, Alice S Whittemore74,
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013 369

npg

Articles
Elisabeth Wik56,57, Boris Winterhoff101, Yin Ling Woo71, Anna H Wu4, Hannah P Yang24, Wei Zheng92,93, Argyrios Ziogas14, Famida Zulkifli96,97, Marc T Goodman31, Per Hall102, Douglas F Easton1,2, Celeste L Pearce4, Andrew Berchuck103, Georgia Chenevix-Trench104, Edwin Iversen105, Alvaro N A Monteiro3, Simon A Gayther4, Joellen M Schildkraut6,106,108 & Thomas A Sellers3,108
1The

Centre for Cancer Genetic Epidemiology, Department of Oncology, University of Cambridge, Cambridge, UK. 2The Centre for Cancer Genetic Epidemiology, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK. 3Department of Cancer Epidemiology, Division of Population Sciences, Moffitt Cancer Center, Tampa, Florida, USA. 4Department of Preventive Medicine, Keck School of Medicine, University of Southern California/Norris Comprehensive Cancer Center, Los Angeles, California, USA. 5Department of Health Science Research, Division of Epidemiology, Mayo Clinic, Rochester, Minnesota, USA. 6Department of Community and Family Medicine, Duke University Medical Center, Durham, North Carolina, USA. 7Division of Biomedical Statistics and Informatics, Department of Health Science Research, Mayo Clinic, Rochester, Minnesota, USA. 8Centre Technologiques, McGill University, Montreal, Quebec, Canada. 9McGill University and Gnome Qubec Innovation Centre, Montreal, Quebec, Canada. 10Department of Laboratory Medicine and Pathology, Division of Anatomic Pathology, Mayo Clinic, Rochester, Minnesota, USA. 11A list of members is provided in the Supplementary Note. 12Department of Epidemiology, Biostatistics and Health Technology Assessment, Radboud University Medical Centre, Nijmegen, The Netherlands. 13Comprehensive Cancer Center The Netherlands, Utrecht, The Netherlands. 14Department of Epidemiology, Center for Cancer Genetics Research and Prevention, School of Medicine, University of CaliforniaIrvine, Irvine, California, USA. 15N.N. Aleksandrov Byelorussian Institute for Oncology and Medical Radiology, Minsk, Belarus. 16Cancer Epidemiology Centre, The Cancer Council Victoria, Melbourne, Victoria, Australia. 17Centre for Molecular, Environmental, Genetic and Analytical Epidemiology, The University of Melbourne, Melbourne, Victoria, Australia. 18The Cancer Institute of New Jersey, Robert Wood Johnson Medical School, New Brunswick, New Jersey, USA. 19Department of Gynecology and Obstetrics, University Hospital Erlangen, Friedrich-Alexander-University Erlangen-Nuremberg, Comprehensive Cancer Center, Erlangen, Germany. 20Department of Medicine, Harvard Medical School, Boston, Massachusetts, USA. 21Massachusetts General Hospital Cancer Center, Massachusetts General Hospital, Boston, Massachusetts, USA. 22Gynaecology Research Unit, Hannover Medical School, Hannover, Germany. 23Cambridge Research Institute, Li Ka Shing Centre, Cambridge, UK. 24Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, USA. 25Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada. 26Department of Surgery and Cancer, Imperial College London, London, UK. 27Department of Pathology, Helsinki University Central Hospital, Helsinki, Finland. 28Department of Obstetrics and Gynecology, Helsinki University Central Hospital, Helsinki, Finland. 29Cancer Genetics Laboratory, Research Division, Peter MacCallum Cancer Centre, Melbourne, Victoria, Australia. 30Department of Pathology, The University of Melbourne, Parkville, Victoria, Australia. 31Cancer Epidemiology Program, University of Hawaii Cancer Center, Honolulu, Hawaii, USA. 32Division of Cancer Epidemiology, German Cancer Research Center, Heidelberg, Germany. 33Division of Cancer Etiology and Genetics, National Cancer Institute, Bethesda, Maryland, USA. 34Department of Microbiology and Preventive Medicine, University of Southern California/Norris Comprehensive Cancer Center, Los Angeles, California, USA. 35Department of Internal Medicine, University of New Mexico, Albuquerque, New Mexico, USA. 36Obstetrics and Gynecology Epidemiology Center, Brigham and Womens Hospital and Harvard Medical School, Boston, Massachusetts, USA. 37Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts, USA. 38International Hereditary Cancer Center, Department of Genetics and Pathology, Pomeranian Medical University, Szczecin, Poland. 39Department of Molecular Pathology, The Maria Sklodowska-Curie Memorial Cancer Center and Institute of Oncology, Warsaw, Poland. 40Division of Gynecologic Oncology, Department of Obstetrics and Gynaecology, Leuven Cancer Institute, Leuven, Belgium. 41Section of Biostatistics and Epidemiology, The Geisel School of Medicine at Dartmouth, Lebanon, New Hampshire, USA. 42Department of Gynecology and Gynecologic Oncology, Dr. Horst Schmidt Kliniken Wiesbaden, Wiesbaden, Germany. 43Department of Gynecology and Gynecologic Oncology, Kliniken Essen-Mitte/ Evang. Huyssens-Stiftung/ Knappschaft, Essen, Germany. 44Department of Gynecology, Jena University Hospital, Jena, Germany. 45Faculty of Medicine, University of Southampton, University Hospital Southampton, Southampton, UK. 46Department of Obstetrics, Gynecology and Reproductive Sciences, University of Pittsburgh, Pittsburgh, Pennsylvania, USA. 47Womens Cancer Program, Magee-Womens Research Institute, Pittsburg, Pennsylvania, USA. 48Institute of Human Genetics, Friedrich-Alexander-University Erlangen-Nuremberg, Erlangen, Germany. 49Department of Medicine, Division of Hematology and Oncology, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California, USA. 50Department of Epidemiology, Shanghai Cancer Institute, Shanghai, China. 51Division of Genetics and Epidemiology, Institute of Cancer Research, Sutton, UK. 52Breakthrough Breast Cancer Research Centre, Institute of Cancer Research, London, UK. 53Gynaecological Cancer Research Centre, University College London Elizabeth Garrett Anderson Institute for Womens Health, London, UK. 54Department of Epidemiology and Preventive Medicine, Monash University, Melbourne, Victoria, Australia. 55Gynecological Oncology Unit, The Royal Marsden Hospital, London, UK. 56Department of Gynecology and Obstetrics, Haukeland University Hospital, Bergen, Norway. 57Department of Clinical Medicine, University of Bergen, Bergen, Norway. 58Clinics of Obstetrics and Gynaecology, Hannover Medical School, Hannover, Germany. 59Department of Biochemistry and Molecular Biology, Oregon Health and Science University, Portland, Oregon, USA. 60Virus, Lifestyle and Genes, Danish Cancer Society Research Center, Copenhagen, Denmark. 61Molecular Unit, Department of Pathology, Herlev Hospital, University of Copenhagen, Copenhagen, Denmark. 62The Juliane Marie Centre, Department of Obstetrics and Gynecology, Rigshospitalet, Copenhagen, Denmark. 63Division of Epidemiology and Prevention, Aichi Cancer Center Research Institute, Nagoya, Japan. 64Department of Medical Oncology, Mayo Clinic, Rochester, Minnesota, USA. 65Womens Cancer Program at the Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, California, USA. 66Department of Population Health Research, Alberta Health Services Cancer Care, Calgary, Alberta, Canada. 67Department of Medical Genetics, University of Calgary, Calgary, Alberta, Canada. 68Department of Urology, Radboud University Medical Centre, Nijmegen, The Netherlands. 69Vesalius Research Center, University of Leuven, Leuven, Belgium. 70Cancer Control Research, BC Cancer Agency, Vancouver, British Columbia, Canada. 71Department of Obstetrics and Gynaecology, University Malaya Medical Centre, University Malaya, Kuala Lumpur, Malaysia. 72The Maria Sklodowska-Curie Memorial Cancer Center, Warsaw, Poland. 73Department of Gynaecology, Radboud University Medical Centre, Nijmegen, The Netherlands. 74Department of Health Research and PolicyEpidemiology, Stanford University School of Medicine, Stanford, California, USA. 75Dalla Lana School of Public Health, Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada. 76Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada. 77Department of Epidemiology, University of Pittsburgh, Pittsburgh, Pennsylvania, USA. 78Department of Cancer Prevention and Control, Roswell Park Cancer Institute, Buffalo, New York, USA. 79Department of Gynecologic Oncology, Aichi Cancer Center Central Hospital, Nagoya, Japan. 80Womens College Research Institute, University of Toronto, Toronto, Ontario, Canada. 81The University of Texas School of Public Health, Houston, Texas, USA. 82University of Southern California Epigenome Center, Keck School of Medicine, University of Southern California/Norris Comprehensive Cancer Center, Los Angeles, California, USA. 83Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, New York, USA. 84The Beatson West of Scotland Cancer Centre, Glasgow, UK. 85Department of Obstetrics and Gynecology, Oregon Health and Science University, Portland, Oregon, USA. 86Knight Cancer Institute, Oregon Health and Science University, Portland, Oregon, USA. 87Channing Division of Network Medicine, Brigham and Womens Hospital, Boston, Massachusetts, USA. 88Department of Epidemiology and Public Health, Yale University School of Public Health and School of Medicine, New Haven, Connecticut, USA. 89Department of Epidemiology, University of Washington, Seattle, Washington, USA. 90Program in Epidemiology, Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA. 91Institut fr Humangenetik, Wiesbaden, Germany. 92Vanderbilt Epidemiology Center, Vanderbilt University School of Medicine, Nashville, Tennessee, USA. 93Vanderbilt-Ingram Cancer Center, Vanderbilt University School of Medicine, Nashville, Tennessee, USA. 94Genetic Epidemiology Laboratory, Department of Pathology, The University of Melbourne, Melbourne, Victoria, Australia. 95Molecular and Medical Genetics, Oregon Health and Science University, Portland, Oregon, USA. 96Cancer Research Initiatives Foundation, Sime Darby Medical Centre, Subang Jaya, Malaysia. 97University Malaya Medical Centre, University Malaya, Kuala Lumpur, Malaysia. 98Department of Obstetrics, Gynecology and Oncology, IInd Faculty of Medicine, Warsaw Medical University, Warsaw, Poland. 99Department of Obstetrics and Gynaecology, Brodnowski Hospital, Warsaw, Poland. 100Department of Obstetrics and Gynecology, University of Ulm, Ulm, Germany. 101Department of Obstetrics and Gynecology, Mayo Clinic, Rochester, Minnesota, USA. 102Department of Epidemiology and Biostatistics, Karolinska Istitutet, Stockholm, Sweden. 103Department of Obstetrics and Gynecology, Duke University Medical Center, Durham, North Carolina, USA. 104Cancer Division, Queensland Institute of Medical Research, Herston, Queensland, Australia. 105Department of Statistics, Duke University, Durham, North Carolina, USA. 106Cancer Prevention, Detection and Control Research Program, Duke Cancer Institute, Durham, North Carolina, USA. 107These authors contributed equally to this work. 108These authors jointly directed this work. Correspondence should be addressed to P.D.P.P. (paul.pharoah@medschl.cam.ac.uk).

npg

2013 Nature America, Inc. All rights reserved.

370

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

ONLINE METHODS

SNP selection. We combined the results from two ovarian cancer GWAS from North America and the UK. Details of these studies have been published previously4,32 and are described in the Supplementary Note. To account for different marker sets and improve genome coverage, imputation to HapMap 2 was performed using 60 CEU (Utah residents of Northern and Western European ancestry) founders as a reference. Data on 2,508,744 genotyped SNPs or SNPs imputed with r2 > 0.3 were available for analysis. The North American and UK studies were analyzed separately, and the results were combined using a fixed-effects meta-analysis. The 2.5 million SNPs were ranked according to the P values for each of 4 analyses performed: North America study only (all invasive and serous histology) and combined GWAS meta-analysis (all invasive and serous histology). The minimal ranking for each SNP was obtained across the four sets of results. SNPs with MAF < 3% or SNPs that were already genotyped or in perfect linkage disequilibrium with UK GWAS phase 2 SNPs were excluded. We acquired the design score for each SNP using the Illumina Assay Design Tool and removed SNPs that were redundant or predicted to perform poorly. In total, 24,552 SNPs were included on the iCOGS custom genotyping array (Supplementary Note). Study populations. A total of 47,630 samples from 43 studies in OCAC were genotyped, of which 44,308 passed quality control, including 18,174 cases (10,315 with the serous subtype) and 26,134 controls (Supplementary Table1). All participants provided written informed consent. Each contributing study was approved by the relevant local institutional research ethics committee. The HapMap samples from European (CEU, n = 60), African (YRI, n = 53) and Asian (JPT + CHB, n = 88) populations were also genotyped using the iCOGS array. SNP genotyping. Genotyping was conducted using an Illumina Infinium iSelect BeadChip in six centers, of which two were used for OCAC: McGill University and Gnome Qubec Innovation Centre (n = 19,806) and the Mayo Clinic Medical Genome Facility (n = 27,824). Each 96-well plate contained 250 ng of genomic DNA (or 500 ng of whole-genome amplified DNA). Raw intensity data files for all consortia were sent to the COGS data coordination center at the University of Cambridge for centralized genotype calling and quality control. Genotypes were called using GenCall33. Initial calling used a cluster file generated using 270 samples from HapMap 2. These calls were used for ongoing quality control checks during the genotyping. To generate the final calls used for the data analysis, we first selected a subset of 3,018 individuals, including samples from each of the genotyping centers, each of the participating consortia and each major ancestry group. Only plates with a consistently high call rate in the initial calling were used. The HapMap samples and ~160 samples that were known positive controls for rare variants on the array were used to generate a cluster file that was then applied to call the genotypes for the remaining samples. We also investigated two other calling algorithms, Illumnus34 and GenoSNP35, but manual inspection of a sample of SNPs with discrepant calls indicated that GenCall was invariably superior. Sample quality control. For OCAC, 1,273 samples were genotyped in duplicate. Genotypes were discordant for >40% of SNPs for 22 pairs. For the remaining 1,251 pairs, concordance was >99.6%. In addition, we identified 245 pairs of samples that were unexpected genotypic duplicates. Of these, 137 were pheno typic duplicates and were judged to be from the same individual. We used identity by state to identify 618 pairs of first-degree relatives. Samples were excluded, including (i) 1,133 samples with a conversion rate of <95%; (ii) 169 samples with heterozygosity >5 s.d. from the intercontinental ancestry-specific mean heterozygosity; (iii) 65 samples with ambiguous sex; (iv) 269 samples with the lowest call rate from a first-degree relative pair; and (v) 1,686 samples that were either duplicate samples that were discordant for genotype or genotypic duplicates that were not concordant for phenotype. Thus, a total of 44,308 subjects, including 18,174 cases and 26,134 controls, were available for analysis. Of these, 2,482 had been in the North American GWAS, 1,641 were in phase 1 of the UK GWAS, and 8,463 were in phase 2 of the UK GWAS. SNP quality control. Of the 211,155 SNP assays successfully designed and included on the array, we excluded (i) 1,311 SNPs without a genotype call; (ii) 2,857 monomorphic SNPS; (iii) 5,201 SNPs with a call rate of <95% and MAF of >0.05

or call rate of <99% and MAF of <0.05; (iv) 2,194 SNPs showing evidence of deviation of genotype frequencies from Hardy-Weinberg equilibrium (P < 1 107); and (v) 22 SNPS with >2% discordance in duplicate pairs. Overall, 94.5% of SNPs passed quality control. Genotype intensity cluster plots were visually inspected for the most strongly associated SNPs at each newly identified locus. Statistical methods. We used the program LAMP36 to assign intercontinental ancestry on the basis of genotype frequencies in the European, Asian and African populations. Subjects with >90% European ancestry were defined as European (n = 39,944), and those with >80% Asian or African ancestry were defined as Asian (n = 2,388) or African (n = 387), respectively. All other subjects were defined as being of mixed ancestry (n = 1,770). We then used a set of 37,000 unlinked markers to perform principal-components analysis within each major population subgroup37. To enable this analysis on very large sample sizes, we used an in-house program written in C++ using the Intel MKL libraries for eigenvectors (see URLs). Unconditional logistic regression treating the number of alternate alleles carried as an ordinal variable (log-additive, codominant model) was used to evaluate the association between each SNP and ovarian cancer risk. A likelihood ratio test was used to test for association, and per-allele log ORs and 95% CIs were estimated. The likelihood ratio test has been shown to have greater power than alternatives such as the Wald test and the score test for rare variants38. Separate analyses were carried out for each ancestry group. The model for European subjects was adjusted for study and population substructure by including study-specific indicators and the first five eigenvalues from the principal-components analysis in the model. For analysis of the Asian population and other ancestry groups, the first five ancestry-specific principal components were included in the model, and one principal component was included in the model for the analysis of subjects of African ancestry. The number of principal components was chosen on the basis of the position of the inflexion of the principal-components scree plot (Supplementary Fig. 13). We tested for subtype-specific heterogeneity by comparing genotype frequencies in the four case subtypes using the Kruskal-Wallis test. We tested for heterogeneity in ORs by study and ancestry using the method of Breslow and Day39. To assess the magnitude of confounding caused by cryptic population substructure, we tested the 147,722 SNPs that had not been selected as candidates for ovarian cancer susceptibility. Inflation in the test statistics () was estimated by dividing the median of the test statistic by 0.455 (the median for the 2 distribution on 1 degree of freedom). The inflation was converted to an equivalent inflation for a study with 1,000 cases and 1,000 controls (1000) by adjusting by effective study size: l1000 = 1 + 500(l 1) 1 1 k + n m k k
1

npg

2013 Nature America, Inc. All rights reserved.

where n is the number of cases and m is the number of controls in each study stratum, k. In analyses restricted to European subjects and adjusted only for study, there was a small inflation of the test statistics ( = 1.13, 1000 = 1.007). This was reduced to 1.078 (1000 = 1.004) after adjusting for five principal components. Heterogeneity in ORs between studies was tested with Cochrans Q statistic. Functional studies. We performed the following assays for each gene in the 1-Mb region centered on the most significantly associated SNP at each locus (Supplementary Note): (i) gene expression in EOC cell lines ( n = 50) and normal precursor cells and tissues for ovarian cancers (OSECs and FTSECs) (n = 73) and (ii) CpG island methylation analysis in high-grade serous EOC (n = 106) and normal (n = 7) tissues. We also evaluated these genes in silico using bioinformatics tools to mine publicly-available somatic genetic data generated for primary EOCs and other cancer types. These were TCGA data for ~500 high-grade serous EOCs (gene expression, somatic mutation, DNA copy number variation, eQTL and methylation data)26 and the COSMIC40 analysis of mutations in genes curated from the published literature and data from the whole-genome resequencing of cancer samples undertaken by the Cancer Genome Project. We generated coexpression networks for genes in each locus using GeneMANIA, a large data set of gene expression studies (n = 154)41.

doi:10.1038/ng.2564

Nature Genetics

All these data enabled us to (i) compare gene expression in tumor and normal epithelium (EOC cell lines compared to normal cell lines and TCGA tumors compared to normal tissue); (ii) test for association between copy number alteration and gene expression at each locus; (iii) compare gene methylation status in tumor and normal tissue; (iv) carry out a gene eQTL analysis to evaluate associations between germline genotype and gene expression in lymphoblastoid cell lines, normal serous EOC precursor tissues and tumors; and (v) carry out an mQTL analysis to evaluate associations between germline genotype and gene methylation in tumors (Supplementary Fig. 8). We used data from ENCODE8 to evaluate the overlap between regulatory elements in noncoding regions and risk-associated SNPs at the three loci. ENCODE describes regulatory DNA elements (for example, enhancers, insulators and promoters) and noncoding RNAs (for example, microRNAs, long noncoding and piwi-interacting RNAs) that may be targets for susceptibility alleles (Supplementary Fig. 9). However, ENCODE does not include data for EOC-associated tissues, and the activity of such regulatory elements often varies in a tissue-specific manner8,42. Therefore, we profiled the spectrum of noncoding regulatory elements in OSECs and FTSECs using a combination of FAIRE-seq and RNA-seq (Supplementary Fig. 9). We also analyzed regulatory regions in early-stage transformed OSECs. For all regulatory biofeatures spanning the 1-Mb region at each locus, we evaluated their overlap with the most strongly associated SNP and all SNPs correlated with r2 0.8. 2013 Nature America, Inc. All rights reserved.

32. Permuth-Wey, J. et al. LIN28B polymorphisms influence susceptibility to epithelial ovarian cancer. Cancer Res. 71, 38963903 (2011). 33. Kermani, B.G. Artificial intelligence and global normalization methods for genotyping. US patent 7,035,740 (2008). 34. Teo, Y.Y. et al. A genotype calling algorithm for the Illumina BeadArray platform. Bioinformatics 23, 27412746 (2007). 35. Giannoulatou, E., Yau, C., Colella, S., Ragoussis, J. & Holmes, C.C. GenoSNP: a variational Bayes within-sample SNP genotyping algorithm that does not require a reference population. Bioinformatics 24, 22092214 (2008). 36. Sankararaman, S., Sridhar, S., Kimmel, G. & Halperin, E. Estimating local ancestry in admixed populations. Am. J. Hum. Genet. 82, 290303 (2008). 37. Price, A.L. et al. Principal components analysis corrects for stratification in genomewide association studies. Nat. Genet. 38, 904909 (2006). 38. Xing, G., Lin, C.Y., Wooding, S.P. & Xing, C. Blindly using Walds test can miss rare disease-causal variants in case-control association studies. Ann. Hum. Genet. 76, 168177 (2012). 39. Breslow, N.E. & Day, N.E. Statistical Methods in Cancer Research. Volume 1The Analysis of Case-Control Studies (International Agency for Research on Cancer, Lyon, 1980). 40. Forbes, S.A. et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 39, D945D950 (2011). 41. Mostafavi, S., Ray, D., Warde-Farley, D., Grouios, C. & Morris, Q. GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biol. 9 (suppl. 1), S4 (2008). 42. Heintzman, N.D. et al. Histone modifications at human enhancers reflect global cell-typespecific gene expression. Nature 459, 108112 (2009).

npg

Nature Genetics

doi:10.1038/ng.2564

Articles

Multiple independent variants at the TERT locus are associated with telomere length and risks of breast and ovarian cancer
2013 Nature America, Inc. All rights reserved.

TERT-locus SNPs and leukocyte telomere measures are reportedly associated with risks of multiple cancers. Using the Illumina custom genotyping array iCOGs, we analyzed ~480 SNPs at the TERT locus in breast (n = 103,991), ovarian (n = 39,774) and BRCA1 mutation carrier (n = 11,705) cancer cases and controls. Leukocyte telomere measurements were also available for 53,724 participants. Most associations cluster into three independent peaks. The minor allele at the peak 1 SNP rs2736108 associates with longer telomeres (P = 5.8 107), lower risks for estrogen receptor (ER)-negative (P = 1.0 108) and BRCA1 mutation carrier (P = 1.1 105) breast cancers and altered promoter assay signal. The minor allele at the peak 2 SNP rs7705526 associates with longer telomeres (P = 2.3 1014), higher risk of low-malignant-potential ovarian cancer (P = 1.3 1015) and greater promoter activity. The minor alleles at the peak 3 SNPs rs10069690 and rs2242652 increase ER-negative (P = 1.2 1012) and BRCA1 mutation carrier (P = 1.6 1014) breast and invasive ovarian (P = 1.3 1011) cancer risks but not via altered telomere length. The cancer risk alleles of rs2242652 and rs10069690, respectively, increase silencing and generate a truncated TERT splice variant. Chromosome ends are capped by telomeres, which protect them from inappropriate DNA repair and maintain genomic integrity1. Telomeres consist of structural proteins2 combined with many hundreds of hexanucleotide DNA repeats3,4, which are progressively shortened by normal cell division57. Shortening restricts the proliferation of normal somatic cells but not cancer cells, which can maintain long telomeres, usually via telomerase810, and may divide indefinitely. The TERT gene at 5p15.33 (NCBI gene 7015) encodes the catalytic subunit of telomerase reverse transcriptase, a key component of telomerase. Germline mutations in TERT cause dyskeratosis congenita, a cancer susceptibility disorder characterized by exceedingly short telomeres11. Although up to 80% of the variation of telomere length is estimated to be due to heritable factors12,13, association studies of TERT SNPs and differences in leukocyte telomere length have so far been inconclusive1417. Furthermore, it is unclear whether telomere length, measured in leukocyte DNA, is predictive of cancer risk: retrospective studies report that cancer patients after diagnosis have shorter telomeres than unaffected controls1821, but prospective studies with DNA taken before diagnosis have been inconclusive19,22,23. SNPs at 5p15.33 are reported to be associated with risks of several human cancers1416,2432, including certain subtypes of both ovarian33 and breast34 cancers. Resulting from a common interest, members of each of the constituent consortia in the Collaborative Oncological Geneenvironment Study (COGS) nominated SNPs surrounding the TERT locus for inclusion on a genotyping array. Consequently, the iCOGS array design included a combination of individual TERT gene candidate SNPs, as well as a more comprehensive set to fine-scale map the entire locus, for shared use by all consortia. This study had three aims:
A full list of authors and affiliations appears at the end of the paper. Received 6 July 2012; accepted 31 January 2013; published online 27 March 2013; doi:10.1038/ng.2566

to assess SNPs across the TERT locus for all detectable associations with mean telomere length and breast and ovarian cancer subtypes; to fine-scale map this locus to identify potentially causal variants for the observed associations; and to evaluate the functional effects of the strongest candidate causative variants. RESULTS One hundred and ten SNPs at the 5p15.33 locus (Build 37 positions 1,227,6931,361,969) passed quality control tests in 103,991 breast cancer cases and controls from 52 Breast Cancer Association Consortium (BCAC) studies, of which 41 studies (89,050 individuals) were of European ancestry, 9 were of Asian ancestry (12,893 individuals) and 2 were of African-American ancestry (2,048 individuals). The same 110 SNPs passed quality control tests in 11,705 BRCA1 mutation carriers of European ancestry, recruited by 45 studies from the Consortium of Investigators of Modifiers of BRCA1 and BRCA2 (CIMBA), and 108 SNPs passed quality control tests in 44,308 ovarian cancer cases and controls from 43 Ovarian Cancer Association Consortium (OCAC) studies. For OCAC, analysis was confined to the 39,774 participants of European ancestry, of whom 8,371 cases had invasive epithelial ovarian neoplasia and 986 had serous low-malignant-potential (LMP) neoplasia. For all study participants, genotype imputation, using the 110 genotyped SNPs together with the January 2012 release of the 1000 Genomes Project3538, was used to increase coverage to ~480 SNPs (imputation r2 > 0.3; minor allele frequency (MAF) > 0.02) for each phenotype. Telomere length was initially measured in control subjects from two BCAC studies (Studies of Epidemiology And Risk factors in Cancer Heredity (SEARCH) and the Copenhagen City Heart Study (CCHS); combined n = 15,567) (Supplementary Note).

npg

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

371

Articles a
MIR4457 CLPTM1L 1 rs7705526

15 12

SLC6A18

TERT 2

12

SLC6A18

TERT 2

MIR4457 CLPTM1L 1 rs3215401

15 12

SLC6A18

TERT

MIR4457 CLPTM1L

Imputed Genotyped log10 P

3 1 rs10069690

log10 P

rs7734992 rs56963355

6 3 0

rs2736108

log10 P

9 6 3 0

rs2736108

15 12

e 12
9 log10 P

0
1

15 12

rs7705526

rs2242652

log10 P

6
rs2736107

6 3 0

log10 P 1.35

rs3215401

9 6 3 0 12
3 rs10069690

2013 Nature America, Inc. All rights reserved.

1.25

1.30 Chr. 5 position (Mb)

1.35

1.25

1.30 Chr. 5 position (Mb)

Figure 1 Association results for all SNPs with seven phenotypes. ( ag) Phenotypes analyzed include telomere length (a), overall breast cancer risk (b), breast cancer risk in BRCA1 mutation carriers (c), ER-negative breast cancer risk (d), ER-positive breast cancer risk (e), serous LMP ovarian cancer risk (f) and serous invasive ovarian cancer risk (g). Directly genotyped SNPs are shown as filled black circles, and imputed SNPs (r2 > 0.3, MAF > 0.02) are shown as open red circles, plotted as the negative log of the P value against relative position across the locus. A schematic of the gene structures is shown above ac. Association peaks are labeled with blue numbers and are shown as gray regions, when appropriate.

log10 P

1.25

1.30 Chr. 5 position (Mb)

1.35

Manhattan plots are shown of the genotyped and well-imputed SNPs for the seven phenotypes analyzed, including mean telomere length (Fig. 1a), overall breast cancer risk (Fig. 1b), breast cancer in BRCA1 mutation carriers (Fig. 1c), ER-negative breast cancer (Fig.1d), ERpositive breast cancer (Fig. 1e), serous LMP ovarian cancer (Fig. 1f) and serous invasive ovarian cancer (Fig. 1g). Conditional analyses within each of these phenotypes identified multiple independent SNP associations each for telomere length, overall breast cancer risk, ERnegative breast cancer and breast cancer in BRCA1 mutation carriers but only one peak each for ER-positive breast cancer, serous LMP ovarian cancer and invasive ovarian cancer (Table 1). Full results of all these SNP analyses are given in Supplementary Tables 13. All associations are consistent with a log-additive model. Associations with telomere length SNPs in two distinct regions (hereafter denoted peaks 1 and 2) were strongly associated with telomere length (Fig. 1a, Tables 1 and 2 and Supplementary Fig. 1a). Imputed SNP rs7705526 (peak 2, position 1,285,974, TERT intron 2) had the largest effect, with a change in relative telomere length of 1.026-fold per allele (95% confidence interval (CI) = 1.0191.033; P = 2.3 1014; conditional P = 2.5 1011). We confirmed this finding in an additional 20,512 women and 17,645 men from a third study (CGPS) genotyped for rs7726159 (the best directly genotyped SNP; r2 = 0.83 with rs7705526). From a joint analysis of all 53,724 individuals, the change in relative telomere length was 1.020-fold per allele (95% CI = 1.0161.023; P = 7.5 1028). A second, independent association was observed with rs2736108 (peak 1, position 1,297,488, TERT promoter) with a per-allele change in relative telomere length of 1.017-fold (95% CI = 1.0101.024;
372

P = 5.8 107; conditional P = 4.0 104) (Fig. 1a, Tables 1 and 2 and Supplementary Fig. 1a). SNPs rs7705526 and rs2736108 were only weakly correlated (r2 = 0.04 in Europeans). Weak associations between peak 3 SNPs and telomere length became nonsignificant after adjustment for peak 2 SNP rs7705526 (data not shown). Associations with breast cancer risk We identified SNPs associated with breast cancer risk (P < 1 104) in three distinct regions in subjects from the BCAC studies and in two regions in CIMBA BRCA1 mutation carriers. No significant (P < 1 104) evidence for heterogeneity among odds ratios (ORs) or hazard ratios (HRs) between studies for any of the top SNPs was observed (Supplementary Fig. 2). The strongest association with overall breast cancer risk in BCAC was with peak 1 SNP rs3215401 (Fig.1b, Tables 1 and 2 and Supplementary Fig. 1b). There was also good evidence for an association with SNPs in peak 2 and weaker evidence that an additional SNP, outside the three main association peaks, was independently associated with breast cancer risk (Table 1 and Supplementary Table 1). The most strongly associated SNPs in BRCA1 mutation carriers were located in introns 24 (hereafter denoted peak 3), including rs10069690 (Fig. 1c, Tables 1 and 2 and Supplementary Fig. 2c) and rs2242652 (correlation with rs10069690, r2 = 0.70). The latter SNP also showed the strongest association with ER-negative breast cancer in BCAC (Fig. 1d, Tables 1 and2 and Supplementary Fig. 1d) but showed little evidence of association with ER-positive breast cancer (Table 2). Stepwise regression analysis in CIMBA studies indicated two independent associations with breast cancer risk in BRCA1 mutation carriers (conditional P = 5 105 for rs2736108 in peak 1 and P = 4.8 1013 for rs10069690 in peak 3).
VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

npg

Articles
Table 1 Independently associated SNPs for each phenotype
SNP Telomere length (RTL) BCAC (SEARCH and CCHS) n = 15,567 rs2736108 rs7705526 Overall breast cancer BCAC 46,451 cases, 42,599 controls rs3215401 rs7734992 rs56963355 Risk of breast cancer in BRCA1-mutation carriers CIMBA n = 11,705 rs2736108 rs10069690 Estrogen receptornegative breast cancer BCAC 7,435 cases, 41,575 controls rs3215401 rs2242652 1,296,255 1,280,028 1 3 Imputed Imputed 0.91 (0.860.95) 1.15 (1.101.20) 6.15 106 4.29 109 1,297,488 1,279,790 1 3 Genotyped Genotyped 0.92 (0.880.96) 1.16 (1.111.21) 5.12 105 4.83 1013 1,296,255 1,280,128 1,251,503 1 2 None Imputed Imputed Imputed 0.94 (0.910.96) 1.06 (1.041.08) 0.90 (0.840.95) 9.91 1010 1.73 107 1.95 105 1,297,488 1,285,974 1 2 Genotyped Imputed 1.010 (1.0041.016) 1.019 (1.0141.025) 0.0004 2.47 1011 Chr. 5 position TERT peak Source Effect (95% CI) Ptrend

2013 Nature America, Inc. All rights reserved.

Estrogen receptorpositive breast cancer BCAC 27,074 cases, 41,749 controls Serous LMP ovarian cancer OCAC 986 cases, 23,491 controls Serous invasive ovarian cancer OCAC 8,371 cases, 23,491 controls rs10069690 1,279,790 3 Genotyped 1.15 (1.111.20) 1.25 1011 rs7705526 1,285,974 2 Imputed 1.51 (1.361.67) 1.34 1015 rs2736107 1,297,854 1 Imputed 0.95 (0.920.97) 3.32 105

Independently associated SNPs are shown for each phenotype, including overall breast cancer and ER subgroups in European individuals in BCAC and invasive and LMP subgroups in OCAC following forward conditional stepwise logistic regression analysis, relative change in telomere length in SEARCH and CCHS combined data following forward stepwise linear regression analysis and breast cancer in BRCA1 mutation carriers in CIMBA following forward stepwise Cox regression. These analyses were performed on all SNPs with MAF > 0.02 and association P < 1 104 in the single-SNP analyses.

A very similar pattern was observed for ER-negative breast cancer in BCAC (conditional P = 6 106 for rs3215401 in peak 1 and P = 4.3 109 for rs2242652 in peak 3; Table 1). The most strongly associated SNP with ER-positive breast cancer was rs2736107 in peak 1 (Fig. 1e, Tables 1 and 2 and Supplementary Fig. 3e). Weak associations between the key SNPs and risk for BRCA2 mutation carriers were also observed, but the sample size was too small to draw definitive conclusions (data not shown). Associations with ovarian cancer risk The strongest association observed for risk of LMP ovarian cancer was with peak 2 SNP rs7705526, and this was the only SNP retained in the stepwise regression analysis (Fig. 1f, Tables 1 and 2 and Supplementary Fig. 1f). The strongest observed association for serous invasive ovarian cancer was with peak 3 SNP rs10069690 ( Fig.1g, Tables 1 and 2 and Supplementary Fig. 1g). No other independent association was observed for serous invasive ovarian cancer (Table1). We also analyzed SNP associations with endometrioid, mucinous, clear-cell invasive and mucinous LMP ovarian cancers but found no associations at P < 1 104 (Supplementary Table 4). We attempted analysis of invasive serous ovarian cancer stratified by grade, but, again, statistical power was low (Supplementary Fig. 3). Three main peaks of association within the TERT locus The above results indicate that the majority of observed associations with all seven tested phenotypes fall into association peaks 13.
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

Correlated SNPs in the TERT promoter (peak 1) were associated with telomere length, ER-positive breast cancer, ER-negative breast cancer and breast cancer in BRCA1 mutation carriers. SNPs in peak 2, spanning TERT introns 24, were independently associated with telomere length, overall breast cancer risk and serous LMP ovarian cancer. SNPs in peak 3, also spanning TERT introns 24, showed strong associations with ER-negative breast cancer, breast cancer risk for BRCA1 mutation carriers and serous invasive ovarian cancer but not with telomere length (Tables 1 and 2). Although peaks 2 and 3 overlap physically, they define distinct sets of SNPs that are only partially correlated (for example, correlation between rs10069690 and rs7705526 was weak, r2 = 0.33; Fig. 2). Some SNP-phenotype associations in peak 2 were clearly weaker than those in peak 3 (for example, with ER-negative breast cancer) and became nonsignificant after adjustment for SNP rs2242652 in peak 3. Conversely, the associations with telomere length and serous LMP ovarian cancer were stronger for SNPs in peak 2, indicating that the associations in peaks 2 and 3 are not being driven by the same causal variants. The strongest candidates for causation within each peak were identified by computing likelihood ratios; the SNPs listed in Tables 1 and 2 are those that cannot be excluded from being causal candidates at a likelihood ratio of >1:100 fold compared to the top hit in the peak. The statistical power to exclude SNPs differed between phenotypes: in peak 1, all but seven SNPs could be excluded from being causal for relative telomere length, breast cancer risk in BRCA1 mutation carriers and ERnegative breast cancer risk, but an additional SNP could be excluded for
373

npg

npg
2013 Nature America, Inc. All rights reserved.

374
Overall breast cancer risk OR (95% CI) Ptrend 7.95 1010 0.90 (0.860.95) 0.89 (0.850.94)b 0.88 (0.840.93) 0.88 (0.840.92) 0.88 (0.840.93) 0.88 (0.840.92)b 0.95 0.0012 (0.930.98) 0.96 0.00034 (0.930.98) 0.95 (0.920.97) 4.58 105 0.88 (0.840.92) 0.88 (0.840.93) 2.41 108 2.60 109 1.19 109 1.37 108 0.96 0.00051 (0.930.98) 1.02 (0.921.13) 1.01 (0.911.11) 1.01 (0.911.11) 1.00 (0.911.11) 0.96 0.00083 (0.930.98) 1.02 (0.921.13) 0.89 (0.840.94) 0.90 (0.850.94) 0.90 (0.850.94) 0.89 (0.850.94) 4.22 105 4.26 105 3.05 105 1.90 105 1.04 105 1.21 108 0.762 0.769 0.908 0.893 0.935 1.05 105 0.89 (0.840.93) 0.96 0.00051 (0.930.98) 1.00 (0.911.10) 0.996 1.01 108 0.00013 0.88 (0.830.93) 0.909 6.73 109 1.20 108 1.91 108 4.68 1010 4.01 109 6.65 1010 1.41 108 0.95 3.32 105 1.01 (0.920.97)b (0.911.11) HR (95% CI) Ptrend OR (95% CI) Ptrend OR (95% CI) Ptrend OR (95% CI) Ptrend OR (95% CI) Risk of breast cancer in BRCA1 mutation carriers ER-negative breast cancer risk ER-positive breast cancer risk Serous LMP ovarian cancer risk Serous invasive epithelial ovarian cancer risk Ptrend Ptrend 8.33 107 0.93 (0.910.96) 0.94 (0.920.95) 0.94 (0.920.96) 0.94 (0.920.96) 0.94 (0.910.96)b 0.94 (0.920.96) 0.93 (0.910.96) 5.81 107 7.13 106 7.13 106 1.63 106 2.95 106 1.35 106 0.98 0.301 (0.941.02) 0.98 0.238 (0.941.02) 0.98 0.324 (0.941.02) 0.98 0.351 (0.941.02) 0.98 0.188 (0.941.01) 0.97 0.204 (0.941.01) 0.97 (0.931.01) 0.152 2.32 1014 1.04 (1.021.06) 1.04 (1.021.06) 1.04 (1.021.06) 1.05 (1.031.07) 1.05 (1.031.07) 1.05 (1.031.07)b 1.10 (1.051.15) 2.06 106 0.00017 1.62 105 1.10 (1.041.15) 0.00038 1.29 105 1.09 (1.041.15) 0.00047 1.09 (1.051.13) 1.10 (1.061.14) 1.10 (1.061.14) 1.95 105 1.10 (1.051.16) 1.08 (1.041.12) 7.47 105 4.75 105 1.08 (1.031.14) 0.002 1.08 (1.041.12) 9.68 105 9.29 105 2.23 105 5.02 106 0.00011 1.04 (0.991.10) 0.120 1.06 (1.021.10) 0.0035 1.93 1013 1.10 1010 4.52 1013 1.62 1011 1.04 0.0049 (1.011.06) 1.03 0.014 (1.011.06) 1.03 0.0083 (1.011.06) 1.03 0.0090 (1.011.06) 1.03 0.021 (1.001.05) 1.57 106 1.03 0.0053 (1.011.06) 1.51 (1.361.67)b 1.49 (1.351.64) 1.42 (1.291.56) 1.44 (1.311.59) 1.44 (1.301.59) 1.45 (1.311.59) 1.34 1015 2.18 1015 7.24 1013 4.50 1014 9.83 1013 1.11 6.38 107 (1.061.15) 1.13 5.38 109 (1.081.17) 1.12 2.08 109 (1.081.17) 1.12 4.59 109 (1.081.17) 1.12 9.46 109 (1.081.17) 5.25 1014 1.12 2.75 109 (1.081.16) 2.23 109 0.0010 0.016 0.013 1.06 (1.041.08) 2.43 108 1.06 (1.031.08) 6.82 106 1.07 (1.041.09) 1.02 107 1.19 (1.121.25) 1.25 (1.181.32) 1.23 (1.161.29)b 5.61 109 6.89 1014 1.60 1014 1.17 (1.121.21) 1.18 (1.141.23)b 1.16 (1.121.20) 2.34 1011 1.23 1012 1.68 1012 1.04 0.014 (1.011.06) 1.02 0.131 (0.991.05) 1.03 0.011 (1.011.06) 1.45 (1.301.61) 1.40 (1.251.56) 1.33 (1.201.47) 1.96 1011 4.45 109 2.49 108 1.15 3.04 109 (1.101.20) 1.17 4.85 1011 (1.121.22) 1.15 1.25 1011 (1.111.20)b

Table 2 Association between TERT SNPs and the seven studied phenotypes

Articles

Telomere length change

SNP

Chr. 5 position

Major/ minor allele

MAF

RTL (95% CI)

Peak 1: promoter

rs2736107

1,297,854

C/T

0.28

1.015 (1.0091.022)

rs2736108a

1,297,488

C/T

0.29

1.017 (1.0101.024)b

rs72525896 1,297,081 CCA/C

0.27

1.016 (1.0091.023)

5-1297077

1,297,077 ACC/A

0.27

1.016 (1.0091.023)

rs3215401

1,296,255

A/AG

0.30

1.016 (1.0091.023)

rs2853669

1,295,349

A/G

0.30

1.016 (1.0091.024)

rs2736098

1,294,086

C/T

0.27

1.017

(1.0101.024)

Peak 2: introns 24

rs7705526

1,285,974

C/A

0.33

1.026 (1.0191.033)b

rs4449583

1,284,135

C/T

0.34

1.025 (1.0181.031)

rs7725218a

1,282,414

G/A

0.35

1.021 (1.0151.028)

rs7726159a

1,282,319

C/A

0.34

1.024 (1.0171.031)

5-1280940

1,280,940

GAG CCCA CC/G

0.37

1.023 (1.0161.029)

rs7734992

1,280,128

T/C

0.43

1.019 (1.0131.025)

Peak 3: introns 24

rs72709458 1,283,755

C/T

0.22

1.012 (1.0051.020)

rs2242652

1,280,028

G/A

0.20

1.009 (1.0021.017)

rs10069690a 1,279,790

C/T

0.26

1.009 (1.0021.016)

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

Single-SNP estimates for the most significant SNPs per peak per phenotype in the TERT fine-mapping region of chromosome 5 from positions 1,227,014 to 1,361,964 (Genome Build 37). The major and minor alleles are from the genotyping assay and are not necessarily from the coding strand. Change in telomere length per minor allele is given as fold change (RTL) with respect to the estimated telomere length for the common homozygote per SNP, both with 95% CI (Supplementary Note). All breast and ovarian cancer risk results are given as ORs with 95% CI and per-allele Ptrend, and the risk results for BRCA1 mutation carriers are given as HRs with 95% CI. All SNPs that cannot be excluded as >100 worse than the top hit in the block, using a likelihood ratio test, for any phenotype are listed. Independent peaks 13 were localized using forward conditional analyses (Table 1). The most significant hit in each region for each phenotype is shown in bold for each peak.

aGenotyped

SNPs. All three peaks contain at least one directly genotyped variant. bResults for the most significant SNPs in each block in the forward conditional analyses (described further in Table 1).

Articles
Figure 2 Associated signals within the TERT gene. Peak regions are labeled: peaks 2 and 3 overlap around introns 24, and peak 1 encompasses the promoter. The positions of associated SNPs are shown as black and red lines representing genotyped and imputed SNPs, respectively. The TERT gene structure is depicted with exons (boxes) joined by introns (lines). The positions of all analyzed iCOGS SNPs are marked. Data from the UCSC Genome Browser, including epigenetic marks for methylation of histone H3 at lysine 4 (H3K4me1) and trimethylation of histone H3 lysine 4 (H3K4me3), evolutionary and sequence pattern extraction through reduced representations (ESPERR) regulatory potential and vertebrate conservation tracks, are shown. Regions cloned into reporter constructs are depicted as the green rectangle (TERT promoter) or as blue rectangles (PRE-A and PRE-B). The pattern of linkage disequilibrium based on the BCAC population is shown with white representing r2 = 0 and black representing r2 = 1.
1,255,000 1,260,000 1,265,000 1,270,000 1,275,000 1,280,000 1,285,000 1,290,000 1,295,000 1,300,000

Chr. 5 position (hg19) Associated signals Layered H3K4me1 Layered H3K4me3 ESPERR regulatory potential Vertebrate conservation iCOGS SNPs
rs56963355

Peak3 PRE-B rs10069690 rs2242652

Peak2 PRE-A rs7705526 rs7734992

Peak1 rs3215401 rs2736108

Promoter

rs2736107

BCAC linkage disequilibrium

2013 Nature America, Inc. All rights reserved.

ER-positive breast cancer risk (Table 2). In peak 2, the greatest power was for the telomere length phenotype, where all but three SNPs could be excluded, whereas five or six remained for cancer risk. For peak 3, three putative causal SNPs remained for ER-negative breast cancer risk, two for serous invasive ovarian cancer risk and just one for breast cancer risk in BRCA1 mutation carriers. Results in each peak are compatible with a single causative variant being responsible for the multiple phenotype associations (notably, in peak 3, SNPs rs2242652 and rs10069690 were equally compatible with being the single causal variant). However, the possibilities of different causal variants being responsible for different phenotypes or of the associations being due to haplotype effects cannot be ruled out. Asian and African-American studies We tested all SNPs (n = 341) with MAF > 0.02 and imputation r2 > 0.3 for association with breast cancer in the 9 BCAC Asian studies (comprising 6,269 cases and 6,624 controls) for association, but none reached formal levels of significance. Furthermore, none of the top SNPs in individuals of European ancestry showed more than borderline levels of significance in Asians (Supplementary Table 5). Peak 3 SNP rs10069690 was directly genotyped in 2 BCAC African-American studies (1,116 cases and 932 controls), as well as in the abovementioned Asian studies, and had estimated effects on ER-negative breast cancer similar to those in European populations (per-allele OR = 1.19, 95% CI = 1.061.31, P = 0.009 in African- Americans and OR = 1.09, 95% CI = 1.001.19, P = 0.07 in Asian women). Within OCAC, there were too few women of Asian and African ancestry to draw meaningful conclusions (Supplementary Table 6). Chromatin analysis Analysis of Encyclopedia of DNA Elements (ENCODE) data39 showed no evidence of regulatory elements or open chromatin coinciding with any risk-associated SNPs in normal breast epithelial cells or the other represented tissues ( Supplementary Fig. 4). Data for ovarian tissues are not included in ENCODE. We therefore performed site-specific formaldehyde-assisted isolation of regulatory elements (FAIRE)40 in ovarian cancer precursor tissues to identify regulatory elements in a 1 Mb region centered on peak 3. In fallopian tube secretory and ovarian surface epithelial cells, we detected FAIRE peaks coinciding with the CLPTM1L promoter but not the
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

TERT promoter (Supplementary Fig. 4). In silico analyses additionally indicated that TERT introns 4 and 5 (within and beyond peak 3) contained regions showing regulatory potential and vertebrate sequence conservation41. We performed site-specific FAIRE analyses on a ~1 kb region centered on the peak 3 SNP rs10069690 in normal tissue samples from breast reduction mammoplasty (n = 4), ovarian cancer precursor tissues (n = 4) and ovarian cancer cell lines (n = 4). Breast cells from each woman were sorted into four enriched fractions on the basis of differential expression of cell surface markers42 (myoepithelial/stem, luminal progenitor, mature luminal and stromal cells), and assays were performed on each fraction (Fig. 3). Chromatin was in a closed configuration in all ovarian, breast luminal progenitor and mature luminal cell fractions. However, in two of four stromal cell fractions, we detected ~600 bp of open chromatin of varying amplitude, covering the position of SNP rs10069690 but not of rs2242652, and, in three of four myoepithelial/stem cell fractions, we detected ~800 bp of open chromatin, covering the positions of both SNPs rs10069690 and rs2242652. Luciferase reporter assays The regulatory capabilities of the DNA in each of the three peaks and the effects of most of the strongest candidate causative variants in each one were examined in luciferase reporter assays, using a construct containing 3,915 bp of the TERT promoter sequence43. The effects of peak 1 TERT promoter variants were examined via five haplotype constructs differing at rs2736107, rs2736108 and rs2736109 (ref. 25) (Fig. 4a): one with all three major alleles (wild-type TERT), another with all three minor alleles and three each with a single minor allele of the SNPs. Relative promoter activity was determined in ER-positive (MCF7) and ER-negative (MDA-MB-468) breast cancer cell lines and in an ovarian cancer cell line (A2780). The construct containing all three minor alleles consistently generated the lowest luciferase signals, close to baseline. To determine whether the risk-associated variants in peaks 2 and 3 fell within putative cis-acting regulatory elements (PREs), we cloned ~3 kb of sequence surrounding each SNP. Constructs of PRE-A (peak 2) had no significant effect on the activity of either the wild-type (TERTwt) promoter or the promoter with three minor alleles (TERTh) (Fig. 4a). However, inclusion of the minor
375

npg

Articles
Figure 3 Open chromatin signatures around rs10069690. (a) Map of the PCR amplicon sites AD used to annotate a 1 kb region surrounding rs10069690 and rs2442652. Primer sequences are listed in Supplementary Table 11. (b) PCR analysis of FAIRE-processed chromatin from FACS fractions enriched for myoepithelial/stem, luminal progenitor, mature luminal and stromal cells derived from the breast tissues of four subjects, Q695, Q723, Q706 and Q674. Error bars represent the standard errors from triplicate PCR runs.

TERT

rs10069690 PCR amplicons Site A Site B Site C 800 bp

rs2242652 Site D

b
Relative chromatin recovery (percent input)

Q695 10.0 Myoepithelial/stem Lumenal progentior Stromal 5.0 Mature lumenal 5.0 10.0

Q723

allele of rs7705526 resulted in ~30% higher TERT promoter activity in all three cell lines, 0 0 suggesting that it can act as a transcriptional Site A Site B Site C Site D Site A Site B Site C Site D enhancer. Higher promoter activity was also observed with this construct in A2780 ovarQ706 Q674 10.0 10.0 ian cancer cells but not in the two breast cancer cell lines. Constructs of PRE-B (peak 3) consistently acted as strong transcriptional silencers, leading to 4050% lower activ5.0 5.0 ity, specifically in constructs containing the wild-type TERT promoter. Notably, inclusion of the minor allele of rs2242652 in PRE0 0 B constructs decreased relative wild-type Site A Site B Site C Site D Site A Site B Site C Site D TERT promoter activity by a further ~20% compared to the silencer containing the major allele, but the minor cg06550200 (CLPTM1L, P = 6.9 104) out of the 935 probes tested. allele of the highly correlated SNP rs10069690 did not generate this Both regions showed lower methylation with the minor, cancer risk effect (Fig. 4a). associated allele (Supplementary Table 9), but this did not correlate with changes in expression. Alternative splicing of TERT Several alternatively spliced variants of TERT have been found to DISCUSSION affect telomerase activity44,45. To determine the role of PRE-B (peak 3) Our comprehensive examination of the TERT locus has answered SNPs in TERT alternative splicing, we inserted intron 4 sequence some long-standing questions and raised several new ones. We have into a full-length TERT cDNA mini-gene construct and confirmed identified two independent regions associated with telomere length in accurate splicing. Cancer riskassociated alleles for rs10069690 and leukocyte DNA; these provide definitive evidence for genetic control rs2242652 were generated individually and in combination within the of telomere length by common TERT variants. For rs2736108, the mini-gene. RT-PCR, using primers spanning intron 4, showed that all most significant SNP in promoter peak 1, the minor allele is associated SNP permutations in all cell lines produced comparable levels of both with a 1.7% increase in telomere length. This is equal to a telomere wild-type transcript and an INS1 alternatively spliced variant, which length change of ~60 bp, which, because telomere length decreases includes the first 38 bp of TERT intron 4 (refs. 46,47) (Supplementary by approximately 19 bp per year50, is equivalent in magnitude to an Fig. 5a). We also identified a new splice variant of TERT, specifi- age difference of 3.1 years. We estimate that rs2736108 explains 0.08% cally associated with the minor allele of rs10069690 (termed INS1b; of the variance in telomere length in men and 0.06% of the variation Supplementary Fig. 5a). Sequence analysis confirmed that INS1b in women. SNPs in peak 2 have a stronger effect on telomere length, includes the first 480 bp of intron 4 and results from the use of an alter- with each additional A (minor) allele of rs7705526 associated with native splice donor created by the minor allele of rs10069690 (ref. 48). a 2.6% increase. This is equal to a ~90 bp change in telomere length INS1b has a premature stop codon 16 amino acids into intron 4 and is and, correspondingly, to 4.7 years of age. We estimate that rs7705526 predicted to generate a severely truncated protein product, which is explains 0.31% of the variance in telomere length in men and 0.16% likely to affect telomerase activity (Supplementary Fig. 5b). of the variance in women. The only other reported associations with telomere length reaching genome-wide significance involve TERCGene expression and methylation analyses in ovarian tissue locus SNP rs1269304 (ref. 51) and OBFC1-locus SNP rs4387287 We used The Cancer Genome Atlas (TCGA) 49 data to examine (ref. 52), which have similar effects on telomere length (75 bp and gene expression of the 11 protein-coding genes and 1 microRNA 115 bp per allele, respectively). (MIR4457) located within 1 Mb of peak 3 SNP rs10069690. Most Our only findings consistent with the hypothesis that shorter telgenes showed higher expression in ovarian tumors compared with omeres predispose to increased cancer risk53 (equivalent to longer normal tissues (Supplementary Fig. 4 and Supplementary Table 7). telomeres being protective) are those from the peak 1 SNPs. However, We observed no association between rs10069690 and the expression a regulatory element construct containing the longer telomere levels of any of the genes in any of the cells tested (Supplementary associated alleles of three highly correlated SNPs, rs2736108, Fig. 5 and Supplementary Tables 7 and 8). There was some evi- rs2736107 and rs2736109 (reconstructing a haplotype with 25% fredence of association between rs10069690 and tumor methylation quency in Europeans35), virtually abolished promoter activity in a with probes cg23827991 (TERT CpG island, P = 1.3 106) and reporter assay. This finding leaves a seemingly paradoxical association
Relative chromatin recovery (percent input)

npg

2013 Nature America, Inc. All rights reserved.

376

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

Articles a
pGL2-Control pGL2-Basic TERTwt TERTh rs2736107 rs2736108 rs2736109
SV40 Luc Luc Luc Luc Luc Luc Luc

b
MDA-MB-468

* * * *
0

50 100 150 200 250 Relative luciferase activity

* *

* *

pGL2-Control pGL2-Basic TERTwt TERTh rs2736107 rs2736108 rs2736109

SV40

Luc Luc Luc Luc Luc Luc Luc

MCF7

pGL3-Basic TERTwt PRE-A WT rs7705526 PRE-B WT rs10069690 rs2242652 TERTh PRE-A WT rs7705526 PRE-B WT rs10069690 rs2242652

Luc TERTwt TERTwt TERTwt TERTwt TERTwt TERTwt TERTh TERTh TERTh TERTh TERTh TERTh Luc Luc Luc Luc Luc Luc Luc Luc Luc Luc Luc Luc

* * *

* * * * * *

* * *

MDA-MB-468 0 50 100 150

* *

* * * *

Relative luciferase activity

10 20 30 90 110 Relative luciferase activity

2013 Nature America, Inc. All rights reserved.

pGL2-Control pGL2-Basic TERTwt TERTh rs2736107 rs2736108 rs2736109

SV40

Luc Luc Luc Luc Luc Luc Luc

A2780

* * * *
0 50 100 150 Relative luciferase activity

pGL3-Basic TERTwt PRE-A WT rs7705526 PRE-B WT rs10069690 rs2242652 TERTh PRE-A WT rs7705526 PRE-B WT rs10069690 rs2242652

Luc TERTwt TERTwt TERTwt TERTwt TERTwt TERTwt TERTh TERTh TERTh TERTh TERTh TERTh Luc Luc Luc Luc Luc Luc Luc Luc Luc Luc Luc Luc

* * * * *

* * * * * *

* *

MCF7 0 50 100 150

Relative luciferase activity

Figure 4 TERT promoter and PRE activity. Luciferase pGL3-Basic Luc TERTwt Luc TERTwt reporter assays following transient transfection of TERTwt Luc PRE-A WT ER-negative breast cancer (MDA-MB-468), ER-positive * * TERTwt Luc rs7705526 * breast cancer (MCF7) and ovarian cancer (A2780) TERTwt Luc PRE-B WT * * cell lines. Error bars represent standard error from at ** TERTwt Luc rs10069690 * * least three independent experiments. (a) Luciferase * TERTwt Luc rs2242652 * * TERTh Luc TERTh reporter assays after transient transfection of cells with Luc TERTh PRE-A WT pGL2-Control (SV40 promoter and enhancer), TERTh Luc rs7705526 * pGL2-Basic (no promoter or enhancer) or the TERT * Luc TERTh PRE-B WT reporter vectors with TERTwt (3.9 kb of TERT TERTh Luc rs10069690 A2780 promoter), TERT promoter with the minor (T) allele rs2242652 TERTh Luc of rs2736107, rs2736108 or rs2736109 or TERTh 0 50 100 150 (TERT promoter with the T allele at all sites). The results Relative luciferase activity of comparisons with wild-type TERT performed using one-way ANOVA with post-hoc Dunnetts tests are shown (**P < 0.001, *P < 0.005). (b) PRE-A or PRE-B was cloned downstream of either the TERTwt or TERTh promoter-driven reporters with and without the minor alleles of SNPs rs10069690, rs2242652 and rs7705526, respectively. The results of comparisons with wild-type TERTwt or TERTh performed using one-way ANOVA with post-hoc Dunnetts tests are shown (*** P < 0.0001, **P < 0.001). WT, wild type.

npg

between lower enhancer activity and greater telomere length (Fig. 4). Control of telomerase activity is currently poorly understood, and this finding clearly merits further investigation. SNPs within peak 3 (TERT introns 24) show strong associations with hormone-related cancers: peak 3 SNP rs10069690 is associated with risk of ER-negative breast cancer34 and breast cancer in BRCA1 mutation carriers, consistent with the observation that the majority of breast cancers arising in BRCA1 mutation carriers are ER negative. This variant has been reported to be associated with prostate cancer26,54, and we find it associated with serous invasive ovarian cancer. Although SNPs in peaks 2 and 3 overlap on a physical map, the SNPs most strongly associated with cancer risk or telomere length were not highly correlated with each other (r2 between rs10069690 and rs7705526 = 0.33; Fig. 2b). This observation suggests that either the associations observed with multiple cancers and SNPs in peak 3 are mediated via a mechanism distinct from control of telomere length or that telomere length in breast, prostate and ovarian cells is under the control of a different set of SNPs from those controlling telomere length in leukocytes. Luciferase reporter assays show that peak 3 contains a silencer of the TERT promoter and that the minor allele
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

of peak 3 SNP rs2242652 further reduces expression. Consistent with this finding, Kote-Jarai et al.54 report that the minor, risk allele of this SNP is associated with reduced TERT expression in benign prostate tissue. However, we were unable to identify comparable associations in ovarian or breast tumor tissue, possibly because TERT expression is severely dysregulated in most tumors. Taken together, our luciferase assays indicate that either reduced signal from regulatory elements in peaks 1 and 3 or increased signal from peak 2 increases risk of specific cancer types. It should be noted that the minor allele of rs2242652 is associated with significantly lower risk of prostate cancer54 (OR = 0.84, 95% CI = 0.810.87; Ptrend = 1 1023) but with significantly higher risks of breast and ovarian cancers (Tables 1 and 2). Similarly, a nearby SNP, rs401681, is associated with higher risks of cancers of the lung, bladder, testes and cervix and basal cell carcinoma but with lower risk of melanoma28,30,31. Such inverted associations might be due to tissue-specific interactions that need further examination. We have additionally shown that the minor allele of rs10069690 affects splicing and is associated with transcription of a novel truncated isoform resulting from the introduction of a premature stop
377

Articles
codon (Supplementary Fig. 6). We do not yet know whether this isoform affects canonical telomerase activity or how it changes activity. We further identified new open chromatin signatures overlapping rs10069690 in breast stromal and myoepithelial/stem cell fractions but not in progenitor or differentiated luminal epithelial cell fractions. Senescent stromal cells can stimulate malignant transformation of epithelial cells in in vitro and in vivo models55,56, and the mechanisms mediated by these SNPs merit investigation in future studies. The SNPs originally reported to be associated with risk of lung (rs402710)57 and breast (rs3816659)58 cancers (Supplementary Table10) were not associated with any cancer in this study. Moreover, SNP rs2736100 in peak 2 has been reported to be associated with glioma and lung and testicular cancers27,28,31,57,5962, whereas nearby SNP rs2853677 was reported to be associated with glioma in the Han Chinese population63. Despite their physical proximity, these SNPs are not highly correlated with rs7705526 (r2 = 0.52 and 0.18, respectively), nor do they show independent associations with telomere length after adjustment for rs7705526. Thus, variants underlying susceptibility to different cancer types are different from the set of variants in the TERT region mediating changes in telomere length. One limitation of this study is the incomplete representation of all SNPs at 5p15.33 on the iCOGS chip, which was designed in March 2010 using SNPs catalogued in HapMap 3 together with those from the pilot study of the 1000 Genomes Project35. To help fill known gaps on the iCOGS chip, additional SNPs were genotyped from the October 2010 1000 Genomes Project data release, and imputation was based on the most recent January 2012 release. However, several gaps remain across the TERT locus, and the existence of these gaps, coupled with the low linkage disequilibrium across the region (Fig.2), raises the possibility that there could be more independent associations that we have not yet detected. Furthermore, the incomplete SNP catalog at the time of study design means that we cannot assume with certainty that the true causal variants, directly responsible for the observed association peaks, were captured in our analysis. It is also possible that additional rare variants not specifically investigated in this study could have functional effects within this locus. Further resequencing of this region is needed to uncover the full spectrum of variation and phenotype associations. Another limitation is that telomere length was measured in DNA from leukocytes rather than from breast or ovarian tissue. Whereas we obtained suitable blood DNA for measurements in >53,000 subjects (a necessary sample size for adequate statistical power), obtaining comparable qualities and quantities of DNA from normal breast or ovary cells would be almost impossible. Telomere lengths measured in different tissues within one individual have been shown to be highly correlated6466, meaning that leukocyte telomere lengths are likely to be good surrogates for the corresponding lengths in other tissues. Furthermore, one of our aims was to investigate whether the previously reported associations between mean telomere length and cancer risk might be mediated by TERT variants, and such studies have been based on telomere length measured in blood cell DNA. Another limitation was that we were unable to stratify OCAC ovarian cancer cases by BRCA1 and BRCA2 mutation status because this information was not available; nor was there sufficient power to evaluate ovarian cancer risk in mutation carriers in CIMBA. Our findings provide evidence relevant to the hypothesis that shorter telomeres increase cancer risks: associations in the TERT promoter (peak 1) fit this hypothesis best, whereas those in peaks 2 and 3 (TERT introns 24) and other reported 5p15.33 SNP cancer associations (Supplementary Table 10) do not. Thus, it would seem that the majority of cancer associations within the TERT locus
378

are mediated via alternative mechanisms involving the TERT gene. The protein product of TERT has functions beyond the telomerasemediated extension of telomeres67. These non-canonical functions of TERT strongly resemble those mediated by MYC and WNT68, which are upstream regulators of proliferation, differentiation and migration. TERT also modulates WNT/-catenin signaling69, and ectopic TERT expression induces increased cell division and decreased apoptosis in primary mammary cells, independent of telomere elongation70. In conclusion, this study provides definitive evidence for genetic control of telomere length by common genetic variants in the TERT locus. Additionally, we report multiple, independent TERT SNP associations with breast cancer risk, confirming previously reported associations and identifying new associations in both the general population and in BRCA1 mutation carriers. We also provide, for the first time to our knowledge, highly significant evidence for the association of distinct TERT SNPs with serous LMP and invasive ovarian cancer risks. Our results show that the relationships between TERT genotype, telomere length and cancer risk are complex and that the TERT locus may influence cancer risk through multiple mechanisms. URLs. HapMap 3 catalog, http://www.sanger.ac.uk/resources/ downloads/human/hapmap3.html; Wellcome Trust Case Control Consortium investigators, http://www.wtccc.org.uk/; investigators and institutions constituting the TCGA research network, http:// cancergenome.nih.gov/. Methods Methods and any associated references are available in the online version of the paper.
Note: Supplementary information is available in the online version of the paper. Acknowledgments We thank all the individuals who took part in these studies and all the researchers, clinicians, technicians and administrative staff who have enabled this work to be carried out. COGS is funded through a grant from the European Commissions Seventh Framework Programme (agreement 223175HEALTH-F2-2009-223175). BCAC is funded by Cancer Research UK (C1287/A10118 and C1287/A12014). BCAC meetings have been funded by the European Union Cooperation in Science and Technology (COST) programme (BM0606). Telomere length measurement and analysis were funded by Cancer Research UK project grant C1287/A9540 and Chief Physician Johan Boserup and Lise Boserups Fund. CIMBA data management and analysis were supported by Cancer Research UK grants C12292/A11174 and C1287/A10118. OCAC is supported by a grant from the Ovarian Cancer Research Fund thanks to the family and friends of Kathryn Sladek Smith (PPD/RPCI.07). Genotyping of the iCOGS array was funded by the European Union (HEALTH-F2-2009-223175), Cancer Research UK (C1287/A10710), the Canadian Institutes of Health Research (CIHR) for the CIHR Team in Familial Risks of Breast Cancer program (J.S. and D.E.) and the Ministry of Economic Development, Innovation and Export Trade of Quebec (grant PSR-SIIRI-701; J.S., D.E. and P. Hall). Scientific development and funding of the OCAC portion of this project were supported by Genetic Associations and Mechanisms in Oncology (GAME-ON; U19-CA148112). CIMBA genotyping was supported by US National Institutes of Health (NIH) grant CA128978, a National Cancer Institute (NCI) Specialized Program of Research Excellence (SPORE) in Breast Cancer (CA116201), a US Department of Defense Ovarian Cancer Idea award (W81XWH-10-1-0341) and grants from the Breast Cancer Research Foundation and the Komen Foundation for the Cure. This study made use of data generated by The Wellcome Trust Case Control Consortium (funding was provided by Wellcome Trust award 076113) and the TCGA Pilot Project established by NCI and the National Human Genome Research Institute. AUTHOR CONTRIBUTIONS Manuscript writing group: S.E.B., K.A.P., S.E.J., J. Beesley, K. Michailidou, S.L.E., H.A.P., P.L.M., M.H.G., H.C.S., K.L., A.N.A.M., S.A.G., A. Berchuck, P.D.P.P., E.L.G., R.R.R., D.F.E., A.C.A., G.C.-T. and A.M.D. Locus SNP selection: S.E.B., K.A.P., M.H.G., E. Dicks and A.M.D. Preparation of OCC samples for genotyping: S.J.R. and C.L.P. iCOGS genotyping, calling and quality control: S.E.B., S.F.N., A.G.-N., M.A.

npg

2013 Nature America, Inc. All rights reserved.

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

Articles
Rossing, J. Beesley, D.C.T., D.V., F.B., A. Swerdlow, M.J.M., C.L., C. Baynes, J.M. Cunningham, J.A.D., A.M.D., G.C.-T., K.A.P., D.F.E., P. Soucy and J.S. Imputation: K. Michailidou, K.B.K., J.P.T., A.C.A. and D.F.E. Telomere length determination and analysis: S.E.B., K.A.P., M.W., A.M.D. and D.F.E. Statistical analyses and programming: K. Michailidou, K.B.K., S.E.B., K.A.P., A.C.A., D.F.E., S.E.J. and Y. Lu. Functional analysis and bioinformatics: S.L.E., J.D.F., K.M.H., H.A.P., R.R.R., H.C.S., K.L., S.A.G., A.N.A.M., B.L.F., E.L.G., S.J.R., M.C.L., J. Beesley, M.D.S., K.L., C.E.S., R.L.J., S.R.L. and G.C.-T. COGS coordination: P. Hall, D.F.E., J. Benitez and A.M.D. BCAC coordination: D.F.E., G.C.-T. and P.D.P.P. BCAC data management: M.K.B. and Q.W. CIMBA coordination: A.C.A., G.C.-T. and F.J.C. OCAC coordination: P.D.P.P., S.J.R. and C.M.P. CIMBA data management: L.M. and D.B. Provided participant samples and phenotype information and read and approved the manuscript: S.E.B., K.A.P., S.E.J., J. Beesley, K. Michailidou, J.P.T., S.L.E., H.A.P., H.C.S., C.E.S., K.M.H., P.L.M., K.L., M.D.S., Y. Lu, R.K., N. Woods, R.L.J., J.D.F., X.C., M.W., S.F.N., M.J.M., M. Ghoussaini, S.A., C. Baynes, M.K.B., Q.W., J.D., L.M., D.B., A. Lee, S. Healey, M.L., D.C.T., D.V., F.B., I.V., S.L., E. Despierre, H.A.R., A.G.-N., M.A. Rossing, G.P., J.A.D., N. lvarez, M.C.L., B.L.F., N. Schoof, J.C.-C., M.S.C., J. Peto, K.R.K., A. Broeks, S.M.A., M.K.S., L.M.B., B. Winterhoff, H.N., G.E.K., D.L., L.R., P.G., A.T., R.L.M., J.J.G., A.C., V.S., B. Burwinkel, F.M., R.H., E.J.S., C.A.H., S.W.-G., I.L.A., K.B.M., J.L.H., K. Odunsi, A. Lindblom, G.G.G., H. Brenner, J.S., G.L., P.A.F., M.E.C., P.R., L.R.W., A. Swerdlow, M.T.G., H. Brauch, M.G.-C., P. Hillemanns, R.W., M. Drst, P.D., I.R., A. Jakubowska, J. Lubinski, A. Mannermaa, R. Butzow, N.V.B., T.D., L.M.P., W.Z., A. Leminen, H.A.-C., C.H.B., V. Kristensen, R.B.N., K. Muir, R.E., A. Meindl, F.H., K. Matsuo, A.d.B., A.H.W., P. Harter, S.-H.T., I.S., X.-O.S., W.B., S. Hosono, D.K., T.N., M. Hartman, Y.Y., U.H., B.Y.K., S. Sangrajrang, S.K.K., V.G., A. Jensen, D.E., E.H., C.-Y.S., J. Brown, Y.L.W., M. Shah, M.A.N.A., R.L., S.Z.O., K.C., R.A.V., B.G.N., H.F., C.V., J.E.O., X.W., D.A.L., A.R., R.P.W., D.F.-J., E.I., S.N., J.M.S., I.D.S.S., D.W.C., L.G., K.L.T., O.F., A.F.V., C.E.v.d.S., E.M.P., F.B.L.H., S.S.T., J. Liu, E.V.B., J. Li, S.H.O., K.H., I.O., C. Blomqvist, L.R.-R., K.A., H.B.S., T.A.M., E. Wik, B. Brouwers, C.K., E. Wauters, M.K.H., H.W., L.A.K., C.M., K.K.A., P.L.-P., A.M.v.A., T.T., L.F.A.G.M., J. Benitez, T.P., J.I.A.P., M. Hoatlin, M.P.Z., L.S.C., S.P.B., L.E.K., A. Schneeweiss, N.D.L., C. Sohn, A.B.-W., I.T., M.J. Kerin, N.M., C.C., B.E.H., J. Menkiszak, F.S., N. Wentzensen, L.L.M., H.P.Y., A.M.M., G. Glendon, S.A.E., J.A.K., C.K.H., C.A., M. Gore, H.T., H.S., M.C.S., A. Jager, A.M.W.v.d.O., R. Brown, J.W.M.M., J.M.F., M.K., J. Paul, S. Margolin, N. Siddiqui, G.S., A.S.W., L. Baglietto, V.M., C. Stegmaier, W.S., H. Mller, V.A., F.L., Y.-T.G., M.S.G., G.Y., M. Dumont, J.R.M., A. Hartmann, A.B.E., M.W.B., C.M.P., M.P.L., J.P.-W., B.P., T.A.S., F.F., M. Barile, A.Z., A.A., A.G.-M., M.J., S.J.R., N.O., U.M., C.L.P., T. Brning, M.C.P., Y.-D.K., J. Lissowska, J.F., J.K., S.J.C., A.D.-M., A.J.-V., I.K.R., K.P., M. Bidzinski, S.K., A. Hollestelle, C. Seynaeve, R.A.E.M.T., K.D., K.J., J.M.H., V.-M.K., V. Kataja, N.N.A., J. Long, M. Shrubsole, S.D.-H., A. Lophatananon, P. Siriwanarangsan, S.S.-B., N.D., P.L., R.K.S., H. Ito, H. Iwata, K.T., C.-C.T., D.O.S., D.v.d.B., C.H.Y., M.K.I., Y.-C.T., H.C., W.L., L.B.S., Q.C., D.-Y.N., K.-Y.Y., H. Miao, P.T.-C.I., Y.Y.T., J. McKay, C. Shapiro, F.A., G.F., C.-N.H., J.-C.Y., M.-F.H., C.S.H., C.L., S.P., D.S.-L., P.P., T.R.R., M.P., C.F.S., E. Friedman, M.T., K. Offit, T.V.O.H., S.L.N., C.I.S., I.B., J. Garber, S.A.N., J.N.W., M.M., E.O., A.K.G., D.Y., D.E.G., T. Caldes, E.N.I., L.T., B.K.A., I.C., A.R.M., C.J.v.A., K.E.P.v.R., H.M.-H., J.M. Colle, J.C.O., M.J.H., M.A. Rookus, R.B.v.d.L., T.A.M.v.O., D.G.E., D.F., E. Fineberg, J. Barwell, L.W., M.J. Kennedy, R.P., R. Davidson, S.D.E., T. Cole, B.B.-d.P., B. Buecher, F.D., L. Faivre, M.F., O.M.S., O.C., S.G., S. Mazoyer, V.B., V.C.-M., A.T.-G., J. Gronwald, T. Byrski, A.B.S., B. Bonanni, D.Z., G. Giannini, L. Bernard, R. Dolcetti, S. Manoukian, N. Arnold, C.E., H.D., K.R., D.N., H.P., C. Sutter, B. Wappenschmidt, . Borg, B.M., J.R., M. Soller, K.L.N., S.M.D., G.C.R., R.S., D.G.K., M.-K.T., S.S.P., Y. Laitman, A.-B.S., T.A.K., U.B.J., M.R., A.-M.G., B.E., L. Foretova, S.A.S., J. Lester, P. Soucy, K.B.K., C.O., J.M. Cunningham, S. Slager, V.S.P., E. Dicks, S.R.L., F.J.C., P. Hall, A.N.A.M., S.A.G., P.D.P.P., R.R.R., E.L.G., M.H.G., D.F.E., A. Berchuck, A.C.A., G.C.-T. and A.M.D. COMPETING FINANCIAL INTERESTS. The authors declare no competing financial interests.
Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html.
1. McEachern, M.J., Krauskopf, A. & Blackburn, E.H. Telomeres and their control. Annu. Rev. Genet. 34, 331358 (2000). 2. Palm, W. & de Lange, T. How shelterin protects mammalian telomeres. Annu. Rev. Genet. 42, 301334 (2008). 3. Baird, D.M. Telomeres. Exp. Gerontol. 41, 12231227 (2006). 4. Moyzis, R.K. et al. A highly conserved repetitive DNA sequence, (TTAGGG)n, present at the telomeres of human chromosomes. Proc. Natl. Acad. Sci. USA 85, 66226626 (1988). 5. Allsopp, R.C. et al. Telomere length predicts replicative capacity of human fibroblasts. Proc. Natl. Acad. Sci. USA 89, 1011410118 (1992). 6. Harley, C.B. Telomere loss: mitotic clock or genetic time bomb? Mutat. Res. 256, 271282 (1991). 7. Levy, M.Z. et al. Telomere end-replication problem and cell aging. J. Mol. Biol. 225, 951960 (1992). 8. Counter, C.M. et al. Telomere shortening associated with chromosome instability is arrested in immortal cells which express telomerase activity. EMBO J. 11, 19211929 (1992). 9. Hiyama, E. et al. Telomerase activity in human breast tumors. J. Natl. Cancer Inst. 88, 116122 (1996). 10. Stampfer, M.R. & Yaswen, P. Human epithelial cell immortalization as a step in carcinogenesis. Cancer Lett. 194, 199208 (2003). 11. Alter, B.P. et al. Telomere length is associated with disease severity and declines with age in dyskeratosis congenita. Haematologica 97, 353359 (2012). 12. Njajou, O.T. et al. Telomere length is paternally inherited and is associated with parental lifespan. Proc. Natl. Acad. Sci. USA 104, 1213512139 (2007). 13. Slagboom, P.E., Droog, S. & Boomsma, D.I. Genetic determination of telomere size in humans: a twin study of three age groups. Am. J. Hum. Genet. 55, 876882 (1994). 14. Mirabello, L. et al. The association of telomere length and genetic variation in telomere biology genes. Hum. Mutat. 31, 10501058 (2010). 15. Pooley, K.A. et al. No association between TERT-CLPTM1L single nucleotide polymorphism rs401681 and mean telomere length or cancer risk. Cancer Epidemiol. Biomarkers Prev. 19, 18621865 (2010). 16. Mocellin, S. et al. Telomerase reverse transcriptase locus polymorphisms and cancer risk: a field synopsis and meta-analysis. J. Natl. Cancer Inst. 104, 840854 (2012). 17. Soerensen, M. et al. Genetic variation in TERT and TERC and human leukocyte telomere length and longevity: a cross-sectional and longitudinal analysis. Aging Cell 11, 223227 (2012). 18. McGrath, M. et al. Telomere length, cigarette smoking, and bladder cancer risk in men and women. Cancer Epidemiol. Biomarkers Prev. 16, 815819 (2007). 19. Pooley, K.A. et al. Telomere length in prospective and retrospective cancer casecontrol studies. Cancer Res. 70, 31703176 (2010). 20. Shen, J. et al. Telomere length, oxidative damage, antioxidants and breast cancer risk. Int. J. Cancer 124, 16371643 (2009). 21. Wentzensen, I.M. et al. The association of telomere length and cancer: a metaanalysis. Cancer Epidemiol. Biomarkers Prev. 20, 12381250 (2011). 22. De Vivo, I. et al. A prospective study of relative telomere length and postmenopausal breast cancer risk. Cancer Epidemiol. Biomarkers Prev. 18, 11521156 (2009). 23. Zee, R.Y. et al. Mean telomere length and risk of incident colorectal carcinoma: a prospective, nested case-control approach. Cancer Epidemiol. Biomarkers Prev. 18, 22802282 (2009). 24. Baird, D.M. Variation at the TERT locus and predisposition for cancer. Expert Rev. Mol. Med. 12, e16 (2010). 25. Beesley, J. et al. Functional polymorphisms in the TERT promoter are associated with risk of serous epithelial ovarian and breast cancers. PLoS ONE 6, e24987 (2011). 26. Kote-Jarai, Z. et al. Seven prostate cancer susceptibility loci identified by a multistage genome-wide association study. Nat. Genet. 43, 785791 (2011). 27. Landi, M.T. et al. A genome-wide association study of lung cancer identifies a region of chromosome 5p15 associated with risk for adenocarcinoma. Am. J. Hum. Genet. 85, 679691 (2009). 28. Rafnar, T. et al. Sequence variants at the TERT-CLPTM1L locus associate with many cancer types. Nat. Genet. 41, 221227 (2009). 29. Shete, S. et al. Genome-wide association study identifies five susceptibility loci for glioma. Nat. Genet. 41, 899904 (2009). 30. Stacey, S.N. et al. New common variants affecting susceptibility to basal cell carcinoma. Nat. Genet. 41, 909914 (2009). 31. Turnbull, C. et al. Variants near DMRT1, TERT and ATF7IP are associated with testicular germ cell cancer. Nat. Genet. 42, 604607 (2010). 32. Wang, Y. et al. Common 5p15.33 and 6p21.33 variants influence lung cancer risk. Nat. Genet. 40, 14071409 (2008). 33. Johnatty, S.E. et al. Evaluation of candidate stromal epithelial cross-talk genes identifies association between risk of serous ovarian cancer and TERT, a cancer susceptibility hot-spot. PLoS Genet. 6, e1001016 (2010). 34. Haiman, C.A. et al. A common variant at the TERT-CLPTM1L locus is associated with estrogen receptornegative breast cancer. Nat. Genet. 43, 12101214 (2011). 35. 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 10611073 (2010). 36. Michailidou, K. et al. Large-scale genotyping identifies 41 new loci associated with breast cancer risk. Nat. Genet. published online; doi:10.1038/ng.2563 (27 March 2013). 37. Pharoah, P.D.P. et al. GWAS meta-analysis and replication identifies three new common susceptibility loci for ovarian cancer. Nat. Genet. published online; doi:10.1038/ng.2564 (27 March 2013). 38. Garcia-Closas, M. et al. Genome-wide association studies identify four ER negative specific breast cancer risk loci. Nat. Genet. published online; doi:10.1038/ng.2561 (27 March 2013). 39. Rosenbloom, K.R. et al. ENCODE whole-genome data in the UCSC Genome Browser: update 2012. Nucleic Acids Res. 40, D912D917 (2012). 40. Giresi, P.G. et al. FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates active regulatory elements from human chromatin. Genome Res. 17, 877885 (2007). 41. Taylor, J. et al. ESPERR: learning strong and weak signals in genomic sequence alignments to identify functional elements. Genome Res. 16, 15961604 (2006). 42. Lim, E. et al. Aberrant luminal progenitors as the candidate target population for basal tumor development in BRCA1 mutation carriers. Nat. Med. 15, 907913 (2009).

npg

2013 Nature America, Inc. All rights reserved.

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

379

Articles
43. Chen, Y.J. et al. PAX8 regulates telomerase reverse transcriptase and telomerase RNA component in glioma. Cancer Res. 68, 57245732 (2008). 44. Colgin, L.M. et al. The hTERT splice variant is a dominant negative inhibitor of telomerase activity. Neoplasia 2, 426432 (2000). 45. Saebe-Larssen, S., Fossberg, E. & Gaudernack, G. Characterization of novel alternative splicing sites in human telomerase reverse transcriptase (hTERT): analysis of expression and mutual correlation in mRNA isoforms from normal and tumour tissues. BMC Mol. Biol. 7, 26 (2006). 46. Kilian, A. et al. Isolation of a candidate human telomerase catalytic subunit gene, which reveals complex splicing patterns in different cell types. Hum. Mol. Genet. 6, 20112019 (1997). 47. Wick, M., Zubov, D. & Hagen, G. Genomic organization and promoter characterization of the gene encoding the human telomerase reverse transcriptase (hTERT). Gene 232, 97106 (1999). 48. Desmet, F.O. et al. Human Splicing Finder: an online bioinformatics tool to predict splicing signals. Nucleic Acids Res. 37, e67 (2009). 49. Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma. Nature 474, 609615 (2011). 50. Weischer, M. et al. Short telomere length, myocardial infarction, ischemic heart disease, and early death. Arterioscler. Thromb. Vasc. Biol. 32, 822829 (2012). 51. Codd, V. et al. Common variants near TERC are associated with mean telomere length. Nat. Genet. 42, 197199 (2010). 52. Levy, D. et al. Genome-wide association identifies OBFC1 as a locus involved in human leukocyte telomere biology. Proc. Natl. Acad. Sci. USA 107, 92939298 (2010). 53. Willeit, P. et al. Telomere length and risk of incident cancer and cancer mortality. J. Am. Med. Assoc. 304, 6975 (2010). 54. Kote-Jarai, Z. et al. Fine-mapping identifies multiple prostate cancer risk loci on 5p15, one of which associates with TERT expression. Hum. Mol. Genet. published online; doi:10.1093/hmg/ddt086 (27 March 2013). 55. Krtolica, A. et al. Senescent fibroblasts promote epithelial cell growth and tumorigenesis: a link between cancer and aging. Proc. Natl. Acad. Sci. USA 98, 1207212077 (2001). 56. Lawrenson, K. et al. Senescent fibroblasts promote neoplastic transformation of partially transformed ovarian epithelial cells in a three-dimensional model of early stage ovarian cancer. Neoplasia 12, 317325 (2010). 57. McKay, J.D. et al. Lung cancer susceptibility locus at 5p15.33. Nat. Genet. 40, 14041406 (2008). 58. Shen, J. et al. Multiple genetic variants in telomere pathway genes and breast cancer risk. Cancer Epidemiol. Biomarkers Prev. 19, 219228 (2010). 59. Hu, Z. et al. A genome-wide association study identifies two new lung cancer susceptibility loci at 13q12.12 and 22q12.2 in Han Chinese. Nat. Genet. 43, 792796 (2011). 60. Mushiroda, T. et al. A genome-wide association study identifies an association of a common variant in TERT with susceptibility to idiopathic pulmonary fibrosis. J. Med. Genet. 45, 654656 (2008). 61. Truong, T. et al. Replication of lung cancer susceptibility loci at chromosomes 15q25, 5p15, and 6p21: a pooled analysis from the International Lung Cancer Consortium. J. Natl. Cancer Inst. 102, 959971 (2010). 62. Zou, P. et al. The TERT rs2736100 polymorphism and cancer risk: a meta-analysis based on 25 case-control studies. BMC Cancer 12, 7 (2012). 63. Zhao, Y.M. et al. Fine-mapping of a region of chromosome 5p15.33 (TERTCLPTM1L) suggests a novel locus in TERT and a CLPTM1L haplotype are associated with glioma susceptibility in a Chinese population. Int. J. Cancer 129, 24632472 (2011). 64. Takubo, K. et al. Telomere lengths are characteristic in each human individual. Exp. Gerontol. 37, 523531 (2002). 65. Gadalla, S.M. et al. Telomere length in blood, buccal cells, and fibroblasts from patients with inherited bone marrow failure syndromes. Aging (Albany. NY) 2, 867874 (2010). 66. Thomas, P., O Callaghan, N.J. & Fenech, M. Telomere length in white blood cells, buccal cells and brain tissue and its variation with ageing and Alzheimers disease. Mech. Ageing Dev. 129, 183190 (2008). 67. Greider, C.W. & Blackburn, E.H. Identification of a specific telomere terminal transferase activity in Tetrahymena extracts. Cell 43, 405413 (1985). 68. Choi, J. et al. TERT promotes epithelial proliferation through transcriptional control of a Myc- and Wnt-related developmental program. PLoS Genet. 4, e10 (2008). 69. Park, J.I. et al. Telomerase modulates Wnt signalling by association with target gene chromatin. Nature 460, 6672 (2009). 70. Mukherjee, S. et al. Separation of telomerase functions by reverse genetics. Proc. Natl. Acad. Sci. USA 108, E1363E1371 (2011).

2013 Nature America, Inc. All rights reserved.

Stig E Bojesen1,2,285, Karen A Pooley3,285, Sharon E Johnatty4,285, Jonathan Beesley4,285, Kyriaki Michailidou3,285, Jonathan P Tyrer5,285, Stacey L Edwards6, Hilda A Pickett7,8, Howard C Shen9, Chanel E Smart10, Kristine M Hillman6, Phuong L Mai11, Kate Lawrenson9, Michael D Stutz7,8, Yi Lu4, Rod Karevan9, Nicholas Woods12, Rebecca L Johnston10, Juliet D French6, Xiaoqing Chen4, Maren Weischer1,2, Sune F Nielsen1,2, Melanie J Maranian5, Maya Ghoussaini5, Shahana Ahmed5, Caroline Baynes5, Manjeet K Bolla3, Qin Wang3, Joe Dennis3, Lesley McGuffog3, Daniel Barrowdale3, Andrew Lee3, Sue Healey4, Michael Lush3, Daniel C Tessier13, Daniel Vincent13, Franis Bacot13, Australian Cancer Study14, Australian Ovarian Cancer Study14, Kathleen Cuningham Foundation Consortium for Research into Familial Breast Cancer (kConFab)14, Gene Environment Interaction and Breast Cancer (GENICA)14, Swedish Breast Cancer Study (SWE-BRCA)14, The Hereditary Breast and Ovarian Cancer Research Group Netherlands (HEBON)14, Epidemiological study of BRCA1 & BRCA2 Mutation Carriers (EMBRACE)14, Genetic Modifiers of Cancer Risk in BRCA1/2 Mutation Carriers (GEMO)14, Ignace Vergote15,16, Sandrina Lambrechts15,16, Evelyn Despierre15,16, Harvey A Risch17, Anna Gonzlez-Neira18, Mary Anne Rossing19,20, Guillermo Pita18, Jennifer A Doherty21, Nuria lvarez18, Melissa C Larson22, Brooke L Fridley23, Nils Schoof24, Jenny Chang-Claude25, Mine S Cicek26, Julian Peto27, Kimberly R Kalli28, Annegien Broeks29, Sebastian M Armasu22, Marjanka K Schmidt29,30, Linde M Braaf29, Boris Winterhoff31, Heli Nevanlinna32, Gottfried E Konecny33, Diether Lambrechts34,35, Lisa Rogmann31, Pascal Gunel36,37, Attila Teoman31, Roger L Milne38, Joaquin J Garcia39, Angela Cox40, Vijayalakshmi Shridhar39, Barbara Burwinkel41,42, Frederik Marme41,43, Rebecca Hein25,44, Elinor J Sawyer45, Christopher A Haiman9, Shan Wang-Gohrke46, Irene L Andrulis47,48, Kirsten B Moysich49, John L Hopper50, Kunle Odunsi49, Annika Lindblom51, Graham G Giles50,52,53, Hermann Brenner54, Jacques Simard55, Galina Lurie56, Peter A Fasching33,57, Michael E Carney56, Paolo Radice58,59, Lynne R Wilkens56, Anthony Swerdlow60,61, Marc T Goodman62, Hiltrud Brauch63,64, Montserrat Garcia-Closas65, Peter Hillemanns66, Robert Winqvist67,68, Matthias Drst69, Peter Devilee70,71, Ingo Runnebaum69, Anna Jakubowska72, Jan Lubinski72, Arto Mannermaa73,74, Ralf Butzow32,75, Natalia V Bogdanova76,77, Thilo Drk76, Liisa M Pelttari32, Wei Zheng78, Arto Leminen32, Hoda Anton-Culver79, Clareann H Bunker80, Vessela Kristensen81,82, Roberta B Ness83, Kenneth Muir84,85, Robert Edwards86, Alfons Meindl87,

npg

380

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

Articles
Florian Heitz88,89, Keitaro Matsuo90, Andreas du Bois88,89, Anna H Wu9, Philipp Harter88,89, Soo-Hwang Teo91,92, Ira Schwaab93, Xiao-Ou Shu78, William Blot78,94, Satoyo Hosono90, Daehee Kang95, Toru Nakanishi96, Mikael Hartman24,97,98, Yasushi Yatabe99, Ute Hamann100, Beth Y Karlan101, Suleeporn Sangrajrang102, Susanne Krger Kjaer103,104, Valerie Gaborieau105, Allan Jensen103, Diana Eccles106, Estrid Hgdall103,107, Chen-Yang Shen108,109, Judith Brown3, Yin Ling Woo110, Mitul Shah5, Mat Adenan Noor Azmi110, Robert Luben3, Siti Zawiah Omar110, Kamila Czene24, Robert A Vierkant22, Brge G Nordestgaard1,2, Henrik Flyger111, Celine Vachon112, Janet E Olson112, Xianshu Wang39, Douglas A Levine113, Anja Rudolph25, Rachel Palmieri Weber114, Dieter Flesch-Janys115,116, Edwin Iversen117,118, Stefan Nickels25, Joellen M Schildkraut114,118, Isabel Dos Santos Silva27, Daniel W Cramer119,120, Lorna Gibson27, Kathryn L Terry119,120, Olivia Fletcher65, Allison F Vitonis119, C Ellen van der Schoot121, Elizabeth M Poole120,122, Frans B L Hogervorst123, Shelley S Tworoger120,122, Jianjun Liu124, Elisa V Bandera125, Jingmei Li124, Sara H Olson126, Keith Humphreys24, Irene Orlow126, Carl Blomqvist127, Lorna Rodriguez-Rodriguez125, Kristiina Aittomki128, Helga B Salvesen129,130, Taru A Muranen32, Elisabeth Wik129,130, Barbara Brouwers131,132, Camilla Krakstad129,130, Els Wauters34,35, Mari K Halle129,130, Hans Wildiers132, Lambertus A Kiemeney133135, Claire Mulot136, Katja K Aben133,134, Pierre Laurent-Puig136, Anne Mvan Altena137, Thrse Truong36,37, Leon F A G Massuger137, Javier Benitez18,138,139, Tanja Pejovic140,141, Jose Ignacio Arias Perez142, Maureen Hoatlin143, M Pilar Zamora144, Linda S Cook145, Sabapathy P Balasubramanian40, Linda E Kelemen146148, Andreas Schneeweiss41,43, Nhu D Le149, Christof Sohn41, Angela Brooks-Wilson150,151, Ian Tomlinson152,153, Michael J Kerin154, Nicola Miller154, Cezary Cybulski155, Brian E Henderson9, Janusz Menkiszak156, Fredrick Schumacher9, Nicolas Wentzensen157, Loic Le Marchand56, Hannah P Yang157, Anna Marie Mulligan158,159, Gord Glendon160, Svend Aage Engelholm161, Julia A Knight162,163, Claus K Hgdall104, Carmel Apicella50, Martin Gore164, Helen Tsimiklis165, Honglin Song5, Melissa C Southey165, Agnes Jager166, Ans M Wvan den Ouweland167, Robert Brown168, John W M Martens166, James M Flanagan168, Mieke Kriege166, James Paul169, Sara Margolin170, Nadeem Siddiqui171, Gianluca Severi50,52, Alice S Whittemore172, Laura Baglietto50,52, Valerie McGuire172, Christa Stegmaier173, Weiva Sieh172, Heiko Mller54, Volker Arndt54, France Labrche174, Yu-Tang Gao175, Mark S Goldberg176,177, Gong Yang78, Martine Dumont55, John R McLaughlin160,178, Arndt Hartmann179, Arif B Ekici180, Matthias W Beckmann57, Catherine M Phelan12, Michael P Lux57, Jenny Permuth-Wey12, Bernard Peissel181, Thomas A Sellers12, Filomena Ficarazzi59,182, Monica Barile183, Argyrios Ziogas184, Alan Ashworth65, Aleksandra Gentry-Maharaj185, Michael Jones60, Susan J Ramus9, Nick Orr65, Usha Menon185, Celeste L Pearce9, Thomas Brning186, Malcolm C Pike9,126, Yon-Dschun Ko187, Jolanta Lissowska188, Jonine Figueroa157, Jolanta Kupryjanczyk189, Stephen J Chanock157, Agnieszka Dansonka-Mieszkowska189, Arja Jukkola-Vuorinen190, Iwona K Rzepecka189, Katri Pylks67,68, Mariusz Bidzinski191, Saila Kauppila192, Antoinette Hollestelle166, Caroline Seynaeve166, Rob A E M Tollenaar193, Katarzyna Durda72, Katarzyna Jaworska72,194, Jaana M Hartikainen73,74, Veli-Matti Kosma73,74, Vesa Kataja74, Natalia N Antonenkova195, Jirong Long78, Martha Shrubsole78, Sandra Deming-Halverson78, Artitaya Lophatananon84,85, Pornthep Siriwanarangsan196, Sarah Stewart-Brown84, Nina Ditsch197, Peter Lichtner198, Rita K Schmutzler199,200, Hidemi Ito90, Hiroji Iwata201, Kazuo Tajima90, Chiu-Chen Tseng9, Daniel O Stram9, David van den Berg9, Cheng Har Yip92, M Kamran Ikram202, Yew-Ching Teh92, Hui Cai78, Wei Lu203, Lisa B Signorello78,94, Qiuyin Cai77, Dong-Young Noh95, Keun-Young Yoo95, Hui Miao98, Philip Tsau-Choong Iau97, Yik Ying Teo98, James McKay105, Charles Shapiro204, Foluso Ademuyiwa205, George Fountzilas206, Chia-Ni Hsiung109, Jyh-Cherng Yu207, Ming-Feng Hou208,209, Catherine S Healey5, Craig Luccarini5, Susan Peock3, Dominique Stoppa-Lyonnet210212, Paolo Peterlongo58,59, Timothy R Rebbeck213,214, Marion Piedmonte215, Christian F Singer216, Eitan Friedman217,218, Mads Thomassen219, Kenneth Offit220, Thomas V O Hansen221, Susan L Neuhausen222, Csilla I Szabo223, Ignacio Blanco224, Judy Garber225, Steven A Narod226, Jeffrey N Weitzel227, Marco Montagna228, Edith Olah229, Andrew K Godwin230, Drakoulis Yannoukakos231, David E Goldgar232,233, Trinidad Caldes234, Evgeny N Imyanitov235, Laima Tihomirova236, Banu K Arun237,238, Ian Campbell239, Arjen R Mensenkamp240, Christi J van Asperen241, Kees E P van Roozendaal242, Hanne Meijers-Heijboer243, J Margriet Colle167, Jan C Oosterwijk244, Maartje J Hooning166, Matti A Rookus29, Rob B van der Luijt245, Theo A Mvan Os246, D Gareth Evans247, Debra Frost3, Elena Fineberg3, Julian Barwell248, Lisa Walker249, M John Kennedy250,

npg

2013 Nature America, Inc. All rights reserved.

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

381

Articles
Radka Platte3, Rosemarie Davidson251, Steve D Ellis3, Trevor Cole252, Brigitte Bressac-de Paillerets253,254, Bruno Buecher210, Francesca Damiola255, Laurence Faivre256,257, Marc Frenay258, Olga M Sinilnikova255,259, Olivier Caron260, Sophie Giraud259, Sylvie Mazoyer255, Valrie Bonadona261,262, Virginie Caux-Moncoutier210, Aleksandra Toloczko-Grabarek72, Jacek Gronwald72, Tomasz Byrski72, Amanda B Spurdle4, Bernardo Bonanni183, Daniela Zaffaroni181, Giuseppe Giannini263, Loris Bernard182,264, Riccardo Dolcetti265, Siranoush Manoukian181, Norbert Arnold266, Christoph Engel267, Helmut Deissler46, Kerstin Rhiem199,200, Dieter Niederacher268, Hansjoerg Plendl269, Christian Sutter270, Barbara Wappenschmidt199,200, ke Borg271, Beatrice Melin272, Johanna Rantala273, Maria Soller274, Katherine L Nathanson213,275, Susan M Domchek213,275, Gustavo C Rodriguez276, Ritu Salani277, Daphne Gschwantler Kaulich216, Muy-Kheng Tea216, Shani Shimon Paluch217,218, Yael Laitman217,218, Anne-Bine Skytte278, Torben A Kruse219, Uffe Birk Jensen279, Mark Robson220, Anne-Marie Gerdes280, Bent Ejlertsen281, Lenka Foretova282, Sharon A Savage11, Jenny Lester101, Penny Soucy55, Karoline B Kuchenbaecker3, Curtis Olswold112, Julie M Cunningham39, Susan Slager112, Vernon S Pankratz112, Ed Dicks3, Sunil R Lakhani10,283, Fergus J Couch39,112, Per Hall24, Alvaro N A Monteiro12, Simon A Gayther9, Paul D P Pharoah5, Roger R Reddel7,8, Ellen L Goode26, Mark H Greene11, Douglas F Easton3,5,286, Andrew Berchuck284,286, Antonis C Antoniou3,286, Georgia Chenevix-Trench4,286 & Alison M Dunning5,286
2013 Nature America, Inc. All rights reserved.

1Copenhagen

General Population Study, Herlev Hospital, Copenhagen University Hospital, University of Copenhagen, Copenhagen, Denmark. 2Department of Clinical Biochemistry, Herlev Hospital, Copenhagen University Hospital, University of Copenhagen, Copenhagen, Denmark. 3Centre for Cancer Genetic Epidemiology, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK. 4Department of Genetics, Queensland Institute of Medical Research, Brisbane, Queensland, Australia. 5Centre for Cancer Genetic Epidemiology, Department of Oncology, University of Cambridge, Cambridge, UK. 6School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, Queensland, Australia. 7Cancer Research Unit, Childrens Medical Research Institute, Westmead, New South Wales, Australia. 8Sydney Medical School, University of Sydney, Sydney, New South Wales, Australia. 9Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, California, USA. 10University of Queensland, UQ Centre for Clinical Research (UQCCR) Royal Brisbane and Womens Hospital, Herston, Queensland, Australia. 11Clinical Genetics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, US National Institutes of Health, Rockville, Maryland, USA. 12Department of Cancer Epidemiology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida, USA. 13McGill University and Gnome Qubec Innovation Centre, Montral, Quebec, Canada. 14A list of members is provided in the Supplementary Note. 15Department of Obstetrics and Gynaecology, Division of Gynecologic Oncology, University Hospitals Leuven, Leuven, Belgium. 16Leuven Cancer Institute, University Hospitals Leuven, Leuven, Belgium. 17Department of Epidemiology and Public Health, Yale University School of Public Health and School of Medicine, New Haven, Connecticut, USA. 18Centro Nacional de Genotipacin, Human Cancer Genetics Program, Spanish National Cancer Research Centre (CNIO), Madrid, Spain. 19Department of Epidemiology, University of Washington, Seattle, Washington, USA. 20Program in Epidemiology, Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA. 21Section of Biostatistics and Epidemiology, The Geisel School of Medicine at Dartmouth, Lebanon, New Hampshire, USA. 22Department of Health Science Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota, USA. 23Kansas IDeA Network of Biomedical Research Excellence Bioinformatics Core, The University of Kansas Cancer Center, Kansas City, Kansas, USA. 24Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden. 25Division of Cancer Epidemiology, German Cancer Research Center (DKFZ), Heidelberg, Germany. 26Department of Health Science Research, Division of Epidemiology, Mayo Clinic, Rochester, Minnesota, USA. 27Department of Non-communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK. 28Department of Medical Oncology, Mayo Clinic, Rochester, Minnesota, USA. 29Division of Molecular Pathology, Netherlands Cancer Institute, Antoni van Leeuwenhoek Hospital, Amsterdam, The Netherlands. 30Division of Psychosocial Research and Epidemiology, Netherlands Cancer Institute, Antoni van Leeuwenhoek Hospital, Amsterdam, The Netherlands. 31Department of Obstetrics and Gynecology, Mayo Clinic, Rochester, Minnesota, USA. 32Department of Obstetrics and Gynecology, Helsinki University Central Hospital, University of Helsinki, Helsinki, Finland. 33Division of Hematology and Oncology, Department of Medicine, David Geffen School of Medicine, University of California at Los Angeles, Los Angeles, California, USA. 34Laboratory for Translational Genetics, Department of Oncology, University of Leuven, Leuven, Belgium. 35Vesalius Research Center (VRC), VIB, Leuven, Belgium. 36Institut National de la Sant et de la Recherche Mdicale (INSERM) U1018, CESP (Center for Research in Epidemiology and Population Health), Environmental Epidemiology of Cancer, Villejuif, France. 37University ParisSud, Unit Mixte de Recherche Scientifique (UMRS) 1018, Villejuif, France. 38Genetic and Molecular Epidemiology Group, Human Cancer Genetics Program, CNIO, Madrid, Spain. 39Department of Laboratory Medicine and Pathology, Division of Experimental Pathology, Mayo Clinic, Rochester, Minnesota, USA. 40Cancer Research UK/Yorkshire Cancer Research Sheffield Cancer Research Centre, Department of Oncology, University of Sheffield, Sheffield, UK. 41Department of Obstetrics and Gynecology, University of Heidelberg, Heidelberg, Germany. 42Molecular Epidemiology Group, DKFZ, Heidelberg, Germany. 43National Center for Tumor Diseases, University of Heidelberg, Heidelberg, Germany. 44Primrmedizinische Versorgung (PMV) Research Group at the Department of Child and Adolescent Psychiatry and Psychotherapy, University of Cologne, Cologne, Germany. 45Division of Cancer Studies, National Institute for Health Research (NIHR) Comprehensive Biomedical Research Centre, Guys & St. Thomas National Health Service (NHS) Foundation Trust in partnership with Kings College London, London, UK. 46Department of Obstetrics and Gynecology, University of Ulm, Ulm, Germany. 47Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada. 48Ontario Cancer Genetics Network, Fred A. Litwin Center for Cancer Genetics, Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada. 49Department of Cancer Prevention and Control, Roswell Park Cancer Institute, Buffalo, New York, USA. 50Centre for Molecular, Environmental, Genetic and Analytic Epidemiology, University of Melbourne, Melbourne, Victoria, Australia. 51Department of Molecular Medicine and Surgery, Karolinska Institutet, Stockholm, Sweden. 52Cancer Epidemiology Centre, The Cancer Council Victoria, Melbourne, Victoria, Australia. 53Department of Epidemiology and Preventive Medicine, Monash University, Melbourne, Victoria, Australia. 54Division of Clinical Epidemiology and Aging Research, DKFZ, Heidelberg, Germany. 55Cancer Genomics Laboratory, Centre Hospitalier Universitaire de Qubec and Laval University, Quebec City, Quebec, Canada. 56Cancer Epidemiology Program, University of Hawaii Cancer Center, Honolulu, Hawaii, USA. 57Department of Gynecology and Obstetrics, University Breast Center for Franconia Erlangen University Hospital, Erlangen, Germany. 58Unit of Molecular Bases of Genetic Risk and Genetic Testing, Department of Preventive and Predictive Medicine, Fondazione Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS), Istituto Nazionale Tumori (INT), Milan, Italy. 59Istituto Fondazione Italiana per la Ricerca sul Cancro di Oncologia Molecolare, Milan, Italy. 60Division of Genetics and Epidemiology, The Institute of Cancer Research, Sutton, UK. 61Division of Breast Cancer Research, The Institute of Cancer Research, Sutton, UK. 62Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, California, USA. 63Dr. Margarete Fischer-Bosch-Institute of Clinical Pharmacology, Stuttgart, Germany. 64Faculty of Medicine, University of Tbingen, Tbingen, Germany. 65Breakthrough Breast Cancer Research Centre, Division of Breast Cancer Research, The Institute of Cancer Research, London, UK. 66Clinics of Obstetrics and Gynaecology, Hannover Medical School, Hannover, Germany. 67Laboratory of Cancer Genetics and Tumor Biology, Department of Clinical Genetics, University of Oulu, Oulu University Hospital, Oulu, Finland. 68Biocenter Oulu, University of Oulu, Oulu, Finland. 69Department of Gynecology, Jena University Hospital, Jena, Germany. 70Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands. 71Department of Pathology, Leiden University Medical

npg

382

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

Articles
Center, Leiden, The Netherlands. 72Department of Genetics and Pathology, Pomeranian Medical University, Szczecin, Poland. 73Department of Clinical Pathology, Imaging Center, Kuopio University Hospital, Kuopio, Finland. 74School of Medicine, Institute of Clinical Medicine, Pathology and Forensic Medicine, Biocenter Kuopio, Cancer Center of Eastern Finland, University of Eastern Finland, Kuopio, Finland. 75Department of Pathology, Helsinki University Central Hospital, Helsinki, Finland. 76Department of Obstetrics and Gynaecology, Hannover Medical School, Hannover, Germany. 77Department of Radiation Oncology, Hannover Medical School, Hannover, Germany. 78Division of Epidemiology, Department of Medicine, Vanderbilt Epidemiology Center, Vanderbilt-Ingram Cancer Center, Vanderbilt University School of Medicine, Nashville, Tennessee, USA. 79Department of Epidemiology, University of CaliforniaIrvine, Irvine, California, USA. 80Department of Epidemiology, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania, USA. 81Department of Genetics, Institute for Cancer Research, Oslo University Hospital, Radiumhospitalet, Oslo, Norway. 82Faculty of Medicine (Faculty Division Ahus), Universitetet i Oslo, Oslo, Norway. 83The University of Texas School of Public Health, Houston, Texas, USA. 84Warwick Medical School, Warwick University, Coventry, UK. 85Institute of Population Health, University of Manchester, Manchester, UK. 86Maggee Womens Hospital, Pittsburgh, Pennsylvania, USA. 87Department of Gynecology and Obstetrics, Division of Tumor Genetics, Klinikum Rechts der Isar, Technical University Munich, Munich, Germany. 88Department of Gynecology and Gynecologic Oncology, Dr. Horst Schmidt Klinik Wiesbaden, Wiesbaden, Germany. 89Department of Gynecology and Gynecologic Oncology, Kliniken Essen-Mitte, Essen, Germany. 90Division of Epidemiology and Prevention, Aichi Cancer Center Research Institute, Nagoya, Japan. 91Cancer Research Initiatives Foundation, Sime Darby Medical Centre, Subang Jaya, Malaysia. 92Breast Cancer Research Unit, University Malaya Cancer Research Institute, University Malaya Medical Centre, Kuala Lumpur, Malaysia. 93Institut fr Humangenetik Wiesbaden, Wiesbaden, Germany. 94International Epidemiology Institute, Rockville, Maryland, USA. 95Seoul National University College of Medicine, Seoul, Korea. 96Department of Gynecologic Oncology, Aichi Cancer Center Central Hospital, Nagoya, Japan. 97Department of Surgery, Yong Loo Lin School of Medicine, National University of Singapore, Singapore. 98Saw Swee Hock School of Public Health, National University of Singapore, Singapore. 99Department of Pathology and Molecular Diagnostic, Aichi Cancer Center Central Hospital, Nagoya, Japan. 100Molecular Genetics of Breast Cancer, DKFZ, Heidelberg, Germany. 101Womens Cancer Program, Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, California, USA. 102Research Division, National Cancer Institute, Bangkok, Thailand. 103Virus, Lifestyle and Genes, Danish Cancer Society Research Center, Copenhagen, Denmark. 104The Juliane Marie Centre, Department of Obstetrics and Gynecology, Rigshospitalet, Copenhagen, Denmark. 105International Agency for Research on Cancer, Lyon, France. 106Faculty of Medicine, University of Southampton, University Hospital Southampton, Southampton, UK. 107Molecular Unit, Department of Pathology, Herlev Hospital, University of Copenhagen, Copenhagen, Denmark. 108Colleague of Public Health, China Medical University, Taichong, Taiwan. 109Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan. 110Department of Obstetrics and Gynaecology, Faculty of Medicine, University Malaya Medical Centre, University Malaya, Kuala Lumpur, Malaysia. 111Department of Breast Surgery, Herlev Hospital, Copenhagen University Hospital, Copenhagen, Denmark. 112Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, USA. 113Gynecology Service, Department of Surgery, Memorial Sloan-Kettering Cancer Center, New York, New York, USA. 114Department of Community and Family Medicine, Duke University Medical Center, Durham, North Carolina, USA. 115Department of Cancer Epidemiology/Clinical Cancer Registry, University Clinic HamburgEppendorf, Hamburg, Germany. 116Institute for Medical Biometrics and Epidemiology, University Clinic HamburgEppendorf, Hamburg, Germany. 117Department of Statistical Science, Duke University, Durham, North Carolina, USA. 118Cancer Prevention, Detection and Control Research Program, Duke Cancer Institute, Durham, North Carolina, USA. 119Obstetrics and Gynecology Epidemiology Center, Brigham and Womens Hospital and Harvard Medical School, Boston, Massachusetts, USA. 120Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts, USA. 121Sanquin Research, Amsterdam, The Netherlands. 122Channing Division of Network Medicine, Harvard Medical School and Brigham and Womens Hospital, Boston, Massachusetts, USA. 123Netherlands Cancer Institute, Antoni van Leeuwenhoek Hospital, Amsterdam, The Netherlands. 124Division of Human Genetics, Genome Institute of Singapore, Singapore. 125The Cancer Institute of New Jersey, Robert Wood Johnson Medical School, New Brunswick, New Jersey, USA. 126Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, New York, USA. 127Department of Oncology, University of Helsinki and Helsinki University Central Hospital, Helsinki, Finland. 128Department of Clinical Genetics, Helsinki University Central Hospital, University of Helsinki, Helsinki, Finland. 129Department of Gynecology and Obstetrics, Haukeland University Hospital, Bergen, Norway. 130Department of Clinical Medicine, University of Bergen, Bergen, Norway. 131Laboratory of Experimental Oncology, Department of Oncology, KU Leuven, Leuven, Belgium. 132Department of General Medical Oncology, University Hospitals Leuven, Leuven Cancer Institute, Leuven, Belgium. 133Comprehensive Cancer Center The Netherlands, Utrecht, The Netherlands. 134Department of Epidemiology, Biostatistics and Health Technology Assessment, Radboud University Medical Centre, Nijmegen, The Netherlands. 135Department of Urology, Radboud University Medical Centre, Nijmegen, The Netherlands. 136Universit Paris Sorbonne Cit, Unit Mixte de Recherche (UMR) S775, INSERM, Paris, France. 137Department of Gynecology, Radboud University Medical Centre, Nijmegen, The Netherlands. 138Human Genetics Group, CNIO, Madrid, Spain. 139Biomedical Network on Rare Diseases (CIBERER), Madrid, Spain. 140Department of Obstetrics and Gynecology, Oregon Health and Science University, Portland, Oregon, USA. 141Knight Cancer Institute, Oregon Health and Science University, Portland, Oregon, USA. 142Servicio de Ciruga General y Especialidades, Hospital Monte Naranco, Oviedo, Spain. 143Department of Biochemistry and Molecular Biology, Oregon Health and Science University, Portland, Oregon, USA. 144Servicio de Oncologa Mdica, Hospital Universitario La Paz, Madrid, Spain. 145Division of Epidemiology and Biostatistics, University of New Mexico, Albuquerque, New Mexico, USA. 146Department of Population Health Research, Alberta Health ServicesCancer Care, Calgary, Alberta, Canada. 147Department of Medical Genetics, University of Calgary, Calgary, Alberta, Canada. 148Department of Oncology, University of Calgary, Calgary, Alberta, Canada. 149Cancer Control Research, British Columbia Cancer Agency, Vancouver, British Columbia, Canada. 150Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada. 151Department of Biomedical Physiology and Kinesiology, Simon Fraser University, Burnaby, British Columbia, Canada. 152Welcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK. 153Oxford Biomedical Research Centre, University of Oxford, Oxford, UK. 154School of Medicine, National University of Ireland, Galway, Ireland. 155International Hereditary Cancer Center, Department of Genetics and Pathology, Pomeranian Medical Academy, Szczecin, Poland. 156Department of Surgical Gynecology and Gynecological Oncology of Adults and Adolescents, Pomeranian Medical University, Szczecin, Poland. 157Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, USA. 158Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada. 159Laboratory Medicine Program, University Health Network, Toronto, Ontario, Canada. 160Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada. 161Department of Radiation Oncology, Rigshospitalet, University of Copenhagen, Copenhagen, Denmark. 162Division of Epidemiology, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada. 163Prosserman Centre for Health Research, Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada. 164Gynecological Oncology Unit, The Royal Marsden Hospital, London, UK. 165Genetic Epidemiology Laboratory, Department of Pathology, The University of Melbourne, Melbourne, Victoria, Australia. 166Department of Medical Oncology, Family Cancer Clinic, Erasmus University Medical Center, Rotterdam, The Netherlands. 167Department of Clinical Genetics, Family Cancer Clinic, Erasmus University Medical Center, Rotterdam, The Netherlands. 168Department of Surgery and Cancer, Imperial College London, London, UK. 169The Beatson West of Scotland Cancer Centre, Glasgow, UK. 170Department of Oncology and Pathology, Karolinska Institutet, Stockholm, Sweden. 171Department of Gynecological Oncology, Glasgow Royal Infirmary, Glasgow, UK. 172Department of Health Research and Policy, Stanford University School of Medicine, Stanford, California, USA. 173Saarland Cancer Registry, Saarbrcken, Germany. 174Department of Environmental and Occupational Health, Faculty of Medicine, University of Montreal, Montreal, Quebec, Canada. 175Department of Epidemiology, Shanghai Cancer Institute, Shanghai, China. 176Department of Medicine, McGill University, Montral, Quebec, Canada. 177Division of Clinical Epidemiology, McGill University Health Centre, Royal Victoria Hospital, Montral, Quebec, Canada. 178Dalla Lana School of Public Health, Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada. 179Institute of Pathology, University Hospital Erlangen, Friedrich-Alexander University ErlangenNuremberg, Erlangen, Germany. 180Institute of Human Genetics, Friedrich Alexander University ErlangenNuremberg, Erlangen, Germany. 181Unit of Medical Genetics, Department of Preventive and Predictive Medicine, Fondazione IRCCS INT, Milan, Italy. 182Cogentech Cancer Genetic Test Laboratory, Milan, Italy. 183Division of Cancer Prevention and Genetics, Istituto Europeo di Oncologia, Milan, Italy. 184Department of Epidemiology, Center for Cancer Genetics Research and Prevention, School of Medicine, University of CaliforniaIrvine, Irvine, California, USA. 185Gynaecological Cancer Research Centre, University College London Elizabeth Garrett Anderson Institute for Womens Health, London, UK. 186Institute for Prevention and Occupational Medicine of the German Social Accident Insurance, Institute of the RuhrUniversitt Bochum, Bochum, Germany. 187Department of Internal Medicine, Evangelische Kliniken Bonn, Johanniter Krankenhaus, Bonn, Germany. 188Department of Cancer Epidemiology and Prevention, M. Sklodowska-Curie Memorial Cancer Center & Institute of Oncology, Warsaw, Poland. 189Department of Molecular Pathology, The Maria Sklodowska-Curie Memorial Cancer Center and Institute of Oncology, Warsaw, Poland. 190Department of Oncology, Oulu University Hospital, University of Oulu, Oulu, Finland. 191Department of Gynecologic Oncology, The Maria Sklodowska-Curie Memorial Cancer Center and Institute of Oncology, Warsaw, Poland. 192Department of Pathology, Oulu University Hospital, University of Oulu, Oulu, Finland. 193Department of Surgical Oncology, Leiden University Medical Center, Leiden, The Netherlands. 194Postgraduate School of Molecular Medicine, Warsaw Medical University, Warsaw, Poland. 195N.N. Alexandrov Research Institute of Oncology and Medical Radiology, Minsk, Belarus. 196Ministry of Public Health, Bangkok, Thailand. 197Department of Gynecology and Obstetrics, Ludwig-Maximilians-Universitt, Munich, Germany.

npg

2013 Nature America, Inc. All rights reserved.

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

383

Articles
of Human Genetics, Technische Universitt, Munich, Germany. 199Department of Gynaecology and Obstetrics, Centre of Familial Breast and Ovarian Cancer, University Hospital of Cologne, Cologne, Germany. 200Centre for Molecular Medicine Cologne (CMMC), University Hospital of Cologne, Cologne, Germany. 201Department of Breast Oncology, Aichi Cancer Center Hospital, Nagoya, Japan. 202Singapore Eye Research Institute, National University of Singapore, Singapore. 203Shanghai Center for Disease Control and Prevention, Shanghai, China. 204Wexner Medical Center, Division of Oncology, The Ohio State University, Columbus, Ohio, USA. 205Roswell Park Cancer Institute, Buffalo, New York, USA. 206Department of Medical Oncology, Papageorgiou Hospital, Aristotle University of Thessaloniki School of Medicine, Thessaloniki, Greece. 207Department of Surgery, Tri-Service General Hospital, Taipei, Taiwan. 208Cancer Center, Kaohsiung Medical University Chung-Ho Memorial Hospital, Kaohsiung, Taiwan. 209Department of Surgery, Kaohsiung Medical University Chung-Ho Memorial Hospital, Kaohsiung, Taiwan. 210Institut Curie, Department of Tumour Biology, Paris, France. 211Institut Curie, INSERM U830, Paris, France. 212Universit Paris Descartes, Sorbonne Paris Cit, Paris, France. 213Basser Research Center, Abramson Cancer Center, The University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA. 214Center for Clinical Epidemiology and Biostatistics, The University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA. 215Gynecologic Oncology Group Statistical and Data Center, Roswell Park Cancer Institute, Buffalo, New York, USA. 216Department of Obstetrics and Gynecology, Comprehensive Cancer Center, Medical University of Vienna, Vienna, Austria. 217The Susanne Levy Gertner Oncogenetics Unit, Sheba Medical Center, Tel-Hashomer, Israel. 218Institute of Oncology, Sheba Medical Center, Tel-Hashomer, Israel. 219Department of Clinical Genetics, Odense University Hospital, Odense, Denmark. 220Clinical Genetics Service, Memorial Sloan-Kettering Cancer Center, New York, New York, USA. 221Center for Genomic Medicine, Rigshospitalet, Copenhagen University Hospital, Copenhagen, Denmark. 222Department of Population Sciences, Beckman Research Institute of City of Hope, Duarte, California, USA. 223Department of Biological Sciences, Center for Translational Cancer Research, University of Delaware, Newark, Delaware, USA. 224Genetic Counseling Unit, Hereditary Cancer Program, lInstitut dInvestigaci Biomdica de BellvitgeCatalan Institute of Oncology, Barcelona, Spain. 225Center for Cancer Genetics and Prevention, Dana-Farber Cancer Institute, Boston, Massachusetts, USA. 226Womens College Research Institute, University of Toronto, Toronto, Ontario, Canada. 227Clinical Cancer Genetics, City of Hope, Duarte, California, USA. 228Immunology and Molecular Oncology Unit, Istituto Oncologico Veneto (IOV), IRCCS, Padua, Italy. 229Department of Molecular Genetics, National Institute of Oncology, Budapest, Hungary. 230Department of Pathology and Laboratory Medicine, University of Kansas Medical Center, Kansas City, Kansas, USA. 231Molecular Diagnostics Laboratory, Institute of Radioisotopes and Radiodiagnostic Products, National Centre for Scientific Research Demokritos, Aghia Paraskevi Attikis, Athens, Greece. 232Department of Dermatology, University of Utah School of Medicine, Salt Lake City, Utah, USA. 233Huntsman Cancer Institute, University of Utah School of Medicine, Salt Lake City, Utah, USA. 234Molecular Oncology Laboratory, Hospital Clinico San Carlos, Madrid, Spain. 235N.N. Petrov Institute of Oncology, St. Petersburg, Russia. 236Latvian Biomedical Research and Study Centre, Riga, Latvia. 237Department of Breast Medical Oncology, University of Texas MD Anderson Cancer Center, Houston, Texas, USA. 238Clinical Cancer Genetics, University of Texas MD Anderson Cancer Center, Houston, Texas, USA. 239Victorian Breast Cancer Research Consortium Cancer Genetics Laboratory, Peter MacCallum Cancer Center, Melbourne, Victoria, Australia. 240Department of Human Genetics, Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands. 241Department of Clinical Genetics, Leiden University Medical Center, Leiden, The Netherlands. 242Department of Clinical Genetics, Maastricht University Medical Center, Maastricht, The Netherlands. 243Department of Clinical Genetics, VU University Medical Centre, Amsterdam, The Netherlands. 244Department of Genetics, University of Groningen, University Medical Center, Groningen, The Netherlands. 245Department of Medical Genetics, University Medical Center Utrecht, Utrecht, The Netherlands. 246Department of Clinical Genetics, Academic Medical Center, Amsterdam, The Netherlands. 247Genetic Medicine, Manchester Academic Health Sciences Centre, Central Manchester University Hospitals NHS Foundation Trust, Manchester, UK. 248Leicestershire Clinical Genetics Service, University Hospitals of Leicester NHS Trust, Leicester, UK. 249Oxford Regional Genetics Service, Churchill Hospital, Oxford, UK. 250Academic Unit of Clinical and Molecular Oncology, Trinity College Dublin and St James Hospital, Dublin, Ireland. 251Laboratory Medicine at Southern General Hospital, Glasgow, UK. 252West Midlands Regional Genetics Service, Birmingham Womens Hospital Healthcare NHS Trust, Edgbaston, Birmingham, UK. 253INSERM U946, Fondation Jean Dausset, Paris, France. 254Service de Gntique, Institut de Cancrologie Gustave Roussy, Villejuif, France. 255INSERM U1052, Centre National de Recherche Scientifique (CNRS) Unit Mixte de Recherche (UMR) 5286, Universit Lyon 1, Centre de Recherche en Cancrologie de Lyon, Lyon, France. 256Centre de Gntique, Centre Hspitalier Universitaire Dijon, Universit de Bourgogne, Dijon, France. 257Centre Georges Franois Leclerc, Dijon, France. 258Centre Antoine Lacassagne, Nice, France. 259Unit Mixte de Gntique Constitutionnelle des Cancers Frquents, Hospices Civils de Lyon, Centre Lon Brard, Lyon, France. 260Consultation de Gntique, Dpartement de Mdecine, Institut de Cancrologie Gustave Roussy, Villejuif, France. 261Unit de Prvention et dEpidmiologie Gntique, Centre Lon Brard, Lyon, France. 262Universit Lyon 1, CNRS UMR 5558, Lyon, France. 263Department of Molecular Medicine, Sapienza University, Rome, Italy. 264Department of Experimental Oncology, Istituto Europeo di Oncologia, Milan, Italy. 265Cancer Bioimmunotherapy Unit, Centro di Riferimento Oncologico, IRCCS, Aviano, Italy. 266Department of Gynecology and Obstetrics, University Hospital of Schleswig-Holstein and University Kiel, Kiel, Germany. 267Institute for Medical Informatics, Statistics and Epidemiology, University of Leipzig, Leipzig, Germany. 268Department of Obstetrics and Gynecology, University Medical Center, HeinrichHeine University, Dsseldorf, Germany. 269Institute of Human Genetics, University Hospital of Schleswig-Holstein, University Kiel, Kiel, Germany. 270Department of Human Genetics, University of Heidelberg, Heidelberg, Germany. 271Department of Oncology, Lund University, Lund, Sweden. 272Department of Radiation Sciences, Oncology, Ume University, Ume, Sweden. 273Department of Clinical Genetics, Karolinska University Hospital, Stockholm, Sweden. 274Department of Clinical Genetics, University and Regional Laboratories, Lund University Hospital, Lund, Sweden. 275Department of Medicine, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, Pennsylvania, USA. 276Division of Gynecologic Oncology, North Shore University Health System, University of Chicago, Evanston, Illinois, USA. 277Department of Obstetrics and Gynecology, Ohio State University College of Medicine, Columbus, Ohio, USA. 278Department of Clinical Genetics, Vejle Hospital, Vejle, Denmark. 279Department of Clinical Genetics, Aarhus University Hospital, Aarhus, Denmark. 280Department of Clinical Genetics, Rigshospitalet, Copenhagen University Hospital, Copenhagen, Denmark. 281Department of Oncology, Rigshospitalet, Copenhagen University Hospital, Copenhagen, Denmark. 282Department of Cancer Epidemiology and Genetics, Masaryk Memorial Cancer Institute, Brno, Czech Republic. 283Pathology Queensland, The Royal Brisbane and Womens Hospital, Herston, Brisbane, Queensland, Australia. 284Duke Cancer Institute, Duke University Medical Center, Durham, North Carolina, USA. 285These authors contributed equally to this work. 286These authors jointly directed this work. Correspondence should be addressed to S.E.B. (stig.egil.bojesen@regionh.dk), G.C.-T. (Georgia.Trench@qimr.edu.au) or A.M.D. (amd24@medschl.cam.ac.uk).
198Insitute

npg

2013 Nature America, Inc. All rights reserved.

384

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

ONLINE METHODS

SNP selection and genotyping. Most SNPs were genotyped on the iCOGS custom array36,37,71. SNPs at 5p15.33 (Build 36 positions 1,280,0001,415,000; Build 37 positions 1,227,6931,361,969) were selected on the basis of published cancer associations, from the March 2010 release of the 1000 Genomes Project35. These included all known SNPs with MAF > 0.02 in Europeans and r2 > 0.1 with the then-known cancer-associated SNPs (rs402710 (ref. 57) and/or rs3816659 (ref. 58)), plus a tagging set for all known SNPs in the linkage disequilibrium blocks encompassing the genes in the region (SLC6A18, TERT and CLPTM1L). An additional 30 SNPs in TERT were selected through a telo mere length candidate gene approach. In total, 134 SNPs were selected, 121 of which were successfully manufactured; 110 of those passed quality control36 in BCAC and CIMBA, and 108 passed quality control in OCAC (Supplementary Tables 13). After genotyping, these SNPs were complemented with 22 SNPs, selected from the October 2010 release of the 1000 Genomes Project to improve coverage. These were genotyped in two BCAC studies, SEARCH72 and CCHS73, using a Fluidigm array according to the manufacturers instructions. To improve SNP density further, comprehensive genotype data for the locus were imputed for all subjects on the basis of the January 2012 1000 Genomes Project release. The genotype imputation process is described in refs. 3638. All participants provided written informed consent. Ethical approval for each study/consortium is described in detail in refs. 3638. 2013 Nature America, Inc. All rights reserved. Samples and quality control. Study characteristics, iCOGS methodology and quality control for cancer risk analyses are detailed elsewhere3638. We measured telomere length in 6,766 control samples from the SEARCH study; 1,569 of these were accrued by SEARCH itself36, 793 were collected as part of the Sisters in Breast Screening (SIBS) study15, and 4,404 were cancer-free participants in the European Prospective Investigation into Cancer (EPIC)-Norfolk study19. We also measured telomere length in 8,841 participants in CCHS73,74 and in 38,145 participants in the Copenhagen General Population Study (CGPS)75,76. Genotype clusters were visually inspected for the most strongly associated SNPs (Supplementary Fig. 2). For all studies, ancestry was assigned using HapMap (release 22) genotype data for European, African and Asian populations as reference (for BCAC and CIMBA, using multidimensional scaling; for OCAC, using LAMP77). All CIMBA analyses were restricted to individuals of European ancestry. For BCAC, separate estimates for individuals of east Asian and African-American ancestry were also derived. For OCAC, limited analyses of non-European ancestry groups were also performed. A subset of BCAC and OCAC cases and controls was used in previous breast and ovarian cancer association studies of individual SNPs78. However, the associations with the key SNPs (rs10069690, rs2736108 and rs7705526) remained significant after excluding this subset of cases and controls from analysis, demonstrating similar ORs. Telomere length measurement. Telomere length was measured in SEARCH using a modified version of the protocol described elsewhere19,79. Twelve percent of samples were run in duplicate. Failed PCR reactions were not repeated. Telomere length was measured in CCHS and CGPS with a modified version of the protocol described elsewhere50,80. Each individual was measured in quadruplicate. After exclusion of outliers, average cycle threshold (CT) values of the remaining samples were calculated. Failed measurements were repeated up to twice. For meta-analysis, telomere length measurements from SEARCH were converted to the same scale as that used for the CCHS and CGPS measurements on the basis of parameters from the linear regression between corresponding 5-percentile groups (including the 5th, 10th, 15th, 20th, 25th, 30th, 35th, 40th, 45th, 50th, 55th, 60th, 65th, 70th, 75th, 80th, 85th, 90th, 95th, 97th and 98th percentiles) in each 10-year age group of women from CCHS and SEARCH (Supplementary Fig. 7). This measure of telomere length was used for all the analyses and then converted into fold change (RTL) to aid interpretation (Supplementary Fig. 7). Statistical analyses. SNP associations with telomere length were evaluated using linear regression to model the fold change in telomere length per minor allele, adjusted for age, 384-well plate, sex, 7 principal components and study. Each SNP was coded as the number of minor alleles (0, 1 or 2 for genotyped SNPs and the inferred genotype for imputed SNPs). The test of association was based on the 1-degree-of-freedom trend test statistic. We also performed separate analyses (SEARCH, CCHS females, CCHS males, CGPS females and

CGPS males) and combined the parameter estimates in a fixed-effect metaanalysis in STATA (StataCorp). Associations with breast and ovarian cancer risks in BCAC and OCAC were evaluated by comparing genotype frequencies in cases and controls using unconditional logistic regression. Analyses were adjusted for study and by seven principal components in BCAC 36 and five principal components in OCAC37. Nine OCAC studies with case-only genotype data were paired with case-control studies from similar geographic regions, resulting in 34 analysis study strata. The principal analysis fitted each SNP as an allelic dose and tested for association using a 1-degree-of-freedom trend test, but genotype-specific risks were also obtained. Associations between genotypes and breast cancer risk in CIMBA studies (BRCA1 mutation carriers) were evaluated using a 1-degree-of-freedom per-allele trend score test, based on modeling the retrospective likelihood of the observed genotypes conditional on breast cancer phenotypes81. To allow for non-independence among related individuals, an adjusted version of the score test was used in which the variance of the score was derived, taking into account the correlation between the genotypes by estimating the kinship coefficient for each pair of individuals using the available genotype data82. Per-allele HR estimates were obtained by maximizing the retrospective likelihood. All analyses were stratified by country of residence. US and Canadian strata were further stratified on the basis of reported Ashkenazi Jewish ancestry. Conditional analyses were performed to identify SNPs independently associated with each phenotype. To identify the most parsimonious model, all SNPs with marginal P value < 0.001 were included in forward selection regression analyses with a threshold for inclusion of P < 1 104 and with terms for age (for telomere length only), principal components and study. Similarly, forward selection Cox regression analysis was performed for BRCA1 mutation carriers, stratified by country of residence, using the same P-value thresholds. This approach provides valid association tests, although the estimates can be biased81,83. Parameter estimates for the most parsimonious model were obtained using the retrospective likelihood approach. FACS. Normal breast tissue was donated by women undergoing reduction mammoplasty surgery. These individuals provided written consent, and all work was performed with full local institutional human ethics approval. Tissue was dissociated as described previously84. Cells were prepared for flow cytometry as described previously42 by staining with a cocktail of Lin+ markers (CD31PE, CD45-PE and CD235a-PE), EpCAM-FITC, CD49f-PE-Cy5 and Sytox Blue. Cells were then processed by a BD FACSAria II Cell Sorter, and live cells negative for immunostaining of Lin+ markers were sorted into four subpopulations on the basis of their EpCAM-FITC and CD49f-PE-Cy5 fluorescence. FAIRE analysis. Cell pellets derived from FACS fractionation of breast tissue samples were cross-linked in 1% formaldehyde and lysed in 200 l of Trisbuffered 1% SDS lysis buffer containing protease inhibitors. Lysates were soni cated using a QSONICA Model Q125 Ultra Sonic Processor to shear chromatin to fragments of 200 bp to 1 kb in length. Insoluble cell material was removed through centrifugation, and supernatants were equally divided into 100- l input and FAIRE samples. Input samples were incubated overnight at 65 C to reverse cross-linking. All samples were purified through two rounds of phenolchloroform extraction, and DNA was recovered through ethanol precipitation and resuspended in water for use as PCR template. Sequences for PCR primers are listed in Supplementary Table 11. Plasmid construction and luciferase assays. TERT promoter variants were introduced into pGL3-TERT-3915 (ref. 43) by site-directed mutagenesis (Agilent Technologies). TERT PRE-A (hg19; chr. 5: 1,284,9001,287,087) and PRE-B (chr. 5: 1,279,4011,282,763) were PCR amplified using KAPAHiFi DNA polymerase (Geneworks) and cloned into pGL3-TERT-3915 or the vector encoding the minor alleles of rs2736107, rs2736108 and rs2736107. Individual SNPs were incorporated using overlap extension PCR. Sequences for PCR primers are listed in Supplementary Table 11. Cells were transfected with equimolar amounts of luciferase reporter plasmids and 50 ng of pRLTK using siPORT NeoFX Transfection Agent (Ambion), according to the manufacturers instructions, and harvested after 48 h. Luminescence activity was measured with a Wallac Victor3 1420 multilabel counter, and data from three replicates per construct were analyzed by one-way ANOVA with post-hoc Dunnetts tests.

npg

doi:10.1038/ng.2566

Nature Genetics

Mini-gene construction and quantitative RT-PCR analysis. TERT intron 4 was synthesized by GenScript and subcloned into pIRES-TERT44. The minor alleles at rs10069690 and rs2242652 were introduced by site-directed mutagenesis (Agilent Technologies). The resultant plasmids, designated pIRES-TERTint4-WT (wild-type intron 4), pIRES-TERTint4-rs10069690, pIRES-TERTint4-rs2242652 and pIRES- TERTint4-DM (minor alleles at both sites), were transfected into cells using siPORT NeoFX Transfection Agent, and cells were harvested after 24 h. Total RNA was extracted using the RNeasy Mini kit (Qiagen) and digested with DNase I (Invitrogen). cDNA was synthesized from 1 g of RNA by random priming using SuperScript III reverse transcriptase (Invitrogen). Samples were screened for the presence of TERT splice variants by RT-PCR. Sequences for PCR primers are listed in Supplementary Table 11. Molecular correlations at the 5p15.33 locus. For each gene within 1 Mb of the TERT locus, we performed the following assays: (i) gene expression analysis in ovarian cancer cell lines (n = 50) compared to ovarian surface epithelial and fallopian tube secretory cell lines (n = 73) and tissues from high-grade serous ovarian cancers; (ii) methylation analysis in high-grade serous ovarian cancers compared to normal tissues and methylation quantitative trait locus (mQTL) analysis; and (iii) expression quantitative trait locus (eQTL) analysis to evaluate genotypegene expression associations in normal high-grade serous ovarian cancer precursor tissues. We also evaluated these genes in silico in the somatic data from TCGA49. We profiled the spectrum of noncoding regulatory elements in ovarian surface epithelial and fallopian tube secretory cell lines using a combination of FAIRE sequencing (FAIRE-seq40) and RNA sequencing (RNA-seq).

npg

2013 Nature America, Inc. All rights reserved.

71. Gaudet, M.M. et al. Identification of a BRCA2-specific modifier locus at 6p24 related to breast cancer risk. PLoS Genet. 9, e1003173 (2013). 72. Azzato, E.M. et al. Association between a germline OCA2 polymorphism at chromosome 15q13.1 and estrogen receptornegative breast cancer survival. J. Natl. Cancer Inst. 102, 650662 (2010). 73. Bojesen, S.E., Tybjaerg-Hansen, A. & Nordestgaard, B.G. Integrin 3 Leu33Pro homozygosity and risk of cancer. J. Natl. Cancer Inst. 95, 11501157 (2003). 74. Allin, K.H., Bojesen, S.E. & Nordestgaard, B.G. Baseline C-reactive protein is associated with incident cancer and survival in patients with cancer. J. Clin. Oncol. 27, 22172224 (2009). 75. Allin, K.H. et al. C-reactive protein and the risk of cancer: a mendelian randomization study. J. Natl. Cancer Inst. 102, 202206 (2010). 76. Zacho, J. et al. Genetically elevated C-reactive protein and ischemic vascular disease. N. Engl. J. Med. 359, 18971908 (2008). 77. Sankararaman, S. et al. Estimating local ancestry in admixed populations. Am. J. Hum. Genet. 82, 290303 (2008). 78. Terry, K.L. et al. Telomere length and genetic variation in telomere maintenance genes in relation to ovarian cancer risk. Cancer Epidemiol. Biomarkers Prev. 21, 504512 (2012). 79. Cawthon, R.M. Telomere measurement by quantitative PCR. Nucleic Acids Res. 30, e47 (2002). 80. Cawthon, R.M. Telomere length measurement by a novel monochrome multiplex quantitative PCR method. Nucleic Acids Res. 37, e21 (2009). 81. Barnes, D.R. et al. Evaluation of association methods for analysing modifiers of disease risk in carriers of high-risk mutations. Genet. Epidemiol. 36, 274291 (2012). 82. Antoniou, A.C. et al. A locus on 19p13 modifies risk of breast cancer in BRCA1 mutation carriers and is associated with hormone receptornegative breast cancer in the general population. Nat. Genet. 42, 885892 (2010). 83. Antoniou, A.C. et al. A weighted cohort approach for analysing factors modifying disease risks in carriers of high-risk susceptibility genes. Genet. Epidemiol. 29, 111 (2005). 84. Eirew, P. et al. A method for quantifying normal human mammary epithelial stem cells with in vivo regenerative ability. Nat. Med. 14, 13841389 (2008).

Nature Genetics

doi:10.1038/ng.2566

letters

Identification of 23 new prostate cancer susceptibility loci using the iCOGS custom genotyping array
Prostate cancer is the most frequently diagnosed cancer in males in developed countries. To identify common prostate cancer susceptibility alleles, we genotyped 211,155 SNPs on a custom Illumina array (iCOGS) in blood DNA from 25,074 prostate cancer cases and 24,272 controls from the international PRACTICAL Consortium. Twenty-three new prostate cancer susceptibility loci were identified at genomewide significance (P < 5 108). More than 70 prostate cancer susceptibility loci, explaining ~30% of the familial risk for this disease, have now been identified. On the basis of combined risks conferred by the new and previously known risk loci, the top 1% of the risk distribution has a 4.7-fold higher risk than the average of the population being profiled. These results will facilitate population risk stratification for clinical studies. Epidemiological studies provide strong evidence for genetic predisposition to prostate cancer. Most susceptibility loci identified thus far are common, low-penetrance variants found through genome-wide association studies (GWAS; reviewed in ref. 1). Fifty-four loci have been identified so far16. Because the risks associated with common susceptibility alleles are modest (per-allele odds ratios, ORs, ranging from 1.101.25), it is likely that other predisposition loci for prostate cancer have been missed by previous studies and that such loci should be detectable by studies with larger sample sizes7. Here, we report the findings from an extensive follow-up of GWAS conducted as part of a collaborative study with the Breast Cancer Association Consortium (BCAC), Ovarian Cancer Association Consortium (OCAC) and The Consortium of Investigators of Modifiers of BRCA1/2 (CIMBA) as part of the COGS initiative. We first conducted a meta-analysis of 4 GWAS conducted in populations of European ancestry that included 11,085 cases and 11,463 controls: UK/Australia, Cancer Genetic Markers of Susceptibility (CGEMS); Cancer of the Prostate in Sweden (CAPS) and the Breast and Prostate Cancer Cohort Consortium (BPC3). Genotype data from these GWAS were imputed using the HapMap 2 CEU panel (Utah residents of Northern and Western European ancestry) as a reference, and combined tests of association were then performed for ~2.6 million SNPs (Online Methods). From this meta-analysis, we selected 74,001 SNPs showing evidence of association with overall prostate cancer, aggressive prostate cancer or prostate cancer diagnosed at <55 years of age (Online Methods). Specifically, we included all SNPs with
A full list of authors and affiliations appears at the end of the paper. Received 10 May 2012; accepted 28 January 2013; published online 27 March 2013; doi:10.1038/ng.2560

significant association at P < 0.01 for overall prostate cancer. These SNPs were genotyped as part of a custom array that included 211,155 SNPs (the iCOGS chip), 85,278 of which were specifically chosen for their potential relevance to prostate cancer (74,001 were from GWAS top hits as described, 13,739 were from fine mapping of known susceptibility regions at the time of the chip design and 1,398 were from candidate gene studies in key pathways (for example, hormone metabolism, HOX genes, the cell cycle and DNA repair; Fig. 1 and Online Methods); some SNPs were in more than one category). The results of the GWAS component are presented here. The details of the iCOGS array can be found on the COGS website (see URLs). The iCOGS array was used for the genotyping of 25,074 prostate cancer cases and 24,272 controls from 32 studies participating in the PRACTICAL Consortium (Online Methods). Of these, 39,337 samples of European ancestry and 1,192 of African-American or mixed African origin passed quality control and did not overlap with the GWAS sample sets. Only the results from samples of European ancestry are reported here (19,662 prostate cancer cases and 19,715 controls; Supplementary Table 1 and Supplementary Note). Of the 201,598 SNPs that passed quality control, 72,157 were selected for replication of the combined GWAS (Online Methods). Associations between SNP genotypes and prostate cancer were evaluated by logistic regression, adjusted for study and six principal components. Evidence for association was assessed using a 1-degreeof-freedom test for trend in risk by allele dose. When considering those SNPs not selected for association with prostate cancer, there was little evidence of inflation in the test statistics ( = 1.136, equivalent to 1,000 = 1.008). There was, however, clear evidence of an excess of significant association for SNPs selected for replication of the prostate cancer GWAS (Supplementary Fig. 1). Results from the iCOGS replication stage were then combined with those from the GWAS to provide overall tests for association. After exclusion of SNPs in regions containing previously known loci associated with prostate cancer, 23 SNPs in 23 regions showed evidence of association in the combined GWAS and iCOGS replication stage analysis at P < 5 108 (Fig. 2 and Table 1). There was no strong evidence for heterogeneity in the per-allele ORs between studies (Supplementary Fig. 2). All alleles are common (minor allele frequencies of 850%; Table 1) and conferred estimated per-allele ORs from 1.061.15. All but two of the autosomal SNPs associated with prostate cancer risk showed a pattern of association consistent with a log-additive model, as observed for most common cancer

npg

2013 Nature America, Inc. All rights reserved.

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

385

letters
3.5 108). WNT, FGF and IGF signaling also showed significant levels of enrichment (P = 1.69 104 to 9.41 105). The overall inflation in the test statistics for those SNPs selected for GWAS replication suggests that the number of susceptibility loci may be much larger. To address this possibility more formally, we identified 22,662 SNPs selected for replication of the prostate cancer GWAS that were uncorrelated (r2 < 0.1 for any pair) and examined the directions of the estimated ORs in the iCOGS replication data set. The estimated effects were in the same direction as in the GWAS for 12,278 SNPs and in the opposite direction for 10,384 SNPs. On the basis of this analysis, 1,894 (95% CI = 1,6002,188) selected SNPs reflect true associations with disease. We have found 23 new loci associated with prostate cancer, 16 of which are associated with aggressive as well as non-aggressive disease, although none of the new loci are associated exclusively with the latter. This finding is, however, notable, as aggressive disease requires radical treatment, and, previously, the loci associated with prostate cancer were associated exclusively with non-aggressive disease, which is less likely to require clinical intervention. All of the newly associated loci lie in linkage disequilibrium (LD) blocks that include plausible causative genes ( Fig. 3ad and Supplementary Fig. 3). LD regions vary greatly in the genome; here, we defined LD blocks as regions with SNPs with r2 > 0.2 or took a 500-kb window around the lead SNPs. The list of genes in these 23 new susceptibility regions is given in Supplementary Table 6. Fifteen of the 23 SNPs are either intronic (12 SNPs) or in the promoter region of a gene (3 SNPs). As described below, there are data in the literature that suggest that two of the newly associated SNPs impart direct functional effects that result in allele-specific alterations to the expression of the associated genes. This raises the possibility that these SNPs could themselves represent causative variants, although further finemapping studies and analysis of expression in primary prostate tissue would be needed to confirm this. Of the new loci identified in this study, SNP rs4245739 at 1q32 is situated in the 3 UTR of the MDM4 gene, 32 bp downstream of the stop codon. MDM4 is a negative regulator of TP53, thereby acting to inhibit cell cycle arrest and apoptosis, and is frequently overexpressed in a number of tumor types. rs4245739 is correlated with rs7556371 (r2 = 0.89),

GWAS

Fine mapping

70,247

3,163

10,470

92 499 14

793

Candidates

Figure 1 Composition of the prostate part of the iCOGS chip. There were 74,001 SNPs chosen from a meta-analysis of GWAS (Online Methods), 13,739 SNPs for fine mapping of previously published regions before the development of the iCOGS chip and 1,398 candidate SNPs. Shown is the overlap between the three groups.

2013 Nature America, Inc. All rights reserved.

usceptibility alleles. For rs11902236 on chromosome 2, the estimated s OR in the iCOGS replication stage for the heterozygote genotype was 1.04 (95% confidence interval (CI) = 0.991.08), which is smaller than expected under a log-additive model (P = 0.05), and, for rs7141529 on chromosome 14, the estimated OR in the iCOGS replication stage for the heterozygote genotype was 1.16 (95% CI = 1.101.21), which is greater than expected under a log-additive model (P = 0.004). Aggressive disease was defined as that with Gleason score 8, prostate-specific antigen (PSA) >100 ng/ml, disease stage of distant (outside the pelvis) or death from prostate cancer. When aggressive disease was thus defined, three of the SNPs (rs3771570, rs2273669 and rs1270884) showed a significant difference in per-allele OR between aggressive and non-aggressive disease, in each case with a higher OR for non-aggressive disease and little or no association with aggressive disease (Supplementary Table 2). A similar pattern of association with respect to aggressive disease has been observed for SNPs in the KLK3 region8. The majority of SNPs, however, showed clear association when analysis was restricted to aggressive disease (for example, 13 SNPs showed significant associations at P < 0.01 and 16 at P < 0.05), and, for 22 of the 23 SNPs, the estimated ORs were in the same direction for aggressive and non-aggressive disease. Two SNPs, rs6869841 and rs1270884, were associated with PSA levels (Supplementary Table3). Two of the SNPs showed a significantly higher OR in cases with a first- or second-degree relative with prostate cancer (rs3771570 and rs11135910; Supplementary Table 4). Six SNPs showed a trend in OR with respect to age at diagnosis, with a higher OR at younger ages (rs3771570, rs7611694, rs6869841, rs3096702, rs684232 and rs7241993; Supplementary Table 5). This age effect has been seen previously for four prostate cancer susceptibility SNPs9. We have also conducted an analysis of possible pathway enrichment for the previously reported susceptibility regions and those newly reported by extracting all genes overlapping a 500-kb or a 1-Mb window flanking each lead SNP (72 regions, 589 or 960 genes, respectively). GeneGo pathway enrichment analysis was used to identify any canonical pathways that were over-represented within this gene set. The most strongly associated pathways identified (false discovery rate < 0.05) were cell adhesion and extracellular matrix (ECM) remodeling (P = 1.31 106 to 3.6 109) and transcriptional regulation by the androgen receptor ( P = 3.5 106 to
386

npg

15

log10 (P value)

10

9 10

Chromosome

Figure 2 Manhattan plot of associations for new iCOGS loci. Previously reported loci are not included. The blue line represents P = 1 105, and the red line represents P = 5 108, which is the genome-wide significance level.

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

11 12 13 14 15 16 18 20 22 23

letters
Table 1 Summary results for 23 SNPs identified as associated in samples of European ancestry
P value Marker rs1218582 rs4245739 rs11902236 rs3771570 rs7611694 rs1894292 rs6869841 rs3096702 rs2273669 Chr. 1 1 2 2 3 4 5 6 6 6 7 8 10 11 12 14 14 17 17 18 20 20 X Position 153100807 202785465 10035319 242031537 114758314 74568022 172872032 32300309 109391882 153482772 20961016 25948059 104404211 101906871 113169954 52442080 68196497 565715 44700185 74874961 60449006 61833007 9774135 Allele AG AC GA GA AC GA GA GA AG AG GA GA AG AG GA GA AG AG GA GA GA AC AG MAFa 0.45 0.25 0.27 0.15 0.41 0.48 0.21 0.40 0.15 0.41 0.23 0.16 0.29 0.44 0.49 0.18 0.50 0.36 0.08 0.30 0.37 0.30 0.21 Per-allele ORb (95% CI) 1.06 (1.031.09) 0.91 (0.880.95) 1.07 (1.031.10) 1.12 (1.081.17) 0.91 (0.880.93) 0.91 (0.890.94) 1.07 (1.041.11) 1.07 (1.041.10) 1.07 (1.031.11) 0.89 (0.870.92) 1.11 (1.071.15) 1.11 (1.071.16) 0.91 (0.890.94) 0.91 (0.8800.94) 1.07 (1.041.10) 0.89 (0.860.93) 1.09 (1.061.12) 1.10 (1.071.14) 1.15 (1.091.22) 0.92 (0.890.95) 0.94 (0.910.97) 0.89 (0.860.92) 0.88 (0.830.92) Stagec Combined GWAS iCOGS replication Combined GWAS iCOGS replication Combined GWAS iCOGS replication Combined GWAS iCOGS replication Combined GWAS iCOGS replication Combined GWAS iCOGS replication Combined GWAS iCOGS replication Combined GWAS iCOGS replication Combined GWAS iCOGS replication Combined GWAS iCOGS replication Combined GWAS iCOGS replication Combined GWAS iCOGS replication Combined GWAS iCOGS replication Combined GWAS iCOGS replication Combined GWAS iCOGS replication Combined GWAS iCOGS replication Combined GWAS iCOGS replication Combined GWAS iCOGS replication Combined GWAS iCOGS replication Combined GWAS iCOGS replication Combined GWAS iCOGS replication Combined GWAS iCOGS replication Combined GWAS iCOGS replication Stage 6.0 5.2 2.6 1.7 1.8 1.1 2.0 2.9 7.1 6.3 9.8 9.9 1.2 7.8 1.0 1.1 1.6 1.0 1.9 2.6 3.1 3.5 7.1 2.6 3.5 2.6 1.0 2.4 4.1 1.2 3.1 9.4 3.3 1.3 4.7 2.2 1.5 3.4 1.6 3.6 4.4 2.1 1.6 4.1 1.4 2.8 105 105 105 107 105 104 102 108 104 1011 105 1010 104 105 104 105 107 103 105 1014 105 109 105 107 103 108 102 1010 107 105 107 109 103 108 106 1010 103 107 103 107 104 105 104 1013 104 107 Combined 1.95 108 2.01 1011 2.84 108 5.22 109 3.80 1013 5.02 1013 4.63 108 4.78 109 7.91 109 4.34 1018 4.95 1013 8.16 1011 4.87 1010 1.56 1011 6.75 1011 1.78 1014 2.77 1010 5.17 1015 1.97 109 2.19 109 3.64 108 3.57 1016 2.37 1010 Candidate gene KCNN3 MDM4 TAF1B, GRHL1 FARP2 SIDT1 AFM, RASSF6 BOD1 (FAM44B) NOTCH4 ARMC2, SESN1 RSG17 SP8 EBF2 TRIM8 MMP7 TBX5 FERMT2 RAD51B VPS53, FAM57A HOXB13, PRAC, SPOP, ZNF652 SALL3 GATAS, CABLES2 ZGPAT SHROOM2

2013 Nature America, Inc. All rights reserved.

rs1933488 rs12155172 rs11135910 rs3850699 rs11568818 rs1270884 rs8008270 rs7141529 rs684232 rs11650494 rs7241993 rs2427345 rs6062509 rs2405942
Chr., chromosome.
aAllele

npg

frequency of the second allele in iCOGS replication stage. bPer-allele OR in iCOGS replication stage for the second allele. cCombined GWAS: stage 1 and 2 UK/Australia, CGEMS, CAPS and BPC3 results.

which showed some evidence of association with prostate cancer in a candidate gene study10, and rs1380576 (r2 = 0.86), has previously been reported to be associated with prostate cancer aggressiveness in a case-only analysis of candidate SNPs in the TP53 pathway11. rs4245739 has been shown to create an illegitimate binding site for miR-191 that results in the downregulation of MDM4 expression12; this is in agreement with our analysis using mirsnpscore13, which predicted that the risk allele creates a binding site for miR-191, miR-887 and miR-3669. However, rs4245739 is also highly correlated with a number of other
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

MDM4 variants that overlap functional elements identified by the Encyclopedia of DNA Elements (ENCODE) Project13,14. Other ana lyses using the iCOGS array have found that rs4245739 and correlated SNPs are associated with estrogen receptor (ER)-negative breast cancer15 and breast cancer in BRCA1 mutation carriers16. In addition, the risk allele (C) of rs4245739 has been associated with increased aggressiveness in individuals with ovarian cancer17. rs11568818 at 11q22 lies within a small LD region containing a single gene, MMP7, encoding a matrix metalloproteinase. Matrix
387

letters a
SNPs 12 10 log10 (P value) 8 6 4 2 0
C1orf157 SOX13 ETNK2 KISS1 LOC127841 MDM4 PIK3C2B PPP1R15B LRRN2 NFASC CNTN2 TMEM81

Chr. 1: 202,785,465 rs4245739


11

b
SNPs r2 0.8 0.6 0.4 0.2 100 Recombination rate (cM/Mb) 80 60 40 20 0 12 10 log10 (P value) 8 6 4 2 0
KIAA1377 C11orf70 YAP1

Chr. 11: 101,906,871 rs11568818 rs11568818 (P = 1.56 1011) 100 Recombination rate (cM/Mb) 0.8 0.6 0.4 0.2 80 60 40 20 0
BIRC3 BIRC2 TMEM123 MMP7 MMP20 MMP27 MMP3 DCUN1D5 MMP13

rs4245739 (P = 2.01 10

r2

PLEKHA6

MMP10 MMP8

MMP12 DYNC2H1

204.0

204.2

204.4

204.6

204.8

205.0

102.0

102.2

102.4

12.6

102.8

103.0

Position on chr. 1 (Mb)

Position on chr. 11 (Mb)

c
SNPs 10 log10 (P value) 8 6 4 2 0
RAD51B

Chr. 14: 68,196,497 rs7141529

d
SNPs r2 0.8 0.6 0.4 0.2 100 Recombination rate (cM/Mb) log10 (P value) 80 60 40 20 0 10 8 6 4 2 0
PRAC ATP5G1 SNF8 UBE2Z

Chr. 17: 44,700,185 rs11650494 rs11650494 (P = 1.97 109) 100 r2 0.8 0.6 0.4 0.2 80 60 40 20 0
B4GALNT2 GNGT2 ABI3 ZNF652 NGFR PHB NXPH3 SPOP MYST2 FAM117A SLC35B1

2013 Nature America, Inc. All rights reserved.

rs7141529 (P = 2.77 1010)

Recombination rate (cM/Mb)

ZFP36L1 C14orf181 ACTN1

DCAF5 EXD2 GALNTL1

NCRNA00253 MIR3185

HOXB13

68.6

68.8

69.0 69.2 Position on chr. 14 (Mb)

69.4

69.6

46.8

47.0

47.2 47.4 47.6 Position on chr. 17 (Mb)

47.8

Figure 3 Regional association plots. (ad) Plots of the four SNPS associated with the MDM4 (a), MMP7 (b) and RAD51B (c) genes and the 17q region containing HOXB13 and ZNF652 (d) detailed in Table 1. Plots show the genomic regions associated with prostate cancer and the log 10 association P values of SNPs. Also shown are SNP Build 36/hg18 coordinates, recombination rates and genes in the regions. SNP color indicates the strength of LD (r2) with the index SNP. Plots were drawn using LocusZoom command-line options (University of Michigan; see URLs).

metalloproteinases are implicated in metastasis, and elevated MMP7 expression itself has been reported as a potential biomarker for metastatic prostate cancer and poor disease prognosis18. This SNP is situated 181 bp upstream of the transcriptional start site in the promoter region, within an area of high sequence conservation that overlaps strong DNase hypersensitivity and transcription factor binding sites13,14. rs11568818 itself has been established as a functional promoter variant, with the risk allele (A) having been shown to create a binding site for the FOXA2 transcription factor and result in higher MMP7 expression19. Increased expression of MMP7 may represent a plausible mechanism responsible for the greater prostate cancer risk associated with this SNP; rs11568818 is correlated at r2 > 0.5 with only four other variants and seems to be the most likely candidate for a causal variant. rs7141529 at 14q24 lies within the last intron of the longest isoform of RAD51B (also known as RAD51L1). Members of the RAD51 family are involved in the repair of double-stranded DNA breaks by homologous recombination, and their loss is potentially oncogenic. A variant in RAD51B (rs999737) has previously been associated with breast cancer20, and a second breast cancer susceptibility locus in intron 7 has also been identified in the iCOGS replication stage study21. However, there is no correlation between rs7141529 and any of the breast cancerassociated SNPs.
388

rs11650494 is located at 17q21, a gene-dense locus that contains several genes that have been proposed as potential prostate cancer susceptibility or somatically altered genes, including HOXB13, PRAC, SPOP and ZNF652. rs11650494 is highly correlated with a number of other variants that overlap functional motifs identified in the ENCODE Project13,14. This signal appears to center around the ZNF652 gene, with rs11650494 itself situated downstream of the gene within a long noncoding RNA (lincRNA) sequence. rs7210100 in intron 1 of ZNF652 has previously been identified as a prostate cancer susceptibility gene in African-American men22; however, this variant is rare among individuals of European ancestry, and the correlation between rs11650494 and rs7210100 is modest in the YRI (Yoruba from Ibadan, Nigeria) population (r2 = 0.22), suggesting that rs11650494 represents an independent or European-specific prostate cancer risk association. In addition, ZNF652 has been reported to be highly expressed in the majority of prostate tumors and is associated with higher risk of relapse23. The HOXB13, PRAC and SPOP genes are all situated approximately 500 kb upstream or downstream of rs11650494; however, all are considered candidate prostate cancer genes, and, therefore, the possibility of a trans-regulatory element or locus control region associated with the rs11650494 association signal cannot be excluded. HOXB13 is one of a cluster of homeobox domaincontaining genes at this locus.
VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

npg

letters
These genes are essential for vertebrate embryonic development, and HOXB13 is important for normal prostate development and is a key regulator of the response to androgens24. A rare variant in HOXB13 (rs138213197; encoding a p.Gly84Glu alteration) has recently been shown to significantly increase prostate cancer risk, occurring in families with multiple cases of prostate cancer25, and HOXB13 expression levels have been proposed as a marker of prostate cancer26. Analysis of 1,927 cases and 987 control samples from the CAPS study in which both rs11650494 and rs138213197 were genotyped showed that these SNPs are not correlated (r2 = 0.001) and that the OR for rs11650494 was not altered by adjustment for rs138213197 (Supplementary Table 7). SPOP encodes a protein that may modulate the transcriptional repression activities of death-associated protein 6 (encoded by DAXX), which interacts with histone deacetylase, core histones and other histone-associated proteins. SPOP is reported to be frequently mutated in prostate tumors, and it has been suggested that SPOP mutations may anchor a distinct genetic subtype of ETS fusion negative prostate cancers27. In addition to the presence of plausible candidate genes, most of the 23 newly associated loci harbor several transcription factor binding sites within their LD regions. With the identification of these new loci, 77 susceptibility loci for prostate cancer have now been identified. On the basis of an overall twofold familial relative risk for the first-degree relatives of prostate cancer cases and on the assumption that SNPs combine multiplicatively, the new loci reported here, together with those already known, explain approximately 30% of the familial risk of prostate cancer. Taking into consideration these SNPs and this risk model, the top 1% of men in the highest risk stratum have a 4.7-fold greater risk relative to the population average, and the top 10% of men have a 2.7fold greater risk. For comparison, the former risk estimate is similar to that conferred by deleterious mutations in BRCA2 (ref.28), and such mutation carriers are undergoing targeted screening in trials, for example, in the IMPACT (Identification of Men with a genetic predisposition to ProstAte Cancer: Targeted screening in men at higher genetic risk and controls) Study (see URLs). The SNP-based prostate cancer risk profile now available should therefore be able to distinguish men at a clinically meaningful level of risk. To evaluate the combined effect of the loci associated with prostate cancer risk, we included 68 of the known loci in a logistic regression (59 which were on iCOGS and 9 for which a surrogate with r2 > 0.76 was available). The parameters from this model were used to generate polygenic risk scores (Online Methods). On the basis of these scores, the estimated risk for men in the top 1% of the risk distribution was 4.4-fold greater than the population average risk (Supplementary Table 8), very close to the theoretical estimate predicted under a simple polygenic model (4.7-fold). Furthermore, under a polygenic genetic risk model29, an unaffected man aged 50 who has a father with prostate cancer diagnosed at 60 years of age would have a predicted lifetime risk of prostate cancer from his family history alone of just over 20%. However, if family history is taken into consideration along with the explicit effects of all known common prostate cancer susceptibility alleles, this predicted risk would rise to just over 60% if he were in the top 1% of the known polygenic risk score distribution (A. Antoniou, personal communication). Such differences in predicted risks will be important for facilitating risk stratification in targeted screening and prevention programs. URLs. COGs website, http://ec.europa.eu/research/health/medicalresearch/cancer/fp7-projects/cogs_en.html; IMPACT Study, http:// www.impact-study.co.uk/; SNAP plots from the University of Michigan,
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

http://csg.sph.umich.edu/locuszoom/; SNPTEST, https://mathgen.stats. ox.ac.uk/genetics_software/snptest/snptest.html; MACH 1.0, http:// www.sph.umich.edu/csg/abecasis/MACH/; PRACTICAL, http://ccge. medschl.cam.ac.uk/consortia/practical/index.html; GeneGo (now Thomson Reuters), http://thomsonreuters.com/products_services/ science/systems-biology/; CGEMS Project, http://dceg.cancer.gov/ research/how-we-study/genomic-studies/cgems-summary; BPC3, http://epi.grants.cancer.gov/BPC3/cohorts.html; CAPS, http://ki.se/ ki/jsp/polopoly.jsp?d=13809&a=29862&l=en; SCCS, http://www. southerncommunitystudy.org/. Methods Methods and any associated references are available in the online version of the paper.
Note: Supplementary information is available in the online version of the paper. Acknowledgments Acknowledgments are detailed in the Supplementary Note. AUTHOR CONTRIBUTIONS R.A.E. and D.F.E. designed the study. R.A.E. is principal investigator of PRACTICAL. D.F.E. is Scientific Director of the COGS initiative. Z.K.-J. is co-investigator of PRACTICAL. R.A.E., D.F.E., Z.K.-J. and A.A.A.O. wrote the manuscript; the following named coauthors commented on the manuscript. A.A.A.O. and D.F.E. performed the statistical analyses; S.B. collated the data set. J.D. managed the database. Z.K.-J., E.J. Saunders, D.A.L. and M.T. coordinated sample collation and quality control for iCOGS PRACTICAL genotyping. S.J.-L. carried out pathway analysis and constructed regional plots, and T. Dadaev, K.G., M. Guy, R.A.W., E.J. Sawyer and A.M. managed the UKGPCS database and manifests for genotyping. C.L., A.M.D., C.B., D. Conroy, M.J.M., S.A., E.D., A. Lee, D.C.T., F.B. and D.V. carried out iCOGS PRACTICAL genotyping and set quality control standards. M. Ghoussaini selected the iCOGS PRACTICAL SNPs for fine-scale mapping. K.M. and A. Lophatananon collected some of the UKGPCS samples and controls. F.C.H., D.E.N. and J.L.D. are joint principal investigators of ProtecT. B.E.H. and L.L.M. are principal investigators of MEC; C.A.H. and F.S. are co-investigators. S.I.B. and D.A. are principal investigators of the PLCO study; G.A. is the principal investigator for the St. Louis screening center for PLCO; and S.J.C. and M.Y. led the genotyping for PLCO. S.G., R.B.H. and W.R.D. provided samples for PLCO. D.J.H. directs and P. Kraft coordinates data collection and management/analysis for HPFS. M.W. is the principal investigator of CPCS1 and CPCS2. B.G.N., S.F.N. S.E.B., P. Klarskov and M.A.R. have collected samples and data, and contributed to genotyping in this study. J.L.S. is principal investigator of the Fred Hutchinsonbased study; E.A.O. collaborated on the study; L.M.F. and S.K. coordinated data collation; and E.M.K. and D.M.K. coordinated the preparation of samples. L.C.-A. is principal investigator of the Utah study; C.T. is the analyst; and R.A.S. is the surgeon. S.L. is a co-investigator of the BPC3 Consortium. H.G. is principal investigator of the CAPS and STHM1 study; J.A., M.A., F.W., S.L.Z. and J.X. have contributed to sample collection, clinical data retrieval, analyses and molecular work. S.A.I. is principal investigator of the USC study, and E.M.J. is principal investigator of SFPCS; M.C. Stem and R.C. led the genotyping of both studies. A.D.J. and A. Shahabi were both involved in genotype data production for the USC and SFPCS studies. A.S.K. is principal investigator of WUGS. B.D. and G.C. collected and collated clinical data and performed sample selection. M.R.T. is the principal investigator of the IPO-Porto study; S.M. and P.P. collected familial and molecular data on cases. L.B.S. and W.J.B. are the principal investigators of SCCS; L.B.S., W.J.B., W.Z. and Q.C. were responsible for the original collection of the samples. W.Z. and Q.C. coordinated sample retrieval, DNA extraction and genotyping. L.B.S. oversaw the assembly of the phenotype data. J.B. and J.A.C. are principal investigators of the Queensland study with input from A.B.S., F.L. and S.S. coordinated the data collation. K.A.C. and E.L. provided imputed data for genotyping in carriers of the mutation encoding the p.Gly84Glu alteration in the HOXB13 region. G.G.G., J.L.H., D.R.E. and G.S. are principal investigators of the Australian studies; M.C. Southey manages the molecular work. J.S. is principal investigator of the Tampere study; T.W. collected and collated clinical data and performed sample selection. T.L.J.T. coordinated sample collection. H.B. is principal investigator of the ESTHER study; D.R. and C.S. contributed to design and data collection; and H.M. is study coordinator. J.Y.P. is principal investigator of the Moffitt study; T.A.S. and H.-Y.L. are contributors to this study. R. Kaneva is principal investigator of the PCMUS study; C.S. provided the samples in the PCMUS study; V.M. oversaw the data collation.

npg

2013 Nature America, Inc. All rights reserved.

389

letters
C.C. and J.L. are principal investigators of the Poland study; C.C. and D.W. collated the samples. C.M. and W.V. are principal investigators of the Ulm study; A.E.R. identified and collected clinical material, processed samples, undertook genotyping and/or collated data. E.R. is principal investigator of EPIC; F.C., R. Kaaks and D. Campa are investigators in Germany. T.J.K. is principal investigator of the EPIC-Oxford cohort and collected clinical material. R.C.T. collated data. K.-T.K. is principal investigator of the EPIC-Norfolk study. S.N.T. and D.S. are principal investigators of the Mayo Clinic study; S.K.M. coordinated data collation. M.M.G. provided samples for the ACS study. P.D.P.P. and N.P. provided samples for the East Anglia SEARCH study. C.S.C. gave advice about results and contributed to the manuscript. A.C.A. undertook risk prediction analysis for clinical application. D.P.D., A.H., R.A.H., V.S.K., C.C.P., N.J.V.A., C.J.W., A.T., T. Dudderidge, C.O., A.A., A.C., J.V. and A. Siddiq identified and collected clinical material. Other members of the UK Genetic Prostate Cancer Study Collaborators/British Association of Urological Surgeons Section of Oncology, the UK ProtecT Study Collaborators and the PRACTICAL Consortium (membership lists provided in the Supplementary Note) collected clinical samples, assisted in genotyping and provided data management. Members of the COGSCancer Research UK GWASELLIPSE (part of GAME-ON) Initiatives, the Australian Prostate Cancer Bioresource, the UK Genetic Prostate Cancer Study Collaborators/British Association of Urological Surgeons Section of Oncology, the UK ProtecT Study Collaborators, the PRACTICAL Consortium and CSC collected clinical samples and/or assisted in genotyping and/or provided data management and/or discussion of the data.
9. Kote-Jarai, Z. et al. Multiple novel prostate cancer predisposition loci confirmed by an international study: the PRACTICAL Consortium. Cancer Epidemiol. Biomarkers Prev. 17, 20522061 (2008). 10. Koutros, S. et al. Pooled analysis of phosphatidylinositol 3-kinase pathway variants and risk of prostate cancer. Cancer Res. 70, 23892396 (2010). 11. Sun, T. et al. Single-nucleotide polymorphisms in p53 pathway and aggressiveness of prostate cancer in a Caucasian population. Clin. Cancer Res. 16, 52445251 (2010). 12. Wynendaele, J. et al. An illegitimate microRNA target site within the 3 UTR of MDM4 affects ovarian cancer progression and chemosensitivity. Cancer Res. 70, 96419649 (2010). 13. ENCODE Project Consortium. A users guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol. 9, e1001046 (2011). 14. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 5774 (2012). 15. Garcia-Closas, M. et al. Genome-wide association studies identify four ER negative specific breast cancer risk loci. Nat. Genet. published online; doi:10.1038/ng.2561 (27 March 2013). 16. Couch, F.J. et al. Genome-wide association study in BRCA1 mutation carriers identifies novel loci associated with breast and ovarian cancer risk. PLoS Genet. 9, e1003212 (2013). 17. Volinia, S. et al. A microRNA expression signature of human solid tumors defines cancer gene targets. Proc. Natl. Acad. Sci. USA 103, 22572261 (2006). 18. Szarvas, T. et al. Elevated serum matrix metalloproteinase 7 levels predict poor prognosis after radical prostatectomy. Int. J. Cancer 128, 14861492 (2011). 19. Richards, T.J. et al. Allele-specific transactivation of matrix metalloproteinase 7 by FOXA2 and correlation with plasma levels in idiopathic pulmonary fibrosis. Am. J. Physiol. Lung Cell. Mol. Physiol. 302, L746L754 (2012). 20. Thomas, G. et al. A multistage genome-wide association study in breast cancer identifies two new risk alleles at 1p11.2 and 14q24.1 (RAD51L1). Nat. Genet. 41, 579584 (2009). 21. Michailidou, K. et al. Large-scale genotyping identifies 41 new loci associated with breast cancer risk. Nat. Genet. published online; doi:10.1038/ng.2563 (27 March 2013). 22. Haiman, C.A. et al. Genome-wide association study of prostate cancer in men of African ancestry identifies a susceptibility locus at 17q21. Nat. Genet. 43, 570573 (2011). 23. Callen, D.F. et al. Co-expression of the androgen receptor and the transcription factor ZNF652 is related to prostate cancer outcome. Oncol. Rep. 23, 10451052 (2010). 24. Norris, J.D. et al. The homeodomain protein HOXB13 regulates the cellular response to androgens. Mol. Cell 36, 405416 (2009). 25. Ewing, C.M. et al. Germline mutations in HOXB13 and prostate-cancer risk. N. Engl. J. Med. 366, 141149 (2012). 26. Edwards, S. et al. Expression analysis onto microarrays of randomly selected cDNA clones highlights HOXB13 as a marker of human prostate cancer. Br. J. Cancer 92, 376381 (2005). 27. Barbieri, C.E. et al. Exome sequencing identifies recurrent SPOP, FOXA1 and MED12 mutations in prostate cancer. Nat. Genet. 44, 685689 (2012). 28. Breast Cancer Linkage Consortium. Cancer risks in BRCA2 mutation carriers. The Breast Cancer Linkage Consortium. J. Natl. Cancer Inst. 91, 13101316 (1999). 29. Macinnis, R.J. et al. A risk prediction algorithm based on family history and common genetic variants: application to prostate cancer with potential clinical impact. Genet. Epidemiol. 35, 549556 (2011).

2013 Nature America, Inc. All rights reserved.

COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.


Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html.
1. Goh, C.L. et al. Genetic variants associated with predisposition to prostate cancer and potential clinical implications. J. Intern. Med. 271, 353365 (2012). 2. Akamatsu, S. et al. Common variants at 11q12, 10q26 and 3p11.2 are associated with prostate cancer susceptibility in Japanese. Nat. Genet. 44, 426429 (2012). 3. Gudmundsson, J. et al. Genome-wide association and replication studies identify four variants associated with prostate cancer susceptibility. Nat. Genet. 41, 11221126 (2009). 4. Xu, J. et al. Genome-wide association study in Chinese men identifies two new prostate cancer risk loci at 9q31.2 and 19q13.4. Nat. Genet. 44, 12311235 (2012). 5. Amin Al Olama, A. et al. A meta-analysis of genome-wide association studies to identify prostate cancer susceptibility loci associated with aggressive and nonaggressive disease. Hum. Mol. Genet. 22, 408415 (2013). 6. Gudmundsson, J. et al. A study based on whole-genome sequencing yields a rare variant at 8q24 associated with prostate cancer. Nat. Genet. 44, 13261329 (2012). 7. Park, J.H. et al. Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nat. Genet. 42, 570575 (2010). 8. Kote-Jarai, Z. et al. Identification of a novel prostate cancer susceptibility variant in the KLK3 gene transcript. Hum. Genet. 129, 687694 (2011).

npg

Rosalind A Eeles1,2,74, Ali Amin Al Olama3,74, Sara Benlloch3,73, Edward J Saunders1,73, Daniel A Leongamornlert1,73, Malgorzata Tymrakiewicz1,73, Maya Ghoussaini3,73, Craig Luccarini3,73, Joe Dennis3,73, Sarah Jugurnauth-Little1,73, Tokhir Dadaev1,73, David E Neal4,5,73, Freddie C Hamdy6,7,73, Jenny L Donovan8,73, Ken Muir9,10,73, Graham G Giles11,12,73, Gianluca Severi11,12,73, Fredrik Wiklund13,73, Henrik Gronberg13,73, Christopher A Haiman14,73, Fredrick Schumacher14,73, Brian E Henderson14,73, Loic Le Marchand15,73, Sara Lindstrom16,73, Peter Kraft16,73, David J Hunter16,73, Susan Gapstur17,73, Stephen J Chanock18,73, Sonja I Berndt18,73, Demetrius Albanes19,73, Gerald Andriole20,73, Johanna Schleutker21,22,73, Maren Weischer23,73, Federico Canzian24,73, Elio Riboli25,73, Tim J Key26,73, Ruth C Travis26,73, Daniele Campa24,73, Sue A Ingles14,73, Esther M John2729,73, Richard B Hayes30,73, Paul D P Pharoah3,73, Nora Pashayan3,73, Kay-Tee Khaw31,73, Janet L Stanford32,33,73, Elaine A Ostrander34,73, Lisa B Signorello35,36,73, Stephen N Thibodeau37,73, Dan Schaid38,73, Christiane Maier39,40,73, Walther Vogel40,73, Adam S Kibel41,73, Cezary Cybulski42,73, Jan Lubinski42,73, Lisa Cannon-Albright43,44,73, Hermann Brenner45,73, Jong Y Park46,73, Radka Kaneva47,73, Jyotsna Batra48,73, Amanda B Spurdle49,73, Judith A Clements48,73, Manuel R Teixeira50,51,73, Ed Dicks3, Andrew Lee3, Alison M Dunning3, Caroline Baynes3, Don Conroy3, Melanie J Maranian3, Shahana Ahmed3, Koveela Govindasami1, Michelle Guy1, Rosemary A Wilkinson1, Emma J Sawyer1, Angela Morgan1, David P Dearnaley1,2, Alan Horwich1,2, Robert A Huddart1,2,
390 VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

letters
Vincent S Khoo1,2, Christopher C Parker1,2, Nicholas J Van As2, Christopher J Woodhouse2, Alan Thompson2, Tim Dudderidge2, Chris Ogden2, Colin S Cooper1,52, Artitaya Lophatananon9, Angela Cox53, Melissa C Southey54, John L Hopper12, Dallas R English11,12, Markus Aly13,55, Jan Adolfsson56, Jiangfeng Xu57, Siqun L Zheng58, Meredith Yeager18,59, Rudolf Kaaks60, W Ryan Diver17, Mia M Gaudet17, Mariana C Stern14, Roman Corral14, Amit D Joshi14, Ahva Shahabi14, Tiina Wahlfors21, Teuvo L J Tammela61, Anssi Auvinen62, Jarmo Virtamo63, Peter Klarskov64, Brge G Nordestgaard23, M Andreas Rder65, Sune F Nielsen23, Stig E Bojesen23, Afshan Siddiq66, Liesel M FitzGerald32, Suzanne Kolb32, Erika M Kwon34, Danielle M Karyadi34, William J Blot35,36, Wei Zheng36, Qiuyin Cai36, Shannon K McDonnell37, Antje E Rinckleb39,40, Bettina Drake20, Graham Colditz20, Dominika Wokolorczyk42, Robert A Stephenson44,67, Craig Teerlink43, Heiko Muller45, Dietrich Rothenbacher45, Thomas A Sellers46, Hui-Yi Lin46, Chavdar Slavov68, Vanio Mitev47, Felicity Lose49, Srilakshmi Srinivasan48, Sofia Maia50,51, Paula Paulo50,51, Ethan Lange69, Kathleen A Cooney70, Antonis C Antoniou3, Daniel Vincent71, Franois Bacot71, Daniel C Tessier71, The COGSCancer Research UK GWASELLIPSE (part of GAME-ON) Initiative72, The Australian Prostate Cancer Bioresource72, The UK Genetic Prostate Cancer Study Collaborators/British Association of Urological Surgeons Section of Oncology72, The UK ProtecT (Prostate testing for cancer and Treatment) Study Collaborators72, The PRACTICAL (Prostate Cancer Association Group to Investigate Cancer-Associated Alterations in the Genome) Consortium72, Zsofia Kote-Jarai1,74 & Douglas F Easton3,74

2013 Nature America, Inc. All rights reserved.

1The

Institute of Cancer Research, Sutton, UK. 2Royal Marsden National Health Service (NHS) Foundation Trust, London and Sutton, UK. 3Centre for Cancer Genetic Epidemiology, Department of Oncology, University of Cambridge, Cambridge, UK. 4Surgical Oncology (Uro-Oncology: S4), University of Cambridge, Addenbrookes Hospital, Cambridge, UK. 5Cancer Research UK Cambridge Research Institute, Li Ka Shing Centre, Cambridge, UK. 6Nuffield Department of Surgical Sciences, University of Oxford, Oxford, UK. 7Faculty of Medical Science, University of Oxford, John Radcliffe Hospital, Oxford, UK. 8School of Social and Community Medicine, University of Bristol, Bristol, UK. 9Warwick Medical School, University of Warwick, Coventry, UK. 10Institute of Population Health, University of Manchester, Manchester, UK. 11Cancer Epidemiology Centre, The Cancer Council Victoria, Carlton, Victoria, Australia. 12Centre for Molecular, Environmental, Genetic and Analytic Epidemiology, The University of Melbourne, Melbourne, Victoria, Australia. 13Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Stockholm, Sweden. 14Department of Preventive Medicine, Keck School of Medicine, University of Southern California/Norris Comprehensive Cancer Center, Los Angeles, California, USA. 15University of Hawaii Cancer Center, Honolulu, Hawaii, USA. 16Program in Molecular and Genetic Epidemiology, Department of Epidemiology, Harvard School of Pubic Health, Boston, Massachusetts, USA. 17Epidemiology Research Program, American Cancer Society, Atlanta, Georgia, USA. 18Division of Cancer Epidemiology and Genetics, National Cancer Institute, US National Institutes of Health (NIH), Bethesda, Maryland, USA. 19Nutritional Epidemiology Branch, National Cancer Institute, US NIH, Bethesda, Maryland, USA. 20Division of Urologic Surgery, Washington University School of Medicine, St. Louis, Missouri, USA. 21Institute of Biomedical Technology (BioMediTech), University of Tampere and FimLab Laboratories, Tampere, Finland. 22Department of Medical Biochemistry and Genetics, University of Turku, Turku, Finland. 23Department of Clinical Biochemistry, Herlev Hospital, Copenhagen University Hospital, Herlev, Denmark. 24Genomic Epidemiology Group, German Cancer Research Center (DKFZ), Heidelberg, Germany. 25Department of Epidemiology & Biostatistics, School of Public Health, Imperial College London, London, UK. 26Cancer Epidemiology Unit, Nuffield Department of Clinical Medicine, University of Oxford, Oxford, UK. 27Cancer Prevention Institute of California, Fremont, California, USA. 28Stanford Cancer Institute, Stanford University School of Medicine, Stanford, California, USA. 29Department of Health Research & Policy, Division of Epidemiology, Stanford University School of Medicine, Stanford, California, USA. 30Department of Environmental Medicine, New York University (NYU) Langone Medical Center, Division of Epidemiology, NYU Cancer Institute, New York, New York, USA. 31Clinical Gerontology Unit, University of Cambridge, Cambridge, UK. 32Fred Hutchinson Cancer Research Center, Division of Public Health Sciences, Seattle, Washington, USA. 33Department of Epidemiology, School of Public Health, University of Washington, Seattle, Washington, USA. 34National Human Genome Research Institute, US NIH, Bethesda, Maryland, USA. 35International Epidemiology Institute, Rockville, Maryland, USA. 36Department of Medicine, Vanderbilt Epidemiology Center, Vanderbilt-Ingram Cancer Center, Division of Epidemiology, Vanderbilt University School of Medicine, Nashville, Tennessee, USA. 37Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota, USA. 38Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, USA. 39Department of Urology, University Hospital Ulm, Ulm, Germany. 40Institute of Human Genetics, University Hospital Ulm, Ulm, Germany. 41Division of Urologic Surgery, Brigham and Womens Hospital, Dana-Farber Cancer Institute, Boston, Massachusetts, USA. 42Department of Genetics and Pathology, International Hereditary Cancer Center, Pomeranian Medical University, Szczecin, Poland. 43Division of Genetic Epidemiology, Department of Medicine, University of Utah School of Medicine, Salt Lake City, Utah, USA. 44George E. Wahlen Department of Veterans Affairs Medical Center, Salt Lake City, Utah, USA. 45Division of Clinical Epidemiology and Aging Research, DKFZ, Heidelberg, Germany. 46Division of Cancer Prevention and Control, H. Lee Moffitt Cancer Center, Tampa, Florida, USA. 47Department of Medical Chemistry and Biochemistry, Molecular Medicine Center, Medical University of Sofia, Sofia, Bulgaria. 48Australian Prostate Cancer Research Centre, Queensland Institute of Health and Biomedical Innovation and School of Biomedical Science, Queensland University of Technology, Brisbane, Queensland, Australia. 49Molecular Cancer Epidemiology Laboratory, Queensland Institute of Medical Research, Brisbane, Queensland, Australia. 50Department of Genetics, Portuguese Oncology Institute, Porto, Portugal. 51Abel Salazar Biomedical Sciences Institute (ICBAS), Porto University, Porto, Portugal. 52School of Biological Sciences, University of East Anglia, Norwich, UK. 53Cancer Research UK/Yorkshire Cancer Research Sheffield Cancer Research Centre, University of Sheffield, Sheffield, UK. 54Genetic Epidemiology Laboratory, Department of Pathology, The University of Melbourne, Parkville, Victoria, Australia. 55Department of Clinical Sciences, Division of Urology, Danderyd Hospital, Karolinska Institute, Stockholm, Sweden. 56Regional Cancer Centre, Department of Clinical Science, Intervention and Technology (CLINTEC), Karolinska Institute, Stockholm, Sweden. 57Wake Forest University School of Medicine, Winston-Salem, North Carolina, USA. 58Genomics Genotyping Laboratory, Center for Cancer Genomics, Wake Forest University Health Sciences, Winston-Salem, North Carolina, USA. 59Core Genotyping Facility, SAIC-Frederick, National Cancer Institute, US NIH, Gaithersburg, Maryland, USA. 60Division of Cancer Epidemiology, DKFZ, Heidelberg, Germany. 61Department of Urology, Tampere University Hospital and Medical School, University of Tampere, Tampere, Finland. 62Department of Epidemiology, School of Health Sciences, University of Tampere, Tampere, Finland. 63Department of Chronic Disease Prevention, National Institute for Health and Welfare, Helsinki, Finland. 64Department of Urology, Herlev Hospital, Copenhagen University Hospital, Herlev, Denmark. 65Department of Urology, Rigshospitalet, Copenhagen University Hospital, Copenhagen, Denmark. 66Department of Genomics of Common Disease, School of Public Health, Imperial College London, London, UK. 67Huntsman Cancer Institute, Salt Lake City, Utah, USA. 68Department of Urology, Alexandrovska University Hospital, Medical University Sofia, Sofia, Bulgaria. 69Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA. 70University of Michigan Comprehensive Cancer Center, Division of Hematology/Oncology, University of Michigan Medical School, Ann Arbor, Michigan, USA. 71McGill University and Gnome Qubec Innovation Centre, Montreal, Quebec, Canada. 72A full list of members is provided in the Supplementary Note. 73These authors contributed equally to this work. 74These authors jointly directed this work. Correspondence should be addressed to R.E. (rosalind.eeles@icr.ac.uk).

npg

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

391

ONLINE METHODS

2013 Nature America, Inc. All rights reserved.

GWAS analysis. Primary genotype data were obtained for three prostate cancer GWAS (CGEMS, UK/Australia stages 1 and 2, and CAPS). Standard quality control was performed on all scans; all individuals with low call rate (<95%), extremely high or low heterozygosity (P < 1 105) and non-European ancestry (>15% non-European component by multidimensional scaling using the three HapMap 2 populations (European (CEU), Asian (CHB and JPT) and African (YRI)) as a reference) were excluded. SNPs with call rate < 95%; call rate < 99% and MAF < 5%, or MAF < 1% and SNPs whose genotype frequencies departed from Hardy-Weinberg equilibrium at P < 1 106 in controls or P < 1 1012 in cases were excluded. For BPC3, quality control was performed as previously described30. Genotypes in all four GWAS were imputed for ~2.6 million SNPs using the HapMap phase 2 CEU population as a reference. UK/Australia stages 1 and 2 and CGEMS were imputed using MACH 1.0 (see URLs) for autosomal markers and IMPUTE v1 (ref. 31) for chromosome X markers. Imputation for the BPC3 study used MACH 1.0. The CAPS study used IMPUTE v1. We included imputed data from a SNP in the combined analysis if the estimated correlation between the genotype scores and the true genotypes (r2) was >0.3 (MACH) or if the quality information was >0.3 (IMPUTE). For UK stages 1 and 2 and CGEMS, the imputed genotype probabilities were used to derive a 1-degree-of-freedom association score statistic and its corresponding variance for each SNP. The test statistic for UK/Australia stage 2 was stratified by population as previously described32. In the BPC3 study, estimated values and standard errors were calculated for each component study, including one principal component as a covariate to adjust for population structure using ProbABEL33, and the results were combined to generate overall values and standard errors using a fixed-effects meta-analysis. CAPS used SNPTEST (see URLs) to estimate values and standard errors. We converted the results from all studies into test scores and variances and hence derived a combined 2 trend statistic for each SNP (equivalent to the Mantel extension test or as in a fixed-effects meta-analysis) in R. All studies were approved by the appropriate national ethics committees, and informed consent was obtained. SNP selection. SNPs were selected for the iCOGS array separately by each consortium. Each consortium was given a share of the array: nominally, 25% of the SNPs each for BCAC, PRACTICAL and OCAC and 17.5% for CIMBA; 7.5% were of general interest (COMMON area). In practice, the allocations were larger as a result of overlaps. In each consortium, the allocation was divided into three categories for GWAS replication, fine-mapping and candidate SNPs. The GWAS replication category consisted of a series of lists for each analysis (see the PRACTICAL website for a full description of the lists). In general, we considered only SNPs with an Illumina design score of 0.8 or greater (some OCAC and CIMBA SNPs with lower design scores were included). Where possible, preference was given to SNPs previously genotyped by Illumina (design score = 1.1). For each category, we defined a series of ranked lists of SNPs. For the GWAS SNPs, these were merged in the following way to generate a single list. We selected SNPs in priority order from each list according to predefined weightings. When a SNP (or a surrogate) was selected on the basis of more than one list, the SNP counted toward the tally for each list. For each SNP, we preferentially accepted the SNP if it had a design score of 1.1 (meaning it had previously been genotyped on an Illumina platform). If this was not the case, we sought SNPs with r2 = 1 with the chosen SNP and selected the SNP with the best design score. If no such SNP was available, we selected SNPs with r2 > 0.8 with the chosen SNP and selected the SNP with the best design score. We excluded SNPs that were in strong LD with a previously selected SNP (r2 > 0.9). However, for SNPs that were highly significant in each list (P < 0.00001), we required two surrogate SNPs. The candidate lists were merged in the same way, giving equal weight to lists from each study. The only differences were that (i) there was no provision for additional surrogates and (ii) SNPs were excluded if there was an existing surrogate at r2 = 1. To merge the three categories, we first included all the selected finemapping SNPs and then included SNPs from the merged GWAS and candidate lists in priority order. COMMON SNPs were selected in a similar way. Finally, lists from each of the constituent consortia were merged, in priority order and in proportion to the allocated share of each consortium. SNPs selected by one consortium and subsequently selected by another counted

toward both lists. The process continued until the maximum 240,000 attempted beadtypes had been reached. The final list comprised 220,123 SNPs. Of these, 211,155 were successfully manufactured on the array. iCOGS genotyping. Samples for the iCOGS replication stage were drawn from 32 studies participating in the PRACTICAL Consortium. The majority of studies were population-based or hospital-based case-control studies or were nested case-control studies, but some studies selected samples by age or oversampled for cases with a family history of disease; in the latter instance only, one case per family was genotyped (Supplementary Table 1 and Supplementary Note). Studies were required to provide ~2% of samples in duplicate. Genotyping was conducted using a custom Illumina Infinium array (iCOGS) in seven centers, of which five were used for PRACTICAL samples. Genotypes were called using Illuminas proprietary GenCall algorithm. Initial calling used a cluster file generated with 270 samples from HapMap 2. To generate the final calls, we first selected a subset of 3,018 individuals, including samples from each of the genotyping centers, each of the participating consortia and each major ancestry group. Only plates with consistently high call rates in the initial calling were used. We also included 380 samples of European, Asian or African ancestry genotyped as part of the HapMap Project and 1000 Genomes Project and 160 samples that were known positive controls for rare variants on the array. This subset was used to generate a cluster file that was then applied to call the genotypes for the remaining samples. We also investigated two other calling algorithms: Illumnus34 and GenoSNP35. All three algorithms were >99% concordant in their calling for 91% of the SNPs on the array. However, manual inspection of a sample of the discrepant SNPs indicated that the calls from GenCall were almost invariably superior (generally because Illumnus or GenoSNP attempted to call SNPs that clustered poorly). Therefore, only the genotypes called by GenCall have been used in the analyses reported here. Quality control. We excluded individuals for any of the following reasons: genotypically not male XY (XX or XXY); overall call rate < 95%; low or high heterozygosity (P < 1 106, separately for individuals of European and African-American ancestry); not concordant with previous genotyping within PRACTICAL; genotypes for the duplicate sample that appeared to be from a different individual; and cryptic duplicates where the phenotypic data indicated that the individuals were different. We searched for cryptic duplicates both within each study and between studies from the same country. For known and cryptic duplicates, the sample with the lower call rate was excluded. We attempted to identify first-degree relative pairs using identity-by-state estimates based on data from ~37,000 uncorrelated SNPs. For apparent firstdegree relative pairs, we removed the control from a case-control pair, otherwise, the individual with the lower call rate. For all analyses presented here, we also excluded 6,766 individuals who were included in any of the GWAS to allow the GWAS and iCOGS replication stages to be combined. Ancestry outliers were identified by multidimensional scaling, combining the iCOGS replication stage data with those from the three HapMap 2 populations, based on a subset of 37,000 uncorrelated markers that passed quality control (including ~1,000 selected as ancestry-informative markers). Most studies included individuals predominantly of single, European ancestry, and individuals with >15% minority ancestry were excluded. One study (SCCS) primarily contained individuals of African-American ancestry, and two studies, FHCRC and MOFFITT, contained substantial fractions of individuals of both African-American and European ancestry. After exclusion of ancestry outliers, we used principal-components analysis to correct for inflation. Principal-components analyses were carried out separately for the European and African-American subgroups on the basis of a subset of 37,000 uncorrelated SNPs. We included the first six principal components as covariates in both the European and African-American subgroups. Addition of further principal components did not reduce inflation further. Only the European data are reported here. We excluded SNPs with call rates of <95%. We also excluded SNPs that deviated from Hardy-Weinberg equilibrium in controls at P < 1 107, on the basis of a stratified 1-degree-of-freedom test in which the deviations were summed across strata36. We also excluded SNPs for which the genotypes were

npg

Nature Genetics

doi:10.1038/ng.2560

discrepant in more than 2% of duplicate samples, across all COGS consortia. The final analyses were based on data from 201,598 SNPs. Genotype intensity cluster plots were examined manually ( Supplementary Fig. 4) for SNPs in each new region in which an association at genome-wide significance was obtained, and SNPs eliminated in the clustering were judged to be poor. Statistical analysis. For each SNP, we estimated a per-allele log(OR) and standard error by logistic regression, including study and principal components as covariates. Overall significance levels were obtained by combining the estimates from the combined GWAS and the iCOGS replication stage using a fixed-effects meta-analysis. Tests of homogeneity of the ORs across strata and populations were assessed using likelihood ratio tests. Modification of the ORs by disease aggressiveness and family history was assessed using a case-only ana lysis. Modification of the ORs by age was examined using a case-only analysis assessing the association between age and SNP genotype in the cases using polytomous regression. The associations between SNP genotypes and PSA levels were assessed using linear regression, after log transformation of PSA levels to correct for the positively skewed distribution of PSA levels (ng/ml). Analyses were performed in R, principally using GenABEL37, SNPTEST, ProbABEL33 and Stata. The contribution of the known SNPs to the familial risk of prostate cancer, under a multiplicative model, was computed using the formula

the COMMON fine-mapping regions. The inflation was 1.136 for the subgroup of European ancestry and 1.001 for the subgroup of African-American ancestry. Inflation was converted to an equivalent inflation for a study with 1,000 cases and 1,000 controls (1,000) by adjusting for effective study size, namely l1,000 = 1 + 500(l 1) 1 1 n + m k k k
1

where nk and mk were the number of cases and controls, respectively, for study k. Estimation of the number of associated loci. To estimate the total number of newly associated loci selected for the iCOGS replication stage, we identified a set of 22,662 SNPs selected for replication of the GWAS and not selected for fine mapping to exclude previously known loci that were uncorrelated (r2 < 0.1 for any pair). We then determined the number of loci for which the estimated effect sizes in the iCOGS replication were in the same direction as in the combined GWAS or in the opposite direction. Similar results were obtained using cutoffs of r2 < 0.05 and r2 < 0.2. Pathway analysis. GeneGo pathway enrichment was used to determine whether any canonical pathway was significantly enriched with false discovery rate < 0.05.

2013 Nature America, Inc. All rights reserved.

( log lk )
k 0

( log l )

where 0 is the observed familial risk to first-degree relatives of prostate cancer cases, assumed to be 2, and k is the familial relative risk due to locus k, given by lk =

( pk rk + qk )2

2 pk rk + qk

where pk is the frequency of the risk allele for locus k, qk = 1 pk, and rk is the estimated per-allele OR. Inflation. We estimated the inflation for each analysis on the basis of the 45th percentile of the test statistic for SNPs not selected by PRACTICAL and not in

30. Schumacher, F.R. et al. Genome-wide association study identifies new prostate cancer susceptibility loci. Hum. Mol. Genet. 20, 38673875 (2011). 31. Marchini, J. et al. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39, 906913 (2007). 32. Eeles, R.A. et al. Multiple newly identified loci associated with prostate cancer susceptibility. Nat. Genet. 40, 316321 (2008). 33. Aulchenko, Y.S., Struchalin, M.V. & van Duijn, C.M. ProbABEL package for genomewide association analysis of imputed data. BMC Bioinformatics 11, 134 (2010). 34. Teo, Y.Y. et al. A genotype calling algorithm for the Illumina BeadArray platform. Bioinformatics 23, 27412746 (2007). 35. Giannoulatou, E. et al. GenoSNP: a variational Bayes within-sample SNP genotyping algorithm that does not require a reference population. Bioinformatics 24, 22092214 (2008). 36. Haldane, J.B. & Slater, E. Assortative mating. Eugen. Rev. 38, 103 (1946). 37. Aulchenko, Y.S., Ripke, S., Isaacs, A. & van Duijn, C.M. GenABEL: an R library for genome-wide association analysis. Bioinformatics 23, 12941296 (2007).

npg
doi:10.1038/ng.2560

Nature Genetics

letters

Genome-wide association studies identify four ER negativespecific breast cancer risk loci
Estrogen receptor (ER)-negative tumors represent 2030% of all breast cancers, with a higher proportion occurring in younger women and women of African ancestry1. The etiology2 and clinical behavior3 of ER-negative tumors are different from those of tumors expressing ER (ER positive), including differences in genetic predisposition4. To identify susceptibility loci specific to ER-negative disease, we combined in a metaanalysis 3 genome-wide association studies of 4,193 ERnegative breast cancer cases and 35,194 controls with a series of 40 follow-up studies (6,514 cases and 41,455 controls), genotyped using a custom Illumina array, iCOGS, developed by the Collaborative Oncological Gene-environment Study (COGS). SNPs at four loci, 1q32.1 (MDM4, P = 2.1 1012 and LGR6, P = 1.4 108), 2p24.1 (P = 4.6 108) and 16q12.2 (FTO, P = 4.0 108), were associated with ER-negative but not ER-positive breast cancer (P > 0.05). These findings provide further evidence for distinct etiological pathways associated with invasive ER-positive and ER-negative breast cancers. ER-negative tumors are associated with a worse short-term prognosis3 and have weaker associations with reproductive risk factors2 than ER-positive tumors. There are also important differences in genetic susceptibility to these two types of tumors. BRCA1 mutations predispose primarily to ER-negative disease, whereas most known common susceptibility loci for breast cancer show stronger associations with ER-positive than with ER-negative tumors4. Exceptions are three loci tagged by rs10069690 on chromosome 5p15 (ref. 5) (TERT-CLPTM1L), rs8170 at 19p13 (ref. 6) (BABAM1, also known as MERIT40) and rs2284378 at 20q11 (ref. 7), which predispose primarily to ER-negative tumors, and loci at 6q25 (ref. 8) that confer higher risk for ER-negative than for ER-positive tumors. With the aim of identifying susceptibility loci specific for invasive ER-negative disease, we analyzed three genome-wide association studies (GWAS) in populations of European ancestry and followed-up promising signals from each GWAS in the Breast Cancer Association Consortium (BCAC). The 3 GWAS included a total of 4,193 ER-negative breast cancer cases and 35,194 controls of European ancestry drawn from 23 studies participating in the National Cancer Institute Breast and Prostate Cancer Cohort Consortium (BPC3), the Triple-Negative Breast Cancer Consortium (TNBCC) and the Combined BCAC ER-negative GWAS (C-BCAC) (Online Methods and Supplementary Table 1). We selected 13,276 SNPs on the basis of rank P values from the 3 GWAS, and these were genotyped in an independent set of 6,514
A full list of authors and affiliations appears at the end of the paper. Received 14 May 2012; accepted 29 January 2013; published online 27 March 2013; doi:10.1038/ng.2561

ER-negative cases and 41,455 controls of European ancestry from 40 BCAC studies forming part of the COGS Project (Online Methods and Supplementary Table 1). Samples were genotyped using the iCOGS custom Illumina Infinium array that included a total of 211,155 SNPs selected in collaboration with other cancer consortia (Online Methods). We performed a fixed-effects meta-analysis of odds ratio (OR) estimates from the GWAS and follow-up studies (quantilequantile plot shown in Supplementary Fig. 1) and identified four loci newly associated with ER-negative disease at P < 5 108 (Fig. 1 and Table 1; cluster plots shown in Supplementary Fig. 2). Two independently associated loci were located on chromosome 1q32.1 and were tagged by two uncorrelated (r2 < 0.001) markers (from reference sequence NCBI Build 36): rs4245739 (P = 2.1 1012, OR = 1.14, 95% confidence interval (CI) = 1.101.18) and rs6678914 (P = 1.4 108, OR = 1.10, 95% CI = 1.061.13). Conditional analyses of the two SNPs in BCAC follow-up data showed comparable estimates, indicating that these are two distinct signals (Supplementary Table 2). The other two loci were located at 2p24.1 (rs12710696, P = 4.6 108, OR = 1.10, 95% CI = 1.061.13) and 16q12.2 (rs11075995, P = 4.0 108, OR = 1.11, 95% CI = 1.071.15). For each region, there was little evidence for heterogeneity of effect by study (Table 1 and Supplementary Fig. 3ad), and genotype-specific risks for rs4245739, rs6678914 and rs12710696 were consistent with a log-additive model. For rs11075995, departure from a log-additive model was significant (P = 0.039), and genotype-specific estimates suggested a recessive effect (Supplementary Table 3). The strength of the association for each SNP differed significantly by ER status, and none of the SNPs showed significant associations with ER-positive disease in the analysis of 25,227 ER-positive cases and 41,455 controls of European ancestry in BCAC (Supplementary Tables 4 and 5). Notably, we observed no significant differences in ORs for ER-negative tumors with and without the triple-negative pheno type (defined as ER-negative, progesterone receptor (PR)-negative and HER2-negative) for rs6678914 (1q32.1, LGR6), rs12710696 (2p24.1) and rs11075995 (16q12.2). However, rs4245739 (1q32.1, MDM4) seemed to be specific to triple-negative tumors (caseonly heterogeneity P value (Phet) by triple-negative status = 0.005; Supplementary Table 5). None of the four SNPs showed significant (P < 0.05) associations in studies of Asian ancestry in BCAC, and only the 16q12.2 (FTO) variant was associated at P = 0.05 in combined analyses of studies of AfricanAmerican ancestry in BCAC and the African-American Breast Cancer Consortium5 (AABC; Supplementary Table 6). However, estimates

npg

2013 Nature America, Inc. All rights reserved.

392

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

letters

a
Observed (log P)

14 12 10 8 6 4 2 0

Observed (log P)

rs4245739

60 40 20

6 4 2 0
LMOD1 TIMM17A RNPEP ELF3 GPR37L1 ARL8A PTPN7 PTPRVP LGR6

60 40 20 0

Observed (log P)

>0.8 0.50.8 0.20.5 <0.2

100 80

10 8

rs6678914

100 80

Recombination rate (cM/Mb)

Recombination rate (cM/Mb)

10 8 6 4 2 0

rs12710696

100 80 60 40 20
OSR1

Recombination rate (cM/Mb)

PLEKHA6 PPP1R15B PIK3C2B MDM4

0
LRRN2

NFASC

UBE2T PPP1R12B

204,300 204,500 204,700 Chromosome 1 position (kb)

202,000 202,200 202,400 Chromosome 1 position (kb)

19,100 19,300 19,500 Chromosome 2 position (kb)

d
Observed (log P)

Recombination rate (cM/Mb)

10 8 6 4 2 0
RBL2 AKTIP

rs11075995

100 80 60 40 20 0

2013 Nature America, Inc. All rights reserved.

Figure 1 Association and recombination plots. (ad) Results are shown for the 1q32.1 (rs4245739; MDM4) (a), 1q32.1 (rs6678914; LGR6) (b), 2p24.1 (rs12710696) (c) and 16q12.2 (rs11075995; FTO) (d) loci in populations of European ancestry. Data from ER-negative breast cancer GWAS are plotted as circles; LD between each SNP and the top SNP (blue) is indicated by the color of the symbol. Estimates from the combined analysis of GWAS and BCAC replication data are plotted as squares, with the top SNP shown in blue. Recombination rates, plotted in light blue, are based on the HapMap CEU samples (Utah residents of Northern and Western European ancestry), and genomic coordinates are based on GRCh37 of the human genome.

RPGRIP1L

FTO

53,500 53,800 54,100 Chromosome 16 position (kb)

for Asian and African-American populations were not significantly different from those in Europeans (P > 0.05), and larger studies in these populations are needed to determine whether risk associations exist. None of the markers were significantly associated with increasing age at the onset of ER-negative disease in the BCAC follow-up data (Ptrend 0.314), although there were some differences in age-specific estimates (Supplementary Table 7). Furthermore, OR estimates were not significantly different for women with and without a family history of any breast cancer in at least one first-degree relative, and risk alleles were not over-represented in cases with a positive family history (Supplementary Table 8). rs4245739 (1q32.1) is located in the 3 region of the MDM4 oncogene. MDM4 is a repressor of TP53 and TP73 transcription and is important for cell cycle regulation and apoptosis. rs4245739 resides in a linkage disequilibrium (LD) block of approximately 230 kb (Supplementary Fig. 4a) that also contains the tRNALys transcript and the genes PIK3C2B and LRRN2 (Supplementary Fig. 5a). MDM4, tRNALys and PIK3C2B but not LRRN2 are expressed in normal breast epithelium, breast cancer cell lines and breast tumors911. There are no nonsynonymous SNPs correlated with rs4245739 in the 1000 Genomes Project populations of European ancestry (r2 > 0.10); however, correlated SNPs are located in the promoter region of PIK3C2B (rs3014606, r2 = 0.94 and rs2926534, r2 = 0.94) and in the tRNALys transcript (rs11240753, r2 = 0.78 and rs4951389, r2 = 0.78). Variants in the MDM4 locus correlated with rs4245739 have also been associated with breast cancer in BRCA1 mutation carriers who have predominantly ER-negative tumors12. Thus, this region seems to be specifically associated with ER-negative disease and not with overall breast cancer risk, as suggested by a previous, smaller candidate gene study13. To our knowledge, no studies before the COGS collaboration have evaluated rs4245739 in relation to the risk of ER-negative disease. rs6678914 on chromosome 1q32.1 is located in intron 1 of the LGR6 gene (Supplementary Fig. 4b). LGR6 and several other genes in this region, including UBE2T and PTPN7, are expressed in breast tumors9. A correlated SNP (rs12032424, r2 = 0.96) is located in a putative enhancer region in the same intron of LGR6 in normal breast
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

epithelial cells, although not in the triple-negative breast cancer cell line MDA-MB-231 (Supplementary Fig. 5b). The rs6678914 SNP is not correlated with nonsynonymous SNPs in LGR6 (r2 > 0.10 in 1000 Genomes Project populations of European ancestry). The SNP rs12710696 on chromosome 2p24.1 is located in an intergenic region, more than 200 kb from the nearest gene (OSR1) (Supplementary Fig. 4c). It is possible that the allele marked by rs12710696 could influence a set of active enhancers, as the region contains multiple overlapping chromatin marks in normal breast epithelial cells and the MDA-MB-231 triple-negative breast cancer cell line (Supplementary Fig. 5c). The signal found on chromosome 16q12.2 is located in the fat mass and obesity-associated gene FTO (Supplementary Fig. 4d). This signal is tagged by rs11075995, located in a ~40-kb LD block in intron 1 of FTO, within an enhancer region that appears to be active in both normal and triple-negative breast cancer cells (Supplementary Fig. 5d). rs11075995 is located ~40 kb distal to a region in intron 1 that contains multiple SNPs associated with obesity in the Genetic Investigation of ANthropometric Traits (GIANT) Consortium14,15, as well as a SNP associated with overall breast cancer risk (rs17817449)8. rs11075995 is not correlated with any of the previously reported SNPs associated with obesity at genome-wide significant levels in GIANT or with rs17817449 (P = 3.7 1060, based on 123,864 subjects in GIANT; ref.15). However, rs11075995 is associated with body mass index (BMI), both in GIANT (P = 1.51 106, based on 121,427 subjects) and our control population (P = 2.8 105, based on 20,952 controls in iCOGS; data not shown). Analyses adjusting and stratifying by BMI on the basis of 3,071 ER-negative cases and 20,130 controls from 19 studies genotyped on the iCOGS array indicated that the association between rs11075995 and ER-negative disease is not explained or modified by our measure of BMI (BMI-adjusted OR = 1.16, 95% CI = 1.091.24, P = 1.1 105; Pinteraction = 0.912; data not shown). Furthermore, conditional analyses indicated that the ER-negative diseasespecific signal (rs11075995) is independent of rs17817449 (Supplementary Table2). This finding adds to the increasing evidence of distinct signals at the same locus for different subtypes of cancers occurring at the same site, including, for example, 5p15.33 (TERT-CLPTM1L)16 and 14q24.1
393

npg

letters
(RAD51B, also known as RAD51L1)8 in breast cancer and 5p15.33 (TERT-CLPTM1L)16 and HNF1B17 in ovarian cancer. Detailed fine mapping of known and newly identified breast cancerassociated regions will be required to determine whether additional subtypespecific signals exist in these regions. In an attempt to investigate the likely genes responsible in the observed risk associations, we examined associations between SNPs with available genotype (rs4245739, rs12710696 and rs6678914) and RNA expression in data from 382 primary breast tumors, including 81 ER-negative samples in The Cancer Genome Atlas (TCGA) database. None of the associations were significant after Bonferroni adjustment for multiple comparisons, whether considering only the immediately neighboring genes or all genes within a 1-Mb window of the lead SNP (data not shown).

Table 1 Associations of SNPs and ER-negative breast cancer risk in populations of European ancestry
SNP rs4245739 Cytoband 1q32.1 Gene MDM4 Positiona Stage T/Ib I I I Studies 7 11 5 Cases 2,069 1,562 562 Controls 25,385 3,399 6,410 0.28 RAF 0.27 OR (95% CI) 1.07 (0.971.17) 1.20 (1.081.32) 1.17 (1.021.35) 1.13 (1.091.18) 1.14 (1.101.18) P 0.177 4.6 104 0.024 Phet I 2 study het. (%)c

202785465 GWAS BPC3 TNBCC C-BCAC Follow-up BCAC/iCOGS

40 63

6,512 10,705

41,451 76,645

0.26 0.26

8.5 109 2.1 1012 0.413 3.2

2013 Nature America, Inc. All rights reserved.

Meta-analysis

rs6678914

1q32.1

LGR6

20045399

GWAS BPC3 TNBCC C-BCAC Follow-up BCAC/iCOGS Meta-analysis

I/T T I/T

7 11 5

2,069 1,562 562

25,385 3,399 6,410

0.59 0.59 0.59

1.12 (1.031.22) 1.16 (1.051.27) 1.15 (1.011.30) 1.08 (1.041.12) 1.10 (1.061.13)

0.007 0.003 0.032

40 63

6,514 10,707

41,452 76,646

0.59 0.59

1.8 104 1.4 108 0.481 0.0

rs12710696

2p24.1

Non-genic

19184284

GWAS BPC3 TNBCC C-BCAC Follow-up BCAC/iCOGS Meta-analysis

I I I

7 11 5

2,069 1,562 562

25,385 3,399 6,410

0.37

0.37

1.05 (0.961.14) 1.17 (1.061.29) 1.00 (0.881.14) 1.10 (1.061.15) 1.10 (1.061.13)

0.304 0.001 0.947

npg

40 63

6,512 10,705

41,453 76,647

0.36 0.36

1.4 106 4.6 108 0.464 0.0

rs11075995

16q12.2 KIAA1752-FTO 52412792

GWAS BPC3 TNBCC C-BCAC Follow-up BCAC/iCOGS Meta-analysis

I I I

7 11 5

2,069 1,562 562

25,385 3,399 6,410

0.24

0.24

1.15 (1.041.28) 1.15 (1.031.28) 1.09 (0.921.28) 1.10 (1.051.15) 1.11 (1.071.15)

0.008 0.010 0.328

40 63

6,513 10,706

41,453 76,647

0.24 0.24

4.2 105 4.0 108 0.079 24.3

Results are shown for the SNPs showing the strongest association in four loci reaching association P < 5 108 in meta-analyses of GWAS and follow-up data. RAF, risk allele frequency; freq., frequency.
aNCBI

Build 36. bImputed (I) and typed (T) SNPs: rs6678914 was typed in one BPC3 study (WGHS), three C-BCAC studies (ABCFS, SASBCAC, UK2) and all TNBCC studies and imputed in all other GWAS studies. cResult of Q test for heterogeneity of estimated ORs.

394

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

letters
To provide a comprehensive analysis of common genetic loci for ERnegative breast cancer, we also evaluated associations between 67 known loci for overall breast cancer risk (26 previously reported and 41 newly identified8) and ER-negative disease. On the basis of our meta-analysis of 10,707 ER-negative cases and 76,649 controls, 7 regions influenced risk of ER-negative disease at P < 5 108: 1p36.22 (PEX14), 5p15 (TERT-CLPTM1L), 2 independent loci at 6q25.1 (ESR1), 12p11.22 (PTHLH), 16q12.1 (TOX3) and 19p13.1 (BABAM1) (Supplementary Table 9). Only seven loci identified so far, the four reported here and the three previously reported located at 5p15 (ref. 5), 19p13.1 (ref. 6) and 20q11 (ref. 7), are specific to ER-negative disease. In summary, our analyses provide further evidence for distinct etiological pathways for invasive ER-positive and ER-negative breast cancers. Fine mapping and functional studies of the susceptibility loci for ER-negative disease should provide important insights into the biological mechanisms of ER-negative breast cancer, potentially leading to the identification of new targets for therapy and prevention of this aggressive form of breast cancer. URLs. BCAC, http://www.srl.cam.ac.uk/consortia/bcac/index.html; CIMBA, http://www.srl.cam.ac.uk/consortia/cimba/index.html/; COGS, http://www.cogseu.org/; GIANT, http://www.broadinstitute. org/collaboration/giant/index.php/GIANT_consortium; OCAC, http://www.srl.cam.ac.uk/consortia/ocac/index.html; PRACTICAL, http://www.srl.cam.ac.uk/consortia/practical/index.html; TCGA, http://www.cancergenome.nih.gov/; 1000 Genomes Project, http:// www.1000genomes.org/; GLU (Genotype Library and Utilities), http://code.google.com/p/glu-genetics/; UCSC Genome Browser, http://genome.ucsc.edu/. Methods Methods and any associated references are available in the online version of the paper. Accession codes. Reference sequences for the human genome of the regions containing the following genes are available at NCBI under the indicated accessions: LRRN2, NC_000001.10; UBE2T, NC_000001.10; PTPN7, NC_000001.10; PEX14, NC_000001.10; LGR6 , NC_000001.10; MDM4 , NC_000001.10; TP73 , NC_ 000001.10; PIK3C2B, NC_000001.10; OSR1, NC_000002.11; TERT, NC_000005.9; CLPTM1L, NC_000005.9; ESR1, NC_000006.11; PTHLH, NC_000012.11; RAD51B, NC_000014.8; FTO, NC_ 000016.9; TOX3, NC_000016.9; HNF1B, NC_000017.10; BRCA1, NC_000017.10; TP53, NC_000017.10; MERIT40, NC_000019.9.
Note: Supplementary information is available in the online version of the paper. Acknowledgments The authors wish to thank all the individuals who took part in these studies and all the researchers, clinicians and administrative staff who have enabled this work to be carried out. We are very grateful to Illumina, in particular J. Stone, S. McBean, J. Hadlington, A. Mustafa and K. Cook, for their help with designing the array. BCAC is funded by Cancer Research UK (C1287/A10118 and C1287/A12014) and by the European Communitys Seventh Framework Programme under grant agreement 223175 (HEALTH-F2-2009-223175) (COGS). Meetings of BCAC have been funded by the European Union European Cooperation in Science and Technology (COST) programme (BM0606). BPC3 is funded by US National Cancer Institute cooperative agreements U01-CA98233, U01-CA98710, U01CA98216 and U01-CA98758 and the Intramural Research Program of the US National Institutes of Health (NIH)/National Cancer Institute, Division of Cancer Epidemiology and Genetics. TNBCC is supported by Mayo Clinic Breast Cancer Study (MCBCS) (US NIH grants CA122340 and a Specialized Program of Research Excellence (SPORE) in Breast Cancer (CA116201)), grants from the Komen Foundation for the Cure and the Breast Cancer Research Foundation. Genotyping on the iCOGS array was funded by the European Union (HEALTH-F2-2009223175), Cancer Research UK (C1287/A10710), US NIH grant CA122340, the Komen Foundation for the Cure, the Breast Cancer Research Foundation, the Canadian Institutes of Health Research (CIHR) for the CIHR Team in Familial Risks of Breast Cancer program (J. Simiard and D.E.) and Ministry of Economic Development, Innovation and Export Trade of Quebec grant PSR-SIIRI701 (J. Simiard, D.E. and P.H.). J. Simiard holds the Canada Research Chair in Oncogenetics. Combination of the GWAS data was supported in part by US NIH Cancer Post-Cancer GWAS initiative grant U19 CA 148065-01 (DRIVE, part of the GAME-ON initiative) and Breakthrough Breast Cancer Research. AUTHOR CONTRIBUTIONS M.G.-C., F.J.C., S.L., K. Michailidou, M.K.S., P.D.P.P., C.V., D.F.E., C.A.H. and P. Kraft formed the writing group and drafted the manuscript. M.G.-C. coordinated the writing group. F.J.C., S.L., K. Michaildou, D.F.E., C.A.H. and P. Kraft performed statistical analyses of GWAS data. M.G.-C. and M.N.B. performed statistical analyses of BCAC follow-up studies and meta-analyses. P. Kraft coordinated the BPC3 GWAS, and M.G.-C., E.R., H.S.F., L.L.M., J.E.B., W.C.W., D.J.H. and S.J.C. led individual studies in the BPC3 scan. F.J.C. and C.V. coordinated the TNBCC GWAS, and D.E., P. Miron, P.A.F., J.C.-C., J.C., A.A., H.N., H. Brauch and G.G.G. led individual studies in the TNBCC scan. D.E. coordinated the C-BCAC GWAS, and H.N., J.L.H., J.C.-C. and P.H. led individual studies in the C-BCAC scan. D.F.E. conceived and coordinated the synthesis of the iCOGS array and led BCAC. P.H. coordinated COGS, and J.B. led the BCAC genotyping working group. A.G.-N., G.P., M.R.A., D.V., F.B., D.C.T. and F.J.C. coordinated genotyping of the iCOGS array. M.G.-C., P.D.P.P. and M.K.S. led the pathology working group in BCAC. M.E.S. was the lead pathologist in BCAC. W.J.H. performed automated scoring of tissue microarrays. A.M.D. and G.C.-T. led the quality control working group. J.D. and N.O. provided bioinformatics support. S.K.R. and G.A.C. performed FunciSNP bioinformatics analyses. M.K.B. and Q. Wang provided data management support for BCAC. G.G., A.A., A. Broeks, A.B.E., A.C., U.H., A.-S.D., A.G.U., A.H., A.H.W., A.I., the ABCTB Investigators, A.J.-V., A.J., A.K.G., R.W., A. Lindblom, A. Lophatananon, A.M.D., A.M.M., A.M.W.v.d.O., A.R., A. Swerdlow, A. Schneeweiss, B.B., B.E.H., B.G.N., B.M.-M., B.P., C.B., C.B.A., C.-Y.C., C.C., C.D.B., C.-N.H., C.H.M.v.D., C.H.Y., C.J., C.M., C.M.S., C.O., C.R., C.-Y.S., C. Sohn, C. Stegmaier, C.-C.T., C.T., C.W.C., D.C., D.C.T., D.F.-J., D.G., D.I.C., D.J.P., D.J.S., D.K., D.L., D.O.S., D.S., D.T., D.V.D.B., E.D., C.V., E.J.R., E.J.S., E.M., E.M.J., E.V.B., E.W., F.A., FBCS, F.C.-C., F.C., F.H., F.L., F.M., F.R., F.S., G.A.C., G.C.-T., G.K.C., G.S., G.W.M., H.A.-C., H.C., H.F., H. Ito, H. Iwata, H. Mller, H. Miao, H.M.-H., H.P., H.T., H.W., I.d.S.S., I.K., I.L.A., I.T., J.A.K., J.D.F., J.E.O., J.I.A.P., J.J.H., J. Long, J. Lubinski, J. Liu, J. Lissowska, J.L.R.-G., J.M.H., J.P., J. Stone, J. Simard, J.W., J.-C.Y., K. Aittomki, K. Aaltonen, K.C., K.D., K.J., K.-T.K., K.L., K. Muir, K. Matsuo, K.P., K.S., K.S.C., L. Bernard, L. Baglietto, L. Bernstein, L. Beckmann, L.D., L.G., L.J.V.V., L.N.K., L.S., M.B., M.C.S., M.D., M.F.P., M.G.S., M. Jones, M. Johansson, M.J.H., M.J.K., M.K., M.K.B., M.L., M.M.G., M.P.L., M. Shrubsole, M. Shah, M.W.B., M.W.R.R., N.A.M.T., N.D., N.G.M., N.J., N.M., N.N.A., N.R., N.S., N.V.B., O.F., P.G., P.H., P.H.P., P. Kerbrat, P.L.-P., P.L., P. Mennde, P.N., P.P., P.R., P. Siriwanarangsan, P. Sharma, P.-E.W., Q.C., Q. Wang, Q. Waisfisz, R.B., R.G.Z., R.H., R.K., R.K.S., R.L.M., R.M.M., R.N.H., R.P., R.A.E.M.T., R. Tumino, R. Travis, S.A.I., S.E.B., S.E.H., S.F.N., S.G., S.H.T., S.K., S.K.R., S.L.D.-H., S.M., S.M.J., S. Nickels, S. Nyante, S.P.B., S. Sangrajrang, S.S.-B., S. Slager, S.S.C., T.A.M., T.B., T.D., T.H., T.T., V.A., V. Kristensen, V. Kataja, V.-M.K., W.B., W.L., W.R.D., W.T., X.-O.S., X.W., Y.F., Y.-T.G., Y.-D.K. and Y.Y. contributed to GWAS and/or BCAC follow-up studies. M.G.-C., F.J.C., S.L., K. Michailidou, M.K.S., P.D.P.P., C.V., D.F.E., C.A.H., P. Kraft, M.N.B., E.R., H.S.F., L.L.M., J.E.B., W.C.W., D.J.H., S.J.C., D.E., P.A.F., J.C.-C., J.C., A. Broeks, H.N., H. Brauch, H. Brenner, G.P., G.G.G., J.L.H., P. Miron, J.B., A.G.-N., M.R.A., D.V., F.B., M.E.S., W.J.H., G.G., A.A., A. Beck, A.B.E., A.C., U.H., A.-S.D., A.G.U., A.H., A.H.W., A.I., the ABCTB Investigators, A.J.-V., A.J., A.K.G., R.W., A. Lindblom, A. Lophatananon, A.M.D., A.M.M., A.M.W.v.d.O., A.R., A. Swerdlow, A. Schneeweiss, B.B., B.E.H., B.G.N., B.M.-M., B.P., C.B., C.B.A., C.-Y.C., C.C., C.D.B., C.-N.H., C.H.M.v.D., C.H.Y., C.J., C.M., C.M.S., C.O., C.R., C.-Y.S., C. Sohn, C. Stegmaier, C.-C.T., C.T., C.W.C., D.C., D.C.T., D.F.-J., D.G., D.I.C., D.J.P., D.J.S., D.K., D.L., D.O.S., D.S., D.T., D.V.D.B., E.D., C.V., E.J.R., E.J.S., E.M., E.M.J., E.V.B., E.W., F.A., FBCS, F.C.-C., F.C., F.H., F.L., F.M., F.R., F.S., G.A.C., G.C.-T., G.K.C., G.S., G.W.M., H.A.-C., H.C., H.F., H. Ito, H. Iwata, H. Mller, H. Miao, H.M.-H., H.P., H.T., H.W., I.d.S.S., I.K., I.L.A., I.T., J.A.K., J.D., J.D.F., J.E.O., J.I.A.P., J.J.H., J. Long, J. Lubinski, J. Liu, J. Lissowska, J.L.R.-G., J.M.H., J.P., J. Stone, J. Simard, J.W., J.-C.Y., K. Aittomki, K. Aaltonen, K.C., K.D., K.J., K.-T.K., K.L., K. Muir, K. Matsuo, K.P., K.S., K.S.C., L. Bernard, L. Baglietto, L. Bernstein, L. Beckmann, L.D., L.G., L.J.V.V., L.N.K., L.S., M.B., M.C.S., M.D., M.F.P., M.G.S., M.H., M. Jones, M. Johansson, M.J.H., M.J.K., M.K., M.K.B., M.L., M.M.G., M.P.L., M. Shrubsole, M. Shah, M.W.B., M.W.R.R., N.A.M.T., N.D., N.G.M., N.J., N.M., N.N.A., N.O., N.R., N.S., N.V.B., O.F., P.G., P.H., P.H.P., P. Kerbrat, P.L.-P., P.L.,

npg

2013 Nature America, Inc. All rights reserved.

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

395

letters
P. Mennde, P.N., P.P., P.R., P. Siriwanarangsan, P. Sharma, P.-E.W., Q.C., Q. Wang, Q. Waisfisz, R.B., R.G.Z., R.H., R.K., R.K.S., R.L.M., R.M.M., R.N.H., R.P., R.A.E.M.T., R. Tumino, R. Travis, S.A.I., S.E.B., S.E.H., S.F.N., S.G., S.H.T., S.K., S.K.R., S.L.D.-H., S.M., S.M.J., S. Nickels, S. Nyante, S.P.B., S. Sangrajrang, S.S.-B., S. Slager, S.S.C., T.A.M., T.B., T.D., T.H., T.T., V.A., V. Kristensen, V. Kataja, V.-M.K., W.B., W.L., W.R.D., W.T., X.-O.S., X.W., Y.F., Y.-T.G., Y.-D.K. A. Mannermaa, A. Meindl, W.Z., P.D., M.S.G. and Y.Y. provided critical review of the manuscript. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.
Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html.
1. Chu, K.C. & Anderson, W.F. Rates for breast cancer characteristics by estrogen and progesterone receptor status in the major racial/ethnic groups. Breast Cancer Res. Treat. 74, 199211 (2002). 2. Yang, X.R. et al. Associations of breast cancer risk factors with tumor subtypes: a pooled analysis from the Breast Cancer Association Consortium studies. J. Natl. Cancer Inst. 103, 250263 (2011). 3. Blows, F.M. et al. Subtyping of breast cancer by immunohistochemistry to investigate a relationship between subtype and short and long term survival: a collaborative analysis of data for 10,159 cases from 12 studies. PLoS Med. 7, e1000279 (2010). 4. Mavaddat, N., Antoniou, A.C., Easton, D.F. & Garcia-Closas, M. Genetic susceptibility to breast cancer. Mol. Oncol. 4, 174191 (2010). 5. Haiman, C.A. et al. A common variant at the TERT-CLPTM1L locus is associated with estrogen receptornegative breast cancer. Nat. Genet. 43, 12101214 (2011). 6. Stevens, K.N. et al. 19p13.1 is a triple negativespecific breast cancer susceptibility locus. Cancer Res. 72, 17951803 (2012). 7. Siddiq, A. et al. A meta-analysis of genome-wide association studies of breast cancer identifies two novel susceptibility loci at 6q14 and 20q11. Hum. Mol. Genet. 21, 53735384 (2012). 8. Michailidou, K. et al. Large-scale genotyping identifies 41 new breast cancer susceptibility loci. Nat. Genet. published online; doi:10.1038/ng.2563 (27 March 2013). 9. Turashvili, G. et al. Novel markers for differentiation of lobular and ductal invasive breast carcinomas by laser microdissection and microarray analysis. BMC Cancer 7, 55 (2007). 10. Graham, K. et al. Gene expression in histologically normal epithelium from breast cancer patients and from cancer-free prophylactic mastectomy patients shares a similar profile. Br. J. Cancer 102, 12841293 (2010). 11. Wang, H. & Yan, C. A small-molecule p53 activator induces apoptosis through inhibiting MDMX expression in breast cancer cells. Neoplasia 13, 611619 (2011). 12. Couch, F.J. et al. Genome-wide association study in BRCA1 mutation carriers identifies novel loci associated with breast and ovarian cancer risk. PLoS Genet. 9, e1003212 (2013). 13. Atwal, G.S. et al. Altered tumor formation and evolutionary selection of genetic variants in the human MDM4 oncogene. Proc. Natl. Acad. Sci. USA 106, 1023610241 (2009). 14. Frayling, T.M. et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 316, 889894 (2007). 15. Speliotes, E.K. et al. Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat. Genet. 42, 937948 (2010). 16. Bojesen, S.E. et al. Multiple independent TERT variants associated with telomere length and risks of breast and ovarian cancer. Nat. Genet. published online; doi:10.1038/ng.2566 (27 March 2013). 17. Shen, H. et al. Epigenetic analysis leads to identification of HNF1B as a subtype-specific susceptibility gene for ovarian cancer. Nat. Comm. published online; doi:10.1038/ncomms2629 (27 March 2013).

2013 Nature America, Inc. All rights reserved.

Montserrat Garcia-Closas1,2,180, Fergus J Couch3,180, Sara Lindstrom4,180, Kyriaki Michailidou5,180, Marjanka K Schmidt6,180, Mark N Brook1, Nick Orr2, Suhn Kyong Rhie7, Elio Riboli8, Heather S Feigelson9, Loic Le Marchand10, Julie E Buring11, Diana Eccles12, Penelope Miron13, Peter A Fasching14,15, Hiltrud Brauch16,17, Jenny Chang-Claude18, Jane Carpenter19, Andrew K Godwin20, Heli Nevanlinna21, Graham G Giles22,23, Angela Cox24, John L Hopper25, Manjeet K Bolla5, Qin Wang5, Joe Dennis5, Ed Dicks5, Will J Howat26, Nils Schoof 27, Stig E Bojesen28, Diether Lambrechts29,30, Annegien Broeks6, Irene L Andrulis31,32, Pascal Gunel33,34, Barbara Burwinkel35,36, Elinor J Sawyer37, Antoinette Hollestelle38, Olivia Fletcher2, Robert Winqvist39, Hermann Brenner40, Arto Mannermaa4143, Ute Hamann44, Alfons Meindl45,46, Annika Lindblom47, Wei Zheng48, Peter Devillee49,50, Mark S Goldberg51,52, Jan Lubinski53, Vessela Kristensen54,55, Anthony Swerdlow1, Hoda Anton-Culver56, Thilo Drk57, Kenneth Muir58,59, Keitaro Matsuo60, Anna H Wu7, Paolo Radice61,62, Soo Hwang Teo63,64, Xiao-Ou Shu48, William Blot48,65, Daehee Kang66, Mikael Hartman67,68, Suleeporn Sangrajrang69, Chen-Yang Shen70,71, Melissa C Southey72, Daniel J Park72, Fleur Hammet72, Jennifer Stone25, Laura J Vant Veer6, Emiel J Rutgers6, Artitaya Lophatananon58, Sarah Stewart-Brown58, Pornthep Siriwanarangsan73, Julian Peto74, Michael G Schrauder14, Arif B Ekici75, Matthias W Beckmann14, Isabel dos Santos Silva74, Nichola Johnson2, Helen Warren74, Ian Tomlinson76,77, Michael J Kerin78, Nicola Miller78, Federick Marme35,79, Andreas Schneeweiss35,79, Christof Sohn35, Therese Truong33,34, Pierre Laurent-Puig80, Pierre Kerbrat81, Brge G Nordestgaard28, Sune F Nielsen28, Henrik Flyger82, Roger L Milne83, Jose Ignacio Arias Perez84, Primitiva Menndez85, Heiko Mller40, Volker Arndt40, Christa Stegmaier86, Peter Lichtner87,88, Magdalena Lochmann46, Christina Justenhoven16,17, Yon-Dschun Ko89, The Gene ENvironmental Interaction and breast CAncer (GENICA) Network90, Taru A Muranen21, Kristiina Aittomki91, Carl Blomqvist92, Dario Greco21, Tuomas Heikkinen21, Hidemi Ito60, Hiroji Iwata93, Yasushi Yatabe94, Natalia N Antonenkova95, Sara Margolin96, Vesa Kataja42,43,97, Veli-Matti Kosma4143, Jaana M Hartikainen4143, Rosemary Balleine98,99, kConFab Investigators90, Chiu-Chen Tseng7, David Van Den Berg7, Daniel O Stram7, Patrick Neven100, Anne-Sophie Dieudonn100, Karin Leunen100, Anja Rudolph18, Stefan Nickels18, Dieter Flesch-Janys101,102, Paolo Peterlongo61,62, Bernard Peissel103, Loris Bernard104,105, Janet E Olson3, Xianshu Wang3,106, Kristen Stevens3, Gianluca Severi22,25, Laura Baglietto22,25, Catriona McLean107, Gerhard A Coetzee7,108, Ye Feng7, Brian E Henderson7, Fredrick Schumacher7, Natalia V Bogdanova57,109, France Labrche110, Martine Dumont111, Cheng Har Yip64, Nur Aishah Mohd Taib64, Ching-Yu Cheng67,68,112, Martha Shrubsole48, Jirong Long48, Katri Pylks39, Arja Jukkola-Vuorinen113, Saila Kauppila114,
396 VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

npg

letters
Julia A Knight32,115,116, Gord Glendon32, Anna Marie Mulligan117,118, Robertus A E M Tollenaar119, Caroline M Seynaeve38, Mieke Kriege38, Maartje J Hooning38, Ans M W van den Ouweland120, Carolien H M van Deurzen50, Wei Lu121, Yu-Tang Gao122, Hui Cai48, Sabapathy P Balasubramanian24, Simon S Cross123, Malcolm W R Reed24, Lisa Signorello48, Qiuyin Cai52, Mitul Shah124, Hui Miao68, Ching Wan Chan125, Kee Seng Chia68, Anna Jakubowska53, Katarzyna Jaworska53, Katarzyna Durda53, Chia-Ni Hsiung71, Pei-Ei Wu126, Jyh-Cherng Yu127, Alan Ashworth2, Michael Jones1, Daniel C Tessier128, Anna Gonzlez-Neira129, Guillermo Pita129, M Rosario Alonso129, Daniel Vincent128, Francois Bacot128, Christine B Ambrosone130, Elisa V Bandera131, Esther M John132,133, Gary K Chen7, Jennifer J Hu134,135, Jorge L Rodriguez-Gil134,135, Leslie Bernstein136, Michael F Press137, Regina G Ziegler138, Robert M Millikan139,179, Sandra L Deming-Halverson48, Sarah Nyante139, Sue A Ingles7, Quinten Waisfisz140, Helen Tsimiklis141, Enes Makalic23,25, Daniel Schmidt23,25, Minh Bui23,25, Lorna Gibson74, Bertram Mller-Myhsok142, Rita K Schmutzler143,144, Rebecca Hein18,145, Norbert Dahmen146, Lars Beckmann147, Kirsimari Aaltonen21,91,92, Kamila Czene27, Astrid Irwanto148, Jianjun Liu148, Clare Turnbull1, Familial Breast Cancer Study (FBCS)90, Nazneen Rahman1, Hanne Meijers-Heijboer140, Andre G Uitterlinden149, Fernando Rivadeneira149, Australian Breast Cancer Tissue Bank (ABCTB) Investigators90, Curtis Olswold3, Susan Slager3, Robert Pilarski150, Foluso Ademuyiwa151, Irene Konstantopoulou152, Nicholas G Martin153, Grant W Montgomery153, Dennis J Slamon15,154, Claudia Rauh14, Michael P Lux14, Sebastian M Jud14, Thomas Bruning155, JoEllen Weaver156, Priyanka Sharma157, Harsh Pathak20, Will Tapper12, Sue Gerty12, Lorraine Durcan12, Dimitrios Trichopoulos4,158,159, Rosario Tumino160, Petra H Peeters161, Rudolf Kaaks18, Daniele Campa18, Federico Canzian18, Elisabete Weiderpass27,162164, Mattias Johansson165, Kay-Tee Khaw166, Ruth Travis167, Franoise Clavel-Chapelon33,34, Laurence N Kolonel110, Constance Chen4, Andy Beck168,169, Susan E Hankinson170,171, Christine D Berg172, Robert N Hoover138, Jolanta Lissowska173, Jonine D Figueroa138, Daniel I Chasman11, Mia M Gaudet174, W Ryan Diver174, Walter C Willett175, David J Hunter4, Jacques Simard111, Javier Benitez129,176,177, Alison M Dunning124, Mark E Sherman138, Georgia Chenevix-Trench178, Stephen J Chanock138, Per Hall27, Paul D P Pharoah124,181, Celine Vachon3,181, Douglas F Easton5,181, Christopher A Haiman7,181 & Peter Kraft4,181
1Division

2013 Nature America, Inc. All rights reserved.

of Genetics and Epidemiology, Institute of Cancer Research, Sutton, UK. 2Breakthrough Breast Cancer Research Centre, The Institute of Cancer Research, London, UK. 3Mayo Clinic College of Medicine, Mayo Clinic, Rochester, Minnesota, USA. 4Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts, USA. 5Centre for Cancer Genetic Epidemiology, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK. 6Netherlands Cancer Institute, Antoni van Leeuwenhoek Hospital, Amsterdam, The Netherlands. 7Department of Preventive Medicine, Keck School of Medicine, Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, California, USA. 8School of Public Health, Imperial College, London, UK. 9Kaiser Permanente, Institute for Health Research, Denver, Colorado, USA. 10Epidemiology Program, Cancer Research Center, University of Hawaii, Honolulu, Hawaii, USA. 11Division of Preventive Medicine, Brigham and Womens Hospital, Boston, Massachusetts, USA. 12Faculty of Medicine, University of Southampton, Southampton, UK. 13Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts, USA. 14Department of Gynecology and Obstetrics, University Breast Center Franconia, University Hospital Erlangen, Erlangen, Germany. 15Jonsson Comprehensive Cancer Center, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California, USA. 16Dr. Margarete Fischer-Bosch Institute of Clinical Pharmacology, Stuttgart, Germany. 17University of Tbingen, Tbingen, Germany. 18Division of Cancer Epidemiology, German Cancer Research Center (DKFZ), Heidelberg, Germany. 19Australian Breast Cancer Tissue Bank, University of Sydney at the Westmead Millennium Institute, Westmead, New South Wales, Australia. 20Department of Pathology and Laboratory Medicine, University of Kansas Medical Center, Kansas City, Kansas, USA. 21Department of Obstetrics and Gynecology, University of Helsinki and Helsinki University Central Hospital, Helsinki, Finland. 22Cancer Epidemiology Centre, The Cancer Council Victoria, Melbourne, Victoria, Australia. 23School of Population Health, The University of Melbourne, Melbourne, Victoria, Australia. 24Cancer Research UK/Yorkshire Cancer Research Sheffield Cancer Research Centre, Department of Oncology, University of Sheffield, Sheffield, UK. 25Centre for Molecular, Environmental, Genetic and Analytic Epidemiology, The University of Melbourne, Melbourne, Victoria, Australia. 26Cancer Research UK, Cambridge Research Institute, Li Ka Shing Centre, Cambridge, UK. 27Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden. 28Copenhagen General Population Study, Department of Clinical Biochemistry, Herlev Hospital, Copenhagen University Hospital, University of Copenhagen, Copenhagen, Denmark. 29Vesalius Research Center (VRC), VIB, Leuven, Belgium. 30Department of Oncology, University of Leuven, Leuven, Belgium. 31Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada. 32Ontario Cancer Genetics Network, Fred A. Litwin Center for Cancer Genetics, Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada. 33University ParisSud, Unit Mixte de Recherche Scientifique (UMRS) 1018, Villejuif, France. 34INSERM (National Institute of Health and Medical Research), CESP (Center for Research in Epidemiology and Population Health), Environmental Epidemiology of Cancer, Villejuif, France. 35Department of Obstetrics and Gynecology, University of Heidelberg, Heidelberg, Germany. 36Molecular Epidemiology Group, DKFZ, Heidelberg, Germany. 37Division of Cancer Studies, National Institute for Health Research (NIHR) Comprehensive Biomedical Research Centre, Guys & St. Thomas National Health Service (NHS) Foundation Trust in partnership with Kings College London, London, UK. 38Department of Medical Oncology, Erasmus University Medical CenterDaniel Den Hoed Cancer Center, Rotterdam, The Netherlands. 39Laboratory of Cancer Genetics and Tumor Biology, Department of Clinical Genetics, Biocenter Oulu, University of Oulu, Oulu University Hospital, Oulu, Finland. 40Division of Clinical Epidemiology and Aging Research, DKFZ, Heidelberg, Germany. 41Imaging Center, Department of Clinical Pathology, Kuopio University Hospital, Kuopio, Finland. 42School of Medicine, Institute of Clinical Medicine, Pathology and Forensic Medicine, Kuopio, Finland. 43Biocenter Kuopio, Cancer Center of Eastern Finland, University of Eastern Finland, Kuopio, Finland. 44Molecular Genetics of Breast Cancer, DKFZ, Heidelberg, Germany. 45Division for Gynaecological Tumor Genetics, Clinic of Gynaecology and Obstetrics, Technische Universitt Mnchen, Munich, Germany. 46Division of Gynaecology and Obstetrics, Technische Universitt Mnchen, Munich, Germany. 47Department of Molecular Medicine and Surgery, Karolinska Institutet, Stockholm, Sweden. 48Department of Medicine, Vanderbilt Epidemiology Center, Vanderbilt-Ingram Cancer Center, Division of Epidemiology, Vanderbilt University School of Medicine, Nashville, Tennessee, USA. 49Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands. 50Department of Pathology, Erasmus University Medical Center, Rotterdam, The Netherlands. 51Department of Medicine, McGill University, Montreal, Quebec, Canada. 52Division of Clinical Epidemiology, McGill University Health Centre, Royal Victoria Hospital, Montreal, Quebec, Canada. 53Department of Genetics and Pathology, Pomeranian Medical University, Szczecin, Poland. 54Department of Genetics, Institute for Cancer Research, Oslo University

npg

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

397

letters
Hospital, Radiumhospitalet, Oslo, Norway. 55Faculty of Medicine (Faculty Division Ahus), Universitetet i Oslo, Oslo, Norway. 56Department of Epidemiology, University of CaliforniaIrvine, Irvine, California, USA. 57Department of Obstetrics and Gynaecology, Hannover Medical School, Hannover, Germany. 58Warwick Medical School, Warwick University, Coventry, UK. 59Institute of Population Health, University of Manchester, Manchester, UK. 60Division of Epidemiology and Prevention, Aichi Cancer Center Research Institute, Nagoya, Japan. 61Unit of Molecular Bases of Genetic Risk and Genetic Testing, Department of Preventive and Predictive Medicine, Fondazione IRCCS Istituto Nazionale Tumori (INT), Milan, Italy. 62IFOM, Fondazione Istituto FIRC di Oncologia Molecolare, Milan, Italy. 63Cancer Research Initiatives Foundation, Sime Darby Medical Centre, Subang Jaya, University Malaya Cancer Research Institute, University Malaya, Kuala Lumpur, Malaysia. 64Breast Cancer Research Unit, University Malaya Cancer Research Institute, University Malaya, Kuala Lumpur, Malaysia. 65International Epidemiology Institute, Rockville, Maryland, USA. 66Seoul National University College of Medicine, Seoul, Korea. 67Department of Surgery, Yong Loo Lin School of Medicine, National University of Singapore, Singapore. 68Saw Swee Hock School of Public Health, National University of Singapore, Singapore. 69National Cancer Institute, Bangkok, Thailand. 70Colleague of Public Health, China Medical University, Taichong, Taiwan. 71Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan. 72Department of Pathology, The University of Melbourne, Melbourne, Victoria, Australia. 73Ministry of Public Health, Bangkok, Thailand. 74Non-communicable Disease Epidemiology Department, London School of Hygiene and Tropical Medicine, London, UK. 75Institute of Human Genetics, Friedrich Alexander University Erlangen-Nuremberg, Erlangen, Germany. 76Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK. 77Oxford Biomedical Research Centre, University of Oxford, Oxford, UK. 78Department of Surgery, Clinical Science Institute, University Hospital and National University of Ireland, Galway, Ireland. 79National Center for Tumor Diseases, University of Heidelberg, Heidelberg, Germany. 80INSERM, Universit Paris Sorbonne Cit, UMRS 775, Paris, France. 81Centre Eugne Marquis, Department of Medical Oncology, Rennes, France. 82Department of Breast Surgery, Herlev Hospital, Copenhagen University Hospital, Copenhagen, Denmark. 83Genetic & Molecular Epidemiology Group, Human Cancer Genetics Program, Spanish National Cancer Research Centre (CNIO), Madrid, Spain. 84Servicio de Ciruga General y Especialidades, Hospital Monte Naranco, Oviedo, Spain. 85Servicio de Anatoma Patolgica, Hospital Monte Naranco, Oviedo, Spain. 86Saarland Cancer Registry, Saarbrcken, Germany. 87Institute of Human Genetics, Technische Universitt Mnchen, Munich, Germany. 88Institute of Human Genetics, Helmholtz Zentrum MnchenGerman Research Center for Environmental Health, Neuherberg, Germany. 89Department of Internal Medicine, Evangelische Kliniken Bonn, Johanniter Krankenhaus, Bonn, Germany. 90A list of members is provided in the Supplementary Note. 91Department of Clinical Genetics, Helsinki University Central Hospital, Helsinki, Finland. 92Department of Oncology, Helsinki University Central Hospital, Helsinki, Finland. 93Department of Breast Oncology, Aichi Cancer Center Hospital, Nagoya, Japan. 94Department of Pathology and Molecular Diagnostics, Aichi Cancer Center Hospital, Nagoya, Japan. 95N.N. Alexandrov Research Institute of Oncology and Medical Radiology, Minsk, Belarus. 96Department of Oncology-Pathology, Karolinska Institutet, Stockholm, Sweden. 97Cancer Center, Kuopio University Hospital, Kuopio, Finland. 98Western Sydney Local Health District, Westmead Millennium Institute for Medical Research, University of Sydney, Sydney, New South Wales, Australia. 99Nepean Blue Mountains Local Health District, Westmead Millennium Institute for Medical Research, University of Sydney, Sydney, New South Wales, Australia. 100Multidisciplinary Breast Center, University Hospital Gasthuisberg, Department of Oncology, University of Leuven, Leuven, Belgium. 101Department of Cancer Epidemiology/Clinical Cancer Registry, University Clinic Hamburg-Eppendorf, Hamburg, Germany. 102Institute for Medical Biometrics and Epidemiology, University Clinic Hamburg-Eppendorf, Hamburg, Germany. 103Unit of Medical Genetics, Department of Preventive and Predictive Medicine, Fondazione IRCCS INT, Milan, Italy. 104Department of Experimental Oncology, Istituto Europeo di Oncologia, Milan, Italy. 105Cogentech Cancer Genetic Test Laboratory, Milan, Italy. 106Department of Laboratory Medicine and Pathology, Division of Experimental Pathology, Mayo Clinic, Rochester, Minnesota, USA. 107Department of Anatomical Pathology, The Alfred Hospital, Melbourne, Victoria, Australia. 108Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, California, USA. 109Department of Radiation Oncology, Hannover Medical School, Hannover, Germany. 110Dpartement de Mdecine Sociale et Prventive, Dpartement de Sant Environnementale et Sant au Travail, Universit de Montral, Montreal, Quebec, Canada. 111Cancer Genomics Laboratory, Centre Hospitalier Universitaire de Qubec and Laval University, Quebec City, Quebec, Canada. 112Singapore Eye Research Institute, National University of Singapore, Singapore. 113Department of Oncology, Oulu University Hospital, University of Oulu, Oulu, Finland. 114Department of Pathology, Oulu University Hospital, University of Oulu, Oulu, Finland. 115Division of Epidemiology, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada. 116Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada. 117Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada. 118Department of Laboratory Medicine, Keenan Research Centre of the Li Ka Shing Knowledge Institute, St. Michaels Hospital, Toronto, Ontario, Canada. 119Department of Surgical Oncology, Leiden University Medical Center, Leiden, The Netherlands. 120Department of Clinical Genetics, Erasmus University Medical Center, Rotterdam, The Netherlands. 121Shanghai Center for Disease Control and Prevention, Shanghai, China. 122Department of Epidemiology, Shanghai Cancer Institute, Shanghai, China. 123Academic Unit of Pathology, Department of Neuroscience, University of Sheffield, Sheffield, UK. 124Centre for Cancer Genetic Epidemiology, Department of Oncology, University of Cambridge, Cambridge, UK. 125Department of Surgery, National University Health System, Singapore. 126Taiwan Biobank, Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan. 127Department of Surgery, Tri-Service General Hospital, Taipei, Taiwan. 128McGill University and Gnome Qubec Innovation Centre, Montreal, Qubec, Canada. 129Human Genotyping UnitCEGEN, Human Cancer Genetics Programme, CNIO, Madrid, Spain. 130Department of Cancer Prevention and Control, Roswell Park Cancer Institute, Buffalo, New York, USA. 131The Cancer Institute of New Jersey, New Brunswick, New Jersey, USA. 132Cancer Prevention Institute of California, Fremont, California, USA. 133Department of Health Research and Policy, Division of Epidemiology, Stanford Cancer Institute, Stanford University School of Medicine, Stanford, California, USA. 134Sylvester Comprehensive Cancer Center, University of Miami Miller School of Medicine, Miami, Florida, USA. 135Department of Epidemiology and Public Health, University of Miami Miller School of Medicine, Miami, Florida, USA. 136Division of Cancer Etiology, Department of Population Science, Beckman Research Institute, City of Hope, Duarte, California, USA. 137Department of Pathology, Keck School of Medicine, Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, California, USA. 138Epidemiology and Biostatistics Program, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, USA. 139Department of Epidemiology, Gillings School of Global Public Health, Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, Chapel Hill, North Carolina, USA. 140Section of Oncogenetics, Department of Clinical Genetics, VU University Medical Center, Amsterdam, The Netherlands. 141Genetic Epidemiology Laboratory, Department of Pathology, The University of Melbourne, Melbourne, Victoria, Australia. 142Statistical Genetics Research Group, Max Planck Institute of Psychiatry, Munich, Germany. 143Centre of Hereditary Breast and Ovarian Cancer, University Hospital, Cologne, Germany. 144Centre of Integrated Oncology, University Hospital, Cologne, Germany. 145PMV (Primrmedizinische Versorgung) Research Group, Department of Child and Adolescent Psychiatry and Psychotherapy, University of Cologne, Cologne, Germany. 146Department of Psychiatry, University of Mainz, Mainz, Germany. 147Institute for Quality and Efficiency in Health Care (IQWiG), Cologne, Germany. 148Human Genetics Division, Genome Institute of Singapore, Singapore. 149Department of Internal Medicine and Epidemiology, Erasmus Medical Center, Rotterdam, The Netherlands. 150Department of Internal Medicine, James Comprehensive Cancer Center, Ohio State University, Columbus, Ohio, USA. 151Roswell Park Cancer Institute, Buffalo, New York, USA. 152Molecular Diagnostics Laboratory, Institute of Radioisotopes and Radiodiagnostic Products (IRRP), National Centre for Scientific Research Demokritos, Aghia Paraskevi Attikis, Athens, Greece. 153QIMR GWAS Collective, Queensland Institute of Medical Research, Brisbane, Queensland, Australia. 154Department of Medicine, Division of Hematology and Oncology, University of California, Los Angeles, Los Angeles, California, USA. 155Institute for Prevention and Occupational Medicine of the German Social Accident Insurance (IPA), Bochum, Germany. 156Biosample Repository, Fox Chase Cancer Center, Philadelphia, Pennsylvania, USA. 157Division of Hematology and Oncology, Department of Internal Medicine, University of Kansas Medical Center, Kansas City, Kansas, USA. 158Bureau of Epidemiologic Research, Academy of Athens, Athens, Greece. 159Hellenic Health Foundation, Athens, Greece. 160Cancer Registry, Histopathology Unit Civile MPArezzo Hospital, Ragusa, Italy. 161Julius Center, University Medical Center Utrecht, Utrecht, The Netherlands. 162Department of Community Medicine, University of Troms, Troms, Norway. 163Folkhlsan Research Cancer Centre, Helsinki, Finland. 164Cancer Registry of Norway, Oslo, Norway. 165Genetic Epidemiology Group, International Agency for Research on Cancer, World Health Organization, Lyon, France. 166Clinical Gerontology Unit, University of Cambridge, Cambridge, UK. 167Cancer Epidemiology Unit, Nuffield Department of Clinical Medicine, University of Oxford, Oxford, UK. 168Department of Pathology, Beth Israel Deaconess Medical Center, Boston, Massachusetts, USA. 169Department of Pathology, Harvard Medical School, Boston, Massachusetts, USA. 170Division of Biostatistics and Epidemiology, School of Public Health and Health Sciences, University of Massachusetts, Amherst, Massachusetts, USA. 171Channing Division of Network Medicine, Brigham and Womens Hospital, Boston, Massachusetts, USA. 172Division of Cancer Prevention, National Cancer Institute, Bethesda, Maryland, USA. 173Department of Cancer Epidemiology and Prevention, M Sklodowska-Curie Memorial Cancer Center and Institute of Oncology, Warsaw, Poland. 174Epidemiology Research Program, American Cancer Society, Atlanta, Georgia, USA. 175Department of Nutrition, Harvard School of Public Health, Boston, Massachusetts, USA. 176Human Genetics Group, CNIO, Madrid, Spain. 177Centro de Investigacion en Red de Enfermedades Raras (CIBERER), Madrid, Spain. 178Department of Genetics, Queensland Institute of Medical Research, Brisbane, Queensland, Australia. 179Deceased. 180These authors contributed equally to this work. 181These authors jointly directed this work. Correspondence should be addressed to M.G.-C. (montse.garciaclosas@icr.ac.uk).

npg

2013 Nature America, Inc. All rights reserved.

398

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

ER-negative breast cancer GWAS. Three GWAS of ER-negative breast cancer were conducted in populations of European ancestry by National Cancer Institute (NCI) BPC3 (refs. 7,18), TNBCC5,6 and C-BCAC. ER-negative status for BPC3 and C-BCAC cases was determined from review of medical records or state cancer registry information. TNBCC focused on triple-negative cases, defined as individuals with ER-negative, PRnegative and HER2-negative breast cancer using data from medical records5,6. The BPC3 GWAS included 2,188 ER-negative cases and 26,477 controls from 8 studies (CPSII, EPIC, MEC, NHS, NHSII, PLCO, PBCS and WGHS), genotyped using different versions of Illumina SNP arrays7,18. A total of 1,718 triple-negative cases from 11 studies (ABCTB, BBCC, DFCI, FCCC, GENICA, HEBCS, MARIE, MCBCS, MCCS, POSH and SBCS) were genotyped for the TNBCC GWAS using Illumina SNP arrays5. Data for TNBCC controls (N = 3,670) were obtained from a Finnish study (HEBCS) and publicly available controls of European ancestry from the United States (CGEMS), Germany (KORA), Australia (QIMR) and the UK (Wellcome Trust Case Control Consortium 2, WTCCC2) genotyped using Illumina arrays5. Samples from the four latter studies are not counted in the total number of TNBCC studies because they only provided controls for other studies. C-BCAC performed a meta-analysis of 9 GWAS that included data on 10,052 breast cancer cases and 12,575 controls8. Five studies (ABCFS, MARIE, HEBCS, SASBAC and UK2) provided data on ER status from medical records or cancer registries and contributed data on 702 ER-negative cases and 7,713 controls of European ancestry. All C-BCAC studies were genotyped with versions of Illumina arrays. Control data for C-BCAC were obtained from individual studies or publicly available data. Standard genotyping quality control procedures were performed for each GWAS as previously described5,7,8. Estimated per-allele log(OR) and standard error were calculated for each SNP using unconditional logistic regression on allele counts (dosages), as implemented in ProbABEL19. Analyses were adjusted by study, country of origin or principal components as previously described5,7,8. Analyses assumed a log-additive genetic model, and P values were based on the 1-degree-of-freedom Wald test. Quantile-quantile plots from each GWAS showed no substantial evidence for cryptic population substructure or differential genotype calling between cases and controls. The estimated inflation factor () was 1.02 for BPC3 (ref. 7), 1.04 for TNBCC6 and 0.98 for C-BCAC (Supplementary Fig. 1). SNPs were selected for the iCOGS custom genotyping array separately by each participating group (see details in Michailidou et al.8). BPC3 nominated independent SNPs with a 1-degree-of-freedom log-additive trend test P < 0.02 or with P < 0.02 for one of several auxiliary tests, including tests for dominant or recessive effects of the minor allele and case-only tests comparing PR-positive to PR-negative tumors. SNPs from C-BCAC were selected on the basis of the 1-degree-of-freedom trend test for ER-negative disease. TNBCC nominated SNPs on the basis of log-additive trend test P < 0.01. Subsequent analyses that combined OR estimates across GWAS and follow-up samples only included SNPs that had been directly genotyped on the iCOGS array and had passed genotyping quality control. SNPs successfully genotyped on iCOGS but not included on the chips used for the GWAS were imputed within each GWAS before combining results with iCOGS data. Imputation was performed within each study and genotyping array using the HapMap Phase 2 CEU reference panel and MACH software package v1.0. SNPs with low imputation quality (r2 < 0.3) or minor allele frequency (MAF) < 1% were excluded. iCOGS genotyping. Samples for follow-up analyses were drawn from 50 studies participating in BCAC (40 from populations of predominantly European ancestry (including CTS, DEMOKRITOS, NBCS, NBHS, OSUCCG, RPCI and SKKDKFZS from TNBCC), 9 of Asian ancestry and 1 of African-American ancestry) with information on ER status. Most breast cancer cases in BCAC studies have not been tested for BRCA1 mutations; however, the frequency of mutations in the studied populations is expected to be low. Samples were genotyped as part of the COGS Project using a custom Illumina Infinium array (iCOGS) at four genotyping centers (Supplementary Table 1). The most common source of data for ER, PR and HER2 status was medical records, followed by immunohistochemistry performed on tumor tissue microarrays (TMAs) or whole-section tumor slides. Breast cancer cases in the BCAC follow-up with

ONLINE METHODS

missing data on ER status and cases from one study (PBCS) that included only ER-positive cases are excluded from this report. Studies were required to provide ~2% of samples in duplicate. The iCOGS chip included a total of 211,155 SNPs selected in collaboration with other consortia of BRCA1 and BRCA2 mutation carriers (CIMBA), ovarian cancer (OCAC) and prostate cancer (PRACTICAL). Genotype calling and quality control analyses were conducted by a single analysis center at the University of Cambridge8. A total of 13,276 SNPs proposed by the combined ER-negative GWAS yielded high-quality genotype data (5,738 from BPC3, 4,628 from TNBCC and 2,910 from C-BCAC). Statistical analysis. After quality control exclusions8, BCAC follow-up data were analyzed using the Genotype Library and Utilities (GLU) package to estimate per-allele ORs and standard errors for each SNP using unconditional logistic regression. Analyses were stratified by ancestry (European, Asian or African). For samples of European ancestry, BCAC follow-up analyses were adjusted for seven principal components (the first six plus an additional component to reduce inflation for the LMBC study). GWAS and BCAC follow-up results were combined using inverse varianceweighted fixed-effects meta-analysis, as implemented in METAL20. Forest plots showing study-specific estimates and fixed-effects meta-analysis for SNPs showing genome-wide significance were drawn using the command metan in STATA v.12. Samples that overlapped among the three GWAS and the BCAC follow-up were identified by concordance of genotypes and removed from either the GWAS or follow-up data set before this analysis so that each data set contributing to the meta-analysis was independent of the others (see Supplementary Table 1 for the counts of case and control included in the analyses after removing overlapping samples). Heterogeneity by study was evaluated using the Q statistic. Analyses in this report focused first on the 13,276 SNPs proposed by the ERnegative breast cancer GWAS. For SNPs showing evidence of association with ER-negative breast cancer at P < 1 106, we also evaluated correlated SNPs in the rest of COGS and reported on the most significant SNP in the region. For the regions that reached genome-wide statistical significance (P < 5 108), we performed additional analyses examining heterogeneity in the associated effect by tumor type and subject characteristics using the most significant SNP in the region. The associations between these SNPs and ER-positive breast cancer were assessed using 25,227 ER-positive cases of European ancestry in BCAC who had been genotyped as part of the COGS Project. Differences in the strength of the associations with ER-positive and ER-negative breast cancers were assessed using case-only analyses (Supplementary Table 5). Stratumspecific estimates of per-allele OR by categories of age and family history of disease were obtained from logistic regression models (Supplementary Tables6 and 7), and differences in ORs across strata were tested using an ordinal-product interaction term. We also assessed associations between the most significant markers and ER-negative breast cancer in Asian and African-American populations. The Asian-ancestry analyses included 1,547 ER-negative cases and 6,624 controls in 9 studies from BCAC. The African-American analyses included 91 ERnegative cases and 252 controls in 1 BCAC study and 988 ER-negative cases and 2,745 controls in 9 studies from AABC5 (Supplementary Table 1). Both the Asian-ancestry and African-American analyses adjusted for the first two principal components of genetic variation, calculated separately in each ancestry group. Differences by ancestry were tested by a 2 test comparing summary ORs across the three ancestry groups. Bioinformatics. In an attempt to identify functionality in regions of interest, we used the open-source R/Bioconductor package FunciSNP version 0.1.14 (Functional Integration of SNPs)21 (S.K.R., S.G. Coetzee, H. Noushmehr, C. Yan, J.M. Kim et al., unpublished data), which systematically integrates 1000 Genomes Project SNP data (June 2011 data release) with chromatin features of interest. For each of the four newly associated ER-negative breast cancer markers we analyzed all SNPs within a 1-Mb window that were in LD (r2 > 0.5) with the index marker (according to the 1000 Genome Project CEU panel). We assessed whether these SNPs colocalized with 13 different chromatin features that capture open chromatin regions and enhancers across the genome, using data generated by next-generation

npg

2013 Nature America, Inc. All rights reserved.

doi:10.1038/ng.2561

Nature Genetics

sequencing technologies. Information on open chromatin states (H3K9ac and H3K14ac), nucleosome-depleted regions (DNase I and FAIRE), enhancers (H3K4me1) and active/engaged enhancers (H3K27ac) was either generated by the Coetzee Laboratory (S.K.R. et al., unpublished data) or harvested from the Encyclopedia of DNA Elements (ENCODE) Project. All chromatin features were identified in normal human mammary epithelial cells (HMECs) and triple-negative breast cancer cells (MDA-MB-231). We used the UCSC Genome Browser (see URLs) with potentially functional SNPs identified using FunciSNP and chromatin features tracks to generate images (Supplementary Fig. 5). Ethics. All women in participating studies provided written consent for the research, and approval for the study was obtained from the local ethical

review board relevant to each institution. Collection of blood samples and clinical data from subjects was performed in accordance with local guidelines and regulations.
18. Hunter, D.J. et al. A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat. Genet. 39, 870874 (2007). 19. Aulchenko, Y.S., Struchalin, M.V. & van Duijn, C.M. ProbABEL package for genomewide association analysis of imputed data. BMC Bioinformatics 11, 134 (2010). 20. Willer, C.J., Li, Y. & Abecasis, G.R. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics 26, 21902191 (2010). 21. Coetzee, S.G., Rhie, S.K., Berman, B.P., Coetzee, G.A. & Noushmehr, H. FunciSNP: an R/bioconductor tool integrating functional non-coding data sets with genetic association studies to identify candidate regulatory SNPs. Nucleic Acids Res. 40, e139 (2012).

npg

2013 Nature America, Inc. All rights reserved.

Nature Genetics

doi:10.1038/ng.2561

a n a ly s i s

Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies
Nilanjan Chatterjee1, Bill Wheeler2, Joshua Sampson1, Patricia Hartge1, Stephen J Chanock1 & Ju-Hyun Park1,3
We report a new method to estimate the predictive performance of polygenic models for risk prediction and assess predictive performance for ten complex traits or common diseases. Using estimates of effect-size distribution and heritability derived from current studies, we project that although 45% of the variance of height has been attributed to SNPs, a model trained on one million people may only explain 33.4% of variance of the trait. Models based on current studies allow for identification of 3.0%, 1.1% and 7.0% of the populations at twofold or higher than average risk for type 2 diabetes, coronary artery disease and prostate cancer, respectively. Tripling of sample sizes could elevate these percentages to 18.8%, 6.1% and 12.2%, respectively. The utility of polygenic models for risk prediction will depend on achievable sample sizes for the training data set, the underlying genetic architecture and the inclusion of information on other risk factors, including family history. For quite some time, many have predicted that the identification of heritable disease susceptibility markers, such as common genetic variants, could eventually lead to stable models for prediction of risk with important individual and public health implications1. Even for a trait such as breast cancer, which manifests a modest degree of familial aggregation, a polygenic model based on a comprehensive set of genetic variants could achieve sufficient discriminatory power and thus be applied in targeted screening programs2. To date, genomewide association studies (GWAS) have identified thousands of common susceptibility variants for a wide spectrum of complex traits. Recent studies, however, indicate that for most individual traits, the loci discovered so far explain only a small fraction of heritability and thus, collectively have low predictive power311. Although the phenomenon of missing heritability12,13 can be due to many factors such as an overestimation of heritability itself, lack of knowledge of gene-gene and gene-environment interactions and contributions from rare variants, there is increasing recognition that a substantial part of the heritability comes from a large number
1Division

of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Department of Human and Human Services, Rockville, Maryland, USA. 2Information Management System, Rockville, Maryland, USA. 3Department of Statistics, Dongguk UniversitySeoul, Seoul, South Korea. Correspondence should be addressed to N.C. (chattern@mail.nih.gov). Received 10 May 2012; accepted 8 February 2013; published online 3 March 2013; doi:10.1038/ng.2579

of common SNPs, each of which individually has too small of an effect to be detected at the stringent genome-wide significance level with current sample sizes1418. Recent studies, for example, have indicated that although about 200 loci identified through a large GWAS involving more than 100,000 subjects can explain only ~10% of the variance of adult height6, a set of common SNPs included in existing GWAS platforms can explain up to 45% of the variance of the same trait16. There have also been similar studies for several other complex traits17,1921. The gap between estimates of heritability based on known loci and those estimated owing to the comprehensive set of common susceptibility variants raises the possibility of substantially improving prediction performance of risk models by using a polygenic approach, one that includes many SNPs that do not reach the stringent threshold for genome-wide significance. A major factor that determines how well such a model can predict a trait value in an independent sample will be the sample size of the training data set based on which the prediction model can be built. Intuitively, as the sample size for the training data set increases, effects of underlying SNPs can be more precisely estimated. Corresponding to this, the underlying true polygenic model, which harnesses the full predictive power associated with total heritability associated with the SNPs, will be more accurately approximated. In this report, we measure the ability of models based on current as well as future GWAS to improve the prediction of individual traits. We develop a new theoretical framework that characterizes the relationship between sample size and predictive performance of a polygenic model based on the number and distribution of effect sizes for the underlying susceptibility SNPs and the optimal balance of type I and type II error associated with the underlying criterion of SNP selection. Based on this, we provide a realistic assessment of the predictive performance of a polygenic model for each of ten complex traits, namely, the quantitative traits height, body mass index (BMI), total cholesterol, high-density lipoprotein (HDL) and low-density lipoprotein (LDL), and the disease traits Crohns disease, type 1 diabetes (T1D), type 2 diabetes (T2D), coronary artery disease (CAD) and prostate cancer. We used a range of effect-size distributions that are consistent with both known discoveries, 412 in total, reported from the largest GWAS of these traits and recent estimates of the narrow-sense heritability, that is, the total heritability of the traits attributable to additive effects of common SNPs. The results provide several insights into the predictive ability of polygenic models based on existing GWAS, the marginal utility of an increase in sample size, the sample-size threshold beyond which

npg

2013 Nature America, Inc. All rights reserved.

400

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

a n a ly s i s
Simulation studies confirmed the accuracy of this equation (Supplementary Fig. 1). 0.4 0.4 According to this formula, the predictive 1 101 performance of a model depends on (i) the 0.3 0.3 1 102 number of true susceptibility SNPs ( M1) compared to the total number of SNPs 3 0.2 0.2 1 10 under study (M), (ii) the true effect sizes (m values) for the underlying susceptibility 4 0.1 0.1 1 10 SNPs, (iii) the chosen significance level () for SNP selection, (iv) the power of the 5 1 10 0 0 underlying association test to reach that sig8 6 4 2 4 5 6 4 5 6 1 10 1 10 1 10 1 10 1 10 1 10 5 10 5 10 5 10 5 10 nificance level, and (v) the expected value N N of the estimated regression coefficients and Figure 1 PCC for polygenic models and corresponding optimal significance level for SNP selection their squared values for the selected SNPs. under three models for polygenic architectures for adult height. ( a,b) Expected value of PCC2 (a) The sample size of the training data set (N) and corresponding optimal significance level (opt; b) as a function of sample size (N). (c) PCC influences both the power of the association values reported in a predictive analysis of the GIANT study (dashed line) versus corresponding test statistics and the deviations of the estitheoretical expected values under the three different models. Each model assumes a total of 45% of mated regression coefficients from their true phenotypic variance of adult height can be explained by common SNPs included in standard GWAS platforms involving M = 200,000 independent SNPs. Effect-size distribution for susceptibility values (Online Methods). Given an effect-size SNPs was assumed to follow an exponential distribution (black line), a mixture of two exponential distribution, because the number of underdistributions (red line) or a mixture of three exponential distributions (blue line). lying susceptibility SNPs (M1) determine the total variability of the trait explainable the predictive ability of the models may reach a plateau, the optimal by the underlying model, equation (1) can be rewritten in terms of threshold for SNP selection, and the joint utility of family history narrow-sense heritability (h2 g ), which is defined for the purpose of this information and polygenic risks. The general theoretical framework report to be the heritability of a trait owing to additive effects of comwe provide can be used to make projections for the predictive utility of mon tagging SNPs included on current, commercially available SNP different polygenic modelbuilding strategies that may use alternate microarrays (Equation (2) in Online Methods). In all our subsequent statistical algorithms and/or could incorporate other types of effects, analyses, we assume that genotyping platforms based on which most such as those due to gene-gene interactions and rare variants. current GWAS have been conducted to contain approximately on average M = 200,000 independent SNPs. RESULTS To model a complex trait, we first investigated the predictive perThroughout, we assess the predictive performance of a model based formance of polygenic models for adult height. In Figure 1 we show on its predictive correlation coefficient (PCC), which, for a continu- that the predictive accuracy of polygenic models greatly depends on the ous outcome, is equivalent to the Pearsons correlation coefficient distribution of effect sizes even when all distributions result in a total between true and predicted outcomes for the underlying popula- heritability of 45% (ref. 16). Predictive performance of the model for all tion of subjects. For a binary disease outcome, we show that PCC sample sizes was the highest when an exponential distribution underhas a one-to-one mathematical correspondence to the area under lies the effect sizes. Predictive performance of the model decreased the curve (AUC) statistics and other standard measures for dis- substantially under a two-component, exponential-mixture model, criminatory performance of risk models. In deriving this formula, which, compared to the exponential model, provided a much better fit we assumed a simple but popularly used22 model-building algorithm to the observed effect sizes of the known SNPs by allowing for the presin which SNPs are first selected for inclusion in the model depending ence of more SNPs, each with smaller effect (Supplementary Table 1). on whether the corresponding individual tests of association achieve Finally, the performance of the model was the lowest under a threea specified significance threshold () and then a polygenic score is component exponential-mixture distribution, which allows an even built by weighing the selected SNPs based on their estimated regres- larger number of SNPs with smaller effects and produces results that sion coefficients. Details of the underlying models and assumptions are most consistent with the observed discoveries in the GIANT study6 are available in Online Methods. (Supplementary Table 1). Our methods reproduced results from a The relationship between predictive performance of the model and predictive analysis reported in the GIANT study in which distinct the sample size (N) for the training data set is shown in equation (1) polygenic models had been built with different significance thresholds in Online Methods, which forms the basis of our analytical calculations. for SNP selection, and their predictive performance was empirically
0.5 1 0.5
2

npg

2013 Nature America, Inc. All rights reserved.

Table 1 Characteristics of ten complex traits and associated GWAS used in reported analysis
Trait
2 hg

Height 0.45 133,000 108 0.066

BMI 0.14 162,000 31 0.014

TC 100,000 45 0.063

PCC2

PCC

OPT

HDL 0.12 100,000 35 0.046

LDL 95,000 36 0.059

CD 0.22 25,000 64 0.066

T1D 0.30 22,000 30 0.053

T2D 0.51 36,000 22 0.034

PrCA 0.22 28,000 20 0.061

CAD 73,000 21 0.024

Effective sample-size for the largest GWAS Number of detected SNPs Heritability explained by detected SNPs

2 TC, total cholesterol; CD, Crohns disease; PrCA, prostate cancer. Estimates of hg , that is, phenotype variability owing to total additive effects of common SNPs, for height, 2 BMI, HDL, CD, T1D and T2D are from published studies20,21,35 and hg for PrCA is based on internal analysis of a new GWAS at the National Cancer Institute involving ~5,000 2 cases and 5,000 controls genotyped on Illumina Omni 2.5M platform. For qualitative traits, estimates of hg are shown in the liability-threshold scale. Characteristics of largest GWAS and associated discoveries were obtained from published reports68,10,3639. For each trait, an effect-size sample size was calculated for a single-stage study that has equivalent power as the original study, taking into account multistage genotyping and selective sampling by family history for PrCA. For height, sample size and reported discoveries correspond to only first stage of the GIANT study. The number of discoveries reported accounts for any genomic control adjustment used in the original study.

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

401

a n a ly s i s
Figure 2 Expected PCC for polygenic models at optimal significance level for SNP selection for four quantitative traits. ( ad) For HDL and BMI, 2 range of performance is shown corresponding to estimate of hg (yellow line) and associated 95% confidence interval (dark blue region). For LDL 2 and total cholesterol, for which direct estimate of hg was not available, a range of values were chosen based on constraints imposed by the observed discoveries. For all traits, the underlying effect-size distribution was assumed to follow a mixture of three exponential distributions, which 2 together with hg was calibrated to explain observed discoveries from the largest GWAS (Online Methods). Vertical dotted line corresponds to the sample size for the current largest genome-wide scans.

a
2

BMI
0.5 0.4

b
PCC
2

Total cholesterol
0.5 0.4 0.3 0.2 0.1 0

PCC

0.3 0.2 0.1 0


1 10
3

1 10

1 10 N

1 10

1 10

1 10

1 10 N

1 10

c
2

0.5 0.4

HDL

d
PCC
2

0.5 0.4 0.3 0.2 0.1 0

LDL

2013 Nature America, Inc. All rights reserved.

assessed using independently held out datasets. Our method, when applied to the three-component mixture exponential distribution at the given sample size of the GIANT study (N = 130,000), provided an accurate approximation for the entire profile of the observed predictive performance of these polygenic models (Fig. 1). Equation (1) in Online Methods illustrates the tradeoff between specificity and sensitivity of the SNP selection criterion on the predictive performance of the model. With a more liberal significance threshold (), the PCC value will increase through the power of the association tests but will decrease as a function of the underlying type I error (). In Figure 1 we illustrate the optimal threshold for SNP selection that would maximize predictive performance of a model for adult height. Under both the two- and three-component mixture distributions for effect sizes, the optimal significance level initially increased with an increase in sample size, then it plateaued and subsequently remained constant or decreased slightly. In contrast, under the single-exponential distribution that corresponds to stronger effect sizes, the optimal significance level becomes more stringent as sample size increases. We next examined the potential predictive performance of polygenic models for a variety of traits that include both quantitative (BMI, total cholesterol, HDL and LDL) and qualitative phenotypes (Crohns disease, T1D, T2D, CAD and prostate cancer) that together demonstrate a spectrum of estimated heritability (Table 1).

PCC

0.3 0.2 0.1 0


1 103 1 104 N 1 105 1 106

1 103

1 104 N

1 105

1 106

a
AUC

Crohns disease
1.0 0.9 0.8 0.7 0.6 0.5 19.1 3.3

b
AUC

T1D with MHC region


1.0 0.9 0.8 0.7 0.6 0.5 19.1 3.3 1.4 0.5 0.1 0

1.4 0.5 0.1 0

c
AUC

1 103 1 104 1 105 1 106 N

T1D without MHC region


1.0 0.9 0.8 0.7 0.6 0.5
1 10
3 4 5

d
PCC2

1 103 1 104 1 105 1 106 N

T2D
1.0 0.9 19.1 3.3 1.4 0.5 0.1 0
3 4 5 6

19.1 3.3

0.5 0.1 0
1 10 1 10 N 1 10
6

AUC

1.4

0.8 0.7 0.6 0.5


1 10 1 10 1 10 N 1 10

For most traits, we consider a range for the underlying effect-size distributions that are in accord with both reported discoveries from the largest GWAS and recent estimates of h2 g (Online Methods and Supplementary Tables 2 and 3). For a few traits for which external estimates of h2 g are not available, we considered a range of its values within the limits of total heritability and effect-size distributions that can produce results consistent with the observed discoveries in the largest GWAS. For all traits, the expected performance of the polygenic models built based on current GWAS (sample size = N) can be predicted fairly accurately (Figs. 2 and 3). Although it may be possible to improve the performance of these models including SNPs that do not achieve strict genome-wide significance levels, the models are expected to have low to modest predictive power even after optimization of the SNP selection criterion (Table 2). As sample sizes of the future studies will increase, the projected performance of the models will have a wider range, reflecting the uncertainty associated with estimates of heritability. Nevertheless, it is evident that only very large sample sizes can substantially improve the performance of the models, even in some of the best-case scenarios. For prostate cancer, for example, although a polygenic model built based on the current largest GWAS can be expected to achieve an AUC statistic of about 63%, in the future, a model built based on as many as three times that sample size is expected to yield an AUC statistic of only 6470% (Fig. 3). For all disease traits except CAD, it appears that the marginal utility of additional samples can be quite small after the size of GWAS reaches 100,000200,000 subjects. In contrast, for CAD, BMI, and the lipid traits total cholesterol and LDL, the performance of predictive models may continue to improve gradually over a much wider range of sample sizes, as high as 500,000 to one million subjects.
Figure 3 Expected AUC statistics at optimal significance level for SNP selection for five disease traits. (af) For all diseases except CAD, range 2 of performance is shown corresponding to the estimate of hg (yellow line) and associated 95% confidence intervals (dark blue region). For CAD, 2 for which direct estimate of hg was not available, a range of its values were chosen based on constraints imposed by the observed discoveries. For all traits, the underlying effect-size distribution was assumed to follow a 2 mixture of two- or three-exponential distributions, which together with hg was calibrated to explain observed discoveries from the largest GWAS (Online Methods). Vertical dotted line corresponds to the sample size for the current largest genome-wide scans.

npg

PCC2

PCC PCC

2 2

e
1.0 0.9

Prostate cancer
19.1 3.3

f
1.0 0.9

CAD
19.1 3.3 1.4 0.5 0.1 0
3 4 5 6

AUC

0.7 0.6 0.5


1 10
3 4 5

0.5 0.1 0
1 10 1 10 N 1 10
6

AUC

0.8

1.4

0.8 0.7 0.6 0.5


1 10 1 10 1 10 N 1 10

PCC2

PCC
2

402

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

a n a ly s i s
Table 2 Projected discriminatory performance (AUC statistic) for polygenic risk models
Trait CD T1D T2D PrCA CAD AUC with FH alone 0.612 0.533 0.595 0.552 0.601 Current sample size (N) 17,000 16,000 22,000 24,000 57,000 N Model SNPs SNPs + FH SNPs SNPs + FH SNPs SNPs + FH SNPs SNPs + FH SNPs SNPs + FH = 107 0.71 0.79 0.84 (0.67) 0.94 (0.70) 0.57 0.63 0.63 0.65 0.58 0.65 OPT 0.74 0.81 0.84 (0.69) 0.94 (0.71) 0.60 0.66 0.63 0.66 0.59 0.65 = 107 0.77 0.83 0.85 (0.71) 0.95 (0.74) 0.62 0.67 0.64 0.66 0.590.60 0.66 3N OPT 0.82 0.87 0.86 (0.73) 0.96 (0.76) 0.71 0.74 0.66 0.68 0.620.64 0.670.69 = 107 0.81 0.86 0.86 (0.73) 0.96 (0.76) 0.67 0.71 0.66 0.68 0.610.62 0.660.68 5N OPT 0.84 0.89 0.86 (0.75) 0.96 (0.77) 0.76 0.78 0.69 0.71 0.640.67 0.690.71 10N = 107 0.84 0.89 0.86 (0.75) 0.96 (0.77) 0.74 0.77 0.69 0.71 0.640.66 0.680.71 OPT 0.86 0.90 0.87 (0.75) 0.96 (0.78) 0.79 0.81 0.71 0.73 0.670.69 0.710.73

Results are shown for models including SNPs at genome-wide significance level (a = 107) and at optimized significance threshold (aopt). FH, presence of any family history in first-degree relatives. Prevalences of FH for CAD, prostate cancer (PrCA) and T2D are 0.14 (ref. 40), 0.07 (ref. 41) and 0.143 (ref. 42), respectively. Prevalence of FH for T1D and Crohns disease (CD) are taken to be 0.005 and 0.01, which are the same as the disease prevalence35. For all diseases, except PrCA, the current sample size is shown for the first stage of the respective largest GWAS. For PrCA, where a large number of SNPs were followed to stage 2, an effective sample size is shown for stages 1 and 2 combined. Results for T1D are shown with or without (in parentheses) contribution of the MHC region. For all diseases except CAD, AUC values are shown corresponding to point estimates 2 2 of hg in Table 1. For CAD, for which direct estimate of hg was not available, a range of values were chosen based on constraints imposed by the observed discoveries. For all traits, 2 was appropriately calibrated to explain the underlying effect-size distribution was assumed to follow a mixture of two- or three-exponential distributions, which together with hg observed discoveries from the largest GWAS to date.

2013 Nature America, Inc. All rights reserved.

Predictive performance of a model strongly depends on the extent of heritability of the trait. For any given sample size, more accurate prediction is possible for more heritable traits, such as Crohns disease and T1D, than for less heritable traits such as CAD, prostate cancer and T2D, which is in accord with classical estimates of heritability based on sibling and twin studies. Accordingly, the ability of the models to identify individuals likely to develop the disease among high-risk groups varies (Table 3). For example, using models based on current GWAS, the proportion of future cases that could be identified among top 20% of subjects with highest polygenic risk is 71% for T1D and about 32% for T2D. If the sample size for a future GWAS is tripled, then the proportion would be expected to increase to 75% and 48%, respectively. For the three common chronic diseases, the proportion of the population that can be identified to have twofold or higher risk than an average person ranged from 1.1% (CAD) to 7.0% (prostate cancer) for models built based on current sample sizes (Supplementary Table 4). If the sample size in future studies could be tripled, then these proportions could be 6.1% (CAD) and 18.8% (T2D).

For all diseases, family history information alone provides low discriminatory ability. However, models that include both family history and polygenic scores can perform substantially better than models that use polygenic scores alone, especially for rare, highly familial conditions such as Crohns disease and T1D. Even if polygenic scores could be built in the future based on very large sample sizes (for example, sample size = 5N), family history is expected to remain an important variable for identifying high-risk subjects (Tables 2 and 3). DISCUSSION Our analysis demonstrated that the predictive ability of polygenic models depends not only on total heritability but also on the underlying effect-size distributions. Effect-size distributions from large GWAS suggest that although risk prediction models will continue to improve as total sample size increases, the improvement will be slow and modest even when common SNPs account for a large proportion of heritability of the underlying traits. Our analysis also shows that under the most likely effect-size distributions, the optimal significance threshold for selecting SNPs for prediction models

npg

Table 3 Proportion of cases followed among 20% of subjects with highest polygenic risk
Trait CD T1D T2D PrCA CAD Current sample size (N) 17,000 16,000 22,000 24,000 57,000 N Model SNPs SNPs + FH SNPs SNPs + FH SNPs SNPs + FH SNPs SNPs + FH SNPs SNPs + FH = 107 0.48 0.61 0.71 (0.42) 0.91 (0.46) 0.28 0.40 0.35 0.40 0.29 0.42 OPT 0.52 0.65 0.71 (0.44) 0.92 (0.48) 0.32 0.42 0.35 0.40 0.30 0.42 = 107 0.58 0.70 0.73 (0.48) 0.94 (0.52) 0.34 0.43 0.37 0.41 0.31 0.420.43 3N OPT 0.65 0.77 0.75 (0.51) 0.95 (0.56) 0.48 0.54 0.40 0.44 0.340.37 0.440.46 = 107 0.62 0.75 0.75 (0.51) 0.95 (0.56) 0.41 0.48 0.39 0.43 0.320.34 0.430.44 5N OPT 0.72 0.80 0.76 (0.54) 0.95 (0.58) 0.55 0.60 0.44 0.47 0.380.41 0.460.49 = 107 0.72 0.81 0.76 (0.54) 0.95 (0.59) 0.52 0.57 0.44 0.47 0.360.40 0.460.48 10N OPT 0.75 0.83 0.77 (0.55) 0.96 (0.60) 0.63 0.66 0.48 0.51 0.420.45 0.490.52

Results are shown for models including SNPs at genome-wide significance level (a = 107) and at optimized significance threshold (aopt). FH, presence of any family history in first-degree relatives. Prevalences of FH for CAD, prostate cancer (PrCA) and T2D are 0.14 (ref. 40), 0.07 (ref. 41), and 0.143 (ref. 42), respectively. Prevalence of FH for T1D and Crohns disease (CD) are taken to be 0.005 and 0.01 which are the same as the disease prevalence35. For all diseases, except PrCA, the current sample size is shown for the first stage of the respective largest GWAS. For PrCA, where a large number of SNPs were followed to stage 2, an effective sample size is shown for stages 1 and 2 combined. Results for T1D are shown with or without (in parentheses) contribution of the MHC region. For all diseases except CAD, AUC values are shown corresponding to point estimates of 2 2 hg available from GWAS studies. For CAD, for which direct estimate of hg was not available, a range of values were chosen based on constraints imposed by observed discoveries. 2 For all traits, the underlying effect-size distribution was assumed to follow a mixture of two- or three-exponential distributions, which together with hg was appropriately calibrated to explain observed discoveries from the largest GWAS to date.

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

403

a n a ly s i s
in large GWAS can be more liberal than threshold standard (for example, P < 5 108) used for discovery. We observed that for less common, highly familial conditions, such as T1D and Crohns disease, risk models that include family history and optimal polygenic scores based on current GWAS can identify a large majority of cases by targeting a small group of high-risk individuals (for example, subjects who fall in the highest quintile of risk). In contrast, for more common conditions with modest familial components, such as T2D, CAD and prostate cancer, risk models based on GWAS with current sample sizes (N) or foreseeable sample sizes in the near future (for example, 3N) can miss a large proportion (>50%) of cases by targeting a small group of high-risk individuals. For these common diseases, polygenic models using current GWAS data can identify a small minority of the population with elevated risk. Based on our model, we suggest that it is necessary to augment sample size of current GWAS by at least three times to substantially increase the proportion of high-risk populations identified by polygenic models. Perhaps one day GWAS or sequencing would be carried out as part of standard clinical care and then such information together with electronic medical records could be used to build polygenic models based on sufficiently large studies. 2 hg 23 Consistent with a previous report , our analysis of T1D with and without contribution of the major histocompatibility complex (MHC) region highlights the limited incremental discriminatory ability of polygenic scores for diseases that have established common and strong risk factors. Nevertheless, for most diseases, polygenic scores are expected to contribute substantially in addition to family history. One could also expect that in the foreseeable future even crude family history information, such as the presence or absence of the disease in any first-degree relative, will remain an important contributing factor for predicting disease risk in the general population. More detailed information on extended family history, including age-at-onset information, could enhance the predictive utility of these models, especially for applications in high-risk families. Our analysis extends beyond prior reports2427 to project the predictive performance of polygenic models, most of which relied on simulation studies. A previous report25 had noted that predictive performance of models that include all GWAS SNPs in a polygenic score without SNP selection depends only on the sample size of the training data set and . More general theory shows that an algorithm that includes all SNPs in a model, that is, uses the significance level of = 1, could be poor, and the predictive performance of more efficient algorithms is expected to depend on the underlying effectsize distribution. Previous simulation studies often have relied on hypothetical effect-size distributions. Here we used the effect-size distributions that are implied by constraints imposed by both known discoveries reported from some of the largest GWAS to date and recent estimates of heritability to realistically depict the future of genetic-risk prediction. Our results are generally consistent with a recent analysis28 that used information on risk in monozygotic twins to examine the absolute limits of personalized medicine achievable by genome sequencing under the assumption that such technology can ultimately lead to an ideal model that can capture the full spectrum of genetic risk without possibility of any error. In this report, we provide much sharper bounds for what can be achieved in practice using current or future GWAS by taking into account the likely error associated with estimation of underlying risk that is inevitable because of constraints on sample sizes. Emerging effect-size distributions suggest that GWAS will require huge sample sizes to approach the ideal predictive power associated with additive effects of common SNPs. Using a metric used
404

in this report together with the assumption of independent susceptibility alleles across traits, for example, we predict that although GWAS in principle can identify 55.1% of the population that might have twofold or higher risk than average for at least one of the three common diseases, CAD, T2D and prostate cancer, the actual proportion achievable using current GWAS data is only 10.7% and that tripling the sample size could increase this to 33.1%. If the susceptibility alleles across these traits are related, however, these proportions could be higher. Here we made projections based on a simple GWAS polygenic modelbuilding algorithm6,22 after its optimization with respect to the criteria for SNP inclusion. The general framework we constructed (Supplementary Note), however, can be used to assess the likely performance of other, possibly even more efficient, model-building strategies. Using this framework, for example, we project that an algorithm that uses least absolute shrinkage and selection operator (LASSO)type29 thresholds and can analyze all SNPs simultaneously, may outperform the standard GWAS polygenic modelbuilding algorithm. This may be particularly interesting for large sample sizes and highly heritable traits such as height, but we also note that the gains are generally modest in scope (Supplementary Fig. 2). Simultaneous modeling of correlated SNPs in small genomic regions can unmask allelic heterogeneity, possibly adding to the overall predictive strength of the models8,30. Other strategies may include linear mixed modeling16 and Bayesian methods31,32 that can construct polygenic scores based on shrinkage estimates for SNP coefficients using specific priors for the effect-size distribution. Although the absolute performance of different algorithms could be somewhat different across settings, the main results we highlight regarding the order of sample sizes required to improve risk prediction is intrinsically related to the underlying effect sizes and are likely to be observed with other algorithms as well. Our proposed theoretical framework can be used to speculate on the predictive performance of polygenic models that could be built based on rare variants. In an additional illustration (Supplementary Fig. 3), under a model that allows large number of susceptibility loci each containing sets of low-penetrance rare variants, we assessed how polygenic models might perform if variants are included in a model as individual cofactors versus using a gene-collapsing strategy that has been advocated for improving power for association tests33. We observed that up to a certain range of sample sizes for the training data set, models based on collapsed variables often can perform better, apparently because of the improved power for detection of the underlying susceptibility loci. For larger sample sizes, however, their performance might fall short compared to models based on individual variants as collapsed variables, possibly including neutral variants, can cause substantial dilution of effects for the susceptibility loci; the magnitude of such dilution may not diminish with increasing sample size for naive collapsing methods. In the future, it will be of great importance to determine the sample sizes at which such inflection point would occur for different traits depending on the underlying genetic architecture. Here we used a flexible class of mixture-exponential models to specify effect-size distributions. One could specify effect-size distributions using alternate parametric models such as Weibull, gamma or beta distributions, all of which can generate L-shaped distributions that appear to be natural for specification of effect sizes of common SNPs. Although the performance of polygenic models could differ widely in principle under different effect-size distributions, additional analyses (data not shown) indicate that when such models were restricted so that they can also explain discoveries and estimates of heritabilities reported from current GWAS, each produced results that
VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

npg

2013 Nature America, Inc. All rights reserved.

a n a ly s i s
are qualitatively similar to what we report using the mixture of exponential distributions. For future studies of rare variants, however, the range of plausible models for effect-size distributions is substantial, and thus, evaluating the likely performance of polygenic models based on such variants remains challenging (Supplementary Fig. 3). In conclusion, we used a newly developed model together with empirical observations from large GWAS to comprehensively evaluate future polygenic risk models using common susceptibility SNPs. Although our analysis points to challenges for achieving high discriminatory34 power for polygenic risk models, especially for common diseases, it is noteworthy that even models with modest discriminatory power can provide important stratification for absolute risk, thus providing a rationale for potential public health applications such as for weighing risks and benefits for a treatment or an intervention34. For most common disease, existing models based on established environmental risk factors, if any, also have modest discriminatory power and face additional challenges for long-term risk prediction as risk-factor history, unlike susceptibility status, can change over the lifetime of an individual. In the future, development of robust prediction models will need to integrate a spectrum of alleles, from rare to common variants and other risk factors as well. The framework outlined in this paper could be used to identify challenges and opportunities for public health application as well as the required resources needed to develop such models. Methods Methods and any associated references are available in the online version of the paper.
Note: Supplementary information is available in the online version of the paper. Acknowledgments This research was supported by the intramural program of the US National Cancer Institute. AUTHOR CONTRIBUTIONS N.C. led the development of the statistical methods and drafted the manuscript. J.-H.P. contributed to the development of the methods and performed the illustrative analyses. B.W. implemented simulation studies. J.S., P.H. and S.J.C. contributed to designs of various analyses and interpretation of results. N.C., B.W., J.S., P.H., S.J.C. and J.-H.P. reviewed and revised the manuscript. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.
Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html.
1. Bowles Biesecker, B. & Marteau, T.M. The future of genetic counselling: an international perspective. Nat. Genet. 22, 133137 (1999). 2. Pharoah, P.D. et al. Polygenic susceptibility to breast cancer and implications for prevention. Nat. Genet. 31, 3336 (2002). 3. van Hoek, M. et al. Predicting type 2 diabetes based on polymorphisms from genome-wide association studies: a population-based study. Diabetes 57, 31223128 (2008). 4. Pharoah, P.D., Antoniou, A.C., Easton, D.F. & Ponder, B.A. Polygenes, risk prediction, and targeted prevention of breast cancer. N. Engl. J. Med. 358, 27962803 (2008). 5. Wacholder, S. et al. Performance of common genetic variants in breast-cancer risk models. N. Engl. J. Med. 362, 986993 (2010). 6. Lango Allen, H. et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, 832838 (2010). 7. Speliotes, E.K. et al. Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat. Genet. 42, 937948 (2010). 8. Teslovich, T.M. et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707713 (2010). 9. Jostins, L. & Barrett, J.C. Genetic risk prediction in complex disease. Hum. Mol. Genet. 20, R182R188 (2011). 10. Franke, A. et al. Genome-wide meta-analysis increases to 71 the number of confirmed Crohns disease susceptibility loci. Nat. Genet. 42, 11181125 (2010). 11. Kraft, P. & Hunter, D.J. Genetic risk predictionare we there yet? N. Engl. J. Med. 360, 17011703 (2009). 12. Manolio, T.A. et al. Finding the missing heritability of complex diseases. Nature 461, 747753 (2009). 13. Zuk, O., Hechter, E., Sunyaev, S.R. & Lander, E.S. The mystery of missing heritability: Genetic interactions create phantom heritability. Proc. Natl. Acad. Sci. USA 109, 11931198 (2012). 14. Park, J.H. et al. Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nat. Genet. 42, 570575 (2010). 15. Park, J.H. et al. Distribution of allele frequencies and effect sizes and their interrelationships for common genetic susceptibility variants. Proc. Natl. Acad. Sci. USA 108, 1802618031 (2011). 16. Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565569 (2010). 17. Yang, J. et al. Genome partitioning of genetic variation for complex traits using common SNPs. Nat. Genet. 43, 519525 (2011). 18. Park, J.H. & Dunson, D.B. Bayesian generalized product partition model. Statist. Sinica 20, 12031226 (2010). 19. Lee, S.H. et al. Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs. Nat. Genet. 44, 247250 (2012). 20. Stahl, E.A. et al. Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nat. Genet. 44, 483489 (2012). 21. Vattikuti, S., Guo, J. & Chow, C.C. Heritability and Genetic Correlations Explained by Common SNPs for Metabolic Syndrome Traits. PLoS Genet. 8, e1002637 (2012). 22. Purcell, S.M. et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748752 (2009). 23. Clayton, D.G. Prediction and interaction in complex disease genetics: experience in type 1 diabetes. PLoS Genet. 5, e1000540 (2009). 24. Wray, N.R., Goddard, M.E. & Visscher, P.M. Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res. 17, 15201528 (2007). 25. Daetwyler, H.D., Villanueva, B. & Woolliams, J.A. Accuracy of predicting the genetic risk of disease using a genome-wide approach. PLoS ONE 3, e3395 (2008). 26. Janssens, A.C. et al. Predictive testing for complex diseases using multiple genes: fact or fiction? Genet. Med. 8, 395400 (2006). 27. Mihaescu, R., Moonesinghe, R., Khoury, M.J. & Janssens, A.C. Predictive genetic testing for the identification of high-risk groups: a simulation study on the impact of predictive ability. Genome Med. 3, 51 (2011). 28. Roberts,, N.J. et al. The predictive capacity of personal genome sequencing. Sci Transl. Med. 4, 133ra58 (2012). 29. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B Stat. Methodol. 58, 267288 (1996). 30. Yang, J. et al. Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat. Genet. 44, 369375 (2012). 31. Goddard, M.E., Wray, N.R., Verbyla, K. & Visscher, P.M. Estimating effects and making predictions from genome-wide marker data. Stat. Sci. 24, 517529 (2009). 32. Guan, Y. & Stephens, M. Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Ann. Appl. Stat. 5, 17801815 (2011). 33. Li, B. & Leal, S.M. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet. 83, 311321 (2008). 34. Gail, M.H. Personalized estimates of breast cancer risk in clinical practice and public health. Stat. Med. 30, 10901104 (2011). 35. Lee, S.H., Wray, N.R., Goddard, M.E. & Visscher, P.M. Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet. 88, 294305 (2011). 36. Barrett, J.C. et al. Genome-wide association study and meta-analysis find that over 40 loci affect risk of type 1 diabetes. Nat. Genet. 41, 703707 (2009). 37. Voight, B.F. et al. Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis. Nat. Genet. 42, 579589 (2010). 38. Eeles, R.A. et al. Identification of seven new prostate cancer susceptibility loci through a genome-wide association study. Nat. Genet. 41, 11161121 (2009). 39. Schunkert, H. et al. Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nat. Genet. 43, 333338 (2011). 40. Scheuner, M.T. Genetic evaluation for coronary artery disease. Genet. Med. 5, 269285 (2003). 41. Mai, P.L., Wideroff, L., Greene, M.H. & Graubard, B.I. Prevalence of family history of breast, colorectal, prostate, and lung cancer in a population-based study. Public Health Genomics 13, 495503 (2010). 42. Annis, A.M., Caulder, M.S., Cook, M.L. & Duquette, D. Family history, diabetes, and other demographic and risk factors among participants of the National Health and Nutrition Examination Survey 19992002. Prev. Chronic Dis. 2, A19 (2005).

npg

2013 Nature America, Inc. All rights reserved.

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

405

ONLINE METHODS

Underlying polygenic model. We assume Y is the outcome variable and X1, , XM are a set of independent covariates that are potentially predictive of Y. Without loss of generality, we will assume all variables are standardized, so that E(Y) = 0 and Var(Y) = 1 and similarly E(Xm) = 0 and Var(Xm) = 1 for each m. We assume that the true relationship between outcome and the set of covariates can be described by the underlying model (M)

modelbuilding algorithm over different GWAS data sets of sample size N can be written as

mN (a ) =

m 1= 1 bmeN (bm ) pow(N , bm ,a ) m 1= 1n N (bm ) pow(N , bm ,a ) + (M M1) an N (0)


eN ( b m ) = E b m | Zm |> Ca 2
M

(1)

Y=

m =1

b m Xm +

M1

where pow(N,m,) denotes the power of the study of size N for detecting an effect size of m at level ,

m = M1 +1

0 Xm + e
and

where M1 out of the M covariates are truly predictive of Y. We also assume , the residual term, to be independently distributed of X = (X1, , XM). Measure of predictive performance of a model. Now suppose an estimated ) is built based on a training data set of sample size N prediction model (M to predict Y using the formula

2 n N (bm ) = E b m | Zm |> Ca 2

= Y
2013 Nature America, Inc. All rights reserved.

m =1

g X b m m m

Based on the formula for eN(m) and N(m) given in the Suppementary 2 Note, it is easy to see that as N, eN(m)m and n N ( b m ) b m . Thus, it follows that as N,

where m is indicator of whether the variable is selected (m = 1) or not (m = 0) is the estimate of for selected variables. We will denote to be a and b m m generic threshold parameter for the underlying model selection algorithm. to be We define the predictive correlation for the model M

mN (a ) mmax (a ) = mmax =
Because
M1

m =1

2 bm

M1

(2)

) = cor (Y ,Y ) = RN (M e ,X

g m 1= 1 bm b m m 2 g m = 1 b m m
M

of all susceptibility SNPs, mmax = h g , where h2 g is the total heritability in narrow sense. Evaluation of AUC statistics and other performance measures for binary disease outcomes. Previously, several reports2,43,44 have established the relationship between measures of discriminatory ability of risk models and the genetic variance explained by the true underlying polygenic score associated with a set of SNPs. To generalize such results when the polygenic score associated with a set of SNPs may be estimated with error, we assume that the true relationship between the risk of a binary disease outcome D and a set of covariates X1, , XM is given by an underlying logistic model logit{pr(D = 1 | X } = a + m 1 b X + 0 Xm = 1 m m m = M1 +1 We assume that a risk-prediction model is built based on a training data set of + M b sample size N using the formula logit{pr(D = 1 | X } = a m = 1 mg m Xm , where m is an indicator of whether the variable is selected (m=1) or not (m = 0) is the estimate of m for selected variables. Let and b m
M M

m =1

2 is the variance of the trait owing to the total additive effects bm 2

where the subscript X and signify that the correlation coefficient is computed with respect to the distribution of X and in the underlying population for and its associwhich prediction is desired while the estimated model M and m, m = 1, 2, , M) are held fixed. The ated parameter estimates ( b m ) is due to the randomness of the original only source of variation of RN (M is built. For any fixed N and , the expected training data set based on which M ) can be approximated as (see Supplementary Note) value of RN (M

npg

mN (l ) =
where

m 1= 1 bmem (N , l ) pm (N , l ) m = 1n m (N , l ) pm (N , l )
M

| g = 1) em (N , l ) = EN ,l ( b m m pm (N , l ) = PrN , l (g m = 1)
and

= U

m =1

g X b m m m

2 | g = 1) n m (N , l ) = E N , l ( b m m
GWAS polygenic modelbuilding algorithm. Suppose in a GWAS study, independent SNPs are included in a prediction model depending on whether the corresponding marginal trend test for association achieves a specified significance level or not. Let Zm denote the association test statistics for the mth SNP and C/2 denote the critical level for a two-sided test at level . For any SNP that achieves the required significance level, that is, m = 1, its cor , that is, responding coefficient in the prediction model could be taken as b m the estimated regression coefficient from the marginal analysis of the SNP. Based on general theory developed in the Supplementary Note, we show that in the above setting the expected value of the PCC of the above polygenic

be the estimated risk for a person with covariate profile X in the underlying logistic scale. Without loss of generality, we assume each covariate Xm has been standardized with respect to its mean and variance of disease free population so that E(Xm|D = 0) = 0 and Var(Xm|D = 0) = 1. In the Supplementary Note, in controls (D = 0) and cases (D = 1) for we show that the distribution of U large M, M1 and N can be approximated by normal distributions as
2 2 | D = 0) N (0, SN | D = 1) N (CN , SN pr(U ) and pr(U )

where
2 | D = 0) = SN = Var (U m =1

2 g b m m
M1

and

,U | D = 0) = CN = Cov(U

m =1

g bm b m m

Nature Genetics

doi:10.1038/ng.2579

It is noteworthy that although the characterization of the distributions of true risk U for cases and controls requires a single parameter, namely the variance of U2,43,44, the characterizations for the corresponding distributions for requires two parameters, namely the variance of U and its estimated risk U covariance with the true risk U. The AUC, that is, the probability that value of risk score will be greater for a randomly selected case than that of a randomly selected control, can be approximated as

1 > U 0 ) = ( 0 .5RN ) AUC N = pr(U C where RN = N is the predictive correlation measure defined earlier for continuous outcome. Similarly, using above results, other measures of discriminatory performance of models, such as proportion of cases followed (PCF)2, can be also characterized in terms of RN (Supplementary Note). In the Supplementary Note, we show that the distribution of estimated risk for subjects conditional on both his/her own disease status, D, and that of U a relative, DR, can be approximately characterized as
2 | D = 0, DR = 0) N (0, SN pr(U ) 2 | D = 0, DR = 1) N (kRCN , SN pr(U ) 2 | D = 1, DR = 0) N (CN , SN pr(U ) and 2 | D = 1, DR = 1) N (1 + kR )CN , SN pr(U )

SN

tive traits and f is the allele frequency. In the GWAS context, a covariate X in a polygenic model is the number of risk alleles associated with a SNP and thus following the notation in the main text where a covariate X is assumed to be standardized, it follows that b = y 2 f (1 f ) and es = 2. To minimize bias from the winners curse, we estimated effect sizes by excluding discovery-stage data whenever replication-phase data were available. Otherwise, we corrected for possible bias using statistical techniques46. In step (ii), we evaluated power for detection for each susceptibility SNP at their observed effect sizes following the exact design of the original discovery studies (Supplementary Table 2). In step (iii), we obtained estimate of effect-size distribution by fitting a parametric model to the effect sizes for observed susceptibility SNPs. In our previous work14,15,45, we have described nonparametric methods for estimating effect-size distribution in the range of effect sizes for observed susceptibility SNPs. In this report, we considered the use of parametric models that can be used to describe distribution of effect sizes beyond the range of known discoveries. Specifically, we used the class of mixture of exponential distributions that allows specification of effect-size distribution in a flexible, weakly parametric fashion. The model is very natural as it allows for increasingly large number of susceptibility SNPs with decreasingly smaller effects, a common pattern that is emerging from GWAS. Mathematically, we assumed that the distribution of effect sizes for all underlying susceptibility SNPs is given by

2013 Nature America, Inc. All rights reserved.

f (es |q ) =

h =1

ph g (es | lh )

where kR=2R is the coefficient of relationship. Based on these distributions, we derive discriminatory ability of risk models that include both polygenic risk scores and family history. Estimation of effect-size distribution. We extended our previous methods14,15,45 to obtain realistic estimates of effect-size distribution for all underlying susceptibility SNPs for individual traits by combining information from 2 both known discoveries from largest GWAS and estimates of h g that have recently become available for most of the traits we studied. The major steps are: (i) identify the largest GWAS, termed the current study, for each of the traits and list observed susceptibility SNPs that are discovered through these studies; (ii) following the design of the discovery studies (Supplementary Table 2), compute the power to detect SNPs with given effect sizes; (iii) obtain an estimate effect-size distribution by fitting parametric mixture-exponential distribution to observed susceptibility SNPs after accounting for statistical power for their discovery and (iv) incorporate an additional mixture component to the effect-size distribution that can allow a larger number of SNPs with very small effects so that the overall distribution can explain both estimate of 2 heritability owing to common variants (h g ) and the number of observed discoveries and genetic variances explained in current studies. Below we describe the details for each step. In step (i), for each trait, we identified the largest GWAS to date (Supplementary Table 2) and constructed a list of observed susceptibility SNPs that could be considered to have been detected from this study. All independent SNPs that reach genome-wide significance according to specified criteria for these studies are included in the list of known susceptibility SNPs. Some studies used multistage designs and did not follow up previously established susceptibility SNPs beyond the first stage. We included such previously established SNPs in our list if they reached the required threshold for follow-up in the first stage of the current study, on the assumption that these SNPs would have reached genome-wide significance had they been followed up like all other SNPs meeting the same criterion. For each observed susceptibility SNP, we obtained the effect size as es = 2 2f(1 f ), where is linear or logistic regression coefficient depending on quantitative or qualita-

where = (p1, , pH, 1, , H), with ph being the mixture weight for the hth component, h = 1, , H and g(es|h) is an exponential distribution with mean 1/ lh . Noting that the set of K observed susceptibility SNPs can be viewed as a random sample from the set of all underlying susceptibility SNPs, with probability of sampling for each SNP proportional to its power for discovery, we constructed a likelihood as

L(q ) =

{ f (es |q ) powstudy (es | N ,a )des}K

iK =1 f (esi |q ) powstudy (esi | N ,a )

where powstudy(esi|N,) is the power to detect a SNP with effect size es in the current GWAS of size N at a significance level of . We used Bayesian methods to estimate the parameters of the mixture model based on the above likelihood and non-informative priors for the parameter vectors p = (p1, , pH) and l = (1,, H). Specifically, we assumed a discrete Dirichlet distribution for p that leads to uniform prior for each of the ph, h = 1,, H marginally. We assumed h, h = 1, , H to be independently distributed each following a gamma distribution with shape and scale parameters a = 0.5 and b = 2 104, respectively. Posterior means for all parameters were obtained based on Markov chain Monte Carlo algorithms. For each trait, among several fitted mixture models with varying H (up to 3), we selected the best mixture model on the basis of the deviance information criterion47. For all traits except prostate cancer (PrCA) and CAD, a two-component (H = 2) mixture model was the best fitted distribution. For PrCA and CAD, a single exponential distribution (H = 1) was adequate. In step (iv), we incorporated an additional mixture component to the effectsize distribution estimated in step (iii) so that the overall distribution can be used to describe the effect sizes for all SNPs that contribute to h g =
2 2 . bm M1

npg

We observed that if we had assumed that the parametric effect-size distribution estimated based on known loci can be extrapolated to describe the effect sizes for all susceptibility loci explaining h2 g , then the expected number of discoveries and the corresponding heritabilities explained in the current GWAS will be substantially larger than those empirically observed in these studies (Supplementary Table 1). Thus, it is very likely that the true effect-size distribution for all susceptibility SNPs contributing to narrow-sense heritability is more skewed toward smaller effects. To obtain a properly calibrated effect-size

m =1

doi:10.1038/ng.2579

Nature Genetics

distribution for all susceptibility SNPs, we thus added an additional mixture component to the fitted effect-size distribution that we estimated based on known loci. We assumed

f (es |q ) = pH +1 f (es | lH +1 ) + (1 pH +1 )

h =1

h) ph g (es | l

where the summation in the right side corresponds to the fitted mixture model based on known loci. For any given value of h2 g , we found the value of parameters pH + 1 and H + 1 for the additional component by equating the expected and observed number of discoveries and the corresponding heritability explained in the current largest GWAS by solving the equations

Mobs =
and

m=1

| Zm | > Ca M1 powstudy (es | N ,a ) f (es |q )des 1 2

M1

(3)

imposed to allow the mean of the new component to be smaller than that of the smallest component of the fitted distribution by a factor of up to 20-fold. 2 For traits for which estimates of h g and associated confidence intervals were 2 available, values of h g were chosen to be at their point estimates ( Tables 2 and 3) or varied within the range of their confidence intervals (Figs. 2 and 3), and for each such value of h2 g a corresponding effect-size distribution was obtained by solving the above equations. For total cholesterol (TC), LDL 2 and CAD, for which direct estimates of h g were not available, we varied the 2 h value of g to be within 2080% of the range of total heritability of these traits that are available from family studies. For CAD, however, the range of h2 g for which solutions could be found for the equations (3) and (4) were severely restricted. In particular, it appears that the limited number of findings (21 SNPs) from the very large existing GWAS (N = 75,000) of this trait automatically imposes major constraint on the upper bound of h2 g , at least for the class of effect-size distributions we considered.
43. Wray, N.R., Yang, J., Goddard, M.E. & Visscher, P.M. The genetic interpretation of area under the ROC curve in genomic profiling. PLoS Genet. 6, e1000864 (2010). 44. So, H.C., Kwan, J.S., Cherny, S.S. & Sham, P.C. Risk prediction of complex diseases from family history and known susceptibility loci, with applications for cancer screening. Am. J. Hum. Genet. 88, 548565 (2011). 45. Park, J.H., Gail, M.H., Greene, M.H. & Chatterjee, N. Potential usefulness of single nucleotide polymorphisms to identify persons at high cancer risk: an evaluation of seven common cancers. J. Clin. Oncol. 30, 21572162 (2012). 46. Ghosh, A., Zou, F. & Wright, F.A. Estimating odds ratios in genome scans: an approximate conditional likelihood approach. Am. J. Hum. Genet. 82, 10641074 (2008). 47. Spiegelhalter, D.J., Best, N.G., Carlin, B.R. & van der Linde, A. Bayesian measures of model complexity and fit. J. R. Stat. Soc. Series B Stat. Methodol. 64, 583616 (2002).

2 GVobs = bm 1 | Zm | > Ca 2 m=1

M1

M espow 1 study (es | N ,a ) f (es |q )des (4)

2013 Nature America, Inc. All rights reserved.

where is the genome-wide significance level used for discovery and M1 is defined by

h2 g =

m =1

2 M1 es f (es |q )des bm

M1

We solved for pH + 1 and H + 1 by performing a grid-search within the ranges H l 0.01 pH + 1 0.99 and l H + 1 20 l H , where the latter constraint is

npg

Nature Genetics

doi:10.1038/ng.2579

Articles

Using population admixture to help complete maps of the human genome


Giulio Genovese14, Robert E Handsaker1,2,4, Heng Li1,2, Nicolas Altemose2, Amelia M Lindgren5, Kimberly Chambert1,4, Bogdan Pasaniuc6, Alkes L Price1,6, David Reich2, Cynthia C Morton1,5,7, Martin R Pollak1,3, James G Wilson8 & Steven A McCarroll1,2,4
2013 Nature America, Inc. All rights reserved.

Tens of millions of base pairs of euchromatic human genome sequence, including many protein-coding genes, have no known location in the human genome. We describe an approach for localizing the human genomes missing pieces using the patterns of genome sequence variation created by population admixture. We mapped the locations of 70 scaffolds spanning 4 million base pairs of the human genomes unplaced euchromatic sequence, including more than a dozen protein-coding genes, and identified 8 new large interchromosomal segmental duplications. We find that most of these sequences are hidden in the genomes heterochromatin, particularly its pericentromeric regions. Many cryptic, pericentromeric genes are expressed at the RNA level and have been maintained intact for millions of years while their expression patterns diverged from those of paralogous genes elsewhere in the genome. We describe how knowledge of the locations of these sequences can inform disease association and genome biology studies. Physical maps of the human genome, including the sequence of most of its euchromatic portions1,2, are basic resources in human genetics and genomics research: they provide the framework for the analysis of sequence data, and they enable genome-scale analysis of SNPs, copy number variants (CNVs), epigenetic phenomena and gene expression. Yet, physical maps of the human genome remain incomplete. Almost 30 Mb of euchromatic genome sequence that are apparently humanobserved in human whole-genome sequence data3,4, containing human ESTs5,6 and homologous to other mammalian genome sequencesare either absent from or have no assigned locations in current assemblies of the human genome7,8. These missing pieces of the reference human genome are a likely source of mistaken inference in todays analyses of genome sequence data9. Sequence reads arising from the missing pieces may be discarded as non-alignable or incorrectly assumed to arise from paralogous sequences in the known, assembled part of the human genome. Sequences missing from the reference human genome might also help answer questions in human genetics research, such as what is the source of the genetic signals that have been ascertained (but not yet fine mapped to causal variation or causal genes) by linkage, association and CNVs. Here, we describe an approach for applying admixture mapping to localize the human genomes missing pieces at megabase-pair scales
1Program

using the patterns of sequence variation that have been created by the isolation and subsequent remixture of human populations. We report the successful mapping of ~5 Mb of unplaced human euchromatic sequences, including many protein-coding genes. We find that most of these sequences are euchromatic islands within the genomes heterochromatic oceans, including centromeres and the short arms of the acrocentric chromosomes, and that they almost always consist of segmental duplications (sometimes recent, sometimes millions of years old) of sequence present elsewhere in the reference genome. The construction of large-scale genome models (or assemblies) uses physical sequence overlaps between genomic clones10. Clones are assembled into larger scaffolds on the basis of overlapping sequences at their ends. By contrast, mapping based on statistical relationships among variants can provide information that is complementary to physical mapping, as it does not require a continuous path of sequences to be cloned and uniquely assembled. Before physical mapping was feasible, linkage among alleles was used to construct the first genetic maps of the human genome based on RFLPs11,12 and later to build and improve genetic maps based on microsatellite markers13,14. A unique kind of long-range informationfiner in resolution than linkage in families, yet longer in reach than linkage disequilibrium (LD) in populationsis present in many of the worlds admixed

npg

in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA. 2Department of Genetics, Harvard Medical School, Boston, Massachusetts, USA. 3Division of Nephrology, Department of Medicine, Beth Israel Deaconess Medical Center and Harvard Medical School, Boston, Massachusetts, USA. 4Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA. 5Department of Obstetrics, Gynecology and Reproductive Biology, Brigham and Womens Hospital, Boston, Massachusetts, USA. 6Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, USA. 7Department of Pathology, Brigham and Womens Hospital and Harvard Medical School, Boston, Massachusetts, USA. 8Department of Physiology and Biophysics, University of Mississippi Medical Center, Jackson, Mississippi, USA. Correspondence should be addressed to G.G. (giulio.genovese@gmail.com) or S.A.M. (mccarroll@genetics.med.harvard.edu). Received 24 July 2012; accepted 31 January 2013; published online 24 February 2013; doi:10.1038/ng.2565

406

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

Articles
Figure 1 Admixture mapping of the human genomes missing pieces. (a) Chromosomes of West African descent have recombined with chromosomes of European descent through admixture to form mosaic genomes in African Americans. (b) Localization of genomic missing pieces, including unlocalized scaffolds and cryptic segmental duplications, by admixture mapping. Wherever allele frequencies have been influenced by genetic drift in the ancestral populations, statistically significant correlation between genotype and local ancestry allows the unplaced genomic sequence to be mapped to its correct location.

populations. Whenever human populations have been reproductively isolated for long periods of time (such as Africans and Europeans) and then remixed (such as African Americans), the genomes of the descendants are mosaics of segments that derive from ancestors from the two ancestral populations (Fig. 1a). The divergence of the sequences in the ancestral populations gives rise to sequence variation that is informative about the ancestry of each segment. Long-range admixture LD has been used to map genetic factors that segregate at different frequencies in different populations15,16 and to identify genomic sites of recombination in African Americans17,18. We reasoned that population admixture could also be used to map the locations of unmapped human genome sequences. Provided that the sequence in a genomic missing piece is variable, that this variation was subject to genetic drift and that the extent of this drift is known in the two ancestral populations, we could infer the ancestral origin of a missing piecewhether it has been inherited from each individuals European or African ancestorswith varying levels of statistical certainty, in a large panel of admixed individuals. By comparing such ancestry profiles for the genomes missing pieces to similar determinations across the known mapped and assembled sequences that make up the majority of these individuals genomes, each missing piece could in principle be connected to the genomic location at which it resides, even if we lack a continuous path of cloned, assembled sequence with which to make such a connection (Fig. 1b). Specifically, we can test ancestry-informative SNPs for correlation between their genotypes and inferred local ancestry across the genome, estimated using available genome-wide genotypes19. This is different from and potentially much more powerful than detecting LD between genotypes at two SNPs, as the correlation between genotypes and local ancestry is expected to be much stronger (than that between SNPs) at genetic distances up to a few cM, and the distance between unmapped missing pieces and the nearest parts of the reference genome may be substantial. Furthermore, we estimated statistical mapping power from allele frequencies in the ancestral populations and found that it was substantial, even for admixed population samples of even a few hundred individuals ( Supplementary Figs. 13 and Supplementary Note). Thus, admixture mapping could in principle connect sequences that are physically farther apart than the size of most genomic clones (20180 kb) and LD blocks (1550 kb). RESULTS Sources of the missing pieces We used 3 sources of unplaced genome sequence: (i) the current reference genome (hg19), which contains 59 unplaced contigs (~5 Mb of euchromatic sequence) for which the correct location is either only known at the chromosomal level or not known at all; (ii) the HuRef genome20, assembled by random shotgun sequencing of a single individual, containing an even larger number of unplaced scaffolds (~3.5 Mb of euchromatic sequence in 28 scaffolds >100 kb in length and ~7 Mb of euchromatic sequence in 698 scaffolds >10 kb
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

West African European

Individual from admixed population

Low r 2

High r 2

Low r 2

2013 Nature America, Inc. All rights reserved.

Chromosome with missing sequence

Unlocalized sequence

in length); and (iii) sequence from BAC and fosmid clones available from GenBank21 (Online Methods). Mapping the human genomes missing pieces If an ancestry-informative SNP resides on an unmapped contig, we can map the location of the contig by admixture mapping of the SNP. We (i) aligned all unmapped sequence reads from the 1000 Genomes Project22,23 to unplaced scaffolds from HuRef, (ii) identified polymorphic sites across these unplaced sequences and (iii) computed genotypes at each locus in all European (CEU) and West African (YRI) samples (Online Methods). We selected 314 ancestry-informative SNPs whose genotypes had Pearsons correlation r2 > 15% with local ancestry. We then genotyped these SNPs in a cohort of 380 AfricanAmerican participants from the Jackson Heart Study24 (JHS), selecting this sample size on the basis of initial analyses of the predicted power to map each SNP as a function of the number of available genotypes (Online Methods and Supplementary Fig. 3). We successfully performed admixture mapping of 139 SNPs (Supplementary Fig. 4 and Supplementary Table 1), assigning locations for 70 previously unlocalized scaffolds ( Fig. 2 and Supplementary Table 2). We never observed SNPs from the same scaffold mapping to different locations, as could be the case if the scaffold were itself misassembled. Sequences mapped by this approach comprised a total of ~4 Mb of euchromatic sequence that had not been included or mapped in hg19.
407

npg

Articles
RP11247L13 87792 192 87626 87460 217 87616 87736

191

87122

79890

248

1
87072 88307 87926 RP1185C8 82133 82252 82830 193 81012

22

2
RP4813B7 222 80085 79615 88197 87452

21

3
RP1216J23 87554 82504 87972 84675 RP11462H3

20

4
83386

19

5
88164 CH1792N24 87930 88241 80438 204

18

6
81781 87931 88429 80486 80463

17

16
87475 87477

2013 Nature America, Inc. All rights reserved.

8
88124 199 88305 82087 230 84990 80967

15

9
88326 88017 87526 235 212 84412

14

10
RP51039L24 220 87720

13

11
88373

12

Figure 2 Approximate locations of previously unplaced genome sequence scaffolds that were mapped by our approach. Contigs from hg19 are labeled with three digits and stand for GL000###, and scaffolds from HuRef are labeled with five digits and stand for SCAF_11032791#####. Scaffolds with available chromosomal assignment or FISH data are denoted in blue; other scaffolds are denoted in red. Green indicates BAC clones that we mapped through SNPs from the CARe, ICDB or HapMap data sets. No scaffold was mapped to a location incompatible with FISH data. Mappings in the pericentromeric regions of acrocentric chromosomes indicate any location either in the pericentromeric regions or the short arms.

npg

Identifying additional cryptic missing pieces An additional set of cryptic missing pieces might be entirely missing from human genome reference sequences (might not even be described as unlocalized sequences nor present in HuRef) but exist instead as cryptic segmental duplications (or paralogs) of known genomic sequences and have been incorrectly assumed to represent the same genomic sequence as their known paralogs. We reasoned that admixture mapping could also be used to identify cryptic segmental duplications. A SNP that is annotated in the assembled part of the human genome might in fact exist on a cryptic paralogous sequence elsewhere. Therefore, the identification of SNPs that admixture map to a different genomic location than their annotated location might indicate the presence of these SNPs at another genomic location on a cryptic segmental duplication. To identify mismapped SNPs, we analyzed genome-wide SNP data from two large African-American cohorts. Among the 906,703 SNPs from the Affymetrix 6.0 array genotyped in ~7,800 individuals from the Candidate gene Association Resource (CARe) cohort 25 and the 566,714 SNPs from the Illumina HumanHap550 array genotyped in
408

~1,800 individuals from the Illumina iControlDB (ICDB) cohort, we identified, respectively, 121 and 15 SNPs that admixture mapped to genomic locations far from their HapMap26 annotations of physical location (Supplementary Table 3 and Supplementary Note). Approximately half of these mismapped SNPs belonged to a single region, an approximately 360-kb segmental duplication from 16q22.2 to 1q21.1 involving the HYDIN gene2729, confirmed by FISH and previously found to give rise to false genome-wide association signals at 16q22.2 that in fact arose from true association at the Duffy locus at 1q23.2 (ref. 30) (Supplementary Tables 4 and 5). Excluding the HYDIN paralog, incorrect mapping for ~30 SNPs can be explained by known segmental duplications3137, whereas, for the remaining ~40 mismapped SNPs, the most likely explanation is that they reside on sequence missing from the reference genome. (Of the ~30 SNPs that we simply remapped from one known segmental duplication copy to another, 10 corresponded to sites previously used as single unique nucleotides38 (SUNs) to distinguish known segmental duplications. By definition, none of the remapped SNPs with which we identified novel segmental duplications corresponded to SUNs.)
VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

Articles
Table 1 Segmental duplications localized by admixture mapping
Chr. 1 1 2 2 2 3 3 4 4 4 5 Position 83598160 83955427 206072708 206558788 37958019 38003219 91737476 91880745 133120083 612223 663367 75761051 75871577 25709 68702 3536207 3636136 190470115 190684480 21506326 21573437 256518 382461 57204729 57435462 57204729 57608453 57369236 57608453 57401565 57570618 57447574 57575919 147380 188194 19020001 19167977 19817857 20194548 70845287 71202573 10971951 11032242 11083847 11156072 Band 1p31.1 1q32.1 2p22.2 2p11.1 2q21.2 3p26.3 3p12.3 4p16.3 4p16.3 4q35.2 5p14.3 6p25.3 6p11.2 6p11.2 6p11.2 6p11.2 6p11.2 Gene POMZP3 FAM72/SRGAP2 NA OTOP1 NA NA ZNF717 ZNF595 FLJ35424 NA NA DUSP22 PRIM2 PRIM2 PRIM2 PRIM2 PRIM2 Size (kb) Chr. ~400 ~240 ~45 ~140 ~115 ~50 >110 ~40 ~100 ~215 ~65 ~125 ~230 ~400 ~240 >170 ~130 >40 ~200 ~400 ~360 >60 >80 Position Band NA NA SCAF_1103279187616 RP11-247L13 Scaffold Divergence CARea ICDBb HapMapc FISHd ~1.4% ~0.6% ~4.0% ~1.2% >2.0% ~2.0% >5.0% ~0.5% ~3.0% >2.0% ~1.5% ~0.1% ~2.0% ~2.0% ~2.0% ~2.0% ~2.0% ~1.2% ~0.8% ~0.6% ~0.6% ~0.2% NA 6 3 3 2 1 1 1 1 1 2 0 0 0 0 3 3 0 1 3 8 58 1 2 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 8 1 0 + + + + + + + + + + + + + + + + + + + + 7 76182346 7q11.23 76575579 1 14388000 1q21.1 1440957834 22 NA 22q11.1 1 NA 20 NA 22 NA 21 NA 22 NA 9 NA 21 NA 6 58137660 58139549 16 NA 6 NA 6 NA 3 NA 3 NA 6 NA 20 62947067 62965512 21 14447204 14594419 2 16085071 16459525 1 146341167 146400000 13 NA 13 NA 1q21.1

20q11.21 RP11-462H3 22q11.1 GL000217 21q11.2 22q11.1 9p11.2 21q11.2 6p11.2 16p11.2 6p11.2 6q11.1 3p11.1 3p11.1 6p11.2 RP4-813B7 RP11-85C8 SCAF_1103279188214 GL000193 CH17-92N24 NA SCAF_1103279188350 SCAF_1103279188263 SCAF_1103279180085 RP1-216J23 SCAF_1103279188406

2013 Nature America, Inc. All rights reserved.

6 6 6 6 6 6 12 13 14 16

12p13.33 FAM138 13q11 14q11.2 16q22.2 21p11.1 21p11.1 ANKRD30BP2 POTEH/POTEM HYDIN TPTE BAGE

20q13.33 SCAF_1103279187960 21q11.2 22q11.1 1q21.1 13q11 13q11 NA NA GL000192 RP5-1039L24 NA

npg

21 21

Chr., chromosome; NA, not available. Genomic positions and bands are based on hg19 coordinates and localization of the ancestral copy of the duplication, respectively. Protein-coding gene(s) overlapping the duplication are shown. The estimated size of the duplication is given. Column titles marked with prime symbols include information on the derived copy of the duplication, with the genomic scaffold containing the sequence in the derived copy of the duplication is indicated. The estimated sequence divergence between the ancestral and derived copies of the duplication is given.
aNumber

of Affymetrix 6.0 SNPs remapped in the CARe data set. bNumber of Illumina SNPs remapped in the ICDB data set. cWhether independent evidence of the cryptic duplication was confirmed by interchromosomal LD from HapMap genotypes. dWhether a FISH experiment was performed to validate the duplication.

To understand the relationships between these cryptic paralogs and unplaced scaffolds from large sequencing efforts, we crossreferenced the locations of these SNPs with alignments of unlocalized sequence from HuRef and GenBank. We identified 18 sequences >40 kb in length each containing 1 or more of the mismapped SNPs. Twelve of these 18 regions (spanning ~1.3 Mb of euchromatic sequence) could not be explained by segmental duplications already annotated in the reference genome; these indicate the presence of cryptic segmental duplications. To critically evaluate these findings by an independent method, we used the principle that cryptic segmental duplications should give rise, for SNPs called from sequencing data, to excess heterozygosity that does not follow simple models of Hardy-Weinberg equilibrium
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

between pairs of alleles. We searched for such a signalannotated SNPs that behave more like paralogous sequence variants (PSVs)in data from the 1000 Genomes Project pilot and confirmed all of these regions (Online Methods and Supplementary Table 6). For 8 of the 12 cryptic segmental duplications, we could find no mention in the literature. We further confirmed six of them by interchromosomal LD analysis using HapMap genotypes (Table 1). We determined for each region whether the alternate allele of any of the mismapped SNPs was present in any of the BAC clones aligning to that region, by aligning sequences from BAC clones retrieved from GenBank to the hg19 reference genome. For SNPs in six of these regions, we could identify BAC clones carrying the alternate allele, suggesting that these clones harbor the sequence where these SNPs
409

Articles a 10
Normalized read depth 8 6 4 2 0
PRIM2 Region 1 SCAF_1103279188350 SCAF_1103279188263 SCAF_1103279180085 RP3-422B11 RP1-71H19 RP3-401D24 Region 2 SCAF_1103279188406 Region 3 (6p11.2) (6q11.1) (3p11.1) RP11-343D24

Sample HG00155 GBR Sample NA18541 CHB Meadian of all samples

57,100

57,200

57,300

2013 Nature America, Inc. All rights reserved.

Figure 3 Cryptic paralogs of the PRIM2 gene. (a) Analysis of sequencing coverage depth in data from the 1000 Genomes Project suggests the presence of three segments (blue arrows) with higher copy number. Although the copy number of each segment seems to be fixed in most genomes, at least two genomes show extra copy number gain at two of the three segments (HG00155 GBR at regions 1 and 2 and NA18541 CHB at regions 2 and 3), suggesting a model in which there are two additional copies of this locus in most human genomes, one copy containing regions 1 and 2 and another copy containing regions 2 and 3. Blue arrows indicate the regions, black arrows indicate alignment of HuRef scaffolds within these regions, and green arrows indicate the BAC clones overlapping these regions and used in the reference assembly. (b) FISH analysis of PRIM2 and its cryptic paralogs. Fosmid clone WI20569M11 overlapping PRIM2 (G248P8956G6 aligned to chr. 6: 57,417, 67757,467,167) hybridized to two distinct locations in the pericentromeric region of chromosome 6, 6p11.2 and 6q11.1, and to a third location in the pericentromeric region of chromosome 3, confirming the two additional partial copies of the PRIM2 gene missing from the reference genome.

57,400 Chr. 6 (Kb)

57,500

57,600

6p11.2, 6q11.1 6p11.2 6q11.1 Chr. 3

Chr. 3

actually reside (Table 1). For one of these regions containing the gene PRIM2, further analysis indicated an intrachromosomal duplication in the pericentromeric region of chromosome 6 and an additional interchromosomal duplication in the pericentromeric region of chromosome 3 (Supplementary Note). We confirmed the existence of

a
19q13.31 Chr. 20

b
Chr. 22 Chr. 14

this triplication by the presence of excess sequence read depth across this region in low-coverage data from the 1000 Genomes Project (Fig. 3a and Supplementary Fig. 5) and FISH analysis (Fig. 3b). We also observed that the copy in the reference genome is a hybrid of the two copies on chromosome 6 owing to a misassembly (Supplementary Fig. 6 and Supplementary Note). Pericentromeric locations of the missing pieces Despite the fact that most of the 300 or so gaps8 in the reference human genome exist in interstitial regions, most of the sequence we were able to localize mapped not to interstitial gaps but to cytogenetically defined heterochromatic regions of the human genome. Among the mapped scaffolds, 57 of 70 mapped to pericentromeric regions (Fig. 2 and Supplementary Table 2). Among the remapped SNPs
Figure 4 FISH analysis confirmed the presence of cryptic segmental duplications. (a) Fosmid clone WI2-1750D05 (G248P87673B3 aligned to chr. 2: 133,062,362133,104,847) hybridized to 19q13.31 and to the centromeric region of chromosome 20, as predicted by admixture mapping. (b) WI2-1656E10 (G248P83226C5 aligned to chr. 3: 613,680650,737) hybridized to the centromeric/acrocentric regions of chromosomes 14 and 22, as predicted by admixture mapping. (c) WI2-0903H06 (G248P8635D3 aligned to chr. 4: 3,573,606 3,614,890) hybridized to the centromere of chromosome 9, as predicted by admixture mapping. (d) WI2-1022I06 (G248P82546E3 aligned to chr. 5: 21,531,02621,568,722) hybridized to 6p11.2.

npg

3p26.3 Chr. 20 19q13.31 2q21.2 2q21.2 3p26.3 Chr. 22 Chr. 14

c
Chr. 9 4p16.3

5p14.3 6p11.2

Chr. 9 4p16.3

6p11.2 5p14.3

410

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

Articles
Figure 5 Expression of cryptic gene paralogs from pericentromeric regions of the human genome. PSVs were used to distinguish the expression of the DUSP22, PRIM2, HYDIN and MAP2K3 genes from the expression of their cryptic paralogs in RNA-seq data from diverse human tissues. PSVs in the UTRs are represented by blue text, PSVs predicted to change the protein product of the paralog are shown in red, and synonymous PSVs are shown in green. The color of each box indicates the number of RNA-seq reads that can be assigned to one paralog or the other using the PSV.
Reference paralog
DUSP22
rs11242812 rs11242813 rs6927235 rs3778605 rs1129085 rs1046656 rs5011403 rs6913546 rs927192 rs76686926 rs80081867 rs9476080 rs9476081 rs62398997 rs71214816 rs77436138 rs62398998 rs62398999 rs9885913 rs9885916 rs4535533 rs9885751 rs4406234 rs4307164 rs4294007 rs5001076 rs4294008 rs4398719 rs5001484 rs5001483 rs7773110 rs76296076 rs72880943 rs7752845 rs78256005 rs77911716 rs75351177 rs77348569 rs77563921 rs75209982 rs77519815 rs77947729 rs75636132 rs74695174 rs79119670 rs76730673 rs74581130 rs75390508 rs56375552 rs76503032 rs1801400 rs74617308 rs1801352 rs1801399 rs79506978 rs9367774 rs74652947

Cryptic paralog
200 100 0 45 40 35 30 25 20 15 10 5 0

identifying cryptic segmental duplications, 40 of 70 mapped to pericentromeric regions. (In all these cases, the resolution of the mapping was limited to the pericentromeric region identified.) We sought to confirm these pericentromeric mappings using both published and new cytogenetic data. Of the 70 scaffolds we mapped successfully, 17 were among 29 scaffolds that were previously analyzed by FISH (Supplementary Information of ref. 39 and Supplementary Table 8 of ref. 20). All 17 of these admixture mappings were consistent with 1 of the often multiple locations suggested by FISH (Fig. 2 and Supplementary Table 2). Although confirmatory, this result also emphasizes the discerning power of admixture mapping over techniques based on hybridization, as the latter can yield ambiguous results when clones contain segmental duplications or other kinds of repeats. We also performed additional FISH experiments to critically evaluate the mappings of five novel cryptic paralogous sequences for which no previous FISH data existed. In all (5/5) cases, FISH confirmed the presence of the additional copy in the predicted pericentromeric region (Fig. 4 and Online Methods). A further prediction of these mappings to pericentromeric regions involves the sequence content of the respective scaffolds. If these genomic missing pieces are indeed euchromatic islands in hetero chromatic oceans, then they might frequently contain heterochromatic beaches consisting of the satellite sequences associated with human centromeres. To evaluate this prediction, we measured the amount of sequence classified as heterochromatic satellite on each scaffold. The great majority of the scaffolds that admixture mapped to pericentromeric regions (50/57) contained more than 5% satellite sequence (Online Methods, Supplementary Fig. 4 and Supplementary Table 2), compared with almost none (1/13) of the scaffolds that admixture mapped to interstitial regions (P = 0.003). Another prediction of these pericentromeric mappings is that, given earlier data indicating that recombination within centromeres is likely to be heavily repressed40, scaffolds mapping to the same pericentromeric regions might show LD with one another. We identified pairs of SNPs (from distinct scaffolds) with LD not due to admixture and ~500 SNP pairs from distinct scaffolds for which both SNPs mapped to the same genomic regions (Supplementary Table 7). In no instance did these LD-based relationships among scaffolds disagree with our mappings from admixture. To understand how the pericentromeric missing pieces relate to the known human genome, we aligned their sequences to hg19; virtually all scaffolds mapping to pericentromeric regions were found to consist of one or more segmental duplications of mapped euchromatic sequence, with 25% sequence divergence (Supplementary Table 2). This suggests that a large fraction of these sequences arrived at their current locations by a process of segmental duplication in primate ancestors41. Our mapping of these cryptic segmental duplications to centromeric regions is consistent with an earlier finding that most chromosome arms (35/43) have greater density of known interchromosomal duplications in the proximity of centromeres than is observed farther away from centromeres42; both results seem to reflect a tendency of interchromosomal duplications to deposit sequence at and around centromeres.
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

PRIM2

Lymph node Ovary Prostate Skeletal muscle Testes Thyroid White blood cells

2013 Nature America, Inc. All rights reserved.

Reference paralog
rs77602727 rs77347409 rs117615754 rs140880761 rs78139131 rs76311501 rs117585639 rs114164853 rs148248201 rs1798314 rs79931006 rs2502698 rs1774416 rs1626593 rs1774421 rs1798325 rs77739958 rs1774504 rs2502690 rs1774311 rs1354550 rs1774331 rs1798440 rs1774423 rs79428476 rs1774449 rs78978478 rs74249266 rs74249267 rs74249268 rs1798528 rs77472425 rs76023477 rs1798531 rs1798532 rs77513448 rs117922041 rs117379722 rs115852469 rs115844546 rs116739010 rs1774266 rs783762 rs115159786 rs1774303 rs1774516 rs1774513 rs117718337 rs116016318 rs118185262 rs76641962 rs79616449 rs78093086 rs1022220 rs3817211 rs6416717 rs10744982 rs10735134 rs1398391 rs62040318 rs929311 rs79679861

Adipose Adrenal Brain Breast Colon Heart Kidney Liver Lung Lymph node Ovary Prostate Skeletal muscle Testes Thyroid White blood cells

Heart Kidney

Brain

Breast

Liver

Adipose Adrenal

Colon

Lung

Cryptic paralog

40

35

30

25

HYDIN

20

15

10

npg

Ovary Prostate Skeletal muscle Testes

Reference paralog
rs33911218 rs36047035 rs34105301 rs62057672 rs2305873 rs62057673 rs34458870 rs56067280 rs56216806 rs55796947 rs56166328 rs2305872 rs62057721 rs74575904 rs58609466 rs76111309 rs1657695 rs55935757 rs55736474 rs56369732 rs35206134 rs1657688 rs2363198 rs2363197 rs2363196 rs2363375 rs2363374 rs2363373 rs2363195 rs2363194 rs2363370 rs2363369 rs2363193 rs2363192 rs2363191 rs7502445 rs2363190 rs2363189 rs2363188 rs2363187 rs2363186 rs4021726 rs4021725 rs4021724 rs4021723 rs2363185 rs4416071 rs56254869 rs56125869 rs72838584 rs62055368

Adipose Adrenal Brain Breast Colon Heart Kidney Liver Lung Lymph node Ovary Prostate Skeletal muscle Testes Thyroid White blood cells

Lymph node

White blood cells

Adipose

Heart

Adrenal

Thyroid

Kidney

Colon

Brain

Liver

Breast

Lung

Cryptic paralog
160

140

120

100

MAP2K3

80

60

40

20

White blood cells

Lymph node

Adipose Adrenal Brain Breast Colon Heart Kidney Liver Lung Lymph node Ovary Prostate Skeletal muscle Testes Thyroid White blood cells

Liver

Ovary Prostate Skeletal muscle Testes Thyroid

Kidney

Brain

Breast

Adipose

Adrenal

Colon

Heart

Lung

411

Articles
Are the missing pieces copy number variable? Although the cryptic, pericentromeric euchromatic regions described here have not been purposefully interrogated in earlier CNV studies, they may have been indirectly interrogated via assays that targeted paralogous sequences in the known, assembled parts of the human genome. This seems the likely scenario, as almost all of the mismapped SNPs we identified from genotyping arrays (63/70, not including the HYDIN locus) fell within CNVs reported in the Database of Genomic Variants (DGV)43 (Supplementary Table 3), despite the fact that DGV CNVs together cover less than a third of the human genome. Given the sequence divergence over the identified cryptic paralogs (often greater than 2%), these additional copies are likely to have fixed in the ancestors of all humans. Identifying CNVs over these sequences at a greater rate than for the rest of the genome might therefore indicate the instability of sequences in pericentromeric regions rather than a persistent state of polymorphism of these additional copies in the human population after the duplication event. To evaluate the copy number variability of four selected paralogous region pairs, we analyzed the read depth of coverage and paralogous sequence variation using data from the 1000 Genomes Project (Online Methods). We identified common CNVs affecting the segmental duplications from the 2p22.2, 4q35.2 and DUSP22 loci (Supplementary Figs. 79), and we found evidence for CNVs affecting either of the PRIM2 cryptic paralogs (Fig. 3a and Supplementary Fig. 5). In each case, we could confirm, using PSVs, that the cryptic paralogs rather than the paralogs present in the reference genome accounted for the observed copy number variability (Supplementary Note), consistent with CNVs having arisen in the pericentromeric paralogs. Expression of protein-coding genes from pericentromeric regions Cryptic, pericentromeric paralogs of known protein-coding genes could in principle be either pseudogenes or expressed, intact genes. To test whether cryptic paralogs of coding genes are expressed at the RNA level, we analyzed RNA sequencing (RNA-seq) data from the Human BodyMap 2.0 project. We focused on reads aligning to the DUSP22, PRIM2, HYDIN, MAP2K3 and KCNJ12 genes, all of which appear to have cryptic paralogs in pericentromeric regions (Fig. 3 and Supplementary Figs. 5, 9 and 10). To distinguish RNA arising from reference gene copies from RNA arising from the cryptic paralogs, we focused on reads covering PSVs identifiable from genomic DNA sequence (many of which were previously misannotated as SNPs); this makes it extremely likely that sequence differences observed in RNA have a genomic origin (Fig. 5 and Online Methods). We identified expressed RNA for all of the paralogs except MAP2K3 (Fig. 5). The expression of cryptic, pericentromeric gene copies showed several kinds of relationship to the expression of their paralogs. Both DUSP22 and its recently duplicated paralog were expressed and showed similar distribution across tissues. In contrast, the cryptic paralogs of PRIM2, which contain only exons 614 of the original transcript (Fig. 3a), gave rise to shorter transcripts that were expressed exclusively in the brain and testes (Fig. 5). For HYDIN, which is expressed in brain and several other tissues, this analysis indicated that the cryptic paralog at 1q21.1 was expressed in the brain, consistent with its earlier observation in a brain cDNA library28. For KCNJ12, we detected expression of the pericentromeric paralog KCNJ18 in testes (Supplementary Fig. 11), KCNJ18 is also expressed in skeletal muscle and is essential to muscle function44. The tissue specificity observed for paralogous copies is also evidence that these observations are not the result of sequencing errors at putative PSV sites. These results suggest that many of these cryptic, pericentromeric gene paralogs are expressed genes and that their expression patterns can differ from those of their known paralogs.
412

DISCUSSION We have described a population-based approach for helping to assemble the rest of the euchromatic human genome, even when missing pieces are separated from known euchromatic sequence by extensive heterochromatic sequence. Because our approach uses data that are widely available or are quickly becoming so, its power will increase quickly in the coming years. We anticipate that this approach will help complete physical maps of the human genome. Analysis of ancestry-informative markers in unlocalized scaffolds can be used to map the genomic locations of these scaffolds with a physical resolution comparable to that of FISH but with unambiguous mapping to individual loci, in a highly scalable way that will become inherently more powerful as sequence data sets grow. (Many aspects of the genome assembly will continue to require other methodsfor example, our approach does not determine the physical orientation of novel sequence with respect to the chromosome.) Using this approach, we mapped ~4 Mb of unplaced euchromatic sequence, most of which we found to be embedded in the heterochromatic regions of the genome. These regions are not included in the current human reference genome, and, with two exceptions, they do not overlap with any of the current patches included in the latest revision (Supplementary Table 8). One limitation of our approach is that it relies on novel sequence having been correctly assembled and distinguished from paralogous sequence. Most sequences from HuRef unplaced scaffolds have a divergence greater than 2% from their closest paralogs; owing to limitations of shotgun sequencing assembly, paralogous segments with <2% sequence divergence are likely to be under-represented in human genome assemblies45. Unfortunately, owing to their short read lengths, current whole-genome next-generation sequencing approaches do not provide better assemblies for such regions than those obtained with capillary-based sequencing approaches 46. Nonetheless, we showed that admixture mapping of the SNPs ascertained in such regions can still allow the discovery and mapping of these cryptic paralogous sequences. Our results have several potential implications for the mapping of disease-relevant genes in humans, particularly wherever genetic signals map near pericentromeric regions, assembly gaps and segmental duplications. (i) CNVs frequently straddle or are flanked by ambiguous regions of the genome assembly. For example, deletions and duplications at 1q21.1 reported to affect ~1.5 Mb of genomic sequence associate with cardiac developmental defects47, schizophrenia48,49, mental retardation, autism, congenital anomalies50 and abnormal head size51. Fully defining the gene content of these CNVs will require interrogating the missing sequence hidden in the assembly gaps at 1q21.1. (ii) Some regions implicated in genome-wide association studies may require reanalysis in light of the results here. For example, human height associates with rs17511102 and other markers in a long noncoding RNA (lincRNA)-containing segment of 2p22.2 (ref. 52) for which we found a cryptic segmental duplication (and paralogous lincRNA) in the pericentromeric region of chromosome 22. Following up this association will require that markers throughout the region be reassigned to the correct paralogous gene copies. (iii) The SERPINB6 gene was associated with a clinical phenotype through homozygosity mapping by the identification of an homozygous region terminated by the heterozygous genotype of the rs7762811 SNP53, which our results suggest is incorrectly assigned to 6p25.3, although it in fact resides at 16p11.2, leading to a slight underestimation of the correct homozygous region. (iv) The genes affected by cryptic segmental duplications may be functionally important and critical to include and explicitly model in exome sequencing studies. For example, mutations in KCNJ18,
VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

npg

2013 Nature America, Inc. All rights reserved.

Articles
a gene missing from the reference genome, have been shown to cause thyrotoxic hypokalemic periodic paralysis44. (v) An admixture mapping study found that African Americans with multiple sclerosis are more likely than healthy African Americans to have European ancestry around the centromere of chromosome 1 (ref. 15), a region to which our work has assigned more than a megabase of novel sequence. We showed that CNVs are more common over cryptic paralogs missing from the reference genome, most likely owing to the physical instability of pericentromeric regions. We also showed that paralogous genes in these cryptic, pericentromeric duplications are transcribed, sometimes with patterns of expression that diverge from those of their paralogs, and therefore potentially serve unique biological functions. The presence of duplicated regions complicates genome assemblies and SNP and CNV discovery (Supplementary Figs. 1224). Notably, HYDIN and PRIM2 are among the most difficult genes to reconstruct using de novo assembly from short sequence reads54. PRIM2 and KCNJ12 are among the genes with the largest number of misidentified nonsynonymous SNPs55, most likely owing to the identification of PSVs as SNPs. Approximately 6% of the human genome reference is currently considered unreliable for variant discovery by the 1000 Genomes Project23, owing to dearth or excess read coverage or poor alignment of sequence reads. Most of the regions we identified as harboring a cryptic segmental duplication (Table 1 and Supplementary Table 6) fall in this inaccessible part of the human genome. While waiting for a more complete version of the human genome reference, the 1000 Genomes Project now aligns sequence data to an expanded genome reference that includes additional unlocalized sequences (termed decoy sequences) to reduce false alignments in regions with cryptic segmental duplications. These additional sequences consist mainly of sequenced clones discarded by the Human Genome Project and sequence from the HuRef assembly (~30% of decoy sequences consist of HuRef unlocalized scaffolds). Of course, the eventual goal of such projects will be the alignment of all human sequence reads to their actual physical locations. In completing maps of the human genome, the important remaining challenges include mapping the human genomes structure at all scales, fully cataloging the genomes sequence content and appreciating how sequences are ordered and arranged along chromosomes. As the scientific community works toward a complete reference assembly of the human genome56, the analysis of genome-wide data from admixed populations will add unique value and help complete understanding of the human genomes structure and evolution. URLs. HuRef unplaced scaffolds, ftp://ftp.tigr.org/pub/data/huref/; GenBank database, ftp://ftp.ncbi.nih.gov/genbank/; database of Genotypes and Phenotypes (dbGaP), http://www.ncbi.nlm.nih. gov/gap; Illumina iControlDB, http://www.illumina.com/science/ icontroldb.ilmn; HapMap interchromosomal LD, ftp://ftp.ncbi. nlm.nih.gov/hapmap/inter_chr_ld/; Illumina Human BodyMap 2.0 data, http://www.ncbi.nlm.nih.gov/projects/geo/query/acc. cgi?acc=GSE30611; decoy sequences, ftp://ftp-trace.ncbi.nih.gov/ 1000genomes/ftp/technical/reference/phase2_reference_assembly_ sequence/; UCSC Genome Browser, http://genome.ucsc.edu/; RepeatMasker, http://www.repeatmasker.org/. Methods Methods and any associated references are available in the online version of the paper.
Note: Supplementary information is available in the online version of the paper. Acknowledgments This study was supported by grants RC1 GM091332-01 (S.A.M. and J.G.W.), R01 HG006855 (S.A.M.) and R01DK54931 (G.G. and M.R.P.) from the US National Institutes of Health and by a Smith Family Foundation Award for Excellence in Biomedical Research (S.A.M.). The Jackson Heart Study is supported and conducted in collaboration with Jackson State University (N01-HC-95170), University of Mississippi Medical Center (N01-HC-95171) and Touglaoo College (N01-HC-95172) contracts from the National Heart, Lung, and Blood Institute (NHLBI) and the National Institute for Minority Health and Health Disparities (NIMHD), with additional support from the National Institute on Biomedical Imaging and Bioengineering (NIBIB). The Atherosclerosis Risk in Communities Study is carried out as a collaborative study supported by NHLBI contracts (HHSN268201100005C, HHSN268201100006C, HHSN268201100007C, HHSN268201100008C, HHSN268201100009C, HHSN268201100010C, HHSN268201100011C and HHSN268201100012C). The Coronary Artery Risk Development in Young Adults Study (CARDIA) is conducted and supported by the NHLBI in collaboration with the University of Alabama at Birmingham (N01-HC95095 and N01-HC48047), the University of Minnesota (N01-HC48048), Northwestern University (N01-HC48049) and the Kaiser Foundation Research Institute (N01-HC48050). MESA, MESA Family and the MESA SHARe project are conducted and supported by the NHLBI in collaboration with the MESA investigators. Support for MESA is provided by contracts N01-HC-95159, through N01-HC-95169, and RR-024156. Funding for MESA Family is provided by grants R01-HL-071051, R01-HL-071205, R01-HL-071250, R01-HL-071251, R01-HL-071252, R01-HL071258 and R01-HL-071259. MESA Air is funded by the US Environmental Protection Agency (EPA)Science to Achieve Results (STAR) Program Grant RD831697. Funding for genotyping was provided by NHLBI contracts N02-HL-6-4278 and N01-HC-65226. This manuscript does not necessarily reflect the opinions or views of ARIC, CARDIA, JHS, MESA or the NHLBI. AUTHOR CONTRIBUTIONS G.G. and S.A.M. conceived the project, designed the analyses and wrote the manuscript. G.G. performed the analysis of the CARe, ICDB, JHS and BodyMap 2.0 data sets. R.E.H. performed the sequence read depth analysis of selected regions. H.L. performed the alignments of HuRef scaffolds and GenBank clones. N.A. contributed the analysis of the HuRef unplaced scaffolds. A.M.L. performed the FISH experiments. K.C. organized and contributed to the design of the Sequenom experiment. B.P., A.L.P. and D.R. provided advice for the local ancestry inference. C.C.M. participated in the interpretation of the FISH experiments. M.R.P. participated in planning discussions for the linkage analysis. J.G.W. participated in planning discussions, coordinated interactions with JHS and edited the manuscript. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.
Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html.
1. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860921 (2001). 2. Venter, J.C. et al. The sequence of the human genome. Science 291, 13041351 (2001). 3. Li, R. et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 5763 (2010). 4. Kidd, J.M. et al. Characterization of missing human genome sequences and copynumber polymorphic insertions. Nat. Methods 7, 365371 (2010). 5. Kirsch, S. et al. Interchromosomal segmental duplications of the pericentromeric region on the human Y chromosome. Genome Res. 15, 195204 (2005). 6. Lyle, R. et al. Islands of euchromatin-like sequence and expressed polymorphic sequences within the short arm of human chromosome 21. Genome Res. 17, 16901696 (2007). 7. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931945 (2004). 8. Lander, E.S. Initial impact of the sequencing of the human genome. Nature 470, 187197 (2011). 9. Pickrell, J.K., Gaffney, D.J., Gilad, Y. & Pritchard, J.K. False positive peaks in ChIP-seq and other sequencing-based functional assays caused by unannotated high copy number regions. Bioinformatics 27, 21442146 (2011). 10. Eichler, E.E., Clark, R.A. & She, X. An assessment of the sequence gaps: unfinished business in a finished human genome. Nat. Rev. Genet. 5, 345354 (2004). 11. Botstein, D., White, R.L., Skolnick, M. & Davis, R.W. Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am. J. Hum. Genet. 32, 314331 (1980).

npg

2013 Nature America, Inc. All rights reserved.

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

413

Articles
12. Donis-Keller, H. et al. A genetic linkage map of the human genome. Cell 51, 319337 (1987). 13. Weissenbach, J. et al. A second-generation linkage map of the human genome. Nature 359, 794801 (1992). 14. Kong, A. et al. A high-resolution recombination map of the human genome. Nat. Genet. 31, 241247 (2002). 15. Reich, D. et al. A whole-genome admixture scan finds a candidate locus for multiple sclerosis susceptibility. Nat. Genet. 37, 11131118 (2005). 16. Winkler, C.A., Nelson, G.W. & Smith, M.W. Admixture mapping comes of age. Annu. Rev. Genomics Hum. Genet. 11, 6589 (2010). 17. Hinch, A.G. et al. The landscape of recombination in African Americans. Nature 476, 170175 (2011). 18. Wegmann, D. et al. Recombination rates in admixed individuals identified by ancestry-based inference. Nat. Genet. 43, 847853 (2011). 19. Seldin, M.F., Pasaniuc, B. & Price, A.L. New approaches to disease mapping in admixed populations. Nat. Rev. Genet. 12, 523528 (2011). 20. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007). 21. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J. & Sayers, E.W. GenBank. Nucleic Acids Res. 39, D32D37 (2011). 22. 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 10611073 (2010). 23. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 5665 (2012). 24. Taylor, H.A. Jr. et al. Toward resolution of cardiovascular health disparities in African Americans: design and methods of the Jackson Heart Study. Ethn. Dis. 15, S6-4-17 (2005). 25. Musunuru, K. et al. Candidate gene association resource (CARe): design, methods, and proof of concept. Circ. Cardiovasc. Genet. 3, 267275 (2010). 26. International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851861 (2007). 27. Martin, J. et al. The sequence and analysis of duplication-rich human chromosome 16. Nature 432, 988994 (2004). 28. Doggett, N.A. et al. A 360-kb interchromosomal duplication of the human HYDIN locus. Genomics 88, 762771 (2006). 29. Kim, J.I., Ju, Y.S., Kim, S., Hong, D. & Seo, J.S. Detection of HYDIN gene duplication in personal genome sequence data. Genomics Inform. 7, 159162 (2009). 30. Reiner, A.P. et al. Genome-wide association study of white blood cell count in 16,388 African Americans: the Continental Origins and Genetic Epidemiology Network (COGENT). PLoS Genet. 7, e1002108 (2011). 31. Guipponi, M. et al. Genomic structure of a copy of the human TPTE gene which encompasses 87 kb on the short arm of chromosome 21. Hum. Genet. 107, 127131 (2000). 32. Bailey, J.A., Yavor, A.M., Massa, H.F., Trask, B.J. & Eichler, E.E. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11, 10051017 (2001). 33. Bailey, J.A. et al. Recent segmental duplications in the human genome. Science 297, 10031007 (2002). 34. Bailey, J.A. et al. Human-specific duplication and mosaic transcripts: the recent paralogous structure of chromosome 22. Am. J. Hum. Genet. 70, 83100 (2002). 35. Golfier, G. et al. The 200-kb segmental duplication on human chromosome 21 originates from a pericentromeric dissemination involving human chromosomes 2, 18 and 13. Gene 312, 5159 (2003). 36. Ruault, M., Ventura, M., Galtier, N., Brun, M.E. & Archidiacono, N. BAGE genes generated by juxtacentromeric reshuffling in the Hominidae lineage are under selective pressure. Genomics 81, 391399 (2003). 37. Dennis, M.Y. et al. Evolution of human-specific neural SRGAP2 genes by incomplete segmental duplication. Cell 149, 912922 (2012). 38. Sudmant, P.H. et al. Diversity of human copy number variation and multicopy genes. Science 330, 641646 (2010). 39. BAC Resource Consortium. Integration of cytogenetic landmarks into the draft sequence of the human genome. Nature 409, 953958 (2001). 40. Mahtani, M.M. & Willard, H.F. Physical and genetic mapping of the human X chromosome centromere: repression of recombination. Genome Res. 8, 100110 (1998). 41. Samonte, R.V. & Eichler, E.E. Segmental duplications and the evolution of the primate genome. Nat. Rev. Genet. 3, 6572 (2002). 42. She, X. et al. The structure and evolution of centromeric transition regions within the human genome. Nature 430, 857864 (2004). 43. Zhang, J., Feuk, L., Duggan, G.E., Khaja, R. & Scherer, S.W. Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome. Cytogenet. Genome Res. 115, 205214 (2006). 44. Ryan, D.P. et al. Mutations in potassium channel Kir2.6 cause susceptibility to thyrotoxic hypokalemic periodic paralysis. Cell 140, 8898 (2010). 45. Eichler, E.E. Recent duplication, domain accretion and the dynamic mutation of the human genome. Trends Genet. 17, 661669 (2001). 46. Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. USA 108, 15131518 (2011). 47. Christiansen, J. et al. Chromosome 1q21.1 contiguous gene deletion is associated with congenital heart disease. Circ. Res. 94, 14291435 (2004). 48. International Schizophrenia Consortium. Rare chromosomal deletions and duplications increase risk of schizophrenia. Nature 455, 237241 (2008). 49. Stefansson, H. et al. Large recurrent microdeletions associated with schizophrenia. Nature 455, 232236 (2008). 50. Mefford, H.C. et al. Recurrent rearrangements of chromosome 1q21.1 and variable pediatric phenotypes. N. Engl. J. Med. 359, 16851699 (2008). 51. Brunetti-Pierri, N. et al. Recurrent reciprocal 1q21.1 deletions and duplications associated with microcephaly or macrocephaly and developmental and behavioral abnormalities. Nat. Genet. 40, 14661471 (2008). 52. Lango Allen, H. et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, 832838 (2010). 53. Srmac, A. et al. A truncating mutation in SERPINB6 is associated with autosomalrecessive nonsyndromic sensorineural hearing loss. Am. J. Hum. Genet. 86, 797804 (2010). 54. Alkan, C., Sajjadian, S. & Eichler, E.E. Limitations of next-generation genome sequence assembly. Nat. Methods 8, 6165 (2011). 55. Ju, Y.S. et al. Extensive genomic and transcriptional diversity identified through massively parallel DNA and RNA sequencing of eighteen Korean individuals. Nat. Genet. 43, 745752 (2011). 56. Church, D.M. et al. Modernizing reference genome assemblies. PLoS Biol. 9, e1001091 (2011).

npg

2013 Nature America, Inc. All rights reserved.

414

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

ONLINE METHODS

Alignment of HuRef genome and GenBank BAC and fosmid clones. To align the HuRef genome and sequenced BAC and fosmid clones to the hg19 reference genome, we first downloaded all available sequence from The Institute for Genomic Research and GenBank websites (downloading scaffold-not-inchromosome.fasta files and all gbpri* files, respectively; see URLs), and we then used Burrows-Wheeler Aligner (BWA)57 (with bwa bwasw) for alignments against hg19. We identified repeats classified as satellite sequences on HuRef unplaced scaffolds using RepeatMasker (see URLs). Satellite sequence consists of large arrays of tandemly repeated units of noncoding DNA. The amount of satellite and missing sequence is reported for each unplaced scaffold (Supplementary Fig. 4 and Supplementary Table 2). To identify within these resources the presence of cryptic segmental duplicationsthat is, sequence missing from the current reference genome but present in a diverged, duplicated formwe aligned all available contigs from HuRef and GenBank clones against hg19 (Supplementary Table 2). Alignment and variant calls for 1000 Genomes Project data. For genotyping from sequence reads, we selected all the CEU and YRI samples available in the 1000 Genomes Project22,23. Unmapped reads were aligned against the HuRef unplaced scaffolds using BWA58 (with bwa aln/sampe). Genotype calling in the unplaced contigs was performed using the Genome Analysis Toolkit59 (GATK) with default settings for the UnifiedGenotyper walker. Strategy for admixture mapping. To map the location of a SNP, genotypes were first adjusted by regressing for the amount of global West African ancestry for each sample. The adjusted genotypes were then tested for correlation with local ancestry across the genome using a one-tailed Pearsons correlation test. If the correlation of the genotypes with global West African ancestry was positive, a right-tailed test was chosen; otherwise, a left-tailed test was chosen. The location corresponding to the smallest P value was then recorded for each SNP, together with the location corresponding to the smallest P value in a different chromosome. All these steps were performed using custom scripts from MATLAB (2011b, The MathWorks). It is intuitive to expect that the genotyping of SNPs over paralogous sequences, only one of which will be expected to be polymorphic, will often be incorrect, as it will not be possible to correctly infer the homozygous state for the alternate allele, leading to failure of the called genotypes to satisfy Hardy-Weinberg equilibrium, among other things. This is not always so for genotyping arrays, however, as the genotyping of SNPs is often based on a twodimensional Gaussian mixture model over summarized probe intensities for each of the two alleles60, enabling the correct distinction of the three possible genotypes, even without modeling the presence of a cryptic paralog. SNP selection, sample selection and Sequenom genotyping. From all detected SNPs in hg19 unplaced contigs and HuRef unplaced scaffolds, we filtered out SNPs at loci for which the number of reads with mapping quality of 0 was at least four and at least 10% of all reads covering the site. We also filtered out clusters of four SNPs within a window size of 10 bp. The rationale is that, in loci with ambiguous alignment, it is possible to call SNPs that actually belong to a paralogous region of the genome. Variants called in loci where many SNPs cluster together have a higher chance of being an artifact of misaligned reads originating from paralogous regions that are not present in the reference genome used for alignment. This methodology maximizes the chances that a SNP belongs to the unplaced scaffold where it is called. From the filtered list, up to seven ancestry-informative SNPs were chosen for each contig for which genotype was estimated to have Pearsons correlation coefficient with the amount of local European ancestry satisfying r2 > 15%. SNPs were further filtered to fit within ten Sequenom plexes, prioritizing the degree of correlation with ancestry. We selected 380 samples from JHS24, which had been genotyped with the Affymetrix 6.0 array and analyzed with HAPMIX61. To achieve the maximum possible mapping resolution, we exclusively selected samples with at least 62 detected crossovers between ancestry groups (maximum of 115). Most likely owing to the repetitiveness of the flanking sequences for which primers were designed, 86 assays failed completely; of the remainder, 53 failed the Hardy-Weinberg equilibrium test (P < 1 106), and 175 passed. Nevertheless, we could still reliably identify the locations of 139 SNPs (Pearsons

correlation test P < 1 106), 106 of which had passed and 33 of which had failed the Hardy-Weinberg equilibrium test, showing that SNPs with unreliable genotypes can still be informative for mapping purposes (Supplementary Fig. 4 and Supplementary Table 1). By analyzing for each successfully mapped SNP the best correlation between the adjusted genotype and local ancestry on chromosomes other than the one where the SNP mapped, we estimated that the selected conservative P-value threshold of 1 106 gives a false discovery rate lower than 1%. Analysis of cryptic paralogs from 1000 Genomes Project pilot data. To identify regions with an excess of PSVs suggesting the presence of large cryptic segmental duplications, we searched for SNPs across the reference genome whose probabilistic genotype from 1000 Genomes Project pilot low-pass sequencing data failed the Hardy-Weinberg equilibrium test62 (using bcftools view -c). We identified variants that failed the equilibrium test (P < 1 106) in CEU and YRI samples, grouped them together if they were <5 kb apart (using custom MATLAB scripts) and listed all resulting regions of >40 kb in size (Supplementary Table 6). FISH. Peripheral blood mononuclear cells were stimulated with phytohemagglutinin and harvested. Metaphase spreads were prepared by standard protocols. Fosmid clones spanning the regions of interest were selected for FISH mapping using the UCSC Genome Browser (see URLs). Fosmids were labeled with either SpectrumOrange- or SpectrumGreen-conjugated dUTP using a nick translation kit (Abbott Molecular). Labeled pairs were hybridized overnight to metaphase chromosome preparations. After washes with 4 SSC/0.1% Tween, 2 SSC/0.3% Tween and phosphate-buffered detergent, chromosomes were counterstained with DAPI and analyzed by epifluorescence with a Zeiss Axioplan2 microscope and Applied Imaging CytoVision software. Analysis of sequence read depth from 1000 Genomes Project data. To assess the copy number variability of the missing reference segments, we used an updated version of Genome STRiP63 to analyze read depth. Normalized read depth was measured by comparing the number of DNA fragments with sequencing reads aligned to the reference genome in a given region to the expected read depth per haploid copy on the basis of (i) the total sequencing depth for each sample, (ii) the alignability of each position, based on whether it would be uniquely mapped by a perfect 36-bp read and (iii) sequencing bias due to GC content. We performed normalization for GC bias empirically, similar to the method described in ref. 38. We first identified a 588-Mb subset of the autosomal reference sequence with no known evidence of copy number variability to use as a baseline. We removed all positions within 200 bp of the annotated CNV regions listed in DGV and segmental duplications listed in the UCSC browser, repeats annotated by RepeatMasker and assembly gaps, yielding a subset that is highly likely to be copy number invariant in the majority of people. This reference subset was divided into 400-bp windows and stratified by the GC fraction within each window, and the observed read depth at each GC fraction was compared to the total read depth across all windows to yield a GC normalization curve for each sequencing library. Given a genomic locus, the estimation of diploid copy number for each sample was performed by fitting a Gaussian mixture model with sample-specific variance to the observed and expected read depth for each sample63, allowing the model to fit as many copy number classes as needed at each locus. To analyze genome regions with known paralogs in sequences not in the hg19 reference (notably, 2p22.2), we used BWA58 (with bwa aln/sampe) to realign the 1000 Genomes Project reads from the genomic region to a synthetic reference containing the original reference sequence plus the sequence for the extra paralog. Estimation of copy number was then carried out as described above. Analysis of RNA sequence expression data. To compare the expression of different paralogs of the DUSP22, PRIM2, HYDIN, MAP2K3 and KCNJ12 genes, we first identified PSVs over the predicted mRNA for these genes, looking at all heterozygous loci called for 1000 Genomes Project pilot high-coverage samples NA12878 CEU, NA12891 CEU, NA12892 CEU, NA19238 YRI, NA19239 YRI and NA19240 YRI, and then determined, when possible, which allele belonged

npg

2013 Nature America, Inc. All rights reserved.

doi:10.1038/ng.2565

Nature Genetics

to each paralog (Supplementary Tables 913). Once we obtained a list of all PSVs, we counted reads from the Illumina Human BodyMap 2.0 project for each of the alleles observed at the locus using GATK59 (with default settings for the UnifiedGenotyper walker and custom scripts). To validate the findings and filter out possible artifacts, sequence reads were further manually analyzed using the Integrative Genomics Viewer64 (IGV).
57. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589595 (2010). 58. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 17541760 (2009).

59. DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491498 (2011). 60. Korn, J.M. et al. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat. Genet. 40, 12531260 (2008). 61. Price, A.L. et al. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet. 5, e1000519 (2009). 62. Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 29872993 (2011). 63. Handsaker, R.E., Korn, J.M., Nemesh, J. & McCarroll, S.A. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat. Genet. 43, 269276 (2011). 64. Robinson, J.T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 2426 (2011).

npg

2013 Nature America, Inc. All rights reserved.

Nature Genetics

doi:10.1038/ng.2565

Articles

OPEN

Sequencing of the sea lamprey (Petromyzon marinus) genome provides insights into vertebrate evolution
Jeramiah J Smith1,2, Shigehiro Kuraku3,4, Carson Holt5,37, Tatjana Sauka-Spengler6,37, Ning Jiang7, Michael S Campbell5, Mark D Yandell5, Tereza Manousaki4, Axel Meyer4, Ona E Bloom8,9, Jennifer R Morgan10, Joseph D Buxbaum1114, Ravi Sachidanandam11, Carrie Sims15, Alexander S Garruss15, Malcolm Cook15, Robb Krumlauf15,16, Leanne M Wiedemann15,17, Stacia A Sower18, Wayne A Decatur18, Jeffrey A Hall18, Chris T Amemiya2,19, Nil R Saha2, Katherine M Buckley20,21, Jonathan P Rast20,21, Sabyasachi Das22,23, Masayuki Hirano22,23, Nathanael McCurley22,23, Peng Guo22,23, Nicolas Rohner24, Clifford J Tabin24, Paul Piccinelli25, Greg Elgar25, Magali Ruffier26, Bronwen L Aken26, Stephen M J Searle26, Matthieu Muffato27, Miguel Pignatelli27, Javier Herrero27, Matthew Jones6, C Titus Brown28,29, Yu-Wen Chung-Davidson30, Kaben G Nanlohy30, Scot V Libants30, Chu-Yin Yeh30, David W McCauley31, James A Langeland32, Zeev Pancer33, Bernd Fritzsch34, Pieter J de Jong35, Baoli Zhu35,37, Lucinda L Fulton36, Brenda Theising36, Paul Flicek27, Marianne E Bronner6, Wesley C Warren36, Sandra W Clifton36,37, Richard K Wilson36 & Weiming Li30
Lampreys are representatives of an ancient vertebrate lineage that diverged from our own ~500 million years ago. By virtue of this deeply shared ancestry, the sea lamprey (P. marinus) genome is uniquely poised to provide insight into the ancestry of vertebrate genomes and the underlying principles of vertebrate biology. Here, we present the first lamprey whole-genome sequence and assembly. We note challenges faced owing to its high content of repetitive elements and GC bases, as well as the absence of broad-scale sequence information from closely related species. Analyses of the assembly indicate that two whole-genome duplications likely occurred before the divergence of ancestral lamprey and gnathostome lineages. Moreover, the results help define key evolutionary events within vertebrate lineages, including the origin of myelin-associated proteins and the development of appendages. The lamprey genome provides an important resource for reconstructing vertebrate origins and the evolutionary events that have shaped the genomes of extant organisms. The fossil record shows that, during the Cambrian period, there was a great elaboration in the diversity of animal body plans. This included the emergence of a species with several characteristics shared with modern vertebrates, such as a cartilaginous skeleton that encases the central nervous system (cranium and vertebral column) and provides a support structure for the branchial arches and median fins. The cartilaginous cranium of this species housed a tripartite brain, with a forebrain for regulating neuroendocrine signaling via the pituitary gland, a midbrain (including an optic tectum) for processing sensory information from paired sensory organs and a segmented hindbrain for controlling unconscious functions, such as respiration and heart rate. These features in adults suggest that the corresponding embryos must have already possessed uniquely vertebrate cell types such as the skeletogenic neural crest and ectodermal placodes, both defining characters of modern-day vertebrates. Subsequent diversification of this lineage gave rise to the jawed vertebrates (gnathostomes), hagfish (for which genome-scale sequence data are currently limited), lamprey and several extinct lineages (Fig. 1 and Supplementary Note).
A full list of affiliations appears at the end of the paper. Received 20 July 2012; accepted 31 January 2013; published online 24 February 2013; doi:10.1038/ng.2568

npg

2013 Nature America, Inc. All rights reserved.

Recent advances in developmental genetics methods for the lamprey and hagfish have advanced the reconstruction of several aspects of vertebrate evolution, although the interpretation of many of these findings is contingent on an understanding of genome structure, gene content and the history of gene and genome duplication events, areas that remain largely unresolved1. Given the critical phylogenetic position of the lamprey as an outgroup to the gnathostomes (Fig. 1), comparing the lamprey genome to gnathostome genomes holds the promise of providing insights into the structure and gene content of the ancestral vertebrate genome. Questions remain about the timing and subsequent elaboration of ancient genome duplication events and the elucidation of genetic innovations that may have contributed to the evolution and development of modern vertebrate features, including jaws, myelinated nerve sheaths, an adaptive immune system and paired appendages or limbs. RESULTS Sequencing, assembly and annotation Approximately 19 million sequence reads were generated from genomic DNA derived from the liver of a single wild-captured adult female sea

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

415

Articles
Ab initio searches for repetitive DNA sequences showed that the lamprey genome contained abundant repetitive elements with high sequence identity. We identified 7,752 distinct families of repetitive elements, accounting for 34.7% of the assembly (Supplementary Fig.4, Supplementary Tables 3,4 and Supplementary Note). Notably, this proportion is expected to be a significant underestimate, owing to the collapsing of repetitive elements during genome assembly. The large diversity of lamprey repetitive elements and the abundance of high-identity (presumably young) repeats represent a potentially rich resource for studies of the evolution and transposition of repetitive sequences. The location of genes was determined by combining RNA sequencing (RNA-seq) mapping and exon linkage data with gene homologies and the prediction of coding sequences, splicing signals and repetitive elements using the MAKER pipeline5 (Supplementary Table 5 and Supplementary Note). The final set of annotated protein-coding genes contained a total of 26,046 genes. This number is similar to the numbers of predicted protein-coding genes in the other vertebrate genomes reported so far. Conserved noncoding elements (CNEs) were identified by homology to published sequences6,7. Searches identified a limited number of homologous CNEs in lamprey, 337 (5.0% of 6,670; ref. 6) and 287 (6.0% of 4,782; ref. 5), in close agreement with previous analyses8. For those lamprey CNEs that were linked to conserved homologous regions in the lamprey and gnathostome genomes, sequence identity typically extended over approximately half the length (53%) of the homologous gnathostome CNE (Supplementary Table 6 and Supplementary Note). Thus, either the lamprey lineage diverged from jawed vertebrates before most gnathostome CNE sequences became highly constrained or these CNEs have evolved much more rapidly in the lamprey genome than in jawed vertebrate genomes. Future work on additional lamprey and hagfish genomes should ultimately distinguish between these possibilities. Variation in nucleotide content and substitution can strongly influence intragenomic functionality and intergenomic comparative analyses. Analysis of the lamprey genome showed that the GC content of the lamprey genome assembly was higher than that of most other vertebrate genome sequences that have been reported. Overall, 46% of the assembly was composed of GC bases, similar to the GC content of raw whole-genome sequencing reads (Supplementary Fig.5 and Supplementary Note). Genome-wide analyses also showed patterns of intragenomic heterogeneity in GC content, similar to those of amniote species that possess isochore structures, but less variable. Moreover, the GC content of protein-coding regions (61%) was markedly higher than that of noncoding and repetitive regions. As expected, this content was highest in the third position of codons (75%) (Supplementary Fig. 6). Patterns of GC bias strongly affect codon usage and the amino-acid composition of lamprey proteins, imparting an underlying structure to lamprey coding sequences that differs substantially from those of all other sequenced vertebrate and invertebrate genomes (Fig. 2). Notably, we did not detect a significant
Lamprey Fruitfly

65 MYA

250 MYA

Precambrian Paleozoic

Mesozoic

CZ

550 MYA

Outgroups

2013 Nature America, Inc. All rights reserved.

Figure 1 An abridged phylogeny of the vertebrates. Shown is the timing of major radiation events within the vertebrate lineage. Extinct lineages and some extant lineages (for example, coelacanths, lungfish and hagfish) have been omitted for simplicity. Here, reptile is synonymous with sauropsid, ray-finned fish is synonymous with actinopterygian, and osteichthyan is synonymous with euteleostome. CZ, Cenozoic; MYA, million years ago.

lamprey (P. marinus) (Supplementary Note). The lamprey genome project was initiated well before the discovery that the lamprey undergoes programmed genome rearrangements during early embryogenesis, which result in the deletion of ~20% of germline DNA from somatic tissues2,3, with the effects of rearrangement on the genic component of the genome not fully understood. We used raw sequence reads to examine large-scale sequence content and the repetitive structure of the lamprey genome. These analyses indicated that the lamprey genome is highly repetitive, rich in GC bases and highly heterozygous (Supplementary Figs. 13 and Supplementary Note). Although these features tend to encumber the assembly of long contiguous sequences, analyses of broad-scale structure enabled the optimization of the parameters used in assembly algorithms (Supplementary Note). The current assembly was generated using Arachne 4 and consisted of 0.816 Gb of sequence distributed across 25,073 contigs. Half of the assembly was in 1,219 contigs of 174 kb or longer, and the longest contig was 2.4 Mb. This assembly resolved multikilobase- to megabase-scale structure over a majority of single-copy genomic regions (Supplementary Tables 1,2 and Supplementary Note), permitting the annotation of repetitive elements, genes and conserved intergenic features (Supplementary Note). Detection of extensive conserved synteny with gnathostome genomes indicates that the lamprey scaffolds accurately reflect the chromosomal organization of the lamprey genome. This assembly therefore provides unparalleled resolution of the gene content and structure of this evolutionarily informative genome.
Figure 2 Genome-wide deviation of lamprey coding sequence properties from patterns observed in other vertebrate and invertebrate genomes. ( a) Codon usage bias. Correspondence analysis (CA) on relative synonymous codon usage (RSCU) values was performed using the nucleotide sequences of all predicted genes concatenated for individual species. (b) Amino-acid composition. Red, lamprey; gray, invertebrates; green, jawed vertebrates.

npg

C ar til a fis gin h ou s R ay -f fis inn ed Am h ph ib i R ep ans til es M am m al s


Ancestral osteichthyan Ancestral gnathostome Ancestral vertebrate

La

m pr ey

a 0.3
0.2
CA axis 2

b
CA axis 2

0.08 0.04 0 0.04 0.08

0.1 0

S. mansoni C. intestinalis

C. savignyi

F. rubripes

0.1 0.2 0.8

N. vectensis Stickleback Amphioxus T. nigroviridis Zebrafish Platypus Pig X. tropicalis Chicken Dog Human Opossum Mouse Zebra finch Sea urchin

Zebra finch Opossum S. mansoni Chicken Dog X. tropicalis C. savignyi Pig Zebrafish Human Mouse Platypus Fruitfly C. intestinalis T. nigroviridis N. vectensis Stickleback F. rubripes Sea urchin Amphioxus

Lamprey 0.12 0.2 0.1 0 CA axis 1 0.1 0.2

0.6

0.4

0.2 CA axis 1

0.2

0.4

416

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

Articles
Figure 3 Conserved synteny and duplication in the lamprey and gnathostome (chicken) genomes. ( ad) The locations of presumptive lamprey-chicken orthologs (including duplicates) are plotted relative to their physical positions on chromosomes and scaffolds and are connected by colored lines. ( a,b) Pairs of chicken chromosomes that correspond to a series of lamprey scaffolds. ( a) Ten lamprey loci are present as duplicate copies in the chicken genome, and 59 are present as single copies. (b) Twelve lamprey loci are present as duplicate copies in the chicken genome, and 54 are present as single copies. (c,d) Pairs of lamprey scaffolds that correspond to individual chicken chromosomes. (c) Three chicken loci are present as duplicate copies on syntenic lamprey scaffolds. ( d) Two chicken loci are present as duplicate copies on syntenic lamprey scaffolds. Asterisks indicate duplicates.

a
GG27

Lamprey scaffolds

c
PM229

GG20

PM9

GG2

GG7 PM468

* * *

2013 Nature America, Inc. All rights reserved.

correlation between the GC content of the third position of codons and the GC content of adjacent noncoding regions (Supplementary Fig. 7). Thus, it seems that the processes that lead to the patterns of intragenomic heterogeneity in lamprey GC content differ fundamentally from those in species that possess isochore structures. This raises a question regarding the adaptive value or other biological role of the observed variation of GC content within and among genomes. To further explore the biological basis of high GC content and its intragenomic heterogeneity, we examined the relationship between the GC content of protein-coding regions and codon usage bias, amino-acid composition and the levels of gene expression. The results showed that genomic GC content strongly correlated with codon usage bias and amino-acid composition but not with the levels of gene expression (Supplementary Figs. 811, Supplementary Table 7 and Supplementary Note). These observations are consistent with a scenario in which high GC content results from broad-scale substitution bias rather than selection for specific GC-rich codons. As the lamprey is clearly an outlier among vertebrates, further dissection of coding GC content in the sea lamprey and other lamprey and hagfish species will help to identify the causes and consequences of the intragenomic heterogeneity of GC content in vertebrate genomes. Duplication structure of the genome It is generally accepted that two rounds of whole-genome duplication occurred early in the history of vertebrate evolution9. However, the timing of these defining duplication events has not been well supported by genome-wide sequence data thus far10. As the proximate outgroup to jawed vertebrates, the lamprey genome is uniquely suited for addressing several questions regarding the occurrence, timing and outcome of whole-genome duplication events. To identify gene and genome duplication events in the ancestral vertebrate lineage, we analyzed patterns of duplication within conserved syntenic regions of the lamprey and gnathostome genomes and compared these patterns to the entire lamprey genome assembly. We estimated duplication frequencies by aligning all predicted lamprey protein-coding genes from the MAKER5 data set to the human (GRCh37, GCA_000001405.1) and chicken (Gallus_gallus-2.1, GCA_ 000002315.1) whole-genome assemblies. To account for the possibility that paralogs have been retained on one or both genomes, in a way that bypasses many confounding aspects of phylogenetic reconstruction (Supplementary Figs. 1217, Supplementary Table 8 and Supplementary Note), regions were considered putative orthologs if they yielded the highest-scoring alignment between the two genomes or an alignment score (bit score) within 90% of the top-scoring alignment (Supplementary Note). Strong patterns of conserved synteny were observed between the lamprey and both the human and chicken genomes (Supplementary Figs. 1821, Supplementary Tables 913 and Supplementary Note). For simplicity, we present comparisons
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

b
GG3

Lamprey scaffolds

d
GG5 PM2226

GG7

* *
PM90

to the chicken genome, as this genome is known to have undergone substantially fewer interchromosomal rearrangements than have mammalian genomes11,12. Our analyses indicate that most lamprey and gnathostome genes currently do not possess two copies in their respective genomes resulting from the two rounds of whole-genome duplication (Supplementary Note), presumably owing to the frequent loss of one paralog after duplication. Accordingly, we used the lamprey genome to search for a signature of large-scale duplication that does not rely on the retention of duplicated genes but can be informed by their presence. Specifically, we searched for cases in which a single lamprey scaffold contained interdigitated homologies from two distinct regions of a gnathostome genome (Fig. 3). Such patterns are consistent with large-scale duplication followed by random loss of either paralogous copy. Nearly all lamprey scaffolds showed patterns of interdigitated conserved synteny of gnathostome orthologs (Supplementary Tables9 and 10). Moreover, homologs from individual pairs of gnathostome chromosomes were recurrently observed in interdigitated syntenic blocks on several lamprey scaffolds. Notably, some of the individual homologous markers that contributed to these conserved syntenic blocks were mapped to duplicate positions within gnathostome genomes, being present on the two homologous gnathostome chromosomes. Although these duplicates constituted a relatively modest fraction of the conserved syntenic homologs (14.5%, Fig. 3a; 18.2%, Fig. 3b; not counting redundant copies), we interpret these as strong evidence that large-scale (whole-genome) duplication has had a major role in shaping gnathostome genome architecture. Similar duplication patterns on lamprey scaffolds also seem to support the notion that large-scale (whole-genome) duplication has had a major role in shaping lamprey genome architecture.
417

npg

Articles a Medaka Chr. 15


Human Chr. 10 Chr. 2 Chicken Chr. 6 Chr. 4 Lamprey Scf. 176.1-302093 Scf. 821.1-95111 Ancestor
HRH2 EIF3A FAM45

3 X Chr. 20 2 X 2

III II
2-like

b
Human HOXA HOXB HOXC HOXD

NANOS1

GnRH MGMT MMP21 DHX32 INA EBF MRPS26 PTPRA (NEFL) PTPRE

Paralogous Hox groups 13 12 11 10 9 8 7 6 5 4 3 2 1

Figure 4 The effect of genome duplication and independent paralog loss on the evolution of lamprey-gnathostome conserved syntenic regions. (a) Conserved synteny among the GnRH2, GnRH3 and (previously proposed) GnRH4 genes in lamprey, chicken and humans, including the medaka region for GnRH3, which is absent in tetrapods. The orientation of each chromosome (chr.) and scaffold (scf.) is indicated with line arrows. A pointed box represents the orientation of each gene. Open rectangles with red Xs indicate lost GnRH loci. The presumptive ancestral state of the gene region is shown at the bottom. (b) Assembled lamprey Hox scaffolds and patterns of conserved synteny relative to human Hox clusters (human Hox clusters rather than chicken are used because all four human Hox syntenic regions are integrated into the human genome assembly). Three additional conserved syntenic genes, located adjacent to the PM2Hox cluster, are omitted owing to space limitations (retinoic acid receptor (RAR), heterogeneous nuclear ribonucleoprotein (HNRNP) and thyroid hormone receptor (THR)). Symbols indicate representative family members of lampreygnathostome homology groups.

2013 Nature America, Inc. All rights reserved.

Lamprey Pm1Hox Pm2Hox Unassigned


TAX1BP1 EVX SNX SKAP CBX MTX BOLL CYC MRPL FAM126

CALCOCO

mir-196

mir-10

Although lamprey scaffolds do not yet provide chromosome-scale resolution, several cases were identified in which two large lamprey scaffolds contained predicted paralogs and patterns of interdigitated conserved synteny (two defining signatures of large-scale duplication; Fig. 3c,d and Supplementary Note). To further assay for patterns indicative of ancient whole-genome duplication events (for example, two rounds) within the lamprey genome, we manually examined all lamprey scaffolds that possessed ten or more gnathostome homologs. These 83 scaffolds accounted for 10% of the comparative map (10% of homology-informative genes) and possessed a duplication frequency (0.463, including redundant copies of duplicates) that was similar to that of the genome at large (0.448). Among these scaffolds, we identified 29gene pairs that were present as duplicates on two large scaffolds and one trio that was present on three large scaffolds. For a majority of duplicates, scaffolds contained at least one additional ortholog on the chicken chromosome that harbored an ortholog of the duplicate (specifically, both scaffolds (59.3%), one scaffold (29.6%) and no scaffold (11.1%) contained an additional syntenic ortholog). On average, these scaffolds contained 2.98 additional conserved syntenic genes for each individual lamprey duplicate (including the 11.1% with no syntenic markers). These patterns are consistent with the existence of patterns of interdigitated synteny in the lamprey genome that are highly similar to those in gnathostome genomes, indicating that the most recent (two-round) whole-genome duplication event likely occurred in the common ancestral lineage of lampreys and gnathostomes. Additional genome-wide analyses showed that (i) the number of ancestral loci with retained duplicates in gnathostome genomes was not significantly different from the number with retained duplicates in lamprey (lamprey = 0.271, chicken = 0.262; 2 = 2.94, P = 0.08; Supplementary Note); (ii) the frequency of shared duplications was higher than would be expected by chance (observed = 0.150, expected = 0.022; 2 = 6179, P(2) < 1 10100, P(Fishers exact test) < 1 10100; Supplementary Note); (iii) a model invoking recurrent selection against small-scale duplicates across a majority of the genome
418

was not sufficient to explain genome-wide patterns of shared duplication (Supplementary Figs. 1821 and Supplementary Note); and (iv) inclusion of the lamprey in phylogenetic analyses resolved gene families consistent with two rounds of whole-genome duplication (Supplementary Figs. 1217 and Supplementary Note). Moreover, targeted analyses of Hox clusters and gonadotropin-releasing hormone (GnRH) syntenic regions showed that the loss of paralogs after duplication occurred largely independently in the lamprey and gnathostome genomes, consistent with the divergence of the two lineages shortly after the last whole-genome duplication event (Fig. 4, Supplementary Figs.2224, Supplementary Table 14 and Supplementary Note). Although the less parsimonious scenario involving one or two independent and ancient whole-genome duplication events in gnathostome and lamprey lineages cannot be completely ruled out, neither a gnathostome-specific genome duplication nor persistent selection to retain a subset of independent duplicates is likely to explain the subtle differences in the duplication structures of the lamprey and gnathostome genomes. It seems exceedingly unlikely that such genomic arrangements and distributions of synteny blocks would arise by chance or mechanisms other than an ancient shared whole-genome duplication event. We therefore propose that genome-wide patterns of duplication are indicative of a shared history of two rounds of genomewide duplication before lamprey-gnathostome divergence. Ancestral vertebrate biology It has been suggested that many of the morphological and physiological features that characterize vertebrates evolved through the modification of preexisting regulatory regions and gene networks13. However, we reasoned that the lamprey genome might enable us to identify genes that evolved within the ancestral vertebrate lineage and infer how these new genes might have contributed to specific innovations in ancestral vertebrates that contributed to their arguably successful evolutionary trajectory. Toward this end, we searched for lamprey genes that (i) had homologs in at least one sequenced gnathostome genome and (ii) had no identifiable invertebrate homolog in annotated sequence databases and genome projectbased resources (including but not limited to invertebrate deuterostomes: sea urchin, sea limpet, acorn worm, lancelet and sea squirt). In total, this search identified 224 gene families that presumably trace their evolutionary origin to the ancestral vertebrate lineage (Supplementary Table15 and Supplementary Note). Notably, these included many gene families whose taxonomic distribution was previously thought to be more restricted (for example, APOBEC4 was previously reported to be a tetrapod-specific gene)14. Thus, roughly 1.21.5% of the protein-coding
VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

npg

Articles
Neuropeptide signaling pathway Neuropeptide hormone activity Synaptic vesicle targeting Negative regulation of appetite Adult feeding behavior Internode region of axon Chemokine activity Compact myelin 0 0.01 0.02 0.03 0.04 0.05 0.06 Proportion of genes 0.07 0.08 0.09

Vertebrate-specific families All lamprey genes

Figure 5 Enrichment of gene ontologies among vertebrate-specific gene families. Horizontal bars show the frequencies of ontology classes among vertebrate-specific gene families and in the entire set of lamprey gene models. Data are shown for all ontologies that are over-represented with P < 0.005 (Fishers exact test). Most over-represented ontologies are related to neural development and neurohormone signaling.

2013 Nature America, Inc. All rights reserved.

landscape in the human genome (263 genes from 224 families out of ~20,000 genes) originated from new genes that emerged at the base of vertebrate evolution. Phylogenetic analyses also showed expansions and reductions of gene families within vertebrate lineages (Supplementary Table 8 and Supplementary Note). These included the specific loss of clotting-related genes in the lamprey lineage and the differential contraction and expansion of gene families related to neural function and inflammation in the lamprey versus gnathostome lineages, which reflect broad parallels in the evolution of lamprey and gnathostome immunity (Supplementary Figs. 2530, Supplementary Tables 1622 and Supplementary Note). To better understand how new genes might have contributed to the evolution of the vertebrate ancestor, we collected gene ontology (functional) information for the 224 vertebrate-specific gene families (Supplementary Fig. 31 and Supplementary Note). Comparing these gene ontologies to the genome-wide distribution of lamprey ontologies showed that these vertebrate-specific gene families were significantly enriched in functions related to myelination and neuro peptide and neurohormone signaling (Fig. 5). These findings suggest that the elaboration of signaling in the vertebrate central nervous system might have been facilitated by the advent of new vertebrate genes. Ontology analyses were also consistent with the broadly held view that most genes involved in the regulation of morphogenesis are of ancient origin and are common throughout animals. In all extant gnathostomes, myelinating oligodendrocytes wrap axons in a layer of proteins and lipids, increasing the efficiency and speed of neuronal conduction. In humans, disorders of myelination have many manifestations that range from cognitive to movement disorders. Notably, analysis of the lamprey genome identified the specific enrichment of genes associated with myelin formation in the central and peripheral nervous systems of jawed vertebrates (Fig.5, Supplementary Fig. 32, Supplementary Tables 15,23,24 and Supplementary Note), despite the fact that extant jawless vertebrates are thought to completely lack myelinating oligodendrocytes15. These genes include Pmp22 (encoding peripheral myelin protein 22) and Mpz (encoding myelin protein zero), as well as Plp (encoding myelin proteolipid protein), Mal (encoding myelin and lymphocyte protein) and Myt1l (encoding myelin transcription factor 1-like). Homologs of Mal and Pmp22 were reported to be present in Ciona intestinalis, an invertebrate chordate16, and putative Ciona homologs of Myt1l
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

and Plp1 are identifiable in Ensembl17. Unexpectedly, analysis of the lamprey genome identified three myelination-related genes that might have evolved specifically within the ancestral vertebrate lineage (Mbp (encoding myelin basic protein), Mpz and CNP (encoding 2 ,3 -cyclic nucleotide 3-phosphodiesterase); Supplementary Tables15, 23 and Supplementary Note). This suggests that the molecular components of myelin already existed in the vertebrate ancestor and were later recruited in the evolution of myelinating oligodendrocytes in the gnathostome lineage, perhaps through the evolution of regulatory systems18. Alternatively, oligodendrocyte-like cells might have been present in the vertebrate ancestor but were secondarily lost in the lamprey lineage, although it retained genes encoding myelin proteins. Dissecting the function of myelinationrelated genes in lamprey and hagfish should continue to shed light on the origin of gnathostome myelin. By virtue of its basal phylogenetic position, the lamprey also serves as a key comparative model for understanding the evolution of the vertebrate immune system. Lamprey possess two major immune cell types that are similar to the T and B lymphocytes of gnathostomes but possess adaptive immune receptors that are unrelated to gnathostome immunoglobulins, perhaps instead reflecting the receptor of the ancestral vertebrate19,20. The lamprey genome harbors several genes that impart unique functionality to gnathostome T and B lymphocytes. Annotation of other components of the immune system showed that the reduced complexity in vertebrate innate immune receptors might have coincided with the evolution of adaptive immune receptors (Supplementary Figs. 2530, Supplementary Tables 1622 and Supplementary Note). Analysis of the lamprey genome assembly and end-mapped BAC clones showed that each rearranging lamprey immune receptor locus (encoding variable lymphocyte receptors, VLRs) extends for several hundred contiguous kilobases. For example, the VLRB locus extends for at least 717 kb, with components of the
Lmbr1 gene Mouse Exon 5 ShARE Exon 6 Intron length (kb) 30.1

100% Chicken 75% 20.0 50% 100% 75% 10.1 50% 100% 75% 21.3 50% 100% 75% 6.9 50% EF100665 100% 75% ND 50% EF100656 100% 75% ND 50% Scaffold_408 100% 75% 0.3 50% Scaffold_164 100% 75% 2.1 50% 0 4 8 12 16 20 24 28 Base position in the mouse intron (kb)

npg

Anole lizard

Clawed frog

Medaka

Little skate

Spotted ratfish Lamprey

Lamprey

Figure 6 Absence of sequence conservation for a limb Shh enhancer in lamprey. Comparison of an intronic region in the Lmbr1 gene, focusing on the intron containing the Shh cis-regulatory element (ShARE, also known as MFCS1)22,24. Note that two genomic regions were identified in the lamprey harboring potential Lmbr1 orthologs. The lengths of this intron for individual species are listed on the right. ND, not determined.

419

Articles
receptor face being drawn from regions distributed across practically the entire length of the current scaffold (Supplementary Fig. 25). The lamprey genome also sheds light on the evolutionary events that occurred early in the evolution of the gnathostome lineage, after the lamprey-gnathostome split. Paired appendages (pelvic and pectoral fins in fish, hind- and forelimbs in tetrapods) are a major evolutionary innovation of gnathostome vertebrates, as they permitted additional forms of locomotion and behavior. The lamprey has well-developed dorsal and caudal fins but lacks paired fins. Despite different embryonic origins, the signaling pathways involved in the development and positioning of median fins were reused for paired fin development21, raising the question of whether these pathways were already present in the limbless ancestral vertebrate (Supplementary Note). During fin and limb development, Shh is required to pattern the anteroposterior axis of appendages. It has been shown that the limb-specific expression of Shh is coordinated by a long-range cis-acting enhancer. This Shh appendage-specific regulatory element (ShARE) is found in homologous positions in tetrapods, teleosts and chondrichthyans2224. In all vertebrate species analyzed so far, this element is found in intron 5 of the Lmbr1 gene (encoding limb region 1) that lies up to 1 Mb away from the transcription start site of Shh. Notably, the presence of ShARE is correlated with the presence of paired appendages, at least within the tetrapod lineage, as snakes and caecilians seem to have lost this element secondarily25. Because of the conserved genomic position of the element in other vertebrates, we focused our analysis on the lamprey orthologs of the Lmbr1 gene. Directed analysis of intron 5 in the Lmbr1 orthologs showed that these introns were much shorter and had no similarity to ShAREs (Fig. 6 and Supplementary Fig. 33). Searches of the entire genome assembly and raw sequence reads also did not detect any regions similar to ShARE, suggesting that this regulatory region evolved within the gnathostome lineage. DISCUSSION The lamprey genome provides unique insight into the origin and evolution of the vertebrate lineage. Here, we present a few examples of its use in dissecting the evolution of vertebrate genomes and aspects of ancestral vertebrate biology. As examples, we (i) provide genome-wide evidence for two whole-genome duplication events in the common ancestral lineage of lampreys and gnathostomes, (ii) identify new genes that evolved within this ancestral lineage, (iii) link vertebrate neural signaling features to the advent of new genes, (iv) uncover parallels in immune receptor evolution and (v) provide evidence that a key regulatory element in limb development evolved within the gnathostome lineage. This genomic resource holds the promise of providing insights into many other aspects of vertebrate biology, especially with continued refinements in the assembly and the capacity for direct functional analysis in lamprey26,27. URLs. CodonW, http://codonw.sourceforge.net/; RECON, http:// www.repeatmasker.org/; Repbase, http://www.girinst.org/repbase; Rebuilder, http://www.broadinstitute.org/crd/wiki/index.php/ Improving_Assemblies. Methods Methods and any associated references are available in the online version of the paper. Accession codes. The lamprey genome assembly has been deposited under GenBank accession AEFG01. Improved assemblies for Hox clusters have been deposited under GenBank accessions JQ706314JQ706327. Transcript sequencing data have been deposited
420

under GenBank Short Read Archive accessions SRX109761.3, SRX109762.3, SRX109764.3, SRX109765.3, SRX109766.3, SRX109767.3, SRX109768.3, SRX109769.3, SRX109770.3, SRX110023.2, SRX110024.2, SRX110025.2, SRX110026.2, SRX110027.2, SRX110028.2, SRX110029.2, SRX110030.2, SRX110031.2, SRX110032.2, SRX110033.2, SRX110034.2 and SRX110035.2. Additional information is provided in Supplementary Table 5.
Note: Supplementary information is available in the online version of the paper. Acknowledgments We thank the Genome Institute, Washington University School of Medicine, Production Sequencing group for all sample procurement and genome sequencing work, the Michigan State University Genomic Core for transcriptome sequencing and the US Geological Survey, Lake Huron Biological Station for providing lamprey samples for sequencing. We thank F. Antonacci and E.E. Eichler (University of Washington) for performing FISH and providing access to computational facilities, respectively. We thank M. Robinson for bioinformatic analysis of immune systemrelated genes and conversion of GFF files for BAC end mapping. A portion of this research was conducted at the Marine Biological Laboratory (Woods Hole, Massachusetts). We acknowledge the support of the Stowers Institute for Medical Research (SIMR) and technical support from the SIMR Molecular Biology Core, particularly K. Staehling, A. Perera and K. Delventhal for BAC screening and sequencing. We acknowledge the Center for High-Performance Computing at the University of Utah for the allocation of computational resources toward gene annotation. We recognize all the important work that could not be cited owing to space limitations. The lamprey genome project was funded by the National Human Genome Research Institute (U54HG003079 (R.K.W.)). Additional support was provided by grants from the US National Institutes of Health (R24GM83982 (W.L.)) and the Great Lakes Fisheries commission (W.L.). Partial funding was provided by several additional sources, including grants from the US National Institutes of Health (F32GM087919 and T32HG00035 (J.J.S.); DE017911 (M.E.B.); R03NS078519 (O.E.B.); R01HG004694 (M.D.Y.); GM079492, GM090049 and RR014085 (C.T.A.); and R37HD032443 (C.J.T.)), the National Science Foundation (MCB-0719558 (C.T.A.); IOS-0849569 (S.A.S.); IBN-0208138 (L. Holland); and IOS-1126998 (M.D.Y.)), the New Hampshire Agricultural Experiment Station (Scientific Contribution Number 2471 (S.A.S.)), the Charles Evans Research Award (O.E.B., J.D.B. and J.R.M.), the Wellcome Trust (WT095908 (P.F.) and WT098051), the Canadian Institutes of Health Research (MOP74667 (J.P.R.)) and the Natural Sciences and Engineering Research Council of Canada (312221 (J.P.R.)). AUTHOR CONTRIBUTIONS J.J.S. developed the assembly, coordinated analyses, performed analyses of genome structure and conserved synteny, coordinated the manuscript, and wrote and edited the manuscript. S.K. contributed to analyses of GC content, assembly completeness, vertebrate-specific genes, myelin-related genes and limb development, and to preparation of the manuscript and supplements. C.H. compiled molecular data sets and developed the consortium gene annotations and annotation pipeline. T.S.-S. developed the protocol for the preparation of BACs, identified the sequenced individual, and prepared genomic DNA for sequencing and BAC library construction. N.J. performed computational identification and analysis of transposable elements. M.D.Y. and M.S.C. contributed to the development of the consortium gene annotations and the annotation pipeline. T.M. and A.M. performed analysis of vertebrate-specific gene families, codon usage bias and amino-acid composition, and contributed to the writing of the manuscript. S.D. and M.H. contributed to analysis of codon usage bias and amino-acid composition. O.E.B., J.R.M., J.D.B. and R.S. performed experiments generating neuronal transcriptomes and data, and sequence analysis related to the vertebrate central nervous system. C.S., L.M.W., A.S.G., M.C. and R.K. performed experiments and data analysis related to the identification and annotation of Hox genes, led and prepared by L.M.W. S.A.S., W.A.D. and J.A.H. performed analyses related to the evolution of neuroendocrine genes, led by S.A.S. and prepared by W.A.D. C.T.A., N.R.S., K.M.B., J.P.R., S.D. and M.H. performed analyses related to the evolution of immune system genes, led and prepared by C.T.A., K.M.B., J.P.R. and M.H. N.R. and C.J.T. performed analyses related to the evolution and development of appendages. P.P. performed BLAST analyses of the noncoding portion of the lamprey genome, and G.E. analyzed BLAST output and wrote the corresponding sections. M.R., B.L.A. and S.M.J.S. developed the Ensembl gene set, led and prepared by M.R. M.M., M.P. and J.H. performed GeneTree and CAFE analysis for the study of whole-genome duplications at the stem of the vertebrate lineage and prepared the corresponding sections. T.S.-S., M.J., J.A.L. and D.W.M. developed the

npg

2013 Nature America, Inc. All rights reserved.

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

Articles
protocol for the preparation of cDNA. N.M. and P.G. provided isolated leukocyte RNA. C.T.B. and K.G.N. performed transcriptome assemblies. W.L., Y.-W.C.-D., S.V.L., C.-Y.Y. and D.W.M. contributed to next-generation transcriptome sequencing. Z.P. provided lamprey leukocyte RNA and cDNA samples and libraries, and evaluated the first draft assembly of the genome. B.F. contributed to the development of neurodevelopment-related text. P.J.d.J. and B.Z. generated the BAC library used for genome sequencing and assembly. L.L.F., W.C.W. and S.W.C. contributed to sequencing project management. B.T. coordinated the cDNA sequencing projects. P.F. supervised the Ensembl annotation efforts. M.E.B. contributed to the conception of the sea lamprey genome project and the development of the manuscript. R.K.W. provided supervision of the genome sequencing project. W.L. provided coordination of the consortium and analysis of the assembly, and contributed to the development of the manuscript. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.
Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html.
This work is licensed under a Creative Commons AttributionNonCommercial-Share Alike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/ 1. Shimeld, S.M. & Donoghue, P.C. Evolutionary crossroads in developmental biology: cyclostomes (lamprey and hagfish). Development 139, 20912099 (2012). 2. Smith, J.J., Baker, C., Eichler, E.E. & Amemiya, C.T. Genetic consequences of programmed genome rearrangement. Curr. Biol. 22, 15241529 (2012). 3. Smith, J.J., Antonacci, F., Eichler, E.E. & Amemiya, C.T. Programmed loss of millions of base pairs from a vertebrate genome. Proc. Natl. Acad. Sci. USA 106, 1121211217 (2009). 4. Jaffe, D.B. et al. Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13, 9196 (2003). 5. Cantarel, B.L. et al. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 18, 188196 (2008). 6. Woolfe, A. et al. CONDOR: a database resource of developmentally associated conserved non-coding elements. BMC Dev. Biol. 7, 100 (2007). 7. Venkatesh, B. et al. Ancient noncoding elements conserved in the human genome. Science 314, 1892 (2006). 8. McEwen, G.K. et al. Early evolution of conserved regulatory sequences associated with development in vertebrates. PLoS Genet. 5, e1000762 (2009). 9. Ohno, S. Gene duplication and the uniqueness of vertebrate genomes circa 19701999. Semin. Cell Dev. Biol. 10, 517522 (1999). 10. Kuraku, S., Meyer, A. & Kuratani, S. Timing of genome duplications relative to the origin of the vertebrates: did cyclostomes diverge before or after? Mol. Biol. Evol. 26, 4759 (2009). 11. International Chicken Genome Sequencing Consortium. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432, 695716 (2004). 12. Smith, J.J. & Voss, S.R. Gene order data from a model amphibian (Ambystoma): new perspectives on vertebrate genome structure and evolution. BMC Genomics 7, 219 (2006). 13. Carroll, S.B. Evo-devo and an expanding evolutionary synthesis: a genetic theory of morphological evolution. Cell 134, 2536 (2008). 14. Rogozin, I.B., Basu, M.K., Jordan, I.K., Pavlov, Y.I. & Koonin, E.V. APOBEC4, a new member of the AID/APOBEC family of polynucleotide (deoxy)cytidine deaminases predicted by computational analysis. Cell Cycle 4, 12811285 (2005). 15. Bullock, T.H., Moore, J.K. & Fields, R.D. Evolution of myelin sheaths: both lamprey and hagfish lack myelin. Neurosci. Lett. 48, 145148 (1984). 16. Gould, R.M., Morrison, H.G., Gilland, E. & Campbell, R.K. Myelin tetraspan family proteins but no non-tetraspan family proteins are present in the ascidian (Ciona intestinalis) genome. Biol. Bull. 209, 4966 (2005). 17. Flicek, P. et al. Ensembl 2011. Nucleic Acids Res. 39, D800D806 (2011). 18. Newbern, J. & Birchmeier, C. Nrg1/ErbB signaling networks in Schwann cell development and myelination. Semin. Cell Dev. Biol. 21, 922928 (2010). 19. Saha, N.R., Smith, J. & Amemiya, C.T. Evolution of adaptive immune recognition in jawless vertebrates. Semin. Immunol. 22, 2533 (2010). 20. Guo, P. et al. Dual nature of the adaptive immune system in lampreys. Nature 459, 796801 (2009). 21. Freitas, R., Zhang, G. & Cohn, M.J. Evidence that mechanisms of fin development evolved in the midline of early vertebrates. Nature 442, 10331037 (2006). 22. Dahn, R.D., Davis, M.C., Pappano, W.N. & Shubin, N.H. Sonic hedgehog function in chondrichthyan fins and the evolution of appendage patterning. Nature 445, 311314 (2007). 23. Lettice, L.A. et al. A long-range Shh enhancer regulates expression in the developing limb and fin and is associated with preaxial polydactyly. Hum. Mol. Genet. 12, 17251735 (2003). 24. Sagai, T., Hosoya, M., Mizushina, Y., Tamura, M. & Shiroishi, T. Elimination of a long-range cis-regulatory module causes complete loss of limb-specific Shh expression and truncation of the mouse limb. Development 132, 797803 (2005). 25. Sagai, T. et al. Phylogenetic conservation of a limb-specific, cis-acting regulator of Sonic hedgehog (Shh). Mamm. Genome 15, 2334 (2004). 26. Nikitina, N., Sauka-Spengler, T. & Bronner-Fraser, M. Dissecting early regulatory relationships in the lamprey neural crest gene network. Proc. Natl. Acad. Sci. USA 105, 2008320088 (2008). 27. Nikitina, N., Bronner-Fraser, M. & Sauka-Spengler, T. The sea lamprey Petromyzon marinus: a model for evolutionary and developmental biology. in Emerging Model Organisms: A Laboratory Manual Vol. 1 (eds. Behringer, R.R., Johnson, A.D. & Krumlauf, E.E.) 405429 (CSHL Press, Cold Spring Harbor, New York, 2009).

2013 Nature America, Inc. All rights reserved.

1Department

of Biology, University of Kentucky, Lexington, Kentucky, USA. 2Benaroya Research Institute at Virginia Mason, Seattle, Washington, USA. 3Genome Resource and Analysis Unit, Center for Developmental Biology, RIKEN, Kobe, Japan. 4Department of Biology, University of Konstanz, Konstanz, Germany. 5Eccles Institute of Human Genetics, University of Utah, Salt Lake City, Utah, USA. 6Division of Biology, California Institute of Technology, Pasadena, California, USA. 7Department of Horticulture, Michigan State University, East Lansing, Michigan, USA. 8The Feinstein Institute for Medical Research, Manhasset, New York, USA. 9The Hofstra North ShoreLong Island Jewish (LIJ) School of Medicine, Hempstead, New York, USA. 10Marine Biological Laboratory, Woods Hole, Massachusetts, USA. 11Department of Genetics and Genomics Sciences, Mount Sinai School of Medicine, New York, New York, USA. 12Department of Psychiatry, Mount Sinai School of Medicine, New York, New York, USA. 13Department of Neuroscience, Mount Sinai School of Medicine, New York, New York, USA. 14Friedman Brain Institute, Mount Sinai School of Medicine, New York, New York, USA. 15Stowers Institute for Medical Research, Kansas City, Missouri, USA. 16Department of Anatomy & Cell Biology, The University of Kansas School of Medicine, Kansas City, Kansas, USA. 17Department of Pathology and Laboratory Medicine, University of Kansas School of Medicine, Kansas City, Kansas, USA. 18Center for Molecular and Comparative Endocrinology, University of New Hampshire, Durham, New Hampshire, USA. 19Department of Biology, University of Washington, Seattle, Washington, USA. 20Department of Immunology, University of Toronto, Sunnybrook Research Institute, Toronto, Ontario, Canada. 21Department of Medical Biophysics, University of Toronto, Sunnybrook Research Institute, Toronto, Ontario, Canada. 22Emory Vaccine Center, Emory University, Atlanta, Georgia, USA. 23Department of Pathology and Laboratory Medicine, Emory University, Atlanta, Georgia, USA. 24Department of Genetics, Harvard Medical School, Boston, Massachusetts, USA. 25Medical Research Council (MRC) National Institute for Medical Research, London, UK. 26Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. 27European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton Cambridge, UK. 28Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, USA. 29Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, USA. 30Department of Fisheries & Wildlife, Michigan State University, East Lansing, Michigan, USA. 31Department of Zoology, University of Oklahoma, Norman, Oklahoma, USA. 32Department of Biology, Kalamazoo College, Kalamazoo, Michigan, USA. 33Department of Biochemistry & Molecular Biology, University of Maryland School of Medicine, Baltimore, Maryland, USA. 34Department of Biology, University of Iowa, Iowa City, Iowa, USA. 35Childrens Hospital Oakland, Oakland, California, USA. 36The Genome Institute, Washington University School of Medicine, St. Louis, Missouri, USA. 37Present addresses: Ontario Institute for Cancer Research, Informatics and Bio-Computing, Toronto, Ontario, Canada (C.H.), The Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK (T.S.-S.), Institute of Microbiology, Chinese Academy of Sciences, Beijing, China (B.Z.) and The Advanced Center for Genome Technology, Norman, Oklahoma, USA (S.W.C.). Correspondence should be addressed to W.L. (liweim@msu.edu) or J.J.S. (jjsmit3@uky.edu).

npg

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

421

ONLINE METHODS

Genome sequencing. Sea lamprey DNA for whole-genome shotgun sequencing and fosmid and BAC libraries was derived from a liver dissected from a single female lamprey captured from the Great Lakes. Production of BAC library CHORI-303 was described previously28. Other libraries were cloned into bacterial vectors, arrayed individually into the wells of growth trays and sequenced as previously described11,2931. Preassembly analyses. Several analyses were performed before initiating the assembly. These provided insight as to the selection of the assembler. Initial characterization of the repetitive content of the genome was performed by selecting a subset of 10,000 high-quality shotgun sequence reads (>500 bp at Q20) and aligning these to the complete data set of 18.5 million wholegenome shotgun sequence reads (Q20 trimmed). A complementary analysis was also performed by aligning 10,000 trimmed whole-genome shotgun sequence reads from a single human genome32 to a complete data set of 12.1 million whole-genome shotgun sequence reads (Q20 trimmed). All reads were downloaded from the NCBI Trace Archives in .scf format and processed with phred33,34 to generate base calls and quality scores. Alignments to human and lamprey whole-genome shotgun sequence data sets were performed using Megablast35. To gain insight into the potential influence of allelic polymorphism, we estimated the depth of coverage by processing Megablast35 alignments between a subset of reads and the entire whole-genome shotgun sequencing effort, as described above, but with varying thresholds for percent nucleotide identity between aligning sequences. Distributions of coverage depth were estimated using sequence identity thresholds of 90%, 95%, 97% and 99%. Genome assembly. Assembly of the lamprey genome was performed using a total of ~19 million sequence reads with Arachne36 parameterized for the assembly of an outbred diploid genome (Supplementary Note). After assembly by the Assemblez module, contigs corresponding to divergent haplotypes were assembled together using the Rebuilder module, parameterized with liberal settings that permitted the merger of divergent haplotypes (see URLs), and haplotypes were then joined using linkage information from end-read mapping information. End-mapping information was incorporated via the ExtendHaploSupers module in a series of steps that prioritized the number of end reads supporting linkages between contigs and the source of end-mapping information (shotgun reads versus large-insert clones). Specifically, paired-end mapping information was incorporated in the following steps, where subsequent linkages might not supplant linkages that had been previously identified at a more stringent threshold: at least four paired-end linkages from large-insert clones, at least four paired-end linkages from large-insert clones or whole-genome shotgun sequence clones, three paired-end linkages from large-insert clones, three paired-end linkages from large-insert clones or whole-genome shotgun sequence clones, two paired-end linkages from large-insert clones, two paired-end linkages from large-insert clones or whole-genome shotgun sequence clones, a single pairedend linkage from a large-insert clone and, finally, a single paired-end linkage from a whole-genome shotgun sequence clone. Characterization of repetitive sequences. Repetitive sequences were collected with RECON (v1.06; see URLs)37, with a cutoff of ten copies, and sequences were further curated to verify their identity, individuality and 5 and 3 boundaries. Each sequence was searched against the sea lamprey genomic sequences, and at least ten hits (BLASTN38 E < 1 1010) plus 100 bp of 3 and 5 flanking sequence were recovered. If a particular lamprey sequence was found to be similar to a known transposon at the nucleotide or protein level (BLASTN or BLASTX, respectively; E < 1 105; RepBase14.12), it was assigned to that repeat class. Recovered sequences were then aligned using dialign 2 (ref. 39), with the resulting output examined for the presence of possible boundaries between putative elements and the possible presence of target site duplications. Repeats were additionally searched for homology to known repeat classes in Repbase 14.12 (see URLs)40, using RepeatMasker and BLAST (BLASTX E < 1 105) to identify elements similar to other known transposable elements. Gene annotation. Annotations for the lamprey genome assembly were generated using the automated genome annotation pipeline MAKER5, which aligns

and filters EST and protein homology evidence, identifies repeats, produces ab initio gene predictions, infers 5 and 3 UTRs and integrates these data to produce final downstream gene models along with quality control statistics. Inputs for MAKER included the P. marinus genome assembly, P. marinus ESTs, a species-specific repeat library and protein databases containing all annotated proteins for 14 metazoans (Supplementary Note) combined with the Uniprot/ Swiss-Prot41 protein database and all sequences for Chondrichthyes (cartilaginous fishes) and Myxinidae (hagfishes) in the NCBI protein database42,43. Ab initio gene predictions were produced inside of MAKER by the programs SNAP44 and Augustus45. MAKER was also passed P. marinus RNA-seq data processed by the programs tophat and cufflinks (Supplementary Note)46. Identification of CNEs. The lamprey assembly was searched for sequences homologous to conserved noncoding sequences previously identified in comparisons between human and Fugu47 and human and Callorhinchus milii6 genomes. BLASTN (2.2.25+) was used with the word size set to 5 and with gap existence and extension penalties of 1. Codon usage. Genome-wide assessment of codon usage bias and aminoacid composition in lamprey genes was performed using predicted coding sequences after discarding all but the longest transcript variant for each gene. To avoid any bias imparted by small sequences, sequences shorter than 300 bp were excluded from analyses of GC content, leaving a total of 18,444 coding sequences. Overall GC content and GC content at third codon positions were calculated for each protein-coding gene, and the GC content was calculated for the 10-kb fragment harboring the gene(s). To investigate the possible influence of gene expression levels on codon usage bias and amino-acid composition, we compared the GC content of 50 highly expressed and 50 lowly expressed genes on the basis of RNA-seq reads. To analyze codon usage bias and aminoacid composition, we performed correspondence analysis (COA) on RSCU values48 and on amino-acid composition values using the software CodonW49 (see URLs). To assess the possible deviation of the sequence properties of lamprey protein-coding regions relative to other species, we downloaded genome wide protein-coding sequences for diverse vertebrates and invertebrates from Ensembl17 and the archives for individual genome projects. Using species-by-species concatenated protein-coding sequences, we calculated RSCU values and performed a correspondence analysis. Phylogenetic analysis of lamprey genes. A genome-wide phylogenetic analysis including 50 vertebrate genomes, 2 additional chordates and 3 outgroups was performed using the Ensembl tree reconstruction pipeline and the Ensembl compara database, Build 64 (ref. 50). All genes were clustered with hcluster_sg51 according to their sequence similarity52. A multiple-sequence alignment was built for each cluster using MCoffee53, and TreeBeST51 was then used to reconstruct a consensus tree for each family using two maximum-likelihood and three neighbor-joining trees. The software package CAFE54 was used to study the evolution of gene families in the lamprey and the gnathostomes. Comparative genomics. Regions were considered putative orthologs if they yielded the highest-scoring alignment between the two genomes or an alignment score (bit score) within 90% of the top-scoring alignment (TBLASTN 38; comparison of lamprey gene models to the human or chicken genomes). This convention permits some variation in the divergence rate and can be applied uniformly to the genome but may not identify some duplicates that have undergone exceedingly rapid diversification after duplication. Second, analyses were limited to single-copy genes and duplicates that were broadly distributed throughout the genome and present at relatively low copy number by removing redundant copies of tandemly duplicated genes (lineage-specific gene amplifications) and homology groups that contained more than six homologs in either of the two species being compared in any pairwise analysis. Hox genes. To supplement the assembly of Hox genecontaining regions, we selected a series of BACs via hybridization to a Hox2 probe designed from a known lamprey transcript (GenBank accession AY497314). Another series of BACs were selected by hybridization to Hox4 or Hox9 homeodomain probes and were pooled and sequenced by 454 sequencing.

npg

2013 Nature America, Inc. All rights reserved.

Nature Genetics

doi:10.1038/ng.2568

Identification of vertebrate-specific genes. All P. marinus predicted peptides were aligned to peptides of all gnathostome species (Ensembl version 58; ref. 55) using BLASTP38. All gnathostome peptide sequences that showed a maximal bit score of no less than 50 were used as query in a BLASTP search against invertebrate peptide sequences. This invertebrate database included all sequences available in GenBank and Ensembl for invertebrates, as well as all peptides predicted in the genomes of Schistosoma japonicum56, Schistosoma mansoni57 and Lottia gigantea42. All gnathostome query sequences with identifiable homologs in lamprey but not in any invertebrate were considered candidate vertebrate-specific genes. Candidates with bit scores between 50 and 60 were regarded as valid if the best hit from a reciprocal BLASTP search was the starting query sequence itself or its homolog with a bit score of no less than 50. Immunity-related gene families. To understand the relationships among members of individual gene families, neighbor-joining trees were constructed in MEGA5 (ref. 58) using complete gap deletion. The Shh enhancer ShARE. The genomic sequences of jawed vertebrates and the lamprey were compared with mVISTA59 using the mouse as a reference.
28. Osoegawa, K. et al. An improved approach for construction of bacterial artificial chromosome libraries. Genomics 52, 18 (1998). 29. Waterston, R.H. et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520562 (2002). 30. Warren, W.C. et al. The genome of a songbird. Nature 464, 757762 (2010). 31. Warren, W.C. et al. Genome analysis of the platypus reveals unique signatures of evolution. Nature 453, 175183 (2008). 32. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007). 33. Ewing, B., Hillier, L., Wendl, M.C. & Green, P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8, 175185 (1998). 34. Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186194 (1998). 35. Zhang, Z., Schwartz, S., Wagner, L. & Miller, W. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7, 203214 (2000). 36. Jaffe, D.B. et al. Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13, 9196 (2003). 37. Bao, Z. & Eddy, S.R. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 12, 12691276 (2002). 38. Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 33893402 (1997).

npg

2013 Nature America, Inc. All rights reserved.

39. Morgenstern, B. DIALIGN: multiple DNA and protein sequence alignment at BiBiServ. Nucleic Acids Res. 32, W33W36 (2004). 40. Jurka, J. et al. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 110, 462467 (2005). 41. UniProt Consortium. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res 38, D142D148 (2010). 42. Simakov, O. et al. Insights into bilaterian evolution from three spiralian genomes. Nature 493, 526531 (2013). 43. Pruitt, K.D., Tatusova, T., Klimke, W. & Maglott, D.R. NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 37, D32D36 (2009). 44. Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004). 45. Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 34, W435W439 (2006). 46. Trapnell, C., Pachter, L. & Salzberg, S.L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 11051111 (2009). 47. Kenyon, E.J., McEwen, G.K., Callaway, H. & Elgar, G. Functional analysis of conserved non-coding regions around the short stature hox gene (shox) in whole zebrafish embryos. PLoS ONE 6, e21498 (2011). 48. Sharp, P.M. & Li, W.H. The Codon Adaptation Indexa measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 12811295 (1987). 49. Peden, J.F. Analysis of codon usage. in DNA Repair (University of Nottingham, 2000). 50. Vilella, A.J. et al. EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 19, 327335 (2009). 51. Ruan, J. et al. TreeFam: 2008 Update. Nucleic Acids Res. 36, D735D740 (2008). 52. Smith, T.F. & Waterman, M.S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195197 (1981). 53. Wallace, I.M., OSullivan, O., Higgins, D.G. & Notredame, C. M-Coffee: combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res. 34, 16921699 (2006). 54. De Bie, T., Cristianini, N., Demuth, J.P. & Hahn, M.W. CAFE: a computational tool for the study of gene family evolution. Bioinformatics 22, 12691271 (2006). 55. Hubbard, T.J. et al. Ensembl 2009. Nucleic Acids Res. 37, D690D697 (2009). 56. Schistosoma japonicum Genome Sequencing and Functional Analysis Consortium. The Schistosoma japonicum genome reveals features of host-parasite interplay. Nature 460, 345351 (2009). 57. Berriman, M. et al. The genome of the blood fluke Schistosoma mansoni. Nature 460, 352358 (2009). 58. Tamura, K., Dudley, J., Nei, M. & Kumar, S. MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol. Biol. Evol. 24, 15961599 (2007). 59. Frazer, K.A., Pachter, L., Poliakov, A., Rubin, E.M. & Dubchak, I. VISTA: computational tools for comparative genomics. Nucleic Acids Res. 32, W273W279 (2004).

doi:10.1038/ng.2568

Nature Genetics

letters

Identification of seven loci affecting mean telomere length and their association with disease
Interindividual variation in mean leukocyte telomere length (LTL) is associated with cancer and several age-associated diseases. We report here a genome-wide meta-analysis of 37,684 individuals with replication of selected variants in an additional 10,739 individuals. We identified seven loci, including five new loci, associated with mean LTL (P < 5 108). Five of the loci contain candidate genes (TERC, TERT, NAF1, OBFC1 and RTEL1) that are known to be involved in telomere biology. Lead SNPs at two loci (TERC and TERT) associate with several cancers and other diseases, including idiopathic pulmonary fibrosis. Moreover, a genetic risk score analysis combining lead variants at all 7 loci in 22,233 coronary artery disease cases and 64,762 controls showed an association of the alleles associated with shorter LTL with increased risk of coronary artery disease (21% (95% confidence interval, 535%) per standard deviation in LTL, P = 0.014). Our findings support a causal role of telomere-length variation in some age-related diseases. Telomeres are the protein-bound DNA repeat structures at the ends of chromosomes that are important in maintaining genomic stability1. They are critical in regulating cellular replicative capacity2. During somatic-cell replication, telomere length progressively shortens because of the inability of DNA polymerase to fully replicate the 3 end of the DNA strand. Once a critically short telomere length is reached, the cell is triggered to enter replicative senescence, which subsequently leads to cell death 1,2. Conversely, in germ cells and other stem cells that require renewal, telomere length is maintained by the enzyme telomerase, a ribonucleoprotein that contains the RNA template TERC and a reverse transcriptase TERT3. Both longer and shorter telomere length are associated with increased risk of certain cancers4,5, and reactivation of telomerase, which bypasses cellular senescence, is a common requirement for oncogenic progression6. Therefore, telomere length is an important determinant of telomere function. Mean telomere length exhibits considerable interindividual variability and has high heritability with estimates varying between 44% and 80% (refs. 79). Most of these studies have measured mean telomere length in blood leukocytes. However, there is evidence that, within an individual, mean LTL and telomere length in other tissues are highly correlated10,11. In cross-sectional population studies, mean LTL is longer in women than in men and is inversely associated with
A full list of authors and affiliations appears at the end of the paper. Received 26 June 2012; accepted 19 December 2012; published online 27 March 2013; doi:10.1038/ng.2528

age (declining by between 2040 bp per year)9,1214. Shorter ageadjusted and sex-adjusted mean LTL has been found to be associated with risk of several age-related diseases, including coronary artery disease (CAD)1215, and has been advanced as a marker of biological aging16. However, the extent to which the association of shorter LTL with age-related disorders is causal in nature remains unclear. Identifying genetic variants that affect telomere length and testing their association with disease could clarify any causal role. So far, common variants at two loci on chromosome 3q26 (TERC)1719 and chromosome 10q24.33 (OBFC1)18, which explain <1% of the variance in telomere length, have shown a replicated association with mean LTL in genome-wide association studies (GWAS). To identify other genetic determinants of LTL, we conducted a largescale GWAS meta-analysis of 37,684 individuals from 15 cohorts, followed by replication of selected variants in an additional 10,739 individuals from 6 more cohorts. Details of the studies included in the GWAS meta-analysis and in the replication phase are provided in the Supplementary Note, and key characteristics are summarized in Supplementary Table 1. All subjects were of European descent, the majority of the cohorts were population based and three of the replication cohorts were additional subjects from studies used in the meta-analysis. The genotyping platforms and the imputation method (to HapMap 2 build 36) used by each GWAS cohort are summarized in Supplementary Table 2. We measured mean LTL in each cohort using a quantitative PCR method and expressed it as a ratio of telomere repeat length to copy number of a single-copy gene (T/S ratio; Online Methods and Supplementary Note). Then we analyzed LTL, adjusted for age, sex and any study-specific covariates, for association with genotype using linear regression in each study and adjusted the results for genomic inflation control factors (Supplementary Table 2). We performed an inverse variance weighted meta-analysis for 2,362,330 SNPs (Online Methods) with correction for the overall genomic inflation control factor ( = 1.007; quantile-quantile plot for the meta-analysis is shown in Supplementary Fig. 1). SNPs in seven loci exhibited association with mean LTL at genomewide significance (P < 5 108; Figs. 1, 2, Table 1 and Supplementary Fig. 2). The association of the lead SNP on chromosome 2p16.2 (rs11125529) was very close to the threshold for genome-wide significance, and the lead SNP in a locus on 16q23.3 (rs2967374) fell just short of this threshold (Table 1). We therefore sought replication of results for these two loci. We confirmed the association of rs11125529

npg

2013 Nature America, Inc. All rights reserved.

422

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

letters
Figure 1 Signal-intensity plot of genotype association with telomere length. Data are displayed as log10(P values) against chromosomal location for the 2,362,330 SNPs that were tested. The dotted line represents a genome-wide level of significance at P = 5 108. Loci that showed an association at this level are plotted in red.
30 TERC

log10(P value)

20

TERT NAF1 OBFC1 ACYP2

10

2013 Nature America, Inc. All rights reserved.

but not of rs2967374 (Table 1). The combined P value from the GWAS meta-analyses and replication cohorts for rs11125529 was 0 7.50 1010. There was no evidence of sexdependent effects or additional independent signals at any of these loci (Online Methods and Supplementary Tables 3, 4). Details of key genes in each locus associated with LTL and their location in relation to the lead SNP are provided in Supplementary Table5. The most significantly associated locus we found was the previously reported TERC locus on 3q26 (Figs. 1, 2 and Table 1)17. Four additional loci, 5p15.33 (TERT), 4q32.2 (NAF1, nuclear assembly factor 1), 10q24.33 (OBFC1, oligonucleotide/oligosaccharide-binding fold containing 1)18 and 20q13.3 (RTEL1, regulator of telomere elongation helicase 1), harbor genes that encode proteins with known function in telomere biology3,2023. NAF1 protein is required for assembly of H/ACA box small nucleolar RNA, the RNA family to which TERC belongs20. Thus, the three most significantly associated loci (3q26, 5p15.33 and 4q32.2) harbor genes involved in the formation and activity of telomerase. We therefore examined whether the lead SNPs at these loci as well as the other identified loci associate with
1 2

ZNF208 RTEL1

10

11

12

13

14

Chromosome

leukocyte telomerase activity in available data from 208 individuals. We did not find an association of any of the variants with telomerase activity (Supplementary Table6). However, the study only had 80% power ( of 0.05) to detect a SNP effect that explained 3.7% of the variance in telomerase activity, and therefore smaller effects are likely to have been missed in this exploratory analysis. We also found a significant association (P = 6.90 1011) at the previously reported OBFC1 locus18. OBFC1 is a component of the telomere-binding CST complex that also contains CTC1 and TEN1 (ref. 21). In yeast, this complex binds to the single-stranded guanine overhang at the telomere and functions to promote telomere replication. RTEL1 is a DNA helicase that has been shown to have important roles in setting telomere length, telomere maintenance and DNA repair in mice22,23. However, it should be noted that the

a 35
log10(P value)

40 20 0
MECOM PRKCI TERC SAMD7 GPR160 ARPM1 LOC100128164 PHC3 SEC62 MYNN LRRC34 LRRIQ4 LRRC31

60 40 20 0

log10(P value)

60

log10(P value)

30 25 20 15 10 5 0

rs10936599

100 80

b 20
15 10 5 0

rs2736100

100 80

c
15 10 5 0

rs7675998

15

0.8 0.6 0.4 0.2

0.8 0.6 0.4 0.2

0.8 0.6 0.4 0.2

npg

ZDHHC11 BRD9 TRIP13

NKD2

SLC6A19
SLC12A7 SLC6A18

CLPTM1L TERT

LPCAT1

SLC6A3

LOC728613 SDHAP3

NAF1

NPY5R

170.6

171.0 170.8 171.2 Position on chr. 3 (Mb) rs9420907

171.4

1.0

1.2 1.4 1.6 Position on chr. 5 (Mb) rs8105767

1.8

163.8

164.0 164.2 164.4 Position on chr. 4 (Mb) rs755017

log10(P value)

log10(P value)

8 6 4 2 0
PDCD11 CALHM2 CALHM1 CALHM3 NEURL SH3PXD2A SLK OBFC1 C10orf78
COL17A1

80 60 40 20 0

8 6 4 2 0

80 60 40 20 0

log10(P value)

d 10

100

e 10

100

f 10
8 6 4 2 0
COL20A1 CHRNA4 EEF1A2 PPDPF KCNQ2 PTK6 SRMS C20orf195 PRIC285

0.8 0.6 0.4 0.2

0.8 0.6 0.4 0.2

GMEB2 ZGPAT
STMN3 LIME1

C20orf135 TPD52L2 ZBTB46 DNAJC5

SAMD10 OPRL1

GSTO1
C10orf79 GSTO2

PRPF6 SOX18 TCEA2 RGS19 C20orf201

RTEL1 TNFRSF6B

MIR936

MIR609

ITPRIP CCDC147

ZNF429

ZNF100 LOC641367 ZNF43

ZNF208 ZNF257

ZNF676

ZNF98

MIR9411 MIR9412 MIR9413

ARFRP1 SLC2A4RG

105.2

105.4

105.6 105.8 Position on chr. 10 (Mb)

106.0

21.6

21.8 22.0 22.2 Position on chr. 19 (Mb)

22.4

61.4

61.6

g 10
log10(P value)

62.0 61.8 Position on chr. 20 (Mb)


rs11125529

8 6 4 2 0
PSME4 ASB3 LOC100302652 CHAC2 ERLEC1 GPR75 ACYP2 TSPYL6 C2orf73 SPTBN1 RPL23AP32

Figure 2 Regional association plots for the associated loci. ( ag) For each SNP, log10(P value) is plotted against base-pair position for each of the loci. Regional plots are shown in order of strongest association: 3q26 (a), 5p15.33 (b), 4q32.2 (c), 10q24.33 (d), 19p12 (e), 20q13.3 (f) and 2p16.2 (g). In each locus, the lead SNP is represented in purple, and the linkage disequilibrium relationship (r2) of other SNPs to this is indicated. Blue peaks represent recombination rates (HapMap 2), and the RefSeq genes in each region are provided at the bottom.

54.0

54.6 54.4 54.2 Position on chr. 2 (Mb)

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

16 17 18 19 20 21 22
Recombination rate (cM Mb )

Recombination rate (cM Mb1) Recombination rate (cM Mb1)

Recombination rate (cM Mb ) Recombination rate (cM Mb )

100 80 60 40 20 0
NPY1R TKTL2 C4orf43 MARCH1

1 1

164.6
Recombination rate (cM Mb )

100 80 60 40 20 0

0.8 0.6 0.4 0.2

MYT1 PCMTD2

7 genes omitted

62.2
Recombination rate (cM Mb )

r2
0.8 0.6 0.4 0.2

100 80 60 40 20 0

EML6

54.8

423

letters
Table 1 Results of telomere length genome-wide association meta-analysis and replication analysis
Effect on LTL expressed as: SNP Chr. Position 170974795 1339516 164227270 105666455 22007281 61892066 54329370 80767362 Gene TERC TERT NAF1 OBFC1 ZNF208 RTEL1 ACYP2 MPHOSPH6 N 37,669 25,842 34,694 37,653 37,499 37,113 37,653 37,437 Effect Other Effect allele allele allele frequency T A A A A A C G C G C C G C G G A A A A 0.252 0.514 0.217 0.865 0.709 0.869 0.858 0.790 0.864 0.790 Standard error 0.008 0.009 0.009 0.010 0.008 0.011 0.010 0.009 Explained variance (%) 0.36 0.31 0.19 0.11 0.09 0.09 0.08 NA NA NA equivalent agerelated attritiona 3.91 3.14 2.99 2.76 1.92 2.47 2.23 NA NA NA

0.097 0.078 0.074 0.069 0.048 0.062 0.056 0.045

P value 2.54 1031 4.38 1019 4.35 1016 6.90 1011 1.11 109 6.71 109 4.48 108 2.70 107 4.70 103 7.80 101

base pairsb 117.3 94.2 89.7 82.8 57.6 74.1 66.9 NA NA NA

GWAS meta-analysis rs10936599 3 rs2736100 5 rs7675998 4 rs9420907 10 rs8105767 19 rs755017 20 rs11125529 2 rs2967374 16

Selective replication rs11125529 2 54329370 rs2967374 16 80767362

ACYP2 10,254 MPHOSPH6 9,063

0.053 0.070 0.004 0.031

N refers to the number of individuals meta-analyzed for each SNP and, for rs11125529 and rs296734, the additional samples used in replication. The sample size for rs2736100 is smaller than for other loci as this SNP is only present on certain genotyping platforms and, because of weak LD structure in the region, cannot be imputed reliably. Effect allele indicates the allele that is associated with shorter telomere length, explaining why all the estimates are negative. NA, not applicable.
aEstimates

of the per-allele effect on average age-related telomere attrition in years (based on data in Supplementary Fig. 3). bEstimates of the per-allele effect on LTL in base pairs calculated from the equivalent age-related attrition in T/S ratio.

2013 Nature America, Inc. All rights reserved.

lead SNP is 94 kb from RTEL1. The remaining two loci (19p21 and 2p16.2) do not harbor obvious candidate genes related to telomere biology. The locus on 19p12 contains a cluster of genes encoding zinc-finger proteins, and the locus on 2p16.2 spans both the ACYP2 gene, which encodes a muscle-specific acylphosphate, and TSPYL6, a gene in intron 3 of ACYP2 that has homology with nucleosome assembly factor genes. There is evidence that ACYP2 is linked to stress-induced apoptosis in rat muscle24. To gain functional insight into the associated loci, we undertook various bioinformatics analyses (Online Methods). Details of the findings are provided in the Supplementary Note and in Supplementary Table 7. SNPs in high linkage disequilibrium with the lead SNP were within potential regulatory elements of TERC, NAF1 and OBFC1. However, similar SNPs were also present for other genes in some of the loci. These findings emphasize that, although strong candidate genes are located in some of the loci, at this stage we cannot overlook the potential involvement of other genes in each region. Each of the identified loci explains a relatively small proportion of the total variance in LTL (Table 1). To put this in context, we calculated the effect of the lead SNP at each locus in terms of equivalent age-related shortening of LTL based on an estimate of age-related attrition in T/S ratio calculated across all cohorts (Supplementary Fig. 3). We saw per-allele effects using this measure equivalent to 1.93.9 years of age-related attrition in T/S ratio (Table 1). The quantitative PCR method we used here to measure LTL cannot be used to directly calculate the effect on LTL in base pairs. However, many prior studies that have used DNA blotting to measure LTL have shown that mean LTL attrition rate is ~30 bp per year8,1214,25. This suggests that the per-allele effect of the different SNPs on LTL in base pairs ranges from ~57 bp to 117 bp (Table 1). As both shorter and longer mean LTL have been linked to increased risk of various diseases, we searched genetic-association databases for disease associations with the LTL-associated SNPs (Supplementary Table 8). The rs10936599 (TERC) allele associated with longer LTL associates with increased risk of colorectal cancer19 and with two autoimmune diseases, multiple sclerosis (longer LTL allele) and celiac disease (shorter LTL allele). The lead SNP for the 5p15.33 (TERT) locus is associated with different cancer types (both shorter and longer LTL alleles) and with increased risk of idiopathic
424

pulmonary fibrosis (shorter LTL allele), a disease that has previously been shown to be associated with shorter LTL26. One of the most widely reported associations for LTL to date has been that between shorter mean LTL and CAD1214,25. Because LTL is also affected by other risk factors for CAD such as oxidative stress2729, it has been unclear whether the association of shorter LTL with CAD is primary or secondary. To investigate whether the association could be causal, we examined the association of both individual lead SNPs and a genetic risk score based on a combination of all 7 SNPs (adjusted for their effect size) with CAD in the CARDIoGRAM GWAS meta-analysis comprising 22,233 CAD cases and 64,762 controls
OR (95% CI) 0.71 (0.37, 1.37) 0.76 (0.54, 1.07) 1.43 (0.89, 2.30) 0.77 (0.50, 1.17) 0.41 (0.22, 0.76) 0.67 (0.36, 1.26) 0.93 (0.43, 2.00) 0.79 (0.65, 0.95)

Gene ACYP2 TERC NAF1 TERT OBFC1 ZNF208 RTEL1 Overall

npg

0.2

0.5

Figure 3 Telomere length variants and risk of CAD. Forest plot showing the effect of telomere length on CAD risk obtained for each SNP using a risk score analysis31 for each SNP. Effect sizes are plotted with 95% confidence intervals. The overall estimate is from a fixed-effects metaanalysis over all SNPs, where the odds ratio (OR) relates to the change in CAD risk for a s.d. change in telomere length.

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

letters
of European descent30, using the approach recently described by the ICBP Consortium31. Although the results for individual variants were not significant, 6 of 7 variants showed consistency in direction, and the combined genetic risk score analysis showed a significant association (P = 0.014) of the allele associated with shorter LTL with increased risk of CAD (Fig. 3). Shorter mean LTL equivalent to one standard deviation in LTL was associated with a 21% (95% confidence interval, 535%) higher risk of CAD. Here we report five new and confirm two previously reported loci that associate with mean LTL in humans. A specific motivation for our study was the observation that variation in LTL is associated with several age-related diseases and the desire to establish whether this link is causal. This is particularly challenging to disentangle because other environmental and lifestyle factors also affect telomere length29,3234. The most persuasive evidence for a causal role comes from in vitro and in vivo manipulation of telomerase activity, which affects telomere length and has been shown to enhance or reverse senescence and aging-associated phenotypes3539. Here we show that some of the genetic variants associated with LTL are also associated with risk of specific cancers as well as other diseases, some of which have been shown to be previously associated with shorter LTL, suggesting a causal link. An interesting finding was that alleles associated with both shorter and longer telomeres showed associations with specific cancers, suggesting that variation in LTL in either direction may contribute to the development of specific cancers. As an example of a complex disease that has been shown to be associated with shorter LTL, we examined CAD. Through an analysis of a large GWAS database of CAD30, we found that, although individually the lead SNPs at each of the telomere lengthassociated loci were not significantly associated with risk of CAD (probably at least in part reflecting their weak individual effects on LTL and low power), in a combined analysis, alleles associated with shorter LTL were associated with a significantly higher risk of CAD. Because the variants at each of the loci could have other biological effects that could affect their association with CAD through LTL (and possibly explain why the NAF1 locus may be trending in the opposite direction), some caution is required in the interpretation of this association. Nonetheless, the finding is consistent with that in the prospective WOSCOPS study where, after adjustment for other CAD risk factors, baseline LTL was associated with a 44% higher risk of CAD over the ensuing mean 5.5 years of follow-up in individuals in the tertile with the shortest LTL compared to that with the longest LTL13. Our finding here therefore supports a causal association of shorter LTL with CAD, and mechanistic investigation of this relationship is warranted. In summary, we provide insights into the genetic determination of a structure that is critically involved in genomic stability and cellular function. Our findings suggest that several candidate genes encoding proteins with known function in telomere biology contribute to the LTL associations. The findings provide a framework for a genetic approach to investigating the causal role of telomere length in agingrelated diseases. URLs. R software, http://www.r-project.org/; 1000 Genomes Project, http://www.1000genomes.org/; Genotype-Tissue Expression Project, http://www.genome.gov/gtex/; and UCSC Genome Browser http:// genome.ucsc.edu/. Methods Methods and any associated references are available in the online version of the paper.
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

Note: Supplementary information is available in the online version of the paper. Acknowledgments This study was undertaken under the framework of European Union Framework 7 ENGAGE Project (HEALTH-F4-2007-201413). A full list of acknowledgments, including support for each study, is provided in the Supplementary Note. AUTHOR CONTRIBUTIONS V.C. and N.J.S. supervised the overall study. V.C., M.M., T.D.S., P.v.d.H. and N.J.S. designed the study. M.M., T.E., D.R.N., R.A.d.B., G.D.N., D.S., N.A., A.J.B., P.S.B., P.R.B., K.D., M.D., J.G.E., K.G., A.-L.H., A.K.H., L.C. Karssen, J.K., N.K., V.L., I.M.L., E.M.v.L., P.A.M., R.M., P.K.E.M., S.M., M.I.M., S.E.M., E.M., G.W.M., B.A.O., J.P., A. Palotie, A. Peters, Anneli Pouta, I.P., S.R.,V.S., A.M.V., N.V., A.V., H.-E.W., E.W., G.W., M.J.W., K.X., X.X., D.J.v.V., A.L.C., M.D.T., A.S.H., A.I.F.B., P.J.T., N.L.P., M.P., J.D., W.O., J. Kaprio, N.G.M., C.M.v.D., C.G., A.M., D.I.B., M.-R.J., W.H.v.G., P.E.S., T.D.S., P.v.d.H. and N.J.S. contributed to recruitment, study and data management, genotyping and/or imputation of individual studies. V.C., J.L.B., M.K.M., R.A.d.B., J.P., E.D., L.K., H.P., P.T.J. and I.H. measured telomere length. C.P.N., E.A., M.M., J.D., J.L.B., J.J.H., K.F., T.E., I.S., L.B., D.R.N., R.A.d.B., P.S., S.H., G.D.N., P.F.O., I.M.L., S.E.M. and P.v.d.H. undertook association analysis of individual studies; C.P.N., E.A. and J.R.T. carried out the meta-analysis and the additional reported analyses. H.Z., X.W., D.G. and Y.D. provided data on telomerase activity and genotypes. J.E., M.P.R., S.K. and H.S. contributed CAD association data on behalf of CARDIoGRAM. V.C. and N.J.S. prepared the paper together with C.P.N., E.A., M.M. and P.v.d.H. and all authors reviewed the paper. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.
Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html.
1. Blackburn, E.H., Greider, C.W. & Szostak, J.W. Telomeres and telomerase: the path from maize, Tetrahymena and yeast to human cancer and aging. Nat. Med. 12, 11331138 (2006). 2. Allsopp, R.C. et al. Telomere length predicts replicative capacity of human fibroblasts. Proc. Natl. Acad. Sci. USA 89, 1011410118 (1992). 3. Wang, C. & Meier, U.T. Architecture and assembly of mammalian H/ACA small nucleolar and telomerase ribonucleoproteins. EMBO J. 23, 18571867 (2004). 4. Ma, H. et al. Shortened telomere length is associated with increased risk of cancer: a meta-analysis. PLoS ONE 6, e20466 (2011). 5. Wentzensen, I.M., Mirabello, L., Pfeiffer, R.M. & Savage, S.A. The association of telomere length and cancer; a meta-analysis. Cancer Epidemiol. Biomarkers Prev. 20, 12381250 (2011). 6. Chang, S., Khoo, C.M., Naylor, M.L., Maser, R.S. & DePinho, R.A. Telomere-based crisis: functional differences between telomerase activation and ALT in tumour progression. Genes Dev. 17, 88100 (2002). 7. Slagboom, P.E., Droog, S. & Boomsma, D.I. Genetic determination of telomere size in humans: a twin study of three age groups. Am. J. Hum. Genet. 55, 876882 (1994). 8. Njajou, O.T. et al. Telomere length is paternally inherited and is associated with parental lifespan. Proc. Natl. Acad. Sci. USA 104, 1213512139 (2007). 9. Vasa Nicotera, M. et al. Mapping of a major locus that determines telomere length in humans. Am. J. Hum. Genet. 76, 147151 (2005). 10. Wilson, W.R. et al. Blood leukocyte telomere DNA content predicts vascular telomere DNA content in humans with and without vascular disease. Eur. Heart J. 29, 26892694 (2008). 11. Okuda, K. et al. Telomere length in the newborn. Pediatr. Res. 52, 377381 (2002). 12. Brouilette, S., Singh, R.K., Thompson, J.R., Goodall, A.H. & Samani, N.J. White cell telomere length and risk of premature myocardial infarction. Arterioscler. Thromb. Vasc. Biol. 23, 842846 (2003). 13. Brouilette, S. et al. Telomere length, risk of coronary heart disease, and statin treatment in the West of Scotland Primary Prevention Study: a nested case-control study. Heart 94, 422425 (2008). 14. Fitzpatrick, A.L. et al. Leukocyte telomere length and cardiovascular disease in the cardiovascular health study. Am. J. Epidemiol. 165, 1421 (2007). 15. Benetos, A. et al. Short telomeres are associated with increased carotid atherosclerosis in hypertensive subjects. Hypertension 43, 182185 (2004). 16. Samani, N.J. & van der Harst, P. Biological ageing and cardiovascular disease. Heart 94, 537539 (2008). 17. Codd, V. et al. Common variants near TERC are associated with mean telomere length. Nat. Genet. 42, 197199 (2010). 18. Levy, D. et al. Genome-wide association identifies OBFC1 as a locus involved in human leukocyte telomere biology. Proc. Natl. Acad. Sci. USA 107, 92939298 (2010). 19. Jones, A.M. et al. Terc polymorphisms are associated both with susceptibility to colorectal cancer and with longer telomeres. Gut 61, 248254 (2012).

npg

2013 Nature America, Inc. All rights reserved.

425

letters
20. Egan, E.D. & Collins, K. An enhanced H/ACA RNP assembly mechanism for human telomerase RNA. Mol. Cell. Biol. 32, 24282439 (2012). 21. Miyake, Y. et al. RPA-like mammalian Ctc1-Stn1-Ten1 complex binds to single-stranded DNA and protects telomeres independently of the Pot1 pathway. Mol. Cell 36, 193206 (2009). 22. Ding, H. et al. Regulation of murine telomere length by Rtel1: an essential gene encoding a helicase-like protein. Cell 117, 873886 (2004). 23. Barber, L.J. et al. RTEL1 maintains genomic stability by suppressing homologous recombination. Cell 135, 261271 (2008). 24. Kim, J.W., Kwon, O.Y. & Kim, M.H. Differentially expressed genes and morphological changes during lengthened immobilization in rat soleus muscle. Differentiation 75, 147157 (2007). 25. Farzaneh-Far, R. et al. Telomere length trajectory and its determinants in persons with coronary artery disease: longitudinal findings from the heart and soul study. PLoS ONE 5, e8612 (2010). 26. Alder, J.K. et al. Short telomeres are a risk factor for idiopathic pulmonary fibrosis. Proc. Natl. Acad. Sci. USA 105, 1305113056 (2008). 27. Richter, T. & von Zglinicki, T. Continuous correlation between oxidative stress and telomere length shortening in fibroblasts. Exp. Gerontol. 41, 10391042 (2007). 28. Valdes, A.M. et al. Obesity, cigarette smoking, and telomere length in women. Lancet 366, 662664 (2005). 29. Bekaert, S. et al. Telomere length and cardiovascular risk factors in a middle-aged population free of overt cardiovascular disease. Aging Cell 6, 639647 (2007). 30. Schunkert, H. et al. Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nat. Genet. 43, 333338 (2011). 31. International Consortium for Blood Pressure Genome-Wide Association Studies. Genetic variants in novel pathways influence blood pressure and cardiovascular disease risk. Nature 478, 103109 (2011). 32. Lin, J., Epel, E. & Blackburn, E. Telomeres and lifestyle factors: roles in cellular aging. Mutat. Res. 730, 8589 (2012). 33. Farzaneh-Far, R. et al. Association of marine omega-3 fatty acid levels with telomeric aging in patients with coronary heart disease. J. Am. Med. Assoc. 303, 250257 (2010). 34. Epel, E.S. et al. Accelerated telomere shortening in response to life stress. Proc. Natl. Acad. Sci. USA 101, 1731217315 (2004). 35. Minamino, T. et al. Endothelial cell senescence inhuman atherosclerosis: role of telomere in endothelial dysfunction. Circulation 105, 15411544 (2002). 36. Oh, H. et al. Telomerase reverse transcriptase promotes cardiac muscle cell proliferation, hypertrophy, and survival. Proc. Natl. Acad. Sci. USA 98, 1030810313 (2001). 37. Wong, K.K. et al. Telomere dysfunction and Atm deficiency compromises organ homeostasis and accelerates ageing. Nature 421, 643648 (2003). 38. Samper, E., Flores, J.M. & Blasco, M.A. Restoration of telomerase activity rescues chromosomal instability and premature aging in Terc/ mice with short telomeres. EMBO Rep. 2, 800807 (2001). 39. Jaskelioff, M. et al. Telomerase reactivation reverses tissue degeneration in aged telomerase deficient mice. Nature 469, 102106 (2011).

2013 Nature America, Inc. All rights reserved.

Veryan Codd1,2,67, Christopher P Nelson1,2,67, Eva Albrecht3,67, Massimo Mangino4,67, Joris Deelen5,6, Jessica L Buxton7, Jouke Jan Hottenga8, Krista Fischer9, Tnu Esko9, Ida Surakka10,11, Linda Broer6,12,13, Dale R Nyholt14, Irene Mateo Leach15, Perttu Salo11, Sara Hgg16, Mary K Matthews1, Jutta Palmen17, Giuseppe D Norata1820, Paul F OReilly21,22, Danish Saleheen23,24, Najaf Amin12, Anthony J Balmforth25, Marian Beekman5,6, Rudolf A de Boer15, Stefan Bhringer26, Peter S Braund1, Paul R Burton27, Anton J M de Craen28, Matthew Denniff1, Yanbin Dong29, Konstantinos Douroudis9, Elena Dubinina1, Johan G Eriksson11,3032, Katia Garlaschelli19, Dehuang Guo29, Anna-Liisa Hartikainen33, Anjali K Henders14, Jeanine J Houwing-Duistermaat6,26, Laura Kananen34,35, Lennart C Karssen12, Johannes Kettunen10,11, Norman Klopp36,37, Vasiliki Lagou38, Elisabeth M van Leeuwen12, Pamela A Madden39, Reedik Mgi9, Patrik K E Magnusson16, Satu Mnnist11, Mark I McCarthy38,40,41, Sarah E Medland14, Evelin Mihailov9, Grant W Montgomery14, Ben A Oostra12, Aarno Palotie42,43, Annette Peters36,44,45, Helen Pollard1, Anneli Pouta33,46, Inga Prokopenko38, Samuli Ripatti10,11,42, Veikko Salomaa11, H Eka D Suchiman5, Ana M Valdes4, Niek Verweij15, Ana Viuela4, Xiaoling Wang29, H-Erich Wichmann4749, Elisabeth Widen10, Gonneke Willemsen8, Margaret J Wright14, Kai Xia50, Xiangjun Xiao51, Dirk J van Veldhuisen15, Alberico L Catapano18,52, Martin D Tobin27, Alistair S Hall25, Alexandra I F Blakemore7, Wiek H van Gilst15, Haidong Zhu29, CARDIoGRAM consortium53, Jeanette Erdmann54, Muredach P Reilly55, Sekar Kathiresan5658, Heribert Schunkert54, Philippa J Talmud17, Nancy L Pedersen16, Markus Perola911, Willem Ouwehand42,59,60, Jaakko Kaprio10,61,62, Nicholas G Martin14, Cornelia M van Duijn6,12,13, Iiris Hovatta34,35,62, Christian Gieger3, Andres Metspalu9, Dorret I Boomsma8, Marjo-Riitta Jarvelin21,22,6365, P Eline Slagboom5,6, John R Thompson27, Tim D Spector4, Pim van der Harst1,15,66,67 & Nilesh J Samani1,2

npg

of Cardiovascular Sciences, University of Leicester, Leicester, UK. 2National Institute for Health Research Leicester Cardiovascular Biomedical Research Unit, Glenfield Hospital, Leicester, UK. 3Institute of Genetic Epidemiology, Helmholtz Zentrum MnchenGerman Research Center for Environmental Health, Neuherberg, Germany. 4Department of Twin Research and Genetic Epidemiology, Kings College London, London, UK. 5Section of Molecular Epidemiology, Leiden University Medical Center, Leiden, The Netherlands. 6Netherlands Consortium for Healthy Aging, Leiden University Medical Center, Leiden, The Netherlands. 7Section of Investigative Medicine, Imperial College London, London, UK. 8Netherlands Twin Register, Department of Biological Psychology, Vrije Universiteit, Amsterdam, The Netherlands. 9Estonian Genome Center, University of Tartu, Tartu, Estonia. 10Institute for Molecular Medicine Finland, University of Helsinki, Helsinki, Finland. 11Public Health Genomics Unit, Department of Chronic Disease Prevention, National Institute for Health and Welfare, Helsinki, Finland. 12Department of Epidemiology, Erasmus Medical Center, Rotterdam, The Netherlands. 13Centre for Medical Systems Biology, Leiden, The Netherlands. 14Queensland Institute of Medical Research, Brisbane, Australia. 15Department of Cardiology, University of Groningen, University Medical Center, Groningen, The Netherlands. 16Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden. 17Institute of Cardiovascular Science, Univerisity College London, London, UK. 18Department of Pharmacological and Biomolecular Sciences, Universit degli Studi di Milano, Milan, Italy. 19Centro Societa Italiana per lo Studio dellAterosclerosi, Bassini Hospital, Cinisello B, Italy. 20The Blizard Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University, London, UK. 21Department of Epidemiology and Biostatistics, School of Public Health, Imperial College, London, UK. 22Medical Research CouncilHealth Protection Agency Centre for Environment and Health, Faculty of Medicine, Imperial College London, London, UK. 23Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK. 24Center for Non-Communicable Diseases, Karachi, Pakistan. 25Division of Epidemiology, Leeds Institute of Genetics, Health and Therapeutics, School of Medicine, University of Leeds, Leeds, UK. 26Section of Medical Statistics, Leiden University Medical Center, Leiden, The Netherlands. 27Department of Health Sciences, University of Leicester, Leicester, UK. 28Department of Gerontology and Geriatrics, Leiden University Medical Center, Leiden,

1Department

426

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

letters
The Netherlands. 29Georgia Prevention Institute, Georgia Health Sciences University, Augusta, Georgia, USA. 30University of Helsinki, Department of General Practice and Primary Health Care, Helsinki, Finland. 31Folkhlsan Research Center, Helsinki, Finland. 32Unit of General Practice, Helsinki University Central Hospital, Helsinki, Finland. 33Institute of Clinical Medicine/Obstetrics and Gynecology, University of Oulu, Oulu, Finland. 34Research Programs Unit, Molecular Neurology, Biomedicum Helsinki, University of Helsinki, Helsinki, Finland. 35Department of Medical Genetics, Haartman Institute, University of Helsinki, Helsinki, Finland. 36Research Unit of Molecular Epidemiology, Helmholtz Zentrum MnchenGerman Research Center for Environmental Health, Neuherberg, Germany. 37Hanover Unified Biobank, Hanover Medical School, Hanover, Germany. 38Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Oxford, UK. 39Department of Psychiatry, Washington University School of Medicine, St. Louis, Missouri, USA. 40Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK. 41Oxford National Institute for Health Research Biomedical Research Centre, Churchill Hospital, Oxford, UK. 42Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK. 43Department of Medical Genetics, University of Helsinki and the Helsinki University Hospital, Helsinki, Finland. 44Institute of Epidemiology II, Helmholtz Zentrum MnchenGerman Research Center for Environmental Health, Neuherberg, Germany. 45Munich Heart Alliance, Munich, Germany. 46National Institute for Health and Welfare, Oulu, Finland. 47Institute of Epidemiology I, Helmholtz Zentrum MnchenGerman Research Center for Environmental Health, Neuherberg, Germany. 48Institute of Medical Informatics, Biometry and Epidemiology, Chair of Epidemiology, Ludwig Maximilians Universitt, Munich, Germany. 49KlinikumGrosshadern, Munich, Germany. 50Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina, USA. 51Department of Epidemiology, University of Texas MD Anderson Cancer Center, Houston, Texas, USA. 52Instituto di Ricovero e Cura a Carattere Scientifico Multimedica, Milan, Italy. 53A full list of members is provided in the Supplementary Note. 54UniversittzuLbeck, Medizinische Klinik II, Lbeck, Germany. 55The Cardiovascular Institute, University of Pennsylvania, Philadelphia, Pennsylvania, USA. 56Cardiovascular Research Center and Cardiology Division, Massachusetts General Hospital, Boston, Massachusetts, USA. 57Center for Human Genetic Research, Massachusetts General Hospital, Boston, Massachusetts, USA. 58Program in Medical and Population Genetics, Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, Massachusetts, USA. 59Department of Hematology, University of Cambridge, Cambridge, UK. 60National Health Service Blood and Transplant, Cambridge, UK. 61University of Helsinki, Hjelt Institute, Department of Public Health, Helsinki, Finland. 62Department of Mental Health and Substance Abuse Services, National Institute for Health and Welfare, Helsinki, Finland. 63Institute of Health Sciences, University of Oulu, Oulu, Finland. 64Biocenter Oulu, University of Oulu, Oulu, Finland. 65Department of Lifecourse and Services, National Institute for Health and Welfare, Oulu, Finland. 66Department of Genetics, University of Groningen, University Medical Center, Groningen, The Netherlands. 67These authors contributed equally to this work. Correspondence should be addressed to N.J.S. (njs@le.ac.uk).

npg
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

2013 Nature America, Inc. All rights reserved.

427

ONLINE METHODS

Subjects. A total of 37,684 individuals from 15 cohorts were used in the GWAS meta-analysis, along with an additional 10,739 individuals from six cohorts for replication of selected variants. All individuals were of European descent. Full details of the discovery and replication cohorts are given in the Supplementary Note, and key characteristics are summarized in Supplementary Table 1. Measurements of telomere length and quality control analysis. Mean LTL was measured using a quantitative PCRbased technique40,41 in all samples. This method expresses telomere length as a ratio (T/S) of telomere repeat length (T) to copy number of a single copy gene (S), in each sample. To standardize across plates, either a calibrator sample or a standard curve was used for quantification. LTL measurements were made in five separate laboratories. Laboratories used are listed for each cohort in Supplementary Table 1, and details for the methods used are provided in the Supplementary Note. The majority of the samples (67% of the total) were run in a single laboratory with mean inter-run coefficients of variation for LTL measurements in individual cohorts ranging between 2.7% and 3.9%. The remaining samples were run in four other laboratories (Supplementary Note). Mean LTL was first assessed for age-related shortening and for an association of longer LTL with female sex in all cohorts, and showed expected associations (Supplementary Table1a,b). Ranges in T/S ratios were found to vary between cohorts measured in different laboratories (Supplementary Table 1), largely owing to differences in the calibrator or standard DNA used. We therefore standardized LTL in each cohort using a Z-transformation approach. The Z-transformation was performed separately for males and females for sex-stratified analysis. Effects of age, adjusted for sex, on LTL were estimated in a multiple-regression model on untransformed and Z-transformed telomere length in each study separately and combined using a random-effects meta-analysis in STATA (version 11.2, Supplementary Fig. 3). Genotyping, GWAS analysis and study-level quality control. All discovery cohorts had genome-wide genotype information generated on a standard genotyping platform and include imputed genotypes based on HapMapII CEU build 36 as a reference. Detailed information about individual genotyping platforms, imputation methods and analysis software is provided in Supplementary Table 2. Within each cohort, SNP associations with LTL were analyzed by linear regression assuming additive effects with adjustment for age and sex as well as study-specific covariates where appropriate, such as adjustments for family and population structure (Supplementary Table 2). All study-specific files underwent extensive quality control procedures before meta-analysis. All files were checked for completeness and plausible descriptive statistics on all variables partly supported by the gwasqc function in the program R. Allele frequencies were checked for compliance with HapMap. In addition to the study-specific quality control filters, we included SNP results of a study in our meta-analysis only if the SNP imputation quality score was >0.5 and if the minor allele frequency was >1%. Only SNPs that were available in >50% of the total sample size over all studies were analyzed, resulting in a total number of 2,362,330 SNPs in the meta-analysis. Meta-analyses. Meta-analysis of all individual study associations was conducted using inverse variance weighting in STATA. As a measure for between study heterogeneity I2 was calculated42. For SNPs with I2 40%, fixedeffects models were applied; random-effects models were applied for SNPs with I2 > 40%. Fixed-effects results were verified by an independent analyst using METAL43. Before meta-analysis, standard errors of each study were genomic control corrected using study-specific lambda estimates as provided in Supplementary Table 2. The overall inflation factor lambda of the metaanalyzed results was 1.007. Results were further corrected for this. SNPs show ing association with telomere length with P < 5 108, which corresponds to a Bonferroni correction of one million independent tests, were considered to be statistically significant44. Replication study. Replication was sought for two SNPs reaching borderline significant P values in the discovery analysis. Additional subsets of NTR and ECGUT along with the Leiden 85-plus study had LTL measurements performed. LTL measurements were available for GRAPHIC, PLIC cohorts and

for additional samples of PREVEND. De novo genotyping was performed either using a commercial genotyping service (GRAPHIC, PREVEND, KBioscience) or by Taqman genotyping as described previously45. In these studies, the same model was applied as in the discovery studies. Single study results were metaanalyzed using inverse varianceweighted fixed-effects models in STATA. Sex-stratified analysis. Genome-wide associations were additionally conducted separately in women and men to investigate whether sex-specific signals existed. Furthermore, all top SNPs from the overall discovery GWAS were tested for differences between women and men by means of the normally distributed test statistic (wm)/sqrt(sew2 + sem2), where se represents standard error, w women and m men. The results of this analysis are given in Supplementary Table 3. Conditional association analysis. Regional association plots were generated using LocusZoom46 for each of the loci containing significantly associated SNPs. These were assessed to check that additional SNPs in high linkage disequilibrium (LD) with the lead SNP also showed some degree of association with telomere length. This was confirmed, but it was evident that some regions (5p15.33, 10q24.33 and 20q13.3) contained SNPs in low LD with the lead SNP that also showed association to LTL. To assess whether independent signals existed at these loci, conditional analyses were carried out. In a subset of studies, a multiple regression model was calculated for each locus including both SNPs. Adjustments were made in the same way as in the single SNP models. Individual study results were meta-analyzed using fixed-effects in R and compared to the meta-analysis results of single SNP models within the same subset of studies. Independency was defined as the percentage change in the effect estimate between the single and the multiple SNP model being 25%. The data are provided in Supplementary Table 4. Calculations of explained variances. Explained variances were calculated based on the effect estimates () and allele frequencies (EAF) of each single SNP by 2 EAF(1EAF) ( 2/var) as suggested before47. The phenotypic variance (var) is equal to 1 as the analysis was performed using Z-transformed telomere length. Genetic risk scores. To assess the impact of these variants on risk of CAD, we performed a multiple-SNP risk score analysis as previously described31. This method is equivalent to a fixed-effects inverse varianceweighted metaanalysis of the ratio between the two traits. Lookups were performed in CARDIoGRAM30 (1) to obtain the effect sizes for the seven SNPs along with the standard errors for CAD risk. These were then converted to a ratio ( 3) along with its standard error using the estimates from the telomere metaanalysis (2). We removed the BHF-FHS and NBS data from this analysis because they were included in the CARDIoGRAM analysis and to avoid the possibility of reverse causation given the nature of the BHF-FHS sample. The single SNP results were then meta-analyzed using fixed-effects with inversevariance weighting. The pooled estimate can be interpreted as the effect of a standard deviation increase in telomere length on the risk of CAD. Leukocyte telomerase activity assays. Details of the cohort are provided in the Supplementary Note. Peripheral blood mononuclear cells (PBMCs) were freshly isolated from whole blood by Ficoll-Pague Premium (SigmaAldrich) gradient centrifugation within 1 h after blood draw. Isolated PBMCs were stored in a cryopreservation medium composed of RPMI1640, 10% dimethyl sulfoxide and 10% FBS at liquid nitrogen tank until additional processing. Telomerase activity was assayed by the Telo TAGGG Telomerase PCR ELISA kit (Roche Applied Science) (TRAP assay) per the manufacturers protocol using 2 105 cells per assay. An extract from 2,000 cells was used for TRAP reactions. Sample telomerase activity was expressed as ratio of telomerase activity value divided by control HK293 telomerase activity value from 1,000 cells. Intra-assay coefficient of variance (CV) was 5.9% and inter-assay CV was 4.8%. Telomerase activity was log-transformed to obtain better approximations of the normal distribution before analysis. Association analyses with genotype were performed using regression and an additive model with adjustment for age, sex and ethnicity. The interaction between SNP and ethnicity was also built in the regression model to test

npg

2013 Nature America, Inc. All rights reserved.

Nature Genetics

doi:10.1038/ng.2528

whether the effect of the SNP on telomerase activity is ethnicity-dependent. The power of the study to detect a SNP effect on telomerase activity was computed using the Genetic Power Calculator48. Bioinformatics analyses. For all analyses, we tested lead SNPs and SNPs with an r2 > 0.7 to the lead SNP identified through the 1000 Genomes study at each locus. Functional predictions of any identified coding variants were carried out using PolyPhen2 (ref. 49) and SIFT50. To assess whether any variants influenced gene expression, we searched two available genome-wide gene expression databases, the monocyte genome-wide gene expression data from the Gutenburg Heart Study51 and the Genotype-Tissue Expression Project (GTEx) database, which includes liver, brain and lymphoblastoid cell types. To identify regulatory variants, we searched ENCODE data in the UCSC Genome Browser database52 to examine whether any SNPs were located within promoter, enhancer or insulator regions (Chromatin State Segmentation), methylation sites (predicted CpG islands and methylation status of the CpG site using data from the Methyl 450K Bead array data and Bisufite sequencing), conserved elements, conserved transcription factor binding sites and regions of known transcription factor binding as shown by transcription factor chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq).
40. Cawthon, R.M. Telomere measurement by quantitative PCR. Nucleic Acids Res. 30, e47 (2002).

npg

2013 Nature America, Inc. All rights reserved.

41. Cawthon, R.M. Telomere length measurement by a novel monochrome multiplex quantitative PCR method. Nucleic Acids Res. 37, e21 (2009). 42. Higgins, J.P. et al. Measuring inconsistency in meta-analyses. Br. Med. J. 327, 557560 (2003). 43. Willer, C.J. et al. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics 26, 21902191 (2010). 44. Peer, I., Yelensky, R., Altshuler, D. & Daly, M.J. Estimation of the multiple testing burden for genomewide association studies of nearly all common variants. Genet. Epidemiol. 32, 381385 (2008). 45. Salpea, K.D. et al. Association of telomere length with type 2 diabetes, oxidative stress and UCP2 gene variation. Atherosclerosis 209, 4250 (2010). 46. Pruim, R.J. et al. LocusZoom: regional visualization of genome-wide association scan results. Bioinformatics 26, 23362337 (2010). 47. Heid, I.M. et al. Meta-analysis identifies 13 new loci associated with waist-hip ratio and reveals sexual dimorphism in the genetic basis of fat distribution. Nat. Genet. 42, 949960 (2010). 48. Purcell, S., Cherny, S.S. & Sham, P.C. Genetic power calculator: design of linkage and association genetic mapping studies of complex traits. Bioinformatics 19, 149150 (2003). 49. Adzhubei, I.A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248249 (2010). 50. Kumar, P., Henikoff, S. & Ng, P.C. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat. Protoc. 4, 10731081 (2009). 51. Zeller, T. et al. Genetics and beyondthe transcriptome of human monocytes and disease susceptibility. PLoS ONE 5, e10693 (2010). 52. Rosenbloom, K.R. et al. ENCODE whole-genome data in the UCSC Genome Browser: update 2012. Nucleic Acids Res. 40, d912d917 (2012).

doi:10.1038/ng.2528

Nature Genetics

letters

A variant in FTO shows association with melanoma risk not due to BMI
Mark M Iles1, Matthew H Law2, Simon N Stacey3, Jiali Han46, Shenying Fang7, Ruth Pfeiffer8, Mark Harland1, Stuart MacGregor2, John C Taylor1, Katja K Aben9,10, Lars A Akslen11,12, Marie-Franoise Avril13, Esther Azizi14,15, Bert Bakker16, Kristrun R Benediktsdottir17,18, Wilma Bergman19, Giovanna Bianchi Scarr20,21, Kevin M Brown22, Donato Calista23, Valrie Chaudru2426, Maria Concetta Fargnoli27, Anne E Cust28, Florence Demenais24,26,29, Anne C de Waal10,30, Tadeusz Dbniak31, David E Elder32, Eitan Friedman15, Pilar Galan33, Paola Ghiorzo20,21, Elizabeth M Gillanders34, Alisa M Goldstein8, Nelleke A Gruis19, Johan Hansson35, Per Helsing36, Marko Hoc evar37, Veronica Hiom35, John L Hopper38, Christian Ingvar39, Marjolein Janssen40, Mark A Jenkins38, Peter A Kanetsky41,42, Lambertus A Kiemeney9,10,43, Julie Lang44, G Mark Lathrop26,45, Sancy Leachman46, Jeffrey E Lee47, Jan Lubin ski31, Rona M Mackie48, Graham J Mann49, 2 50 11,12 Nicholas G Martin , Jose I Mayordomo , Anders Molven , Suzanne Mulder40, Eduardo Nagore51, 52 53 54 Srdjan Novakovi , Ichiro Okamoto , Jon H Olafsson , Hkan Olsson55, Hubert Pehamberger56, Ketty Peris27, Maria Pilar Grasa50, Dolores Planelles56, Susana Puig57,58, Joan Anton Puig-Butille57,58, Q-MEGA and AMFS Investigators59, Juliette Randerson-Moor1, Celia Requena51, Licia Rivoltini60, Monica Rodolfo60, Mario Santinami61, Bardur Sigurgeirsson54, Helen Snowden1, Fengju Song4,62, Patrick Sulem3, Kristin Thorisdottir54, Rainer Tuominen35, Patricia Van Belle63, Nienke van der Stoep16, Michelle M van Rossum30, Qingyi Wei64, Judith Wendt53, Diana Zelenika45, Mingfeng Zhang4, Maria Teresa Landi8, Gudmar Thorleifsson3, D Timothy Bishop1, Christopher I Amos7,65, Nicholas K Hayward2, Kari Stefansson3,17, Julia A Newton Bishop1 & Jennifer H Barrett1, for the GenoMEL Consortium59
We report the results of an association study of melanoma that is based on the genome-wide imputation of the genotypes of 1,353 cases and 3,566 controls of European origin conducted by the GenoMEL consortium. This revealed an association between several SNPs in intron 8 of the FTO gene, including rs16953002, which replicated using 12,313 cases and 55,667 controls of European ancestry from Europe, the USA and Australia (combined P = 3.6 1012, per-allele odds ratio for allele A = 1.16). In addition to identifying a new melanomasusceptibility locus, this is to our knowledge the first study to identify and replicate an association with SNPs in FTO not related to body mass index (BMI). These SNPs are not in intron 1 (the BMI-related region) and exhibit no association with BMI. This suggests FTOs function may be broader than the existing paradigm that FTO variants influence multiple traits only through their associations with BMI and obesity. Cutaneous melanoma is a disease predominantly of fair-skinned individuals. Established risk factors include a family history of melanoma1, pigmentation phenotypes such as an inability to tan25 and many melanocytic nevi6,7. Established genetic risk factors include
A full list of affiliations appears at the end of the paper. Received 23 October 2012; accepted 5 February 2013; published online 3 March 2013; doi:10.1038/ng.2571

2013 Nature America, Inc. All rights reserved.

rare, highly penetrant variants, at least 11 common variants of lower effect identified by genome-wide association studies (GWAS)8,9 (many related to pigmentation or nevus count10,11) and mutations of intermediate effect in the MITF gene identified through a candidate-gene approach in indidviduals affected with melanoma and renal-cell carcinoma12 and sequencing genomes of multiply affected melanoma families13. The FTO gene was first found to be associated with obesity in GWAS of type 2 diabetes14 and obesity15,16. Most14,1721 but not all22,23 studies found no association between FTO and type 2 diabetes risk after adjustment for BMI. The strongest associations were with variants in intron 1 of FTO, but linkage disequilibrium (LD) stretches across introns 1 and 2 and exon 2. No SNP outside intron 1 has been previously associated with any trait, and no SNP in intron 1 has been associated with any trait unrelated to BMI. The GenoMEL consortium focuses on genetic susceptibility to melanoma and has conducted two melanoma GWAS (Phase 1 and Phase 2) using samples from populations of European or Israeli ancestry9,11. Genotypes of the 1,373 cases and 3,571 controls from Phase 1 of the GenoMEL GWAS of melanoma9 were imputed, giving 2.6 million SNPs, each tested for association with melanoma risk using

npg

428

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

letters
Plotted SNPs 10 r2 0.8 0.6 0.4 0.2 rs16953002 100

80

log10 (P value)

60

40

Figure 1 Results of stratified trend tests of imputed data for association with melanoma in region around FTO in GenoMEL Phase 1 and 2 data combined. log10 P values for association between SNPs in the region of FTO and melanoma case-control status are shown adjusted for geographic region. Color of points indicates extent of LD with rs16953002 (indicated by purple square). SNPs genotyped in all GenoMEL samples are plotted as circles, SNPs imputed in all samples as crosses and SNPs genotyped in some samples and imputed in others (as a result of chip differences) as squares. Positions of genes are given under the graph, and estimated recombination rates given by the blue line along the bottom, with scale on the right. Plot was produced using LocusZoom39.

Recombination rate (cM Mb1)

20

0
CHD9 RBL2 AKTIP RPGRIP1L FTO IRX3 CRNDE IRX5

53.5

54 54.5 Position on Chr. 16 (Mb)

55

geographic region as a covariate (Online Methods). The most significant SNP in a region not previously associated with melanoma was in FTO. Three SNPs in intron 8 of FTO were significant at P < 105, the most significant being rs16953002 (P = 5.59 106, per-allele odds ratio (OR) = 1.33, risk allele A, risk allele frequency = 0.19) and rs12596638 (P = 4.43 106, per-allele OR = 1.34, risk allele A, risk allele frequency = 0.19; in strong LD, r2 = 0.96). We confirmed imputation quality by subsequent genotyping (Online Methods). Following this finding, we imputed a region 1 Mb either side of rs16953002 for 1,449 cases and 4,043 controls in GenoMEL melanoma GWAS Phase 2 (ref. 11) and regressed SNP dosage on melanoma casecontrol status with geographic region as a covariate. In this analysis, we genotyped rs16953002 (P = 0.015, OR = 1.16) and imputed rs12596638 (P = 0.023, OR = 1.15). Combining all GenoMEL GWAS data gave five SNPs within 18 kb with P < 104 in intron 8 of FTO and over 250 kb from the closest SNP associated with BMI (Fig. 1). We sought replication (mainly using existing GWAS data) using other samples of European ancestry from Europe, Australia and the United States, totaling 10,865 cases and 51,624 controls (Supplementary Table 1). All replication samples combined exhibited association between rs16953002 and melanoma with an allelic OR of 1.14, P = 4.8 109, with all sample sets showing OR estimates in the same direction as the original finding and with no evidence of heterogeneity. When we combined these data with the GenoMEL sample data, we observed strong evidence of association with melanoma: P = 3.6 1012, per-allele OR = 1.16, 95% confidence interval (1.11, 1.20), and no evidence of heterogeneity (I2 = 0; Online Methods, Fig.2 and Table 1). BMI has, at best, a weak effect on risk of melanoma24,25. Given the clear association between variants in FTO and BMI, we investigated whether the melanoma-associated SNPs showed any association with BMI or, conversely, whether the known BMI-associated SNPs showed any association with melanoma. BMI data were available for 37% of cases and 59% of controls (many of the GenoMEL samples and seven of the replication sets; Supplementary Table 1), with additional controls collected in Iceland to give 63,518 samples from Iceland with BMI data and 14,222 from elsewhere with BMI data. Adjusting log(BMI) for age and age-squared, and regressing this on SNP genotype, with case-control status and sex as covariates, there was no significant association between rs16953002 and BMI with a combined P value of 0.15 (Supplementary Fig. 1). A more powerful
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

data set for assessing BMI-SNP associations is that of the GIANT consortium26 (http://www.broadinstitute.org/collaboration/giant/ index.php/GIANT_consortium_data_files). In the GIANT consortium data, allele A of rs16953002 was very weakly associated with decreased BMI (P = 0.0156 in 123,852 individuals, indicating at most a very small effect size). In contrast, the genotyped SNP in the FTO region that was most strongly associated with BMI in the GenoMEL data was rs8050136 (P = 8.7 1056 in all our data sets combined; Supplementary Figs.2 and 3). In the GIANT data set this association with BMI reached P = 1 1059. We also found very little LD between the two SNPs (r2 = 0.000039 in 35,583 Icelandic controls and r2 < 0.006 in every other control set). In a recent study in which FTO was sequenced, only SNPs in intron 1 were associated with BMI27. It could be that the rs16953002-BMI association in the GIANT data is due to a very well-powered data set picking up on slight LD. The great difference between the strength of association between BMI at rs8050136 and at rs16953002 can clearly be seen in a plot of the GIANT results (Supplementary Fig. 4). rs8050136 was not associated with melanoma, having a combined meta-analysis P value with GenoMEL of 0.19 (per-allele OR = 1.02; Supplementary Fig. 5). Therefore, from our data, the known BMIrelated SNPs were associated with BMI but not with melanoma risk, and the melanoma-associated SNPs exhibited no evidence of association with BMI (Table 1). We also found no association between melanoma risk and adjusted BMI in the GenoMEL data (P = 0.96).

npg

2013 Nature America, Inc. All rights reserved.

GenoMEL P1 GenoMEL P2 Leeds cohort Harvard Australia Italy Houston Iceland Netherlands Vienna Milan Sweden Valencia Zaragoza All rep Overall

1.32 1.16 1.21 1.05 1.15 1.21 1.16 1.19 1.12 1.08 1.09 1.08 1.01 1.18 1.14 1.16

1.3 105 0.015 0.021 0.634 0.005 0.135 0.043 0.006 0.151 0.347 0.476 0.435 0.916 0.173 4.8 109 3.6 1012 0.5 1 1.5

OR rs16953002

Figure 2 Forest plot of estimated per-allele ORs and P values for effect of rs16953002 on melanoma risk. Horizontal bars indicate 95% confidence intervals. Results are shown for GenoMEL Phase 1 discovery data and subsequent replication data with meta-analysis for replication data only (All rep) and all data (Overall).

429

letters
Table 1 Association between rs16953002 and melanoma, and the BMI-related SNP rs8050136 and melanoma
rs16953002 and melanoma Minor allele frequency GenoMEL Phase 1 All replicates Overall 0.16 0.17 0.17 Number of cases (controls) 1,353 (3,566) 12,314 (55,667) 13,667 (59,233) OR (95% CI) 1.32 (1.17, 1.50) 1.14 (1.09, 1.19) 1.16 (1.11, 1.20) P 1.3 105 4.8 109 3.6 1012 rs8050136 and melanoma Number of cases (controls) 1,353 (3,566) 11,707 (57,160) 13,060 (60,726) OR (95% CI) 1.09 (0.99, 1.19) 1.01 (0.98, 1.05) 1.03 (0.97, 1.10) P 0.08 0.45 0.37

The association between rs16953002 and melanoma risk was consistent across geographic regions (Fig. 2 and Supplementary Fig.6), and we found no significant difference in effect across subsets of the GenoMEL data defined by sex, tumor site, family history, early onset of disease and multiple primary tumors or association with any established melanoma-related trait (nevus count or sun sensitivity; data not shown). The association between rs16953002 and melanoma risk persisted in the subset of samples with BMI recorded even after adjusting for BMI (P = 0.01) despite a substantial reduction in sample size (Supplementary Table 2 and Supplementary Note). We split the GenoMEL data into quartiles defined by adjusted BMI of controls and regressed case/control status on rs16953002 with sex as a covariate in each quartile. The association was stronger for those samples in the first quartile (lowest BMI) than those in the other quartiles (OR = 1.66, P = 3.00 105 versus maximum OR = 1.03, minimum P = 0.82; Supplementary Fig. 7), a difference that was significant (P = 0.0005). This is consistent with rs16953002 only being associated with melanoma risk in those people with low BMI. When we attempted to replicate the results defining BMI quartiles in each population, samples collected in Australia exhibited a similar effect (P = 0.003), but samples collected in other countries outside of the UK gave more equivocal results (Supplementary Fig. 7; P = 0.6 for all replicate samples and P = 0.06 with GenoMEL samples included). However, in the nine replication studies for which BMI data were available, rs16953002 always had the greatest association with melanoma risk for those in quartile 1 or 2. Although the functional effect(s) of FTO is far from understood, evidence points to a variety of possible effects on BMI-related traits. However, a loss-of-function mutation in FTO caused gross developmental defects in nine members of a Palestinian family, suggesting a broader function for FTO28. FTO has been associated with end-stage renal disease29, acute coronary syndrome30, myocardial infarction31, all-cause mortality32, Alzheimers disease33 and osteoarthritis34. Even after adjustment for BMI, some BMI-related traits exhibit association with FTO variants, but it may be that BMI simply correlates with a weight-related factor that acts more directly on the trait of interest. Given that BMI is a risk factor for many cancers, the BMI-related SNPs in intron 1 of FTO have been studied in some of these cancers. A study of lung, kidney and upper aero-digestive cancers revealed no significant effect overall after correction for multiple testing35. The largest study of FTO and endometrial cancer found an association with a known BMI-associated SNP (P = 0.01)36 that disappears after adjustment for BMI. Thus, there is little evidence of variants in FTO being associated with any trait unrelated to BMI. It may be that the melanoma-associated SNPs are in LD with functional SNPs outside of FTO, but given the low level of LD in the region (Fig. 1) this seems unlikely. It should be noted that our most significant SNP, rs16953002, is only 31 kb from exon 9 of FTO, over 146 kb from exon 8 of FTO and over 202 kb from the nearest other gene, IRX3. SNPs overlapping regulatory elements, such as transcription factorbinding sites, can be identified using the recent Encyclopedia of DNA Elements (ENCODE) data as well other data sources37,38. In these data for the FTO gene, 2,148 SNPs have
430

2013 Nature America, Inc. All rights reserved.

been identified, only eight of which reach the highest score possible without expression quantitative trait locus (eQTL) data (score 2a: likely to affect binding). Six of these SNPs are in intron 1, the location of most of the BMI-associated SNPs, five of these in a 5.4-kb region less than 1 kb from rs8050136. The other two SNPs are 13 kb apart from one another in intron 8 and, notably, one of these is rs16953002, the melanoma-associated SNP (Supplementary Note). In conclusion, this is the first time to our knowledge that any variant in FTO has been shown to have a replicable association with a trait without being associated with BMI. It is also the first time that any variant in FTO outside intron 1 has been shown to have any association with any trait. As such, this will be of interest to researchers in the fields of both cancer genetics and obesity research. URLs. GenoMEL, http://www.genomel.org/; Wellcome Trust Case Control Consortium, http://www.wtccc.org.uk/; RegulomeDB, http:// regulomedb.org/; and Epidemiological Study on the Genetics and Environment of Asthma study, https://egeanet.vjf.inserm.fr/. Methods Methods and any associated references are available in the online version of the paper.
Acknowledgments We thank M.I. McCarthy and C.M. Lindgren for assistance with the results of the GIANT study. The GenoMEL study was funded by the European Commission under the 6th Framework Programme (contract number LSHC-CT-2006-018702), by Cancer Research UK Programme Awards (C588/A4994 and C588/A10589), by a Cancer Research UK Project Grant (C8216/A6129), by the Leeds Cancer Research UK Centre (C37059/A11941) and by a grant from the US National Institutes of Health (NIH; CA83115). This research was also supported by the intramural Research Program of the NIH, US National Cancer Institute (NCI), Division of Cancer Epidemiology and Genetics. Genotyping of most of the samples collected in France that were included in GenoMEL was done at Centre National de Gnotypage, Institut de GnomiqueCommissariat lEnergie Atomique and was supported by the Ministre de lEnseignement Suprieur et de la Recherche and Institut National du Cancer (INCa). This study used data generated by the Wellcome Trust Case Control Consortium. A full list of the investigators who contributed to the generation of the data is available from their website (see URLs); funding for the project was provided by the Wellcome Trust under award 076113. We thank the EGEA cooperative group for giving access to data of the EGEA study (see URLs). We acknowledge that the biological specimens of the French Familial Melanoma Study Group were obtained from the Institut Gustave Roussy and Fondation Jean DaussetCEPH Biobanks. Work in Stockholm was funded by the Swedish Cancer Society and Karolinska Institutet research funds. Work in Lund was funded by the Swedish Cancer Society, the Gunnar Nilsson Foundation and the European Research Council (ERC-2011-AdG 294576-risk factors cancer). Work in Genoa was funded by the Italian Ministry of Education, University and Research Progetti di Ricerca di Interesse Nazionale (2008W8JTPA_002), Intergruppo Melanoma Italiano and Mara Naum foundation. Work in Paris was funded by grants from INCa (INCa-PL016) and Ligue Nationale Contre Le Cancer (PRE05/ FD and PRE 09/FD) to F. Demenais, and Programme Hospitalier de Recherche Clinique (AOM-07-195) to M.-F. Avril and F. Demenais. Work in Leiden was funded by a grant provided by European Biobanking and Biomolecular Resources Research Infrastructure Netherlands hub (CO18). Research at the Melanoma Unit in Barcelona is partially funded by grants from Fondo de Investigaciones Sanitarias P.I. 09/01393, Spain and by the Centro de Investigaciones Biomedicas en Red (CIBER) de Enfermedades Raras of the Instituto de Salud Carlos III, Spain; by the Agencia de Gesti dAjuts Universitaris i de Recerca 2009 SGR-1337 of the Catalan Government, Spain. Work in Norway was funded by grants from

npg

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

letters
the Comprehensive Cancer Center, Oslo University Hospital (SE0728) and the Norwegian Cancer Society (71512-PR-2006-0356). Work in Vienna was supported by the Jubilumsfonds of the sterreichische Nationalbank (project numbers 12161 and 13036) and the Hans und Blanca Moser Stiftung. The Italian study was partially supported by a NIH RO1 grant CA65558-02 (to M.T. Landi), Department of Health and Human Services and by the Intramural Research Program of NIH, NCI Division of Cancer Epidemiology and Genetics. Work at the MD Anderson Cancer Center was supported by the NIH NCI (P30CA023108 and 2P50CA093459) and by the Marit Peterson Fund for Melanoma Research. A. Cust is supported by fellowships from the Cancer Institute New South Wales and the National Health and Medical Research Council. Work in Nijmegen, The Netherlands, was funded by the Dutch Cancer Society Koningin Wilhemina Fonds (KWF) Kankerbestrijding and by the Radboud University Medical Centre. The Q-MEGA study was supported by the Melanoma Research Alliance, the NIH NCI (CA88363, CA83115, CA122838, CA87969, CA055075, CA100264, CA133996 and CA49449), the National Health and Medical Research Council of Australia (NHMRC) (200071, 241944, 339462, 380385, 389927,389875, 389891, 389892,389938, 443036, 442915, 442981, 496610, 496675, 496739, 552485, 552498), the Cancer Councils New South Wales, Victoria and Queensland, the Cancer Institute New South Wales, the Cooperative Research Centre for Discovery of Genes for Common Human Diseases (CRC), Cerylid Biosciences (Melbourne), the Australian Cancer Research Foundation, The Wellcome Trust (WT084766/Z/08/Z) and donations from Neville and Shirley Hawkins. N.K.H. was supported by the NHMRC Fellowships scheme. SM was supported by a Career Development award from the NHMRC (496674, 613705). M.H.L. is supported by Cancer Australia grant 1011143. AUTHOR CONTRIBUTIONS M.M.I. led, designed and carried out the statistical analysis and wrote the manuscript. M. Harland was involved in the Leeds replication genotyping design. J.C.T. carried out statistical analysis. H.S., J.R.-M., M.J., S. Mulder and N.v.d.S. carried out genotyping and contributed to the interpretation of genotyping data. B.B. contributed to the design of the GWAS and supervised processing of GWAS samples. J.A.N.B. led the overall consortium and contributed to study design. N.A.G. was deputy lead of the consortium and contributed to study design. D.T.B. and J.H.B. designed and led the overall study. N.K.H., S. MacGregor and M.H.L. led and carried out statistical analysis of the Australian replication data. K.S., S.N.S., P.S. and G.T. led and carried out statistical analysis of the Icelandic, Dutch, Viennese, Milanese, Valencian and Zaragozan replication data. J. Han carried out statistical analysis of the Harvard replication data. C.I.A. and S.F. led and carried out statistical analysis of the Houston replication data. M.T.L. and R.P. led and carried out statistical analysis of the Italian replication data. D.Z. and G.M.L. interpreted and contributed genotype data. A.M.G., P.A.K., E.M.G. and F.D. advised on statistical analysis. F.D. and G.M.L. contributed to the design of the study of the French component of GenoMEL. K.M.B. and D.E.E. contributed to the design of the GWAS. K.K.A., L.A.A., M.-F.A., E.A., K.R.B., W.B., G.B.S., D.C., V.C., M.C.F., A.E.C., A.C.d.W., T.D., E.F., P. Galan, P. Ghiorzo, J. Hansson, P.H., Marko Hoevar, V.H., J.L.H., C.I., M.A.J., L.A.K., J. Lang, S.L., J.E.L., J. Lubiski, R.M.M., G.J.M., N.G.M., J.I.M., A.M., E.N., S.N., I.O., J.H.O., H.O., H.P., K.P., M.P.G., D.P., S.P., J.A.P.-B., C.R., L.R., M.R., M.S., B.S., F.S., K.T., R.T., P.V.B., M.M.v.R., Q.W., J.W. and M.Z. contributed to the design and sample collection of either the initial GWAS or one of the replication studies. COmpeting financial INTERESTs The authors declare no competing financial interests.
Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html.
7. Chang, Y.M. et al. A pooled analysis of melanocytic nevus phenotype and the risk of cutaneous melanoma at different latitudes. Int. J. Cancer 124, 420428 (2009). 8. Brown, K.M. et al. Common sequence variants on 20q11.22 confer melanoma susceptibility. Nat. Genet. 40, 838840 (2008). 9. Bishop, D.T. et al. Genome-wide association study identifies three loci associated with melanoma risk. Nat. Genet. 41, 920925 (2009). 10. Gubjartsson, D.F. et al. ASIP and TYR pigmentation variants associate with cutaneous melanoma and basal cell carcinoma. Nat. Genet. 40, 886891 (2008). 11. Barrett, J.H. et al. Genome-wide association study identifies three new melanoma susceptibility loci. Nat. Genet. 43, 11081113 (2011). 12. Bertolotto, C. et al. A SUMOylation-defective MITF germline mutation predisposes to melanoma and renal carcinoma. Nature 480, 9498 (2011). 13. Yokoyama, S. et al. A novel recurrent mutation in MITF predisposes to familial and sporadic melanoma. Nature 480, 99103 (2011). 14. Frayling, T.M. et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 316, 889894 (2007). 15. Dina, C. et al. Variation in FTO contributes to childhood obesity and severe adult obesity. Nat. Genet. 39, 724726 (2007). 16. Scuteri, A. et al. Genome-wide association scan shows genetic variants in the FTO gene are associated with obesity-related traits. PLoS Genet. 3, e115 (2007). 17. Zeggini, E. et al. Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science 316, 13361341 (2007). 18. Scott, L.J. et al. A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science 316, 13411345 (2007). 19. Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661678 (2007.). 20. Zeggini, E. et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat. Genet. 40, 638645 (2008). 21. Voight, B.F. et al. Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis. Nat. Genet. 42, 579589 (2010). 22. Hertel, J.K. et al. FTO, type 2 diabetes, and weight gain throughout adult life: a meta-analysis of 41,504 subjects from the Scandinavian HUNT, MDC, and MPP studies. Diabetes 60, 16371644 (2011). 23. Li, H. et al. Association of genetic variation in FTO with risk of obesity and type 2 diabetes with data from 96,551 East and South Asians. Diabetologia 55, 981995 (2012). 24. Renehan, A.G., Tyson, M., Egger, M., Heller, R.F. & Zwahlen, M. Body-mass index and incidence of cancer: a systematic review and meta-analysis of prospective observational studies. Lancet 371, 569578 (2008). 25. Pothiawala, S., Qureshi, A.A., Li, Y. & Han, J. Obesity and the incidence of skin cancer in US Caucasians. Cancer Causes Control 23, 717726 (2012). 26. Speliotes, E.K. et al. Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat. Genet. 42, 937948 (2010). 27. Sllman Almn, M. et al. Determination of the obesity-associated gene variants within the entire FTO gene by ultra-deep targeted sequencing in obese and lean children. Int. J. Obes. (Lond.) advance online publication, doi:10.1038/ijo.2012.57 (24 April 2012). 28. Boissel, S. et al. Loss-of-function mutation in the dioxygenase-encoding FTO gene causes severe growth retardation and multiple malformations. Am. J. Hum. Genet. 85, 106111 (2009). 29. Hubacek, J.A. et al. The FTO gene polymorphism is associated with end-stage renal disease: two large independent case-control studies in a general population. Nephrol. Dial. Transplant. 27, 10301035 (2012). 30. Hubacek, J.A. et al. A FTO variant and risk of acute coronary syndrome. Clin. Chim. Acta 411, 10691072 (2010). 31. Doney, A.S. et al. The FTO gene is associated with an atherogenic lipid profile and myocardial infarction in patients with type 2 diabetes: a genetics of diabetes audit and research study in Tayside Scotland (Go-DARTS) study. Circ. Cardiovasc. Genet. 2, 255259 (2009). 32. Zimmermann, E. et al. Fatness-associated FTO gene variant increases mortality independent of fatnessin cohorts of Danish men. PLoS ONE 4, e4428 (2009). 33. Keller, L. et al. The obesity related gene, FTO, interacts with APOE, and is associated with Alzheimers disease risk: a prospective cohort study. J. Alzheimers Dis. 23, 461469 (2011). 34. arcOGEN Consortium and arcOGEN Collaborators. Identification of new susceptibility loci for osteoarthritis (arcOGEN): a genome-wide association study. Lancet 380, 815823 (2012). 35. Brennan, P. et al. Obesity and cancer: Mendelian randomization approach utilizing the FTO genotype. Int. J. Epidemiol. 38, 971975 (2009). 36. Lurie, G. et al. The obesity-associated polymorphisms FTO rs9939609 and MC4R rs17782313 and endometrial cancer risk in non-Hispanic white women. PLoS ONE 6, e16756 (2011). 37. Boyle, A.P. et al. Annotation of functional variation in personal genomes using RegulomeDB. Genome Res. 22, 17901797 (2012). 38. Schaub, M.A. et al. Linking disease associations with regulatory information in the human genome. Genome Res. 22, 17481759 (2012). 39. Pruim, R.J. et al. LocusZoom: regional visualization of genome-wide association scan results. Bioinformatics 26, 23362337 (2010).

npg

2013 Nature America, Inc. All rights reserved.

1. Cannon-Albright, L.A., Bishop, D.T., Goldgar, C. & Skolnick, M.H. Genetic predisposition to cancer. Important Adv. Oncol. 1991, 3955 (1991). 2. Naldi, L. et al. Cutaneous malignant melanoma in women. Phenotypic characteristics, sun exposure, and hormonal factors: a case-control study from Italy. Ann. Epidemiol. 15, 545550 (2005). 3. Titus-Ernstoff, L. et al. Pigmentary characteristics and moles in relation to melanoma risk. Int. J. Cancer 116, 144149 (2005). 4. Holly, E.A., Aston, D.A., Cress, R.D., Ahn, D.K. & Kristiansen, J.J. Cutaneous melanoma in women. I. Exposure to sunlight, ability to tan, and other risk factors related to ultraviolet light. Am. J. Epidemiol. 141, 923933 (1995). 5. Holly, E.A., Aston, D.A., Cress, R.D., Ahn, D.K. & Kristiansen, J.J. Cutaneous melanoma in women. II. Phenotypic characteristics and other host-related factors. Am. J. Epidemiol. 141, 934942 (1995). 6. Bataille, V. et al. Risk of cutaneous melanoma in relation to the numbers, types and sites of naevi: a case-control study. Br. J. Cancer 73, 16051611 (1996).

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

431

letters
1Section

npg

2013 Nature America, Inc. All rights reserved.

of Epidemiology and Biostatistics, Leeds Institute of Molecular Medicine, Leeds Cancer Research UK Centre, St. Jamess University Hospital, Leeds, UK. Institute of Medical Research, Brisbane, Australia. 3deCODE Genetics, Reykjavik, Iceland. 4Department of Dermatology, Brigham and Womens Hospital, Harvard Medical School, Boston, Massachusetts, USA. 5Channing Laboratory, Department of Medicine, Brigham and Womens Hospital and Harvard Medical School, Boston, Massachusetts, USA. 6Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts, USA. 7Department of Genetics, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA. 8Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, USA. 9Comprehensive Cancer Center, The Netherlands, Nijmegen, The Netherlands. 10Department for Health Evidence, Radboud University Medical Centre, Nijmegen, The Netherlands. 11Centre for Cancer Biomarkers, The Gade Institute, University of Bergen, Bergen, Norway. 12Department of Pathology, Haukeland University Hospital, Bergen, Norway. 13Assistance PubliqueHpitaux de Paris, Hpital Cochin, Service de Dermatologie, Universit Paris Descartes, Paris, France. 14Department of Dermatology and the Oncogenetics Unit, Sheba Medical Center, Tel Hashomer, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel. 15Oncogenetics Unit, Sheba Medical Center, Tel Hashomer, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel. 16Department of Clinical Genetics, Center of Human and Clinical Genetics, Leiden University Medical Center, Leiden, The Netherlands. 17Faculty of Medicine, University of Iceland, Reykjavik, Iceland. 18Department of Pathology, Landspitali University Hospital, Reykjavik, Iceland. 19Department of Dermatology, Leiden University Medical Centre, Leiden, The Netherlands. 20Department of Internal Medicine and Medical Specialties, University of Genoa, Genoa, Italy. 21Laboratory of Genetics of Rare Hereditary Cancers, Istituto di Ricovero e Cura a Carattere Scientifico, San Martino-Istituto Scientifico Tumori, Genoa, Italy. 22Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Gaithersburg, Maryland, USA. 23Dermatology Unit, Maurizio Bufalini Hospital, Cesena, Italy. 24Institut National de la Sant et de la Recherche Mdicale (INSERM), Unite Mixte de Recherche (UMR) 946, Genetic Variation and Human Diseases Unit, Paris, France. 25Universit dEvry Val dEssonne, Evry, France. 26Fondation Jean Dausset, Centre dEtude du Polymorphisme Humain (CEPH), Paris, France. 27Department of Dermatology, University of LAquila, LAquila, Italy. 28Cancer Epidemiology and Services Research, Sydney School of Public Health, The University of Sydney, Australia. 29Universit Paris Diderot, Sorbonne Paris Cit, Institut Universitaire dHmatologie, Paris, France. 30Department of Dermatology, Radboud University Medical Centre, Nijmegen, The Netherlands. 31International Hereditary Cancer Center, Pomeranian Medical University, Szczecin, Poland. 32Department of Pathology and Laboratory Medicine, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, Pennsylvania, USA. 33Research Unit on Nutritional Epidemiology, Universit Paris 13, Sorbonne, Paris Cit, INSERM (U557), Institut Scientifique de Recherche Agronomique (INRA U1125), Conservatoire National des Arts et Mtiers (CNAM), Bobigny, France. 34Epidemiology and Genetics Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, Maryland, USA. 35Department of Oncology-Pathology, Karolinska Institutet, Karolinska University Hospital, Solna, Stockholm, Sweden. 36Department of Dermatology, Oslo University Hospital, Rikshospitalet, Oslo, Norway. 37Department of Surgical Oncology, Institute of Oncology Ljubljana, Ljubljana, Slovenia. 38Centre for Molecular, Environmental, Genetic and Analytic (MEGA) Epidemiology, Melbourne School of Population Health, University of Melbourne, Melbourne, Australia. 39Department of Surgery, Clinical Sciences, Lund University, Lund, Sweden. 40ServiceXS B.V., Leiden, The Netherlands. 41Centre for Clinical Epidemiology and Biostatistics, University of Pennsylvania, Philadelphia, Pennsylvania, USA. 42Department of Biostatistics and Epidemiology, University of Pennsylvania, Philadelphia, Pennsylvania, USA. 43Department of Urology, Radboud University Medical Centre, Nijmegen, The Netherlands. 44Department of Medical Genetics, University of Glasgow, Glasgow, UK. 45Commissariat lEnergie Atomique, Institut de Gnomique, Centre National de Gnotypage, Evry, France. 46Department of Dermatology, University of Utah School of Medicine, Huntsman Cancer Institute, Salt Lake City, Utah, USA. 47Department of Surgical Oncology, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA. 48Department of Public Health, University of Glasgow, UK. 49Westmead Institute of Cancer Research, University of Sydney at Westmead Millennium Institute and Melanoma Institute Australia, Sydney, Australia. 50University of Zaragoza, Zaragoza, Spain. 51Department of Dermatology, Instituto Valenciano de Oncologa, Valencia, Spain. 52Department of Molecular Diagnostics, Institute of Oncology Ljubljana, Ljubljana, Slovenia. 53Department of Dermatology, Medical University of Vienna, Vienna, Austria. 54Department of Dermatology, Faculty of Medicine, University of Iceland, Reykjavik, Iceland. 55Departments of Oncology and Cancer Epidemiology, Clinical Sciences, Lund University, Lund, Sweden. 56Laboratory of HistocompatibilityMolecular Biology, Center for Blood Transfusion, Valencia, Spain. 57Melanoma Unit, Dermatology Department, Hospital Clinic, Institut de Investigac Biomdica August Pi Sue, Universitat de Barcelona, Barcelona, Spain. 58Centro de Investigaciones Biomedicas en Red (CIBER) de Enfermedades Raras, Instituto de Salud Carlos III, Barcelona, Spain. 59A full list of members is provided in the Supplementary Note. 60Immunotherapy of Human Tumors Unit, Fondazione Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS) Istituto Nazionale deiTumori, Milan, Italy. 61Melanoma and Sarcoma Surgery Unit, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy. 62Department of Epidemiology, Tianjin Medical University Cancer Institute and Hospital, Tianjin, China. 63Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA. 64Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA. 65Department of Community and Family Medicine, Geisel College of Medicine, Dartmouth College, Hanover, New Hampshire, USA. Correspondence should be addressed to M.M.I. (m.m.iles@leeds.ac.uk).
2Queensland

432

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

ONLINE METHODS

2013 Nature America, Inc. All rights reserved.

Samples. Approval for these studies was obtained at each recruiting center. Informed consent was obtained from all participants. Phase 1 of the original GenoMEL GWAS consisted of cases and controls collected from eight centers in six European countries. These were supplemented with controls from the Wellcome Trust Case Control Consortium (WTCCC)19. Standard quality-control measures were applied to both samples and SNPs, giving a total of 1,353 cases and 3,571 controls. Phase 2 of the GenoMEL GWAS consisted of cases and controls from ten centers (four not in Phase 1) in eight European countries and in Israel, supplemented again by samples from the WTCCC. In both phases, cases were preferentially selected to have a family history of melanoma, multiple primary tumors or an early age of onset. After quality control, 1,450 cases and 4,047 controls remained (quality control and samples are described in ref. 11). We obtained 680 supplementary UK cases and 1,785 controls from a population-based study of incident melanoma cases diagnosed between September 2000 and December 2006 from a geographically defined area of Yorkshire and the northern region of the UK9,40,41. Controls were ascertained by contacting general practitioners to identify eligible individuals. These controls were frequency-matched with cases for age and sex from general practitioners who also had cases as part of their patient register. An additional 220 controls were sex- and age-matched and from the same primary care practice as incident cases of colorectal cancer recruiting from hospitals in Leeds42. The only GenoMEL center that collected BMI data was Leeds. In Leeds, two studies were used: a family-based study that did not collect BMI and a casecontrol study that did collect BMI (see Supplementary Table 1). For details of replication samples, see Supplementary Note. Genotyping. Most GenoMEL Phase 1 samples were genotyped on the Illumina HumanHap300 BeadChip version 2 duo array (with 317,000 tagging SNPs), with the exception of the French cases, which were genotyped on the Illumina HumanCNV370k array. The GenoMEL Phase 2 samples were genotyped on the Illumina 610k array. In the genotyping of the UK case-control samples, rs16953002 and rs12596638 were genotyped using the Taqman assays C__34511379_10 and C__11776446_10, respectively (Applied Biosystems). We performed 2-l PCR in 384-well plates using 10 ng of DNA (dried), using 0.05 l assay mix and 1 l Universal Master Mix (Applied Biosystems) according to the manufacturers instructions. End-point reading of the genotypes was performed using an ABI 7900HT Real-time PCR system (Applied Biosystems). Imputation. Imputation was conducted genome-wide on the GenoMEL Phase 1 samples, excluding SNPs with minor allele frequency (MAF) < 0.03, Hardy-Weinberg equilibrium (HWE) P < 104 (in controls) and missingness > 0.03. IMPUTEv2 (refs. 43,44) was used and the reference panel consisted of 120 European samples from HapMap release #24 (NCBI build36, November 2008). After the initial genome-wide imputation had identified the FTO region as a candidate region, additional imputation of this region (1 Mb either side of rs16953002, chromosome 16: 5311482455114824) was conducted based on the 1000 Genomes Phase 1 integrated variant set (March 2012 release, excluding SNPs with MAF <0.001 in the CEU European samples).

The number of well-imputed SNPs (Impute quality metric (info) score >0.8) in the region increased from 1,245 to 4,874, although the most significant three SNPs remained the same. The first P values quoted for rs16953002 and rs12596638 (P = 5.59 106 and P = 4.43 106, respectively) were from the genome-wide imputation, but all subsequent analyses are based on the FTO regional imputation. Imputed genotypes were analyzed as expected genotype counts based on the posterior probabilities (gene dosage) using SNPTEST2 (ref. 45) assuming an additive model with geographic region as a covariate. Only those with an info score > 0.8 are considered to be of sufficient quality. The FTO region was imputed and analyzed in the GenoMEL Phase 2 data in the same way. Imputation quality was confirmed by genotyping 3,694 of the previously imputed samples from GenoMEL Phase 1 at rs16953002. The imputed genotype with the highest posterior probability was correct in 97% of cases (increasing to 98% if we only consider those genotypes where the maximum posterior probability is > 0.8). Given this strong confirmation of the quality of the imputation, unless otherwise stated we present the result using the imputed Phase 1 results, rather than interleaving imputed and genotyped data indiscriminately. In the Supplementary Note and Supplementary Table 2 results are presented using only genotyped data for comparison with the imputed results. In the replication samples, rs16953002 and rs8050136 were genotyped, with the exception of rs8050136 being imputed in the samples collected at Harvard. Meta-analysis. Meta-analyses assume fixed effects unless otherwise stated. In all cases, heterogeneity between studies is measured with the I2 metric; it has been suggested that values below 31% are of little concern and those above 56% should induce considerable caution 46. Where I2 is >31%, a randomeffects meta-analysis is applied. Here the method published in ref. 47 was used to estimate the between-studies variance, t2 . An overall random effects estimate was then calculated using the weights t1 /(n i +t2 ), where i is the variance of the estimated effect. t2 = 0 for the fixed-effects analyses.
40. Newton-Bishop, J.A. et al. Melanocytic nevi, nevus genes, and melanoma risk in a large case-control study in the United Kingdom. Cancer Epidemiol. Biomarkers Prev. 19, 20432054 (2010). 41. Newton-Bishop, J.A. et al. Relationship between sun exposure and melanoma risk for tumours in different body sites in a large case-control study in a temperate climate. Eur. J. Cancer 47, 732741 (2011). 42. Barrett, J.H. et al. Investigation of interaction between N-acetyltransferase 2 and heterocyclic amines as potential risk factors for colorectal cancer. Carcinogenesis 24, 275282 (2003). 43. Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies via imputation of genotypes. Nat. Genet. 39, 906913 (2007). 44. Howie, B.N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009). 45. Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies via imputation of genotypes. Nat. Genet. 39, 906913 (2007). 46. Higgins, J.P. & Thompson, S.G. Quantifying heterogeneity in a meta-analysis. Stat. Med. 21, 15391558 (2002). 47. Dersimonian, R. & Laird, N. Meta-analysis in clinical trials. Control. Clin. Trials 7, 177188 (1986).

npg

doi:10.1038/ng.2571

Nature Genetics

letters

Seven new loci associated with age-related macular degeneration


The AMD Gene Consortium*
Age-related macular degeneration (AMD) is a common cause of blindness in older individuals. To accelerate the understanding of AMD biology and help design new therapies, we executed a collaborative genome-wide association study, including >17,100 advanced AMD cases and >60,000 controls of European and Asian ancestry. We identified 19 loci associated at P < 5 108. These loci show enrichment for genes involved in the regulation of complement activity, lipid metabolism, extracellular matrix remodeling and angiogenesis. Our results include seven loci with associations reaching P < 5 108 for the first time, near the genes COL8A1-FILIP1L, IER3-DDR1, SLC16A8, TGFBR1, RAD51B, ADAMTS9 and B3GALTL. A genetic risk score combining SNP genotypes from all loci showed similar ability to distinguish cases and controls in all samples examined. Our findings provide new directions for biological, genetic and therapeutic studies of AMD. AMD is a highly heritable, progressive neurodegenerative disease that leads to loss of central vision through death of photoreceptors 1,2. In developed countries, AMD is the leading cause of blindness in those >65 years of age3. Genes in the complement pathway411 and a region of chromosome 10 (refs. 12,13) have now been implicated as the major genetic contributors to disease. Association has also been shown with several additional loci1420, each providing an entry point into AMD biology and potential therapeutic targets. To accelerate the pace of discovery in macular degeneration genetics, 18 research groups from across the world formed the AMD Gene Consortium in early 2010, with support from the National Eye Institute of the US National Institutes of Health (Table 1, Supplementary Table 1 and Supplementary Note). To extend the catalog of diseaseassociated common variants, we first organized a meta-analysis of genome-wide association studies (GWAS), combining data for >7,600 cases with advanced disease (with geographic atrophy, neovascularization or both) and >50,000 controls. Each study was first subjected to GWAS quality control filters (customized to take into account studyspecific features21 as detailed in Supplementary Table 2) and analyzed at sites in the HapMap reference panel using statistical genotype imputation2225. Results were combined through meta-analysis26, and 32 variants representing loci with promising evidence of association were genotyped in an additional >9,500 cases and >8,200 controls
*A

(Supplementary Tables 13; summary meta-analysis results are available online, see URLs). Our overall analysis of the most promising variants thus included >17,100 cases and >60,000 controls. Our meta-analysis evaluated evidence for association at 2,442,884 SNPs (Fig. 1). Inspection of quantile-quantile plots (Supplementary Fig. 1) and the genomic control value (GC = 1.06) suggested that unmodeled population stratification did not significantly affect our findings (Supplementary Table 4). Joint analysis of discovery and follow-up studies27 resulted in the identification of 19 loci with associations reaching P < 5 108 (Fig. 1, Table 2 and Supplementary Table5). These 19 loci include all susceptibility loci previously reaching P < 5 108, except the 4q12 gene cluster, for which association was reported in a Japanese population. In addition, the set included seven loci reaching P < 5 108 for the first time. We evaluated heterogeneity between studies using the I2 statistic, which compares the variability in effect size estimates between studies to that expected by chance28. We observed significant (P < 0.05/19) heterogeneity only for loci near ARMS2 (I2 = 75.7%, P < 1 106) and near CFH (I2 = 85.4%, P < 1 106). Although these two loci were significantly associated in every sample examined, the magnitude of association varied more than expected. To explore sources of heterogeneity, we carried out a series of subanalyses: we repeated the genome-wide meta-analysis, (i) adding an age adjustment, (ii) in neovascularization and geographic atrophy cases separately, (iii) in men and women separately, and (iv) in samples of European and Asian ancestry separately (Fig. 2 and Supplementary Fig. 2). These subanalyses of the full GWAS data set did not identify additional loci with associations reaching P < 5 108; furthermore, heterogeneity near CFH and ARMS2 remained significant in all subanalyses (I2 > 65%, P < 0.001). Consistent with previous reports17,29,30, separate analysis of neovascularization and geographic atrophy cases showed ARMS2 risk alleles preferentially associated with risk of neovascularization disease (ORNV = 2.97, ORGA = 2.50, Pdifference = 0.0008), whereas CFH risk alleles preferentially associated with risk of geographic atrophy (ORNV = 2.34, ORGA = 2.80, Pdifference = 0.0006). We also observed large differences in effect sizes when stratifying by ancestry, with variants near CFH showing stronger evidence for association in Europeans (P = 1 107) and those near TNFRSF10A showing stronger association among east Asians (P = 0.002). Potential explanations for these observations include differences in linkage disequilibrium (LD) between populations or differences in environmental or diagnostic factors that modify genetic effects.

npg

2013 Nature America, Inc. All rights reserved.

full list of authors and affiliations appears at the end of the paper.

Received 29 May 2012; accepted 7 February 2013; published online 3 March 2013; doi:10.1038/ng.2578

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

433

letters
Because index SNP rs10490924 is also in strong disequilibrium (r2 = 0.90) with a Contributing Neovascular nearby SNP, rs11200638, which regulates Analysis study groups Ncases Female (%) disease (%) Ncontrols Female (%) HTRA1 (ref. 41), our data do not directly Genome-wide discovery 15 7,650 53.9 59.2 51,844 45.2 answer whether HTRA1 or ARMS2 is the Targeted follow-up 18 9,531 56.3 57.8 8,230 53.8 causal gene in this locus. Although a common Overall 33 17,181 55.2 58.4 60,074 46.3 deletion of the CFHR1 and CFHR3 genes has Additional details, including a breakdown of the numbers of cases and controls in individual samples, are provided in been associated with AMD42,43, there was Supplementary Table 1. Ncases includes only cases with geographic atrophy, choroidal neovascularization or both. only modest signal in this study, potentially due to LD with our most significantly assoIdentifying the full spectrum of allelic variation that contributes ciated variants in the locus (r2 = 0.31 between rs10737680 and 1000 to disease in each locus will require sequencing of AMD cases and Genomes Project MERGED_DEL_2_6731)34. controls. To conduct an initial evaluation of the evidence for multiple Using RNA sequencing44, we examined the mRNA levels of 85 AMD risk alleles in the 19 loci described here, we repeated genome- genes within 100 kb of our index SNPs in postmortem human retina wide association analyses, conditioning on the risk alleles listed in (Supplementary Table 8). Of 19 independent risk-associated loci, Table 2. We then examined each of the 19 implicated loci for variants 3 had no genes with expressed transcripts in retina tissue from either with independent association (at P < 0.0002, corresponding to an esti- young or elderly individuals. Two genes showed differential expresmate of ~250 independent variants per locus). This analysis resulted sion in the postmortem retina of young (ages 1735) and elderly (ages in the identification of the previously well-documented independently 75 and 77) individuals: CFH (P = 0.009) and VEGFA (P = 0.003), associated variants near CFH and C2-CFB8,10,31,32 and of additional both with higher expression in the older individuals. Using previously independent signals near C3, CETP, LIPC, FRK-COL10A1, IER3- published data45, we also examined the expression of associated genes DDR1 and RAD51B (Supplementary Table 6). In four of these loci, in fetal and adult retinal pigment epithelium (RPE). This analysis the independently associated variants mapped very close to (within showed higher C3 expression in adult RPE compared to fetal RPE 60 kb of) the original signal. This shows that each locus can harbor (P = 0.0008). In addition to C3 and CFH, all the complement genes multiple susceptibility alleles, encouraging searches for rare variants with detectable expression in the retina or RPE experiments showed that elucidate disease-related gene function in these regions33,34. higher expression levels in tissue from the older individuals. To prioritize our search for likely causal variants, we examined each To identify biological relationships among our genetic associalocus in detail (see LocusZoom35 plots in Supplementary Fig.3) and tion signals, we catalogued the genes within 100 kb of the variants in investigated whether risk alleles for AMD were associated with changes each association peak (r2 > 0.8 with the index SNP listed in Table1). in protein sequence, copy number variation or insertion-deletion Ingenuity Pathway Analysis (IPA, Ingenuity Systems) highlighted (indel) polymorphisms. One quarter of associated variants altered several biological pathways, particularly the complement system and protein sequence, either directly (N = 2) or through LD (r2 > 0.6; atherosclerotic signaling, that were enriched in the resulting set of 90 N = 3) with a nearby nonsynonymous variant (Supplementary genes (Table 3 and Supplementary Table 9). To account for features Table7). Some coding variants implicate well-studied genes (ARMS2, of GWAS (such as the different number of SNPs representing each C3 and APOE), whereas others helped prioritize nearby genes for gene), we carried out two additional analyses. First, we repeated our further study. On chromosome 4q25, index SNP rs4698775 is in analysis for 50 sets of 19 control loci drawn from the National Human strong LD (r2 = 0.88) with a potentially protein-damaging variant in Genome Research Institute (NHGRI) GWAS catalog46. In these 50 CCDC109B36, encoding a coiled-coil domaincontaining protein that control sets, Ingenuity enrichment P values for the complement sysmight be involved in the regulation of gene expression. On chromo- tem and for atherosclerosis signaling genes were exceeded 16% and some 6q22, index SNP rs3812111 is a perfect proxy for a coding vari- 32% of the time, respectively (although these 2 specific pathways ant in COL10A1, encoding a collagen protein that could be important were never implicated in a control data set). Second, we repeated our in maintaining the structure and function of 400 the extracellular matrix. Notably, rs1061170 300 (encoding a p.His402Tyr alteration in CFH; 200 100 NP_000177.2) was not in disequilibrium with rs10737680, the most strongly associ15 ated SNP in the CFH region but, instead, was tagged by a secondary, weaker association 10 signal (Supplementary Tables 6 and 7). This is consistent with previous haplotype analyses of the locus10,31,32,34,37. 5 We used publicly available data 38,39 to determine whether any of our index SNPs 0 might be proxies for copy number variants or 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 indels, which are hard to directly interrogate Chromosome with genotyping arrays. This analysis idenin the discovery tified a single strong association (r2=0.99) Figure 1 Summary of GWAS results. Summary of genome-wide association results GWAS sample. Previously described loci with associations reaching P < 5 108 are labeled in between rs10490924, a coding variant in the blue; new loci with associations reaching P < 5 108 for the first time after follow-up analysis ARMS2 gene that is the peak of association are labeled in green. The dashed horizontal lines represent thresholds for follow-up ( P < 1 105, at 10q26, and a 3 UTR indel polymorphism orange) and genome-wide significance (P < 5 106, red) as well as a discontinuity in the y axis associated with ARMS2 mRNA instability40. (at P < 1 1016, gray).
Table 1 Summary of samples used in genome-wide discovery and targeted follow-up analyses
ARMS2-HTRA1 C2-CFB CFH C3 APOE COL15A1-TGFBR1 IER3-DDR1 VEGFA FRK-COL10A1 TNFRSF10A ADAMTS9 COL8A1 CETP

npg

2013 Nature America, Inc. All rights reserved.

434

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

B3GALTL

RAD51B LIPC

SLC16A8

log10 P

CFI

TIMP3

letters
Table 2 Summary of loci with associations reaching genome-wide significance
SNP Risk allele Chr. Position Nearby genes EAF Discovery P Loci previously reported at P < 5 108 rs10490924 T 10 124.2 Mb rs10737680 A 1 196.7 Mb rs429608 G 6 31.9 Mb rs2230199 C 19 6.7 Mb rs5749482 G 22 33.1 Mb rs4420638 A 19 45.4 Mb rs1864163 G 16 57.0 Mb rs943080 T 6 43.8 Mb rs13278062 T 8 23.1 Mb rs920915 C 15 58.7 Mb rs4698775 G 4 110.6 Mb rs3812111 T 6 116.4 Mb Loci reaching P < 5 108 for the first time rs13081855 T 3 99.5 Mb rs3130783 A 6 30.8 Mb ARMS2-HTRA1 CFH C2-CFB C3 TIMP3 APOE CETP VEGFA TNFRSF10A LIPC CFI COL10A1 COL8A1-FILIP1L IER3-DDR1 SLC16A8 TGFBR1 RAD51B ADAMTS9 B3GALTL 0.30 0.64 0.86 0.20 0.74 0.83 0.76 0.51 0.48 0.48 0.31 0.64 0.10 0.79 0.21 0.73 0.61 0.46 0.44 4 1 2 2 6 3 8 4 7 2 2 7 10353 10283 1054 1026 1013 1015 1013 1012 1010 109 1010 108 OR 2.71 2.40 1.67 1.46 1.25 1.34 1.25 1.18 1.17 1.14 1.16 1.13 1.28 1.15 1.16 1.13 1.11 1.13 1.12 Follow-up P 2.8 10190 2.7 10152 2.4 1037 3.4 1017 9.7 1017 4.2 107 8.7 105 1.6 105 6.4 107 0.004 0.025 0.022 6.0 104 3.5 106 5.6 105 6.7 106 2.1 105 0.0066 0.0018 OR 2.88 2.50 1.89 1.37 1.45 1.25 1.17 1.12 1.14 1.10 1.08 1.06 1.17 1.16 1.13 1.13 1.11 1.07 1.08 4 1 4 1 2 2 7 9 3 3 7 2 P 10540 10434 1089 1041 1026 1020 1016 1016 1015 1011 1011 108 2.76 2.43 1.74 1.42 1.31 1.30 1.22 1.15 1.15 1.13 1.14 1.10 Combined OR (95% CI) (2.722.80) (2.392.47) (1.681.79) (1.371.47) (1.261.36) (1.241.36) (1.171.27) (1.121.18) (1.121.19) (1.091.17) (1.101.17) (1.071.14)

4 1011 1 106 8 9 9 9 2 108 107 107 108 106

4 1013 2 1011 2 3 9 5 2 1011 1011 1011 109 108

1.23 (1.171.29) 1.16 (1.111.20) 1.15 1.13 1.11 1.10 1.10 (1.111.19) (1.101.17) (1.081.14) (1.071.14) (1.071.14)

2013 Nature America, Inc. All rights reserved.

rs8135665 rs334353 rs8017304 rs6795735 rs9542236

T T A T C

22 9 14 3 13

38.5 101.9 68.8 64.7 31.8

Mb Mb Mb Mb Mb

All results reported here include a genomic control correction for individual studies and also for the final meta-analysis51. A summary of all gene name abbreviations used in this table and elsewhere in the manuscript is provided in Supplementary Table 5. EAF is the allele frequency of the risk-increasing allele. Chr., chromosome.

enrichment analyses using the Interval-based Enrichment Analysis Tool for Genome-Wide Association Studies (INRICH)47, which is specifically designed for the analysis of GWAS but accesses a more limited set of annotations. The INRICH analyses showed enrichment for genes encoding collagen and extracellular region proteins (both with P=1 105; Padjusted = 0.0006 after adjustment for multiple testing), complement and coagulation cascades (P = 0.0002; Padjusted = 0.03), lipoprotein metabolism (P = 0.0003; Padjusted = 0.04) and regulation of apoptosis (P = 0.0009; Padjusted = 0.09) (Supplementary Table 10).
3.0 OR (unadjusted) 3.0 OR (male) 2.5 2.0 1.5 1.0 1.0 3.0 OR (NV) 2.5 2.0 1.5 1.0 1.0 1.5 2.0 2.5 OR (GA) 3.0
TNFRSF10A CFH CETP LIPC ARMS2-HTRA1 CFH

To explore the connections between our genetic association signals, we tested for interaction between pairs of risk alleles, looking for situations where joint risk was different than the expectation based on marginal effects. This analysis comprised 171 pairwise tests of interaction, of which 9 were nominally significant (P < 0.05; Supplementary Table 11), consistent with expectations by chance. The strongest observed interaction involved risk alleles at rs10737680 (near CFH) and rs429608 (near C2-CFB), the only association that remained significant after adjusting for multiple testing (P = 0.000052 < 0.05/171 = 0.00029).
1.0

npg

2.5 2.0 1.5 1.0 1.5 2.0 2.5 OR (age adjusted)


ARMS2-HTRA1

0.8
1.5 2.0 2.5 OR (female) 3.0

3.0 3.0 OR (European) 2.5 2.0 1.5 1.0 0.5

1.0

0.6
Sensitivity

CFH

0.4

TNFRSF10A

0.2
2.5 3.0

0.5

1.0

1.5 2.0 OR (Asian)

0
1.0 0.8

19 SNPs (AUC = 0.734) 12 SNPs (AUC = 0.736) 7 SNPs (AUC = 0.519)


0.6 0.4 Specificity 0.2 0

Figure 2 Sensitivity analysis. Top left, estimated effect sizes for the original analysis are compared to those for an age-adjusted analysis (where age was included as a covariate and samples of unknown age were excluded). Top right, comparison of analyses stratified by sex. Bottom left, comparison of analyses stratified by disease subtype. GA, geographic atrophy; NV, neovascularization. Bottom right, comparison of disease stratified by ancestry. The size of each marker reflects confidence intervals (with height reflecting the confidence interval along the y axis and width reflecting the confidence interval along the x axis). Comparisons reaching P < 0.05 are labeled and colored in red.

Figure 3 Risk score analysis. We calculated a risk score for each individual, defined as the product of the number of risk alleles at each locus and the associated effect size for each allele (measured on the log-odds scale). The plot summarizes the ability of these overall genetic risk scores to distinguish cases and controls. Analyses were carried out using the 19 SNPs that reached P < 5 108 here, the 12 SNPs previously reaching this threshold and the 7 new variants.

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

435

letters
Table 3 Pathway analysis
Ingenuity canonical pathways Nominal P value Complement system Atherosclerosis signaling VEGF family ligand-receptor interactions Dendritic cell maturation Phospholipid degradation MIF-mediated glucocorticoid regulation Inhibition of angiogenesis by TSP1 FcRI signaling p38 MAPK signaling
FDR, false discovery rate.
aAll

Enrichment analysis FDR q value 0.0015 0.009 0.150 0.150 0.151 0.153 0.153 0.153 0.153 Molecules CFI, CFH, C3, CFB a, C2 a,C4A a, C4B a PLA2G12A, APOC1 b, APOE b, APOC2 b, APOC4 b, TNFSF14, COL10A1, PLA2G6 VEGFA, PLA2G12A, PLA2G6 RELB, ZBTB12, DDR1, COL10A1 PLA2G12A, LIPC, PLA2G6 PLA2G12A, PLA2G6 VEGFA,TGFBR1 VAV1, PLA2G12A, PLA2G6 PLA2G12A, TGFBR1, PLA2G6 Pathway size (Ngenes) 35 129 84 185 102 42 39 111 106 0.000012 0.00014 0.0042 0.0046 0.0058 0.0088 0.0093 0.0098 0.011

flank rs429608 and are thus counted as a single hit when determining the significance of enrichment. bAll flank rs4420638 and are thus counted as a single hit when determining the significance of enrichment.

Individuals carrying risk alleles at both these loci were at slightly higher risk of disease than expected. The proportion of variability in the risk of AMD that is due to genes, or heritability, has been estimated at 4570% (ref. 2). Estimating the proportion of disease risk explained by the susceptibility loci identified48 depends greatly on the disease prevalence, which is difficult to estimate in our sample, as it includes cases and controls of different ages collected through a variety of ascertainment schemes. Using a model that assumes an underlying normally distributed but unobserved disease risk score or liability49, the 19 loci described here accounted for between 10% (if AMD prevalence was close to 1%) and 30% (if AMD prevalence was closer to 10%) of the variability in disease risk (corresponding to 1565% of the total genetic contribution to AMD). The variants representing the association peaks at loci previously reaching genome-wide significance accounted for the bulk of this variability: the new loci identified here accounted for 0.51.0% of the total heritability of AMD, whereas secondary signals at new and known loci accounted for 1.53.0% of the total heritability. We report here the most comprehensive genetic association study of macular degeneration yet conducted, involving 18 international research groups and a large set of cases and controls. Our data identify 19 susceptibility loci, including 7 loci with associations reaching P < 5 108 for the first time, nearly doubling the number of known AMD-associated loci outside the complement pathway. Our results show that some susceptibility alleles show different associations across ancestry groups and might be preferentially associated with specific subtypes of disease. As with other GWAS meta-analysis, differences in genotyping methods, quality control steps and imputation strategies between samples might have a minor effect in our resultsfuture studies with more uniform approaches across larger sample sizes might uncover more association signals. A conundrum of macular degeneration genetics remains that the loci identified so far contribute to both geographic atrophy and neovascular disease, two different phenotypes of advanced disease. In our sample, subtype-specific GWAS analyses considering geographic atrophy or neovascular cases only did not identify additional risk loci. Consistent with observations for other complex diseases39, the majority of common disease susceptibility alleles do not alter protein sequences and are not associated with indels of coding sequences or with copy number variation. We expect that the loci identified here will provide an ideal starting point for studying the contribution of rare variation in AMD33,34. In contrast to most other complex diseases, a risk score combining information across our 19 loci can distinguish cases and controls relatively well (area under the receiver operator curve (AUC) = 0.52,
436

2013 Nature America, Inc. All rights reserved.

including only new loci, or AUC = 0.74, including new and previously reported loci; Fig. 3 and Supplementary Fig. 4). It might be possible to use similar scores to identify and prioritize at-risk individuals so they receive preventative treatment before the onset of disease50. Monotherapies are increasingly used to manage neovascularization disease but offer only a limited repertoire of treatment options to patients. The identification of novel genes and pathways involved in disease enables the pursuit of a larger range of disease-specific targets for the development of new therapeutic interventions. We expect that future therapies directed at earlier stages of the disease process will allow patients to retain visual function for longer periods, improving the quality of life for individuals with AMD. URLs. METAL, http://www.sph.umich.edu/csg/abecasis/Metal/; R, http://www.r-project.org; gee (Generalized Estimation Equation solver), http://CRAN.R-project.org/package=gee; Single Nucleotide Polymorphism Spectral Decomposition Lite (SNPSpD), http://gump. qimr.edu.au/general/daleN/SNPSpDlite/; prespecified R scripts, http:// www.epi-regensburg.de/wp/genepi-downloads; 1000 Genomes Project, http://www.1000genomes.org/; HapMap Project, http://www.hapmap. org/genotypes/; PolyPhen-2, http://genetics.bwh.harvard.edu/pph2/; Ingenuity Systems, http://www.ingenuity.com/; National NHGRI GWAS catalog, http://www.genome.gov/gwastudies/; INRICH, http:// atgu.mgh.harvard.edu/inrich; full result set, http://www.sph.umich. edu/csg/abecasis/public/amdgene2012/. Methods Methods and any associated references are available in the online version of the paper.
Note: Supplementary information is available in the online version of the paper. Acknowledgments We are indebted to all the participants who volunteered their time, DNA and information to make this research study possible. We are also in great debt to the clinicians, nurses and research staff who participated in patient recruitment and phenotyping. We thank H. Chin for constant support and encouragement, which helped us bring this project to completion. We thank S. Miller and J. Barb for access to RPE expression data and the MIGEN study group for use of their genotype data. We thank C. Pappas, N. Miller, J. Hageman, W. Hubbard, L. Lucci, A. Vitale, P. Bernstein and N. Amin for technical and clinical assistance. We thank E. Rochtchina, A.C. Viswanathan, J. Xie, M. Inouye, E.G. Holliday, J. Attia and R.J. Scott for contributions to the Blue Mountains Eye Study GWAS. We thank members of the Genetic Factors in AMD Study Group, the Scottish Macula Society Study Group and the Wellcome Trust Clinical Research facility at Southampton General Hospital. We thank T. Peto and colleagues at the Reading Centre, Moorfields Eye Hospital and C. Brussee and A. Hooghart for help in patient recruitment and phenotyping. Full details of funding sources can be found in the Supplementary Note.

npg

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

letters
AUTHOR CONTRIBUTIONS AMD Gene Analysis Committee: L.G.F., W.C., M.S., B.L.Y., Y.Y., L.A.F., I.M.H. (colead) and G.R.A. (co-lead). AMD Gene Phenotype Committee: R.K., C.C.W.K., T.L., J.M.S. (lead) and J.J.W. (co-lead). AMD Gene Steering Committee: B.H.F.W. (chair, senior executive committee), G.R.A. (senior executive committee), M.M.D. (senior executive committee), J.L.H. (senior executive committee), S.K.I. (senior executive committee), M.A.P.-V. (senior executive committee), R.A., P.N. Baird, C.C.W.K., B.E.K.K., M.L.K., M.K., T.L., J.M.S., U.T., D.E.W., J.R.W.Y. and K.Z. AMD-EU-JHU: D.J.Z., I.A., M. Benchaboune, A.C.B., P.A.C., I.C., F.G.H., Y. Kamatani, N.K., A.J.L., S.M.-S., O.P., R. Ripp, J.-A.S., H.P.N.S., E.H.S., A.R.W., D.Z., G.M.L. and T.L. contributed phenotypes, genotypes and analyses for the AMD-EU-JHU study. BDES: R.P.I., B.E.K.K., R.K., K.E.L., C.E.M., T.A.S., B.J.T. and S.K.I. contributed phenotypes, genotypes and analyses for the BDES study. Blue Mountains Eye Study: X.S., P.M., T.Y.W. and J.J.W. contributed phenotypes, genotypes and analyses for BMES. BU/Utah: M.S., G.S.H., G.J., I.K.K., D.J.M., M.A.M., C.P., K.H.P., D.A.S., G.S., E.E.T., M.M.D. and L.A.F. contributed phenotypes, genotypes and analyses for the BU/Utah study. CCF/VAMC: S.A.H., P.J., G.J.T.P., N.S.P., G.M.S.-S., R.P.I. and S.K.I. contributed phenotypes, genotypes and analyses for the CCF/VAMC study. CEI: P.J.F. and M.L.K. contributed phenotypes, genotypes and analyses for the CEI study. Columbia: J.E.M., G.R.B., R.T.S. and R.A. contributed phenotypes, genotypes and analyses for the Columbia study. deCODE: A.G., G.T., H. Sigurdsson, H. Stefansson, K.S. and U.T. contributed phenotypes, genotypes and analyses for the deCODE study. Japan Age-Related Eye Diseases Study: S.A., T.I., Y. Kiyohara, Y.N., Y.O., A.T. and M.K. contributed phenotypes, genotypes and analyses for JAREDS. Melbourne: R.H.G., M.S.C., A.J.R. and P.N. Baird contributed phenotypes, genotypes and analyses for the Melbourne study. Miami/Vanderbilt: B.L.Y., A.A., W.H.C., J.L.K., A.C.N., S.G.S., W.K.S., M.A.P.-V. and J.L.H. contributed phenotypes, genotypes and analyses for the Miami/Vanderbilt study. MMAP/NEI: W.C., K.E.B., M. Brooks, A.J.B., C.-C.C., E.Y.C., R.C., A.O.E., J.S.F., N.G., J.R.H., A.O., M.I.O., R.R.P., E.R., D.E.S., N.T., A.S. and G.R.A. contributed phenotypes, genotypes and analyses for the MMAP/NEI study. Rotterdam: G.H.S.B., A.G.U., C.M.v.D., J.R.V. and C.C.W.K. contributed phenotypes, genotypes and analyses for the Rotterdam study. SAGE: T.A., C.-Y.C., B.K.C. and E.N.V. contributed phenotypes, genotypes and analyses for the SAGE study. Southern German AMD Study: L.G.F., C.G., C.H., C.N.K., P.L., T.M., G.R., H.-E.W., T.W.W., B.H.F.W. and I.M.H. contributed phenotypes, genotypes and analyses for the Southern German AMD Study. Tufts/Massachusetts General Hospital: Y.Y., S.R., K.A.C., M.J.D., E.E., J.F., J.P.A.I., R. Reynolds, L.S. and J.M.S. contributed phenotypes, genotypes and analyses for the Tufts/MGH study. UK Cambridge/Edinburgh: V.C., A.M.A., P.N. Bishop, D.G.C., B.D., S.P.H., J.C.K., A.T.M., H. Shahid, A.F.W. and J.R.W.Y. contributed phenotypes, genotypes and analyses for the UK Cambridge/Edinburgh study. University of Pittsburgh/UCLA: D.E.W., Y.P.C., M.C.O. and M.B.G. contributed phenotypes, genotypes and analyses for the University of Pittsburgh/UCLA study. UCSD: G. Hannum, H.A.F., G. Hughes, I.K., C.J.L., M.Z., L.Z. and K.Z. contributed phenotypes, genotypes and analyses for the USCD study. VRF: R.J.G., L.V., R.P.I. and S.K.I. contributed phenotypes, genotypes and analyses for the VRF study. Gene expression and RNA sequencing data: data were contributed and analyzed by M. Brooks, J.S.F., N.G., R.R.P. and A.S. COmpeting financial INTERESTs The authors declare competing financial interests: details are available in the online version of the paper.
Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html.
1. Swaroop, A., Chew, E.Y., Rickman, C.B. & Abecasis, G.R. Unravelling a late-onset multifactorial disease: from genetic susceptibility to disease mechanisms for age-related macular degeneration. Annu. Rev. Genomics Hum. Genet. 10, 1943 (2009). 2. Seddon, J.M., Cote, J., Page, W.F., Aggen, S.H. & Neale, M.C. The US twin study of age-related macular degeneration: relative roles of genetic and environmental influences. Arch. Ophthalmol. 123, 321327 (2005). 3. Friedman, D.S. et al. Prevalence of age-related macular degeneration in the United States. Arch. Ophthalmol. 122, 564572 (2004). 4. Edwards, A.O. et al. Complement factor H polymorphism and age-related macular degeneration. Science 308, 421424 (2005). 5. Haines, J.L. et al. Complement factor H variant increases the risk of age-related macular degeneration. Science 308, 419421 (2005). 6. Klein, R.J. et al. Complement factor H polymorphism in age-related macular degeneration. Science 308, 385389 (2005). 7. Yates, J.R. et al. Complement C3 variant and the risk of age-related macular degeneration. N. Engl. J. Med. 357, 553561 (2007). 8. Gold, B. et al. Variation in factor B (BF) and complement component 2 (C2) genes is associated with age-related macular degeneration. Nat. Genet. 38, 458462 (2006). 9. Fagerness, J.A. et al. Variation near complement factor I is associated with risk of advanced AMD. Eur. J. Hum. Genet. 17, 100104 (2009). 10. Hageman, G.S. et al. A common haplotype in the complement regulatory gene factor H (HF1/CFH) predisposes individuals to age-related macular degeneration. Proc. Natl. Acad. Sci. USA 102, 72277232 (2005). 11. Maller, J.B. et al. Variation in complement factor 3 is associated with risk of agerelated macular degeneration. Nat. Genet. 39, 12001201 (2007). 12. Rivera, A. et al. Hypothetical LOC387715 is a second major susceptibility gene for age-related macular degeneration, contributing independently of complement factor H to disease risk. Hum. Mol. Genet. 14, 32273236 (2005). 13. Jakobsdottir, J. et al. Susceptibility genes for age-related maculopathy on chromosome 10q26. Am. J. Hum. Genet. 77, 389407 (2005). 14. Klaver, C.C. et al. Genetic association of apolipoprotein E with age-related macular degeneration. Am. J. Hum. Genet. 63, 200206 (1998). 15. Souied, E.H. et al. The 4 allele of the apolipoprotein E gene as a potential protective factor for exudative age-related macular degeneration. Am. J. Ophthalmol. 125, 353359 (1998). 16. McKay, G.J. et al. Evidence of association of APOE with age-related macular degeneration: a pooled analysis of 15 studies. Hum. Mutat. 32, 14071416 (2011). 17. Chen, W. et al. Genetic variants near TIMP3 and high-density lipoproteinassociated loci influence susceptibility to age-related macular degeneration. Proc. Natl. Acad. Sci. USA 107, 74017406 (2010). 18. Neale, B.M. et al. Genome-wide association study of advanced age-related macular degeneration identifies a role of the hepatic lipase gene (LIPC). Proc. Natl. Acad. Sci. USA 107, 73957400 (2010). 19. Yu, Y. et al. Common variants near FRK/COL10A1 and VEGFA are associated with advanced age-related macular degeneration. Hum. Mol. Genet. 20, 36993709 (2011). 20. Arakawa, S. et al. Genome-wide association study identifies two susceptibility loci for exudative age-related macular degeneration in the Japanese population. Nat. Genet. 43, 10011004 (2011). 21. McCarthy, M.I. et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat. Rev. Genet. 9, 356369 (2008). 22. Li, Y., Willer, C.J., Sanna, S. & Abecasis, G.R. Genotype imputation. Annu. Rev. Genomics Hum. Genet. 10, 387406 (2009). 23. Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39, 906913 (2007). 24. Browning, B.L. & Browning, S.R. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84, 210223 (2009). 25. Li, Y., Willer, C.J., Ding, J., Scheet, P. & Abecasis, G.R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, 816834 (2010). 26. Willer, C.J., Li, Y. & Abecasis, G.R. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics 26, 21902191 (2010). 27. Skol, A.D., Scott, L.J., Abecasis, G.R. & Boehnke, M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat. Genet. 38, 209213 (2006). 28. Higgins, J.P., Thompson, S.G., Deeks, J.J. & Altman, D.G. Measuring inconsistency in meta-analyses. Br. Med. J. 327, 557560 (2003). 29. Sobrin, L. et al. ARMS2/HTRA1 locus can confer differential susceptibility to the advanced subtypes of age-related macular degeneration. Am. J. Ophthalmol. 151, 345352 (2011). 30. Seddon, J.M. et al. Association of CFH Y402H and LOC387715 A69S with progression of age-related macular degeneration. J. Am. Med. Assoc. 297, 17931800 (2007). 31. Li, M. et al. CFH haplotypes without the Y402H coding variant show strong association with susceptibility to age-related macular degeneration. Nat. Genet. 38, 10491054 (2006). 32. Maller, J. et al. Common variation in three genes, including a noncoding variant in CFH, strongly influences risk of age-related macular degeneration. Nat. Genet. 38, 10551059 (2006). 33. Nejentsev, S., Walker, N., Riches, D., Egholm, M. & Todd, J.A. Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science 324, 387389 (2009). 34. Raychaudhuri, S. et al. A rare penetrant mutation in CFH confers high risk of agerelated macular degeneration. Nat. Genet. 43, 12321236 (2011). 35. Pruim, R.J. et al. LocusZoom: regional visualization of genome-wide association scan results. Bioinformatics 26, 23362337 (2010). 36. Adzhubei, I.A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248249 (2010). 37. Sivakumaran, T.A. et al. A 32 kb critical region excluding Y402H in CFH mediates risk for age-related macular degeneration. PLoS ONE 6, e25598 (2011). 38. Wellcome Trust Case Control Consortium.. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature 464, 713720 (2010). 39. 1000 Genomes Project Consortium. A map of human genome variation from population scale sequencing. Nature 467, 10611073 (2010). 40. Fritsche, L.G. et al. Age-related macular degeneration is associated with an unstable ARMS2 (LOC387715) mRNA. Nat. Genet. 40, 892896 (2008). 41. Dewan, A. et al. HTRA1 promoter polymorphism in wet age-related macular degeneration. Science 314, 989992 (2006).

npg

2013 Nature America, Inc. All rights reserved.

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

437

letters
42. Hughes, A.E. et al. A common CFH haplotype, with deletion of CFHR1 and CFHR3, is associated with lower risk of age-related macular degeneration. Nat. Genet. 38, 11731177 (2006). 43. Fritsche, L.G. et al. An imbalance of human complement regulatory proteins CFHR1, CFHR3 and factor H influences risk for age-related macular degeneration (AMD). Hum. Mol. Genet. 19, 46944704 (2010). 44. Brooks, M.J., Rajasimha, H.K., Roger, J.E. & Swaroop, A. Next-generation sequencing facilitates quantitative analysis of wild-type and Nrl/ retinal transcriptomes. Mol. Vis. 17, 30343054 (2011). 45. Strunnikova, N.V. et al. Transcriptome analysis and molecular signature of human retinal pigment epithelium. Hum. Mol. Genet. 19, 24682486 (2010). 46. Hindorff, L.A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106, 93629367 (2009). 47. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102, 1554515550 (2005). 48. Manolio, T.A. et al. Finding the missing heritability of complex diseases. Nature 461, 747753 (2009). 49. So, H.C., Gui, A.H., Cherny, S.S. & Sham, P.C. Evaluating the heritability explained by known susceptibility variants: a survey of ten complex diseases. Genet. Epidemiol. 35, 310317 (2011). 50. Seddon, J.M., Reynolds, R., Yu, Y., Daly, M.J. & Rosner, B. Risk models for progression to advanced age-related macular degeneration using demographic, environmental, genetic, and ocular factors. Ophthalmology 118, 22032211 (2011). 51. Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 9971004 (1999).

Lars G Fritsche1,2,128, Wei Chen2,3,128, Matthew Schu4,128, Brian L Yaspan5,6,128, Yi Yu7,128, Gudmar Thorleifsson8, Donald J Zack912, Satoshi Arakawa13, Valentina Cipriani14,15, Stephan Ripke16,17, Robert P Igo Jr18, Gabrille H S Buitendijk19,20, Xueling Sim2,21, Daniel E Weeks22,23, Robyn H Guymer24, Joanna E Merriam25, Peter J Francis26, Gregory Hannum27, Anita Agarwal28,29, Ana Maria Armbrecht30, Isabelle Audo10,15,31,32, Tin Aung33,34, Gaetano R Barile25, Mustapha Benchaboune35, Alan C Bird14,15, Paul N Bishop36,37, Kari E Branham38, Matthew Brooks39, Alexander J Brucker40, William H Cade41,42, Melinda S Cain24, Peter A Campochiaro11,43, Chi-Chao Chan44, Ching-Yu Cheng33,34,45,46, Emily Y Chew47, Kimberly A Chin7, Itay Chowers48, David G Clayton49, Radu Cojocaru39, Yvette P Conley50, Belinda K Cornes33, Mark J Daly16, Baljean Dhillon30, Albert O Edwards51, Evangelos Evangelou52, Jesen Fagerness53,54, Henry A Ferreyra55,56, James S Friedman39, Asbjorg Geirsdottir57, Ronnie J George58, Christian Gieger59, Neel Gupta39, Stephanie A Hagstrom60, Simon P Harding61, Christos Haritoglou62, John R Heckenlively38, Frank G Holz63, Guy Hughes55,56,64, John P A Ioannidis6567, Tatsuro Ishibashi68, Peronne Joseph18, Gyungah Jun4,69,70, Yoichiro Kamatani71, Nicholas Katsanis7274, Claudia N Keilhauer75, Jane C Khan49,76,77, Ivana K Kim78,79, Yutaka Kiyohara80, Barbara E K Klein81, Ronald Klein81, Jaclyn L Kovach82, Igor Kozak55,56, Clara J Lee55,56,64, Kristine E Lee81, Peter Lichtner83, Andrew J Lotery84, Thomas Meitinger83,85, Paul Mitchell86, Saddek Mohand-Sad30,32,35,87, Anthony T Moore14,15, Denise J Morgan88, Margaux A Morrison88, Chelsea E Myers81, Adam C Naj41,42, Yusuke Nakamura89, Yukinori Okada90, Anton Orlin91, M Carolina Ortube92,93, Mohammad I Othman38, Chris Pappas94, Kyu Hyung Park95, Gayle J T Pauer60, Neal S Peachey60,96, Olivier Poch97, Rinki Ratna Priya39, Robyn Reynolds7, Andrea J Richardson24, Raymond Ripp97, Guenther Rudolph62, Euijung Ryu98, Jos-Alain Sahel10,15,31,32,35,99,100, Debra A Schaumberg78,101, Hendrik P N Scholl43,63, Stephen G Schwartz82, William K Scott41,42, Humma Shahid49,102, Haraldur Sigurdsson57,103, Giuliana Silvestri104, Theru A Sivakumaran105, R Theodore Smith25,106, Lucia Sobrin78,79, Eric H Souied107, Dwight E Stambolian108, Hreinn Stefansson8, Gwen M Sturgill-Short96, Atsushi Takahashi90, Nirubol Tosakulwong98, Barbara J Truitt18, Evangelia E Tsironi109, Andr G Uitterlinden19,110, Cornelia M van Duijn19, Lingam Vijaya58, Johannes R Vingerling19,20, Eranga N Vithana33,34, Andrew R Webster14,15, H-Erich Wichmann111114, Thomas W Winkler115, Tien Y Wong24,33,34, Alan F Wright116, Diana Zelenika117, Ming Zhang55,56,64,118,119, Ling Zhao55,56,64, Kang Zhang55,56,64,118,119, Michael L Klein26, Gregory S Hageman94, G Mark Lathrop71,117, Kari Stefansson8,103, Rando Allikmets25,120,129, Paul N Baird24,129, Michael B Gorin92,93,121,129, Jie Jin Wang24,86,129, Caroline C W Klaver19,20,129, Johanna M Seddon7,122,129, Margaret A Pericak-Vance41,42,129, Sudha K Iyengar18,123125,129, John R W Yates14,15,49,129, Anand Swaroop38,39,129, Bernhard H F Weber1,129, Michiaki Kubo13,129, Margaret M DeAngelis88,129, Thierry Lveillard10,31,32,129, Unnur Thorsteinsdottir8,103,129, Jonathan L Haines5,6,129, Lindsay A Farrer4,69,70,126,127,129, Iris M Heid59,115,129 & Gonalo R Abecasis2,129
1Institute

npg

2013 Nature America, Inc. All rights reserved.

of Human Genetics, University of Regensburg, Regensburg, Germany. 2Department of Biostatistics, Center for Statistical Genetics, University of Michigan, Ann Arbor, Michigan, USA. 3Division of Pediatric Pulmonary Medicine, Allergy and Immunology, Department of Pediatrics, Childrens Hospital of Pittsburgh of University of Pittsburgh Medical Center (UPMC), University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, USA. 4Section of Biomedical Genetics, Department of Medicine, Boston University Schools of Medicine and Public Health, Boston, Massachusetts, USA. 5Center for Human Genetics Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA. 6Department of Molecular Physiology and Biophysics, Vanderbilt University School of Medicine, Nashville, Tennessee, USA. 7Ophthalmic Epidemiology and Genetics Service, Tufts Medical Center, Boston, Massachusetts, USA. 8deCODE Genetics, Reykjavik, Iceland. 9Department of Molecular Biology and Genetics, Wilmer Eye Institute, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA. 10Department of Genetics, Institut de la Vision, Universit Pierre et Marie CurieUniversit Paris 6, Unit Mixte de Recherche Scientifique (UMRS) 968, Paris, France. 11Department of Neuroscience, Wilmer Eye Institute, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA. 12Institute of Genetic Medicine, Wilmer Eye Institute, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA. 13Laboratory for Genotyping Development, Research Group for Genotyping, Center for

438

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

letters
Genomic Medicine (CGM), RIKEN, Yokohama, Japan. 14Moorfields Eye Hospital, London, UK. 15Institute of Ophthalmology, University College London, London, UK. 16Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts, USA. 17Stanley Center for Psychiatric Research, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA. 18Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio, USA. 19Department of Epidemiology, Erasmus Medical Center, Rotterdam, The Netherlands. 20Department of Ophthalmology, Erasmus Medical Center, Rotterdam, The Netherlands. 21Centre for Molecular Epidemiology, National University of Singapore, Singapore. 22Department of Human Genetics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania, USA. 23Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania, USA. 24Centre for Eye Research Australia, University of Melbourne, Royal Victorian Eye and Ear Hospital, East Melbourne, Victoria, Australia. 25Department of Ophthalmology, Columbia University, New York, New York, USA. 26Macular Degeneration Center, Casey Eye Institute, Oregon Health & Science University, Portland, Oregon, USA. 27Department of Bioengineering, University of California, San Diego, La Jolla, California, USA. 28Vanderbilt Eye Institute, Vanderbilt University Medical Center, Nashville, Tennessee, USA. 29Department of Ophthalmology & Visual Sciences, Vanderbilt University School of Medicine, Nashville, Tennessee, USA. 30Department of Ophthalmology, University of Edinburgh and Princess Alexandra Eye Pavilion, Edinburgh, UK. 31Institut National de la Sant et de la Recherche Mdicale (INSERM) U968, Paris, France. 32Centre National de la Recherche Scientifique (CNRS), UMR 7210, Paris, France. 33Singapore Eye Research Institute, Singapore National Eye Centre, Singapore. 34Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore. 35Centre Hospitalier National dOphtalmologie des Quinze-Vingts, INSERMDirection de lHospitalisation et de lOrganisation des Soins, Centres dInvestigation Clinique 503, Paris, France. 36Institute of Human Development, Faculty of Medical and Human Sciences, University of Manchester, Manchester, UK. 37Central Manchester University Hospitals National Health Service (NHS) Foundation Trust, Manchester Academic Health Science Centre, Manchester, UK. 38Department of Ophthalmology and Visual Sciences, University of Michigan, Ann Arbor, Michigan, USA. 39Neurobiology Neurodegeneration & Repair Laboratory (N-NRL), National Eye Institute, US National Institutes of Health, Bethesda, Maryland, USA. 40Scheie Eye Institute, Penn Presbyterian Medical Center, Philadelphia, Pennsylvania, USA. 41John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, Florida, USA. 42Dr. John T. Macdonald Foundation Department of Human Genetics, University of Miami Miller School of Medicine, Miami, Florida, USA. 43Department of Ophthalmology, Wilmer Eye Institute, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA. 44Immunopathology Section, Laboratory of Immunology, National Eye Institute, US National Institutes of Health, Bethesda, Maryland, USA. 45Saw Swee Hock School of Public Health, National University of Singapore, Singapore. 46Centre for Quantitative Medicine, Office of Clinical Sciences, DukeNational University of Singapore Graduate Medical School, Singapore. 47Division of Epidemiology and Clinical Applications, Clinical Trials Branch, National Eye Institute, US National Institutes of Health, Bethesda, Maryland, USA. 48Department of Ophthalmology, Hadassah-Hebrew University Medical Center, Jerusalem, Israel. 49Department of Medical Genetics, Cambridge Institute for Medical Research, University of Cambridge, Cambridge, UK. 50Department of Health Promotion and Development, School of Nursing, University of Pittsburgh, Pittsburgh, Pennsylvania, USA. 51Institute for Molecular Biology, University of Oregon, Eugene, Oregon, USA. 52Department of Hygiene and Epidemiology, University of Ioannina Medical School, Ioannina, Greece. 53Center for Human Genetic Research, Massachusetts General Hospital, Boston, Massachusetts, USA. 54Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA. 55Department of Ophthalmology, University of California, San Diego, La Jolla, California, USA. 56Shiley Eye Center, University of California, San Diego, La Jolla, California, USA. 57Department of Ophthalmology, National University Hospital, Reykjavik, Iceland. 58Glaucoma Project, Vision Research Foundation, Sankara Nethralaya, Chennai, India. 59Institute of Genetic Epidemiology, Helmholtz Zentrum MnchenDeutsches Forschungszentrum fr Gesundheit und Umwelt, Neuherberg, Germany. 60Cole Eye Institute, Cleveland Clinic Foundation, Cleveland, Ohio, USA. 61Department of Eye and Vision Science, Institute of Ageing and Chronic Disease, University of Liverpool, Liverpool, UK. 62Augenklinik, LudwigMaximilians-Universitt Mnchen, Munich, Germany. 63Department of Ophthalmology, University of Bonn, Bonn, Germany. 64Institute for Genomic Medicine, University of California, San Diego, La Jolla, California, USA. 65Stanford Prevention Research Center, Department of Medicine, Stanford University School of Medicine, Stanford, California, USA. 66Department of Health Research and Policy, Stanford University School of Medicine, Stanford, California, USA. 67Department of Statistics, Stanford University School of Humanities and Sciences, Stanford, California, USA. 68Department of Ophthalmology, Graduate School of Medical Science, Kyushu University, Fukuoka, Japan. 69Department of Ophthalmology, Boston University Schools of Medicine and Public Health, Boston, Massachusetts, USA. 70Department of Biostatistics, Boston University Schools of Medicine and Public Health, Boston, Massachusetts, USA. 71Fondation Jean Dausset, Centre dEtude du Polymorphisme Humain (CEPH), Paris, France. 72Center for Human Disease Modeling, Duke University, Durham, North Carolina, USA. 73Department of Cell Biology, Duke University, Durham, North Carolina, USA. 74Department of Pediatrics, Duke University, Durham, North Carolina, USA. 75Department of Ophthalmology, JuliusMaximilians-Universitt, Wrzburg, Germany. 76Department of Ophthalmology, Royal Perth Hospital, Perth, Western Australia, Australia. 77Centre for Ophthalmology and Visual Science, University of Western Australia, Perth, Western Australia, Australia. 78Department of Ophthalmology, Harvard Medical School, Boston, Massachusetts, USA. 79Massachusetts Eye and Ear Infirmary, Boston, Massachusetts, USA. 80Department of Environmental Medicine, Graduate School of Medical Science, Kyushu University, Fukuoka, Japan. 81Department of Ophthalmology and Visual Sciences, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin, USA. 82Bascom Palmer Eye Institute, University of Miami Miller School of Medicine, Miami, Florida, USA. 83Institute of Human Genetics, Helmholtz Zentrum MnchenDeutsches Forschungszentrum fr Gesundheit und Umwelt, Neuherberg, Germany. 84Faculty of Medicine, Clinical and Experimental Sciences, University of Southampton, Southampton, UK. 85Institute of Human Genetics, Technische Universitt Mnchen, Munich, Germany. 86Centre for Vision Research, Department of Ophthalmology and the Westmead Millennium Institute, University of Sydney, Sydney, New South Wales, Australia. 87Department of Therapeutics, Institut de la Vision, Universit Pierre et Marie CurieUniversit Paris 6, UMRS 968, Paris, France. 88Department of Ophthalmology and Visual Sciences, University of Utah, John A. Moran Eye Center, Salt Lake City, Utah, USA. 89Laboratory of Molecular Medicine, Human Genome Center, Institute of Medical Science, University of Tokyo, Tokyo, Japan. 90Laboratory for Statistical Analysis, CGM, RIKEN, Yokohama, Japan. 91Department of Ophthalmology, Weill Cornell Medical College, New York, New York, USA. 92Department of Ophthalmology, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California, USA. 93Jules Stein Eye Institute, Los Angeles, California, USA. 94Moran Center for Translational Medicine, John A. Moran Eye Center, University of Utah, Salt Lake City, Utah, USA. 95Department of Ophthalmology, Seoul National University Bundang Hospital, Kyeounggi, Republic of Korea. 96Research Service, Louis Stokes Veteran Affairs Medical Center, Cleveland, Ohio, USA. 97Laboratory of Integrative Bioinformatics and Genomics, Institut de Gntique et de Biologie Molculaire et Cellulaire (IGBMC), Illkirch, France. 98Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota, USA. 99Fondation Ophtalmologique Adolphe de Rothschild, Paris, France. 100Acadmie des SciencesInstitut de France, Paris, France. 101Division of Preventive Medicine, Brigham and Womens Hospital, Boston, Massachusetts, USA. 102Department of Ophthalmology, Addenbrookes Hospital, Cambridge, UK. 103Faculty of Medicine, University of Iceland, Reykjavik, Iceland. 104Centre for Vision and Vascular Science, Queens University, Belfast, UK. 105Division of Human Genetics, Cincinnati Childrens Hospital Medical Center, Cincinnati, Ohio, USA. 106Department of Biomedical Engineering, Columbia University, New York, New York, USA. 107Centre de Recherche Clinique dOphthalmologie, Hpital Intercommunal de Crteil, Hpital Henri Mondor, Universit Paris Est, Crteil, France. 108Department of Ophthalmology and Genetics, University of Pennsylvania, Philadelphia, Pennsylvania, USA. 109Department of Ophthalmology, University of Thessaly School of Medicine, Larissa, Greece. 110Department of Internal Medicine, Erasmus Medical Center, Rotterdam, The Netherlands. 111Institute of Epidemiology I, Helmholtz Zentrum MnchenDeutsches Forschungszentrum fr Gesundheit und Umwelt, Neuherberg, Germany. 112Institute of Medical Informatics, Ludwig-Maximilians-Universitt and Klinikum Grohadern, Munich, Germany. 113Institute of Biometry, Ludwig-Maximilians-Universitt and Klinikum Grohadern, Munich, Germany. 114Institute of Epidemiology, Ludwig-Maximilians-Universitt and Klinikum Grohadern, Munich, Germany. 115Department of Epidemiology and Preventive Medicine, University of Regensburg, Regensburg, Germany. 116Medical Research Council Human Genetics Unit, Institute of Genetics and Molecular Medicine, Edinburgh, UK. 117Centre National de Gnotypage, Centre dEnergie AtomiqueInstitut de Gnomique (IG), Evry, France. 118Molecular Medicine Research Center, West China Hospital, Sichuan University, Chengdu, China. 119Department of Ophthalmology, West China Hospital, Sichuan University, Chengdu, China. 120Department of Pathology & Cell Biology, Columbia University, New York, New York, USA. 121Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California, USA. 122Department of Ophthamology, Tufts University School of Medicine, Boston, Massachusetts, USA. 123Department of Genetics, Case Western Reserve University, Cleveland, Ohio, USA. 124Department of Ophthalmology and Visual Sciences, Case Western Reserve University, Cleveland, Ohio, USA. 125Center for Clinical Investigation, Case Western Reserve University, Cleveland, Ohio, USA. 126Department of Neurology, Boston University Schools of Medicine and Public Health, Boston, Massachusetts, USA. 127Department of Epidemiology, Boston University Schools of Medicine and Public Health, Boston, Massachusetts, USA. 128These authors contributed equally to this work. 129These authors jointly directed this work. Correspondence should be addressed to G.R.A. (goncalo@umich.edu), I.M.H. (iris.heid@klinik.uni-regensburg.de), L.A.F. (farrer@bu.edu) or J.L.H. (jonathan@chgr.mc.vanderbilt.edu).

npg

2013 Nature America, Inc. All rights reserved.

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

439

ONLINE METHODS

Genome-wide scan for advanced AMD association including follow-up analysis. All participating studies were reviewed and approved by local institutional review boards. In addition, subjects gave informed consent before enrollment. Study-specific association analysis for discovery. Genotyping was performed on a variety of different platforms summarized in Supplementary Table 2. Each study group submitted results from association tests using genotyped and imputed data with the allelic dosages computed with either MACH25, IMPUTE23, BEAGLE24 or snpStats52 using the HapMap 2 reference panels. The CEU panel (Utah residents of Northern and Western European ancestry) was used as a reference for imputation-based analyses for most samples (predominantly of European ancestry), with two exceptions: for the JAREDS samples (predominantly of east Asian ancestry), the Han Chinese in Beijing, China (CHB) and Japanese in Tokyo, Japan (JPT) panel was used as a reference, and, for the VRF samples (predominantly of south Asian ancestry), the combined CEU, CHB and JPT panels were used22,53. For most data sets, association tests were run under a logistic regression model using either PLINK54, Mach2dat25, ProbABEL55 or snpStats52, although, for one data set containing related individuals, the generalized estimating equations algorithm56 was implemented in R57. In addition to the primary analysis that tested for SNP associations with advanced AMD unadjusted for age, an age-adjusted sensitivity analysis was conducted by each group with available age information. Each group also provided stratified results by sex or AMD subtype (geographic atrophy or neovascularization disease), as long as the sample size per stratum exceeded 50 subjects. For all analyses, study-specific control for population stratification was conducted (Supplementary Table 4). Study-specific association analysis for follow-up. Genotyping of the selected SNPs was performed on different platforms. The same models and sensitivity and stratified analyses were computed by each follow-up partner, and SNPs with insufficient call rate were excluded on the basis of study-specific thres holds. If the index SNP could not be genotyped, a highly correlated proxy was used whenever possible (Supplementary Tables 2 and 3). Quality control before meta-analysis. Before meta-analysis, all study-specific files underwent quality control procedures to check for completeness and plausible descriptive statistics on all variables as well as for compliance of allele frequencies with HapMap58. In addition, we excluded individual study SNP results from meta-analysis (i) for discovery if imputation quality measures were too low (MACH and PLINK < 0.3; SNPTEST < 0.4) or if effect size (||) or standard error was too extreme ( 5), indicating instability of the estimates, or (ii) for follow-up if Hardy-Weinberg equilibrium was violated (P < 0.05/32). Meta-analyses. For both discovery and follow-up, we performed meta-analyses using the inverse varianceweighted fixed-effect model, which pools the effect sizes and standard errors from each participating GWAS. Using an alternative-weighted z-score method, which is based on a weighted sum of z-score statistics, we obtained a very similar set of test statistics (correlation of log10 (P value) > 0.98). All analyses were performed using METAL26 and R. For the discovery stage, we applied two rounds each of genomic control corrections to the individual GWAS results and the combined meta-analysis results 51. All results were analyzed and validated among four independent teams. Extended analyses for the identified AMD-associated loci. Extended analyses were conducted on the identified loci and particularly on the top SNP at each locus. Second signal analysis. To detect potential independent signals within the identified AMD-associated loci, each study partner with genotypes for all identified SNPs available reanalyzed their data for all SNPs in the respective loci (index SNP 1 Mb) using a logistic regression model containing all identified index SNPs. Quality control procedures were performed as before. Meta-analysis was performed on the estimate for each SNP, applying the effective sample sizeweighted z-score method and two rounds of genomic control correction. The significance threshold (P < 0.05) for an independent

association signal within any of the identified loci was Bonferroni adjusted using the average effective number of SNPs involved across the identified loci as determined by SNPSpD59. Thirteen studies contributed to this analysis, including 7,489 cases and 51,562 controls. Interaction analysis. Using prespecified R scripts (see URLs), GWAS partners performed 171 logistic regression analyses modeling the pairwise interactions of the 19 index SNPs, assuming an additive model for main and interaction effects. Study-specific covariates were included in the models, if required. For each study, quality control included a check for consistency of the main SNP effects between discovery and interaction analyses. SNPs with low imputation quality measure and pairs with || > 5 or standard error > 5 were excluded before meta-analysis was performed on the interaction effects with the inverse varianceweighted fixed-effect model in METAL. Twelve studies contributed to this analysis, including 6,645 cases and 49,410 controls. Genetic risk score. Effect sizes, j, for each of the 19 SNPs were calculated in the meta-analysis described above and normalized by =b / b j j bk
k =1 19

2013 Nature America, Inc. All rights reserved.

where j = 1,,19. Using these values as weights, each study partner with data available for all 19 SNPs computed the genetic risk score for an individual as a normalized weighted sum of the AMD risk-increasing alleles among the identified SNPs, with x Si = b j ij
j

where xij is the genotype of the ith individual at the jth SNP, such that Si ranges from 0 to 2. Data for these calculations were available from 12 studies, including 7,195 cases and 49,149 controls. For each study, we used leave-one-out cross-validation to access the prediction of the risk score. For the kth subject, we fitted a logistic regression model from all subjects in the study excluding the kth subject as log( yi ) = a + g Si , i ! = k 1 yi

npg

where is the intercept and is the effect of the genetic risk score. The fitted probability of the kth subject was then estimated.
+g Sk ) k = 1 / 1 + e (a y

We sorted the fitted probabilities and calculated sensitivity and specificity by varying the risk threshold (the value compared with the fitted probability to dichotomize the subjects into cases or controls) from 0 to 1. These estimates of sensitivity and specificity were used to compute the AUC of the receiver operating curve. Identification of correlated coding variants and tagged non-SNP variation. LD estimates were calculated using genotype data from the identified risk loci (index SNPs 500 kb) in individuals with European ancestry from the 1000 Genomes Project (March 2012 release)60 or from HapMap (release 28)58. Variants correlated (r2 > 0.6) with one of the GWAS index SNPs were identified using PLINK54. To identify coding variants, all correlated variants were mapped against RefSeq transcripts using ANNOVAR61. Gene expression. We evaluated the expression in retina of genes within 100 kb of 1 of the 19 index SNPs, as well as of several retina-specific, RPE-specific and housekeeping genes unrelated to AMD for comparison (RNA sequencing data from 3 young (1735 years) and 2 elderly (75 and 77 years) individuals). We also analyzed expression in fetal and adult RPE (data in the Gene Expression Omnibus database45; GSE18811). Expression was analyzed using previously described protocols44 (Supplementary Table 8).

Nature Genetics

doi:10.1038/ng.2578

Pathway analyses. Functional enrichment analysis was performed using IPA software. Any gene located within 100 kb of a SNP in high LD (r2 > 0.8) with one of the index SNPs was considered a potential AMD risk-associated gene and was included in subsequent pathway enrichment analysis. LD estimates were calculated as described above. Applying these inclusion filters, 90 genes were implicated by our 19 replicated AMD-associated SNPs (Supplementary Table 8). Because genes with related function sometimes cluster in the same locus, we trimmed gene lists during analysis so that only one gene per locus was used to evaluate enrichment for each pathway. The P value of the association between our gene list and any of the canonical pathways and/or functional gene sets as annotated by IPAs Knowledge Base was computed using a one-sided Fishers exact test. The Benjamini-Hochberg method was used to estimate FDR. To evaluate the significance of observed enrichment, we repeated our Ingenuity analysis starting with 50 lists of 19 SNPs randomly drawn from the NHGRI GWAS catalog46 and, again, using the INRICH tool62. When using INRICH, we used gene sets defined in the Molecular Signatures Database (MSigDB) 47 (ver3.0) representing manually curated canonical pathway, gene ontology (GO) biological process and cellular component and molecular function gene sets (C2:CP, C5:BP, C5:CC and C5:MF). We provided INRICH with our full GWAS SNP list and allowed it to carry out 100,000 permutations, matching selected loci in terms of gene count, SNP density and total number of SNPs. 2013 Nature America, Inc. All rights reserved.

52. Wallace, C. et al. The imprinted DLK1-MEG3 gene region on chromosome 14q32.2 alters susceptibility to type 1 diabetes. Nat. Genet. 42, 6871 (2010). 53. Huang, L. et al. Genotype-imputation accuracy across worldwide human populations. Am. J. Hum. Genet. 84, 235250 (2009). 54. Purcell, S. et al. PLINK: a tool set for whole-genome association and populationbased linkage analyses. Am. J. Hum. Genet. 81, 559575 (2007). 55. Aulchenko, Y.S., Struchalin, M.V. & van Duijn, C.M. ProbABEL package for genome-wide association analysis of imputed data. BMC Bioinformatics 11, 134 (2010). 56. Zeger, S.L. & Liang, K.Y. Longitudinal data analysis for discrete and continuous outcomes. Biometrics 42, 121130 (1986). 57. R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna, 2012). 58. International HapMap Consortium. Integrating common and rare genetic variation in diverse human populations. Nature 467, 5258 (2010). 59. Nyholt, D.R. A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. Am. J. Hum. Genet. 74, 765769 (2004). 60. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 5665 (2012). 61. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010). 62. Lee, P.H., ODushlaine, C., Thomas, B. & Purcell, S.M. INRICH: interval-based enrichment analysis for genome-wide association studies. Bioinformatics 28, 17971799 (2012).

npg
doi:10.1038/ng.2578

Nature Genetics

letters

Somatic mutations in ATP1A1 and ATP2B3 lead to aldosterone-producing adenomas and secondary hypertension
2013 Nature America, Inc. All rights reserved.

Felix Beuschlein1, Sheerazed Boulkroun2,3, Andrea Osswald1, Thomas Wieland4, Hang N Nielsen5, Urs D Lichtenauer1, David Penton6, Vivien R Schack5, Laurence Amar2,3,7, Evelyn Fischer1, Anett Walther4, Philipp Tauber6, Thomas Schwarzmayr4, Susanne Diener4, Elisabeth Graf4, Bruno Allolio8, Benoit Samson-Couterie2,3, Arndt Benecke9, Marcus Quinkler10, Francesco Fallo11, Pierre-Francois Plouin2,3,7, Franco Mantero12, Thomas Meitinger4,13,14, Paolo Mulatero15, Xavier Jeunemaitre2,3,7, Richard Warth6, Bente Vilsen5, Maria-Christina Zennaro2,3,7,16, Tim M Strom4,13,16 & Martin Reincke1
Primary aldosteronism is the most prevalent form of secondary hypertension. To explore molecular mechanisms of autonomous aldosterone secretion, we performed exome sequencing of aldosterone-producing adenomas (APAs). We identified somatic hotspot mutations in the ATP1A1 (encoding an Na+/K+ ATPase a subunit) and ATP2B3 (encoding a Ca2+ ATPase) genes in three and two of the nine APAs, respectively. These ATPases are expressed in adrenal cells and control sodium, potassium and calcium ion homeostasis. Functional in vitro studies of ATP1A1 mutants showed loss of pump activity and strongly reduced affinity for potassium. Electrophysiological ex vivo studies on primary adrenal adenoma cells provided further evidence for inappropriate depolarization of cells with ATPase alterations. In a collection of 308 APAs, we found 16 (5.2%) somatic mutations in ATP1A1 and 5 (1.6%) in ATP2B3. Mutation-positive cases showed male dominance, increased plasma aldosterone concentrations and lower potassium concentrations compared with mutationnegative cases. In summary, dominant somatic alterations in two members of the ATPase gene family result in autonomous aldosterone secretion. Excessive autonomous aldosterone secretion by the adrenal gland, called primary aldosteronism, causes drug-resistant and often lifethreatening arterial hypertension accompanied by severe hypokalemia. Long-term consequences include higher risk of stroke, myocardial
1Medizinische

infarction and atrial fibrillation. Primary aldosteronism is present in up to 7% of hypertensive individuals in population-based studies1 and in up to 20% of individuals with therapy resistance referred to specialized centers2. Primary aldosteronism can be caused by bilateral adrenal hyperplasia or a unilateral APA. Depending on the population and applied diagnostic procedures, the proportion of primary aldosteronism cases with APA can be as high as 60% (ref. 3). Recent reports identified mutations in the potassium channel gene KCNJ5 as a cause of familial and sporadic forms of primary aldosteronism and estimated the proportion of APAs caused by KCNJ5 mutations at 3040% (refs. 4,5). To identify further genetic determinants of primary aldosteronism, we performed exome sequencing in tumor and matched control tissue from nine males affected by hypokalemic primary aldosteronism without somatic KCNJ5 mutations (Supplementary Table 1). Sequencing identified a low number of protein-altering mutations (013 per adenoma; mean of 4.1 1.4; Supplementary Table 2). Notably, within this small set of genetic alterations, we found multiple somatic variants in two members of the P-type ATPase gene family, ATP1A1 (encoding an Na+/K+ ATPase a subunit) and ATP2B3 (encoding the plasma membrane Ca2+ ATPase). Missense variants in ATP1A1 (NM_000701.7) were present in three out of nine adenomas (c.311T>G in two cases and c.995T>G in one case), leading to p.Leu104Arg and p.Val332Gly substitutions, respectively. In-frame deletions of ATP2B3 (NM_021949.3) were present in two adenomas (c.1272_1277delGCTGGT and c.1273_1278delCTGGTC), in both

npg

Klinik und Poliklinik IV, Ludwig-Maximilians-Universitt Mnchen, Munich, Germany. 2Institut National de la Sant et de la Recherche Mdicale (INSERM), Unit Mixte de Recherche Scientifique (UMRS) 970, Paris Cardiovascular Research Center, Paris, France. 3Universit Paris Descartes, Sorbonne Paris Cit, Paris, France. 4Institute of Human Genetics, Helmholtz Zentrum Mnchen, Neuherberg, Germany. 5Department of Biomedicine, Aarhus University, Aarhus, Denmark. 6Medizinische Zellbiologie, Universitt Regensburg, Regensburg, Germany. 7Assistance PubliqueHpitaux de Paris, Hpital Europen Georges Pompidou, Paris, France. 8Department of Medicine I, Endocrine and Diabetes Unit, University Hospital Wrzburg, Wrzburg, Germany. 9Centre National de la Recherche Scientifique (CNRS), Institut des Hautes Etudes Scientifiques, Bures sur Yvette, France. 10Clinical Endocrinology, Campus Mitte, University Hospital Charit, Berlin, Germany. 11Department of Medicine, University of Padova, Padova, Italy. 12Endocrine Unit, Department of Medicine, University of Padova, Padova, Italy. 13Institute of Human Genetics, Technische Universitt Mnchen, Munich, Germany. 14DZHK (German Centre for Cardiovascular Research), partner site, Munich Heart Alliance, Munich, Germany. 15Department of Medical Sciences, Division of Internal Medicine and Hypertension, University of Torino, Turin, Italy. 16These authors contributed equally to this work. Correspondence should be addressed to M.R. (martin.reincke@med.uni-muenchen.de) or F.B. (felix.beuschlein@med.uni-muenchen.de). Received 15 October 2012; accepted 9 January 2013; published online 17 February 2013; doi:10.1038/ng.2550

440

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

letters
ATP1A1 Leu104Arg Phe100_Leu104del Val332Gly

ATP2B3

Val426_Val427del
Leu425_Val426del

Figure 1 Summary of somatic alterations in ATP1A1 and ATP2B3 identified in APAs. The ten transmembrane segments (M1M10) are shown with the most N terminal (M1) on the left. The transported ions are bound by specific residues in M4, M5, M6 and M8, with M4 being of particular importance9,11. ATP and phosphate are bound in the large cytoplasmic loop between M4 and M5. The smaller cytoplasmic loop between M2 and M3 has a major role in conformational changes associated with energy transduction.

instances resulting in deletion of two amino acids at positions 425 and 426 (p.Leu425_Val426del; Fig. 1). Sequence comparisons showed that the affected amino acids are highly conserved among species as well as among different members of the P-type ATPase family (Supplementary Fig. 1). We next sequenced the entire coding regions of ATP1A1 and ATP2B3 in 100 additional adenomas (Supplementary Table 3) and identified 6 further somatic mutations in ATP1A1 and 2 in ATP2B3. Four of the ATP1A1 mutations occurred at the same nucleotide position (c.311T>G), and two deletions including position 311 led to an in-frame deletion of five amino acids (c.299_313delTCTCAATGTTACTGT; p.Phe100_Leu104del). Two adenomas had an in-frame deletion in ATP2B3 resulting in the loss of two amino acids that overlapped with the other two deletions (c.1277_1282delTCGTGG; p.Val426_Val427del). Targeted sequencing of the 6 affected genomic
Figure 2 Structural positions of altered residues in Na+/K+ ATPase and Ca2+ ATPase. (ac) Na+/K+ ATPase. (a) Leu104 in transmembrane helix M1 and Val332 in M4 are shown in relation to Glu334 (E334) in M4, which is crucial for potassium ion binding and gating of the binding pocket. Leu104 positions Glu334 for the occlusion of potassium ions9,11. (b) By replacing the hydrophobic leucine by a large, positively charged arginine, the p.Leu104Arg substitution likely alters the position of Glu334, thus disturbing the gating mechanism. (c) In the Val332Gly mutant, the lack of a side chain for glycine introduces flexibility owing to surrounding space, which might also influence the position of Glu334. Shown as a dashed line is a likely hydrogen bond linking the two residues together (distance between the backbone oxygen and nitrogen moieties of 3.09 ). ( d) Ca2+ ATPase. Representation of one of the two calcium ionbinding sites in SERCA, which is equivalent to the calcium ionbinding site in ATP2B3. The calcium ion is shown together with the liganding residues (Val304, Glu309, Asn796 and Asp800). Val304 in M4 is equivalent to the valine that, together with a juxtaposed leucine, is deleted in the ATP2B3 mutant. Glu309 is equivalent to Glu334 of the Na+/K+ ATPase.

positions in 199 additional adenomas identified 7 further somatic mutations in ATP1A1 and 1 in ATP2B3. The complete collection of APA samples (n = 308) contained 21 (6.8%) ATP1A1 or ATP2B3 mutations and 118 (38.3%) KCNJ5 mutations (Supplementary Table 4). Concomitant KCNJ5 and ATP1A1 or ATP2B3 mutations within the same tumor were not observed. None of the six different mutations were present in 1,600 in-house exomes or in the 1000 Genomes Project data set. In addition, the two missense mutations were not present in the Exome Variant Server data set (v.0.0.14). Although one of the ATP1A1 mutations, c.311T>G, is listed in dbSNP, it is only described as a non-validated candidate SNP derived from EST data and thus is rather unlikely to represent a germline variant. Exome and Sanger sequencing of the somatic mutations indicated that both the reference and alternative alleles were present in tumor tissue. This observation is consistent with a heterozygous state in individuals of both sexes with the ATP1A1 mutations and in females with the ATP2B3 mutations. Because ATP2B3 is located on the X chromosome, these findings suggest a polyclonal tumor composition in the case of the two males with ATP2B3 mutations. In fact, polyclonal composition has been well recognized as a feature of small adrenal adenomas6,7. No ATP1A1 and ATP2B3 mutations were found in germline or adjacent normal adrenal samples within our group of APA cases. Of note, no germline alterations in these ATPase were found in a cohort of 18 subjects with familial aldosteronism type 2 (ref. 8) or in 91 sporadic cases with bilateral adrenal hyperplasia. In the normal adrenal samples, ATP1A1 was highly expressed in the zona glomerulosa and, to a lesser extent, in zona fasciculata, whereas ATP2B3 immunoreactivity was similarly detectable in all three layers of the adrenal cortex. ATP1A1 and ATP2B3 expression at the mRNA and protein levels was similar in APAs and in normal adrenal tissue. Similarly, within the APA group, ATP1A1 and ATP2B3 mRNA levels were found to be independent of the tumors mutational status (Supplementary Fig. 2). The function and structure of the Na+/K+ ATPase, of which the a subunit is encoded by ATP1A1, have been unraveled in exquisite detail over the last decades9. For each ATP being hydrolyzed, ATP1A1 exchanges three cytoplasmic sodium ions for two extracellular potassium ions10. The potassium and sodium gradients created drive the

npg

2013 Nature America, Inc. All rights reserved.

M2

b
M3 V332 K+ L104 E334 M4 K
+

M2

M1

M1

M3 V332 K+

L104R

E334

M6 M4

M6

M2

d
M1 M2 M6
Ca
2+

M1

M3 E309 V332G K+ K+ V304 M6 M4 M4 D800 M5 N796

L104

E334

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

441

letters a
Na+/K+ ATPase activity (%) 100 80 60 40 20 0

100 80 60 40 20

Phosphorylation (%)

80 60 40 20 0

Phosphorylation (%)

100

ja KC cen N t ti KC J5 ssu N and e AT J5 A P mu TP AT 1A1 tan ase w P2 m t ild B3 uta ty m nt pe ut an t

Ad

0 20 Vm (mV) 0 0.1 0.3 1 5 50 40 60 80

0 Effect of sodium ion removal (Vm) 10 20 30 40

(25)

il Le d ty u1 pe Va 04A l3 rg 32 G ly

il Le d ty u1 pe 0 Va 4A l3 rg 32 G ly

Potassium ion concentration (mM)

(65)

(46) (33)

(8) (7)

2013 Nature America, Inc. All rights reserved.

Figure 3 Functional and electrophysiological examination of transfected cells and adenoma primary cultures. (a) Plasma membranes were isolated from COS cells transfected with cDNA encoding Leu104Arg, Val332Gly or wild-type rat Atp1a1 and siRNA knocking down the endogenous ATP1A1 mRNA. The maximal Na+/K+ 0 ATPase activity of the exogenous rat enzyme determined in the presence of 130 mM Na + and 20 mM K+ at 37 C is shown relative to that of wild-type rat Atp1a1, which was 0.58 0.03 mol/mg of membrane protein 20 per hour. Error bars, s.e.m.; n = 6. (b) Quantification of relative phosphorylation from [ -32P] MgATP in the + 32 40 presence of 100 mM Na . (c) Potassium ion inhibition of phosphorylation from [ - P] MgATP in the presence of * 50 mM Na+ for wild-type (open circles), Leu104Arg (filled circles) and Val332Gly (filled diamonds) Atp1a1 (30) 60 (29) protein. Error bars, s.e.m.; n = 6. The phosphorylation level in the absence of potassium ion was taken as 100%. * (29)(30) (d) Membrane voltages (Vm) of primary cultured adrenal cells measured by whole-cell patch clamp. Cells from 80 Control No Na+ normal adjacent tissue (65 cells from 14 cases) showed the most hyperpolarized membrane voltages; cells from adenomas without alterations of KCNJ5 or the two ATPases (46 cells from 7 cases) and cells from adenomas with KCNJ5 alterations (33 cells from 5 cases) were slightly depolarized. Adenoma cells with alterations of ATP2B3 (7 cells from 2 cases) or ATP1A1 (8 cells from 2 cases) were significantly depolarized compared to adjacent tissue cells. ( e) Effect of removal of bath sodium ions on membrane voltage (shown as Vm). Adenoma cells with the sodium ionpermeable mutant KCNJ5 channel showed the strongest effect of sodium ion removal on membrane voltage, whereas less pronounced effects were present in adenoma cells with alterations of ATP2B3 and ATP1A1. ( f) HEK cells transfected with cDNA encoding the Leu104Arg mutant of Atp1a1 were depolarized compared to cells overexpressing wild-type Atp1a1 (control). After removal of bath sodium ions, the membrane voltage was still more depolarized in cells transfected with cDNA for the Leu104Arg mutant. In df, the numbers of analyzed cells are shown in parenthesis; asterisks indicate statistically significant differences ( P < 0.05) from adjacent tissue in d,e or cells expressing wild-type Atp1a1 in f. Error bars, s.e.m.
W ild Le typ u1 e 04 Ar g

ion fluxes that generate resting membrane potential and action potentials. Through site-directed mutagenesis and in vitro assays, individual domains within the Na+/K+ ATPase 1 subunit have been associated with specific functional properties11. Notably, there is good evidence for a direct link between Na+/K+ ATPase and the regulation of aldosterone secretion: blockade of Na+/K+ ATPases with the specific antagonist ouabain results in dose-dependent stimulation of aldosterone release from glomerulosa cells12 and glomerulosa cell growth in vivo13. Furthermore, angiotensin II lowers Na+/K+ ATPase activity, indicating the potential contribution of this enzyme to angiotensin-dependent aldosterone release14. Notably, heterozygous Atp1a1 knockout mice are characterized by higher serum aldosterone concentrations compared to wild-type littermates15, although indirect effects might well apply to this phenotype. However, up until now, no germline or somatic ATP1A1 mutations have been associated with human disease. ATP2B3 also belongs to the ATPase gene family and encodes a plasma membrane Ca2+ ATPase (ATP2B3, also known as PMCA3) that is essential to clear calcium ions from the cytoplasm of eukaryotic cells and thereby has a critical role in intracellular calcium homeostasis16. Although several mutations of the genes encoding ATP2B2 (also known as PMCA2) have been identified in mouse models17 and humans with hearing loss18, no mutations in ATP2B3 have been described as causative for human disease. Projection of the ATP1A1 alterations onto the resolved crystal structure of the orthologous Squalus acanthias Na+/K+ ATPase (Fig. 2 and Supplementary Fig. 1b) showed that all three alterations are either located in the transmembranous helix M1 or the juxtaposed helix M4, which have been suggested to interact and cooperate in potassium ion binding and gating by interaction of Glu334 with Leu104 (ref. 11). As the atomic structure of the plasma membrane
442

Ca2+ ATPase ATP2B3 has not yet been determined, we used the known structure of the homologous rabbit sarcoplasmic reticulum type Ca2+ ATPase (SERCA) to localize the ATP2B3 alterations. Notably, the APA-associated deletion mutations in ATP2B3 also alter the M4 transmembrane helix in the same region where the glutamate homologous to Glu334 of ATP1A1 is positioned (Fig. 2). In SERCA, this glutamate is a crucial residue in calcium ion binding (Fig. 2d). Therefore, the deletion in ATP2B3 is predicted to cause a major distortion of the calcium ion binding site. The recurrence of mutations affecting these highly conserved regions involved in interaction with the transported cations in two paralogs is suggestive of a gain-of-function effect. To examine the functional consequences of the p.Leu104Arg and p.Val332Gly alterations of ATP1A1, we transfected COS cells with the corresponding mutated ouabain-insensitive rat cDNA and subjected them to selection in the presence of ouabain, thereby inhibiting the endogenous enzyme in COS cells11. No viable colonies were detected, indicating that the physiological Na+/K+ pump activity of the mutant rat enzymes was very low or nonexistent. We therefore expressed the mutants transiently in COS cells in the presence of small interfering RNA (siRNA) binding specifically to the endogenous COS cell Na+/K+ ATPase subunit mRNA, thus knocking down the endogenous Na+/K+ ATPase. This allowed us to carry out in vitro enzymatic studies of expressed mutant and wild-type Atp1a1. The ATPase activity determined under optimal conditions for wild-type Atp1a1 was indeed undetectable for the Leu104Arg variant and very low for the Val332Gly variant (Fig. 3a). However, the mutant enzymes were well expressed and able to react with ATP and be phosphorylated in a sodium ion dependent reaction (Fig. 3b). Notably, the detected affinity for potassium ion was conspicuously lower in the mutants relative to wild-type
VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

npg

Vm (mV)

W ild Le typ u1 e 04 Ar g

ja KC cen N t ti KC J5 ssu N an e AT J5 d A P mu TP AT 1A1 tan ase P2 m t w ild B3 uta ty m nt pe ut an t


(11)

Ad

(8) (7) (10)

letters
Figure 4 Proposed mechanism for autonomous aldosterone secretion in APAs with somatic ATPase alterations. (a) Glomerulosa cell with hyperpolarized membrane potential under baseline conditions. (b) Angiotensin II (AngII)induced inhibition of K+ channels22 and Na+/K+ ATPases14 leading to cell depolarization, activation of voltage-gated Ca2+ channels and higher cytoplasmic calcium ion levels. (c) Mutation of ATP1A1 resulting in alteration of potassium ion binding and loss of function followed by cell depolarization. (d) Mutation of ATP2B3 associated with impaired clearance of cytoplasmic calcium ions. CYP11B2, aldosterone synthase gene.

Na+/K+ Na+/Ca2+ Ca2+ Ca2+ ATPase exchanger channel ATPase Na+ Ca2+ Ca2+

Na+/K+ Na+/Ca2+ Ca2+ Ca2+ ATPase exchanger channel ATPase Angll Na+ Ca2+ Ca2+

K+ K+ K+ channel

Na+

Hyperpolarization

Ca2+ K+ K+ channel

K+

Na+

Ca2+

Depolarization Aldosterone

CYP11B2

CYP11B2

Cytoplasm

Nucleus

Cytoplasm

Nucleus

Atp1a1 (~0.2% and 2% for the Leu104Arg and Val332Gly mutants, respectively; Fig. 3c). The K+ Na+ K+ Na+ effect of the p.Leu104Arg mutation supports Ca2+ Ca2+ + K+ K the previous suggestion that Leu104 by van Depolarization K+ K+ der Waals interaction positions Glu334, which channel channel Aldosterone Aldosterone is essential for potassium ion binding and gatCYP11B2 CYP11B2 ing of the binding pocket11. Substitution of Nucleus Nucleus Cytoplasm Cytoplasm the almost-juxtaposed Val332 with a glycine, which lacks a side chain, may create flexibility, likewise disturbing the function of Glu334 (Fig. 2ac). endpoints, including tumor size, blood pressure, serum sodium Under ex vivo conditions, electrophysiological examination of concentration and urinary albumin secretion, were similar between primary cultured adenoma cells with different underlying muta- the groups (Supplementary Table 5). tions showed substantially higher levels of depolarization in ATPase We have been unable to detect germline mutations in ATP1A1 (ATP1A1 or ATP2B3)-mutant cells compared to cells from normal or ATP2B3 in individuals with familial forms or with the bilateral adjacent tissue. This indicates that adenoma cells with mutations in form of primary aldosteronism. Given the central role of the Na+/K+ ATPase genes have profoundly altered electrophysiological properties. ATPases and Ca2+ ATPases in generating the electrochemical gradiThe finding was specific for ATPase-mutant adenoma samples, as ents required for electrical excitability and the cellular uptake of ions, such a strong depolarization was not observed in KCNJ5-mutant ade- nutrients and neurotransmitters, as well as for the regulation of cell noma samples or in adenoma cells without known mutation (Fig. 3d). volume and intracellular pH, germline mutations in these genes are When extracellular sodium ions were removed, primary cells hyper- predicted to be under strong purifying selection. Notably, however, polarized, suggesting a disturbed intracellular ion composition and/ ATP2B1, encoding one of the four mammalian plasma membrane or loss of net charge transport by the mutated pump (Fig. 3e). As Ca2+ ATPase isoforms, was significantly associated with systolic and expected, the hyperpolarization was most pronounced in cells from diastolic blood pressure and hypertension in a recent genome-wide adenomas with mutant sodium ionpermeable KCNJ5 (ref. 4). The association study21. effect of sodium ion removal in primary adenoma cells with mutant In summary, we show here that somatic mutations in ATP1A1 and ATP1A1 was less marked. This suggests that the strong depolariza- ATP2B3 are present in roughly 7% of aldosterone-producing adenotion observed in cells with mutant ATP1A1 is not primarily caused by mas. In both instances, inactivation of the pump function, either indihigher channel-like sodium ion conductance but possibly by disturbed rectly (in the case of ATP1A1) or directly (in the case of ATP2B3) is intracellular ion composition (Fig. 3e). In HEK cells transfected with predicted to increase intracellular calcium ion concentrations, which cDNA for the Leu104Arg variant of Atp1a1, the membrane voltage in turn prime calcium-dependent signaling and aldosterone output was depolarized compared to cells expressing wild-type Atp1a1, (Fig. 4). As the alterations identified in both ATPases are constrained suggesting that the depolarization of primary adenoma cells is a to specific and highly conserved functional domains, further gainspecific consequence of the mutation of ATP1A1 (Fig. 3f). of-function mechanisms, such as a pathological transport mode, Whereas KCNJ5 mutations have been consistently reported to be might be potential contributors to the molecular phenotype. Thus, more prevalent in female cases5,19,20, ATPase alterations were predom- these findings expand the spectrum of somatic alterations leading inantly found in males (81.0% (17/21) of cases with either ATP1A1 to APAs to two members of the P-type ATPase pump family, extend or ATP2B3 mutation were male versus 25.4% (30/118) of cases with knowledge of the molecular mechanism leading to APAs and indicate KCNJ5 mutation; P < 0.0001). Consistent with a more severe pheno- new potential therapeutic targets for the most frequent secondary type, individuals with ATPase-mutant tumors had higher preopera- form of arterial hypertension. tive aldosterone concentrations (median of 397.8 ng/l, interquartile range of 555.0 ng/l versus median of 286.6 ng/l, interquartile range of URLs. National Heart, Lung, and Blood Institute (NHLBI) Exome 287.2 ng/l in cases without mutations; P = 0.02) and significantly lower Sequencing Project Exome Variant Server, http://evs.gs.washington. serum potassium concentrations (median of 2.6 mM, interquartile edu/EVS/; ClinVar, http://www.ncbi.nlm.nih.gov/clinvar/; European range of 0.8 mM versus median of 3.1 mM, interquartile range of Network for the Study of Adrenal Tumors (ENS@T), http://www. 0.8 mM in cases with KCNJ5 mutations; P = 0.013). Other clinical ensat.org/; Primer3, http://frodo.wi.mit.edu/primer3/input.htm;
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013 443

Mutant + 2+ Ca2+ Ca2+ Na+/K+ Na /Ca ATPase exchanger channel ATPase Ca2+ Ca2+ Na+

Mutant Na+/K+ Na+/Ca2+ Ca2+ Ca2+ ATPase exchanger channel ATPase Ca2+ Ca2+ Na+

npg

2013 Nature America, Inc. All rights reserved.

letters
ExonPrimer, http://ihg.helmholtz-muenchen.de/ihg/ExonPrimer. html; ClustalW2, http://www.ebi.ac.uk/Tools/msa/clustalw2/; PyMOL, http://www.pymol.org/. Methods Methods and any associated references are available in the online version of the paper. Accession codes. Disease-causing variants will be submitted to ClinVar. Exome data are available on request within a scientific cooperation.
Note: Supplementary information is available in the online version of the paper. Acknowledgments Munich and Regensburg. This work has been made possible by a grant of the Else KrnerFresenius-Stiftung in support of the German Conns Registry-Else-Krner Hyperaldosteronism Registry (to M.R.). Additional funding was received from the Deutsche Forschungsgemeinschaft to F.B. and M.R. (Re 752/17-1) and R.W. (FOR1086). The work was also supported by the German Ministry of Education and Research (01GR0802 and 01GM0867), the European Commissions Seventh Framework Programme (261123, GEUVADIS) and the DZHK. Paris. We thank H. Lefbvre and E. Louiset (INSERM U982 and University Hospital of Rouen) and M. Sibony (Assistance PubliqueHpitaux de Paris, Hpital Cochin) for providing control adrenal samples. We thank the COMETE (COrtico et MEdullo-surrnale: les Tumeurs Endocrines) network for providing tissue samples from individuals with APA. This work was funded through institutional support from INSERM and by the Agence Nationale pour la Recherche (ANR Physio 2007, 013-01; Genopat 2008, 08-GENO-021), the Fondation pour la Recherche sur lHypertension Artrielle (AO 2007), the Fondation pour la Recherche Mdicale (ING20101221177), the Programme Hospitalier de Recherche Clinique (PHRC grant AOM 06179) and by grants from INSERM and the Ministre Dlgu la Recherche et des Nouvelles Technologies. Aarhus. This work was supported in part by grants to B.V. from the Danish Medical Research Council, the Novo Nordisk Foundation (Fabrikant Vilhelm Pedersen og Hustrus Legat) and the Lundbeck Foundation. Turin. This study was supported by grants from the Fondi Ricerca Ex-60% MIUR (Ministry of University, Scientific and Technological Research) 2012 and the Compagnia di San Paolo. AUTHOR CONTRIBUTIONS S.B., H.N.N., U.D.L., D.P., V.R.S., A.W., P.T., S.D. and B.S.-C. performed the experiments. A.O., T.W., L.A., E.F., T.S., T.M.S., E.G. and A.B. performed statistical analysis and analyzed the data. B.A., M.Q., F.F., P.-F.P., F.M. and P.M. contributed materials. F.B., T.M., X.J., R.W., B.V., M.-C.Z., T.M.S. and M.R. jointly supervised research, conceived and designed the experiments, analyzed the data, contributed reagents, materials and/or analysis tools and wrote the manuscript. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.
Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html.
1. Hannemann, A. et al. Screening for primary aldosteronism in hypertensive subjects: results from two German epidemiological studies. Eur. J. Endocrinol. 167, 715 (2012). 2. Eide, I.K., Torjesen, P.A., Drolsum, A., Babovic, A. & Lilledahl, N.P. Low-renin status in therapy-resistant hypertension: a clue to efficient treatment. J. Hypertens. 22, 22172226 (2004). 3. Rossi, G.P. et al. A prospective study of the prevalence of primary aldosteronism in 1,125 hypertensive patients. J. Am. Coll. Cardiol. 48, 22932300 (2006). 4. Choi, M. et al. K+ channel mutations in adrenal aldosterone-producing adenomas and hereditary hypertension. Science 331, 768772 (2011). 5. Boulkroun, S. et al. Prevalence, clinical, and molecular correlates of KCNJ5 mutations in primary aldosteronism. Hypertension 59, 592598 (2012). 6. Beuschlein, F. et al. Clonal composition of human adrenocortical neoplasms. Cancer Res. 54, 49274932 (1994). 7. Gicquel, C. et al. Clonal analysis of human adrenocortical carcinomas and secreting adenomas. Clin. Endocrinol. 40, 465477 (1994). 8. Mulatero, P. et al. KCNJ5 mutations in European families with nonglucocorticoid remediable familial hyperaldosteronism. Hypertension 59, 235240 (2012). 9. Morth, J.P. et al. Crystal structure of the sodium-potassium pump. Nature 450, 10431049 (2007). 10. Kaplan, J.H. Biochemistry of Na,K-ATPase. Annu. Rev. Biochem. 71, 511535 (2002). 11. Einholm, A.P., Andersen, J.P. & Vilsen, B. Importance of Leu99 in transmembrane segment M1 of the Na+,K+-ATPase in the binding and occlusion of K+. J. Biol. Chem. 282, 2385423866 (2007). 12. Yingst, D.R., Davis, J., Krenz, S. & Schiebinger, R.J. Insights into the mechanism by which inhibition of Na,K-ATPase stimulates aldosterone production. Metabolism 48, 11671171 (1999). 13. Neri, G. et al. Ouabain chronic infusion enhances the growth and steroidogenic capacity of rat adrenal zona glomerulosa: the possible involvement of the endothelin system. Int. J. Mol. Med. 18, 315319 (2006). 14. Hajnczky, G. et al. Angiotensin-II inhibits Na+/K+ pump in rat adrenal glomerulosa cells: possible contribution to stimulation of aldosterone production. Endocrinology 130, 16371644 (1992). 15. Moseley, A.E. et al. Genetic profiling reveals global changes in multiple biological pathways in the hearts of Na,K-ATPase 1 isoform haploinsufficient mice. Cell Physiol. Biochem. 15, 145158 (2005). 16. Di Leva, F., Domi, T., Fedrizzi, L., Lim, D. & Carafoli, E. The plasma membrane Ca2+ ATPase of animal cells: structure, function and regulation. Arch. Biochem. Biophys. 476, 6574 (2008). 17. Street, V.A., McKee-Johnson, J.W., Fonseca, R.C., Tempel, B.L. & Noben-Trauth, K. Mutations in a plasma membrane Ca2+-ATPase gene cause deafness in deafwaddler mice. Nat. Genet. 19, 390394 (1998). 18. Schultz, J.M. et al. Modification of human hearing loss by plasma-membrane calcium pump PMCA2. N. Engl. J. Med. 352, 15571564 (2005). 19. kerstrm, T. et al. Comprehensive re-sequencing of adrenal aldosterone producing lesions reveal three somatic mutations near the KCNJ5 potassium channel selectivity filter. PLoS ONE 7, e41926 (2012). 20. Azizan, E.A. et al. Microarray, qPCR, and KCNJ5 sequencing of aldosteroneproducing adenomas reveal differences in genotype and phenotype between zona glomerulosa and zona fasciculatalike tumors. J. Clin. Endocrinol. Metab. 97, E819E829 (2012). 21. Ehret, G.B. et al. Genetic variants in novel pathways influence blood pressure and cardiovascular disease risk. Nature 478, 103109 (2011). 22. Spt, A. & Hunyady, L. Control of aldosterone secretion: a model for convergence in cellular signaling pathways. Physiol. Rev. 84, 489539 (2004).

npg

2013 Nature America, Inc. All rights reserved.

444

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

ONLINE METHODS

Subjects. Individuals with primary aldosteronism were recruited among seven different centers from the APA working group of the European Network for the Study of Adrenal Tumors (ENS@T). Case detection and subtype identification were in accordance with institutional guidelines. The diagnosis of adrenocortical adenoma was histologically confirmed after surgical resection. All subjects gave written informed consent for genetic investigation within each individual institution (University Hospital Munich; Centre de Protection des Personnes de ParisCochin; Department of Medical Sciences, University of Torino; University Hospital Charit; University Hospital Wrzburg; Department of Medicine, University of Padova). For initial exome sequencing, we selected nine affected individuals from the Munich cohort without germline or somatic KCNJ5 mutations. Baseline clinical and biochemical characteristics of these individuals are summarized in Supplementary Table 1. Nucleic acid extraction. DNA or RNA was extracted from a total of 308 APAs with 17 paired peritumoral adrenal cortices and 16 peripheral DNA samples in addition to 91 peripheral DNA samples from subjects with bilateral adrenal hyperplasia. Tumor DNA was extracted using the RNeasy DNA extraction kit (Qiagen); DNA from peripheral blood leukocytes was prepared using salt extraction. Total RNA was isolated from frozen tissue using TRIzol (Invitrogen) and then cleaned on silica columns using the RNeasy Mini kit (Qiagen). RNA integrity and quality were systematically checked using an Agilent 2100 Bioanalyzer with the RNA6000 Nano Assay (Agilent Technologies). After DNase I treatment (Invitrogen), 500 ng of total RNA was reverse transcribed using Superscript II reverse transcriptase (Invitrogen) and random hexamers (Promega). Exome sequencing. Exomes were enriched in solution and indexed with SureSelect XT Human All Exon 50 Mb kits (Agilent). Sequencing was performed as 100-bp paired-end runs on HiSeq2000 systems (Illumina). Pools of 12 indexed libraries were sequenced on 4 lanes. Image analysis and base calling were performed using Illumina Real Time Analysis. CASAVA 1.8 was used for demultiplexing. Variant detection. Burrows-Wheeler Aligner (BWA v 0.5.9) with standard parameters was used for read alignment against the human genome assembly hg19 (GRCh37). We performed single-nucleotide variant and small insertion and deletion (indel) calling specifically for the regions targeted by the exome enrichment kit using SAMtools (v 0.1.18). Subsequently, variant quality was determined using the SAMtools varFilter script. We used default parameters, with the exception of the maximum read depth (D) and the minimum P value for base quality bias (2), which we set to 9,999 and 1 10400, respectively. Additionally, we applied a custom script to mark all variants with adjacent bases of low median base quality. All variants were then annotated using custom Perl scripts. Annotation included information about known transcripts (UCSC Known Genes and RefSeq genes), known variants (dbSNP v135), type of mutation and, if applicable, amino-acid change in the corresponding protein. Annotated variants were then added to our in-house database. To discover putative somatic variants, we queried our database to show only those variants of a tumor that were not found in the corresponding control tissue. To reduce the number of false positives, we filtered out variants that were already present in our database, had variant quality of less than 40 or did not pass one of the filters from the filter scripts. We then manually investigated the raw read data of the remaining variants using the Integrative Genomics Viewer (IGV). ATP1A1 and ATP2B3 sequencing. DNA was amplified using intron-spanning primers. Bidirectional Sanger sequencing was performed using the ABI BigDye Terminator v.3.1 Cycle Sequencing kit. Primer sequences (available on request) were designed using Primer3 or ExonPrimer software. Protein sequence analysis. Similarity analysis was performed using the ClustalW2 program. Human ATP1A1 (NP_000692.2) and ATP2B3 (NP_068768) sequences were compared with sequences from other species and with those of other human ATPases, respectively.

Modeling of protein structures. Structural representations were prepared using PyMOL software. The Na+/K+ ATPase and sarcoplasmic reticulum Ca2+ ATPase structures shown have Protein Data Bank (PDB) accessions 2ZXE and 1SU4, respectively. Immunohistochemistry. Immunohistochemistry was performed on sections deparaffinized in xylene and rehydrated through graded ethanol. For antigen unmasking, slides were incubated in antigen unmasking solution (Vector Laboratories) for 30 min at 98 C. Endogenous peroxidases were inhibited by incubation in 3% hydrogen peroxide (Sigma-Aldrich) in water for 10 min, and nonspecific staining was blocked with normal goat serum. Primary antibodies (ATP1A1: Abgent, AJ1524a; 1:1,000 dilution; ATP2B3: Sigma-Aldrich, HPA001583; 1:100 dilution) were incubated with sections overnight at 4 C. Sections were washed, incubated for 30 min with affinity-purified goat secondary antibody to rabbit (Vector Laboratories; 1:400 dilution), washed and incubated with an avidin-biotin-peroxidase complex (Vectastain ABC Elite, Vector Laboratories) for 30 min. Slides were developed using diaminobenzidin (Vector Laboratories) and counterstained with hematoxylin (Sigma). In the negative control reactions, primary antibodies were omitted from the dilution buffer, which in all instances resulted in a complete absence of staining. All microscopic examinations were carried out on a Leica microscope. Quantification was performed on 3 different fields per tumor in 3 ATP1A1mutant APAs and 14 non-mutant APAs. Results represent the means s.e.m. of 135215 cells expressed as the percentage for each expression pattern. Quantification of mRNA expression. mRNA expression data for 91 samples with genotype data were retrieved from a pangenomic transcriptome analysis performed on 123 samples collected through the COMETE network from patients operated on for APA between 1994 and 2008 in the Hypertension Unit at the Hpital Europen Georges Pompidou in Paris. Procedures for data acquisition and calculation have been described in detail elsewhere5. Quantification of Na+/K+ ATPase activity in transfected cells. For in vitro studies, mutations were introduced into the full-length cDNA encoding the ouabain-resistant rat 1 isoform of the Na+/K+ ATPase Atp1a1, and the mutant and wild-type constructs were expressed in COS-1 cells. The >100-fold difference between the ouabain affinities of exogenous rat Atp1a1 and the endogenous COS-1 cell ATP1A1 enzyme allows isolation of stable cell lines under ouabain selection pressure, provided that the exogenous enzyme is functional in sodium and potassium ion pumping. Because the mutants studied here were unable to support cell growth in the presence of ouabain, thus indicating lack of pump function, we also used siRNA cotransfection to knock down the endogenous enzyme, thereby allowing studies of transiently expressed enzyme as an alternative. Leaky plasma membranes were assayed functionally using previously described methods11. Na+/K+ ATPase activity was determined at 37 C by following the liberation of inorganic phosphate (Pi) in the presence of 130 mM NaCl, 20 mM KCl, 3 mM MgATP, 30 mM histidine buffer (pH 7.5) and 1 mM EGTA, together with 100 M ouabain to ensure complete knockout of the ouabain-sensitive endogenous enzyme. Phosphorylation was carried out for 10 s at 0 C with 2 M [-32P] ATP in the presence of 100 mM NaCl, 3 mM MgCl2, 20 mM Tris (pH 7.5), 100 M ouabain and 20 g/ml oligomycin or in the presence of 50 mM NaCl, 3 mM MgCl2, 20 mM Tris (pH 7.5), 100 M ouabain and varying concentrations of KCl, with choline chloride added to maintain constant ionic strength. The [32P]-labeled Na+/K+ ATPase was separated by acid SDS gel electrophoresis, and radioactivity was quantified by phosphorimaging. For quantification, the number of replicates was n = 46. Background activity, represented by the inactive phosphorylation site Atp1a1 mutant Asp376Asn, was subtracted in all cases. Preparation of primary cell cultures. For primary cultures from APAs and adjacent normal adrenal gland tissue, samples were cleaned of surrounding fat, connective tissue and blood vessels. Then, tissue samples were minced into pieces smaller than 0.5 mm using a razor blade. Minced samples were transferred to 15-ml Falcon tubes and spun down at 225g for 5 min. The pellet was resuspended in 10 ml of digestion buffer containing 2 mg/ml of collagenase II (Biochrom) in PBS and incubated at 37 C for no longer than 50 min in a shaking water bath. Every 15 min, tube content was pipetted

npg

2013 Nature America, Inc. All rights reserved.

doi:10.1038/ng.2550

Nature Genetics

up and down several times using 25-ml and, subsequently, 10-ml pipettes to support digestion. Collagenase was inactivated by adding pure FCS to a minimum total concentration of 10%, and cells were sequentially filtered through 100-m and 70-m nylon mesh and centrifuged as above. The pellet was resuspended in erythrocyte lysis buffer and incubated for 7 min at room temperature. After another centrifugation step, cells were resuspended in 12 ml of culture medium, depending on the expected cell count (DMEM/F12 with 10% FCS, 3.1 g/l glucose and 10 l/ml penicillin-streptomycin, all from Gibco) and filtered through 70-m nylon mesh. Cell number was determined using a Neubauer counting chamber. Electrophysiological evaluation of primary cell cultures and transfected cells. Patch clamp recordings were performed using an EPC-10 amplifier without leak subtraction (HEKA). The solution (pH 7.4) for primary adenoma cells contained 5 mM HEPES, 124.5 mM NaCl, 1.6 mM Na 2HPO4, 0.4 mM NaH2PO4, 5 mM glucose; 1 mM MgCl2, 1.3 mM CaCl2 and 4.1 mM KCl. The extracellular solution for voltage measurements in HEK cells contained 5 mM HEPES, 145 mM NaCl, 1.6 mM K2HPO4, 0.4 mM KH2PO4,

5 mM glucose, 1 mM MgCl2 and 1.3 mM CaCl2. For the sodium ionfree solution, sodium was replaced by N-methyl-d-glucamine. The pipette solution (pH 7.2) contained 95 mM potassium gluconate, 30 mM KCl, 4.8 mM Na2HPO4, 1.2 mM NaH2PO4, 5 mM glucose, 2.3 mM MgCl2, 0.762 mM CaCl2, 1 mM EGTA and 3 mM Na2ATP. For functional expression in HEK cells, the full-length cDNAs of wild-type rat Atp1a1 and mutant Atp1a1 Leu104Arg were subcloned into the bicistronic pIRES-CD8 expression vector. One day before voltage measurements, HEK cells were transfected with the constructs for wild-type rat Atp1a1 or the Atp1a1 Leu104Arg mutant using Lipofectamine. Dynabeads labeled with antibody to CD8 (Invitrogen) were used to identify transfected cells. Statistical analysis. If not stated otherwise, group results are expressed as median values with interquartile ranges. Data between groups were compared using the Kruskal-Wallis test followed by a two-sided test for the pairwise comparison of two groups. The significance level of P < 0.05 was considered to indicate statistical significance. Statistical analysis was performed using standard statistical software (SPSS 20).

npg

2013 Nature America, Inc. All rights reserved.

Nature Genetics

doi:10.1038/ng.2550

letters

De novo mutations in the autophagy gene WDR45 cause static encephalopathy of childhood with neurodegeneration in adulthood
2013 Nature America, Inc. All rights reserved.

Hirotomo Saitsu1,10, Taki Nishimura2,3,10, Kazuhiro Muramatsu4,10, Hirofumi Kodera1, Satoko Kumada5, Kenji Sugai6, Emi Kasai-Yoshida5, Noriko Sawaura4, Hiroya Nishida7, Ai Hoshino7, Fukiko Ryujin8, Seiichiro Yoshioka8, Kiyomi Nishiyama1, Yukiko Kondo1, Yoshinori Tsurusaki1, Mitsuko Nakashima1, Noriko Miyake1, Hirokazu Arakawa4, Mitsuhiro Kato9, Noboru Mizushima2,3 & Naomichi Matsumoto1
Static encephalopathy of childhood with neurodegeneration in adulthood (SENDA) is a recently established subtype of neurodegeneration with brain iron accumulation (NBIA)13. By exome sequencing, we found de novo heterozygous mutations in WDR45 at Xp11.23 in two individuals with SENDA, and three additional WDR45 mutations were identified in three other subjects by Sanger sequencing. Using lymphoblastoid cell lines (LCLs) derived from the subjects, aberrant splicing was confirmed in two, and protein expression was observed to be severely impaired in all five. WDR45 encodes WD-repeat domain 45 (WDR45). WDR45 (also known as WIPI4) is one of the four mammalian homologs of yeast Atg18, which has an important role in autophagy4,5. Lower autophagic activity and accumulation of aberrant early autophagic structures were demonstrated in the LCLs of the affected subjects. These findings provide direct evidence that an autophagy defect is indeed associated with a neurodegenerative disorder in humans. NBIA is a heterogeneous group of neurodegenerative diseases that are characterized by a prominent extrapyramidal movement disorder, intellectual deterioration and deposition of iron in the basal ganglia13. Mutations in several genes involved in diverse cellular processes cause NBIA6. SENDA is a recently established subtype of NBIA. SENDA begins with early childhood psychomotor retardation, which remains static until adulthood. Then, during their twenties to early thirties, affected individuals develop sudden-onset progressive dystonia-parkinsonism and dementia. In addition to iron deposition in the globus pallidus and substantia nigra, individuals with SENDA have a distinct pattern on brain magnetic resonance images (MRI)
1Department

of T1-weighted signal hyperintensity of the substantia nigra, with a central band of hypointensity13,6,7. SENDA is always sporadic6,7, suggesting the involvement of de novo mutations or autosomal recessive traits. To identify de novo or recessive mutations, family-based exome sequencing was performed including the affected individual, an unaffected sibling and the unaffected parents. A total of 180 and 187 rare protein-altering and splice-site variants were identified per affected subject, which were absent in dbSNP135 data and in 88 in-house control exomes (Supplementary Table 1). All genes in each subject were surveyed for de novo mutations and compound heterozygous or homozygous mutations that were consistent with an autosomal recessive trait in each family (Supplementary Table 2). Two de novo and one autosomal recessive candidate mutations were found in subject 1, and a de novo candidate mutation was found in subject 2. Only mutations in WDR45 at Xp11.23, encoding WDR45 (referred to here as WIPI4), were common in the two subjects. A canonical splice-site mutation (c.439+1G>T) was found in subject 1, and a synonymous mutation located at the last base of exon 8 (c.516G>C) was found in subject 2, both of which occurred de novo (Fig. 1a). Sanger sequencing of WDR45 in three other affected subjects identified one nonsense and two frameshift mutations (Fig. 1a). The c.1033_1034dupAA mutation in subject 5 occurred de novo. Parental samples for the other two subjects were unavailable. None of the five mutations were found in 6,500 National Heart, Lung, and Blood Institute (NHLBI) exomes or among our 212 in-house control exomes. All subjects with a WDR45 mutation are female. To examine the effects of the mutations on WDR45 transcription, RT-PCR and sequencing were performed on total RNA extracted from the LCLs of subjects. The c.439+1G>T mutation in subject 1 and the c.516G>C mutation in subject 2 caused 24-bp in-frame and 22-bp

npg

of Human Genetics, Graduate School of Medicine, Yokohama City University, Yokohama, Japan. 2Department of Physiology and Cell Biology, Graduate School and Faculty of Medicine, Tokyo Medical and Dental University, Tokyo, Japan. 3Department of Biochemistry and Molecular Biology, Graduate School and Faculty of Medicine, The University of Tokyo, Tokyo, Japan. 4Department of Pediatrics, Gunma University Graduate School of Medicine, Gunma, Japan. 5Department of Neuropediatrics, Tokyo Metropolitan Neurological Hospital, Tokyo, Japan. 6Department of Child Neurology, National Center of Neurology and Psychiatry, Tokyo, Japan. 7Department of Pediatrics, National Rehabilitation Center for Children with Disabilities, Tokyo, Japan. 8Department of Pediatrics, Shiga University of Medical Science, Shiga, Japan. 9Department of Pediatrics, Yamagata University Faculty of Medicine, Yamagata, Japan. 10These authors contributed equally to this work. Correspondence should be addressed to H.S. (hsaitsu@yokohama-cu.ac.jp), N. Mizushima (nmizu@m.u-tokyo.ac.jp) or N. Matsumoto (naomat@yokohama-cu.ac.jp). Received 24 October 2012; accepted 29 January 2013; published online 24 February 2013; doi:10.1038/ng.2562

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

445

letters
Figure 1 Heterozygous WDR45 mutations in individuals with SENDA. (a) Schematic of WDR45, which comprises 12 exons (rectangles). 1 2 3 4 5 6 7 8 9 10 11 12 The UTRs and coding region are shown in white and black, respectively. Three mutations were confirmed as de novo; the others were unable Subject 3 Subject 1 Subject 2 Subject 4 Subject 5 to be confirmed because parental samples were unavailable. Blue c.437dupA c.439+1G>T c.516G>C c.637C>T c.1033_1034dupAA and green arrows indicate the locations of the two sets of primers p.Leu148Alafs*3 p.[Gly147Val; p.Asp174Valfs*29 p.Gln213* p.Asn345Lysfs*67 Val147_Leu148ins8] used in mRNA analysis. (b) RT-PCR analysis using the blue primer No parental sample De novo De novo No parental sample De novo set (left) and green primer set (right) from a. Whereas control cDNA samples showed a single product corresponding to the wild-type allele (WT), an apparently longer product was observed in bp bp subjects 1, 2 and 5, indicating that only the transcripts from the 300 400 mutant allele were expressed. In subject 3, both wild-type and mutant Mutants Mutants 300 200 WT alleles were expressed. Template without reverse transcriptase WT was used as a negative control, RT(). ( c) Schematic of the mutant 200 transcript resulting from the c.439+1G>T mutation (red) in subject 1. WIPI4 361 aa A 24-bp insertion caused by the use of a cryptic splice donor site 24 bp WT * Exon 7 Exon 8 within intron 7 was observed, resulting in a p.Gly147Val substitution Subject 1 * followed by an in-frame eight-amino-acid insertion (p.[Gly147Val; Subject 2 Val147_Leu148ins8]). (d) Schematic of the mutant transcript resulting from the c.516G>C mutation (red) in subject 2. A 22-bp Subject 3 22 bp Exon 8 Exon 9 insertion from the use of a cryptic splice donor site within intron 8 was Subject 4 observed, leading to a frameshift (p.Asp174Valfs*29). ( e) Schematic Subject 5 * of mutant WIPI4 proteins. -propeller structures and additional residues caused by mutations are colored in blue and red, respectively. The amino-acid residues of the mutant protein predicted from cDNA sequences are shown in relation to seven propeller structures1315. An asterisk indicates the position of the FRRG motifs.

c Su t 1 bj ec Su t 2 bj ec t3 R T ( )

2013 Nature America, Inc. All rights reserved.

frameshift insertions, respectively (Fig. 1bd and Supplementary Fig. 1). The c.437dupA, c.637C>T and c.1033_1034dupAA mutations were confirmed in the transcripts (Fig. 1b and Supplementary Fig. 1). Theoretically, mutant WIPI4 would be severely truncated in subjects 2, 3 and 4 and relatively conserved in subjects 1 and 5 (Fig. 1e). As human female cells are subject to X-chromosome inactivation, subjects with a WDR45 mutation may have two cell populations: one expressing a wild-type allele and the other expressing a mutant allele. Notably, whereas both wild-type and mutant alleles were expressed in the LCLs of subject 3, the LCLs of the other four affected subjects exclusively expressed mutant transcripts, suggesting that the wild-type alleles underwent X inactivation in most cells (Fig. 1b and Supplementary Fig. 1). In fact, X-inactivation analysis with genomic DNA from peripheral leukocytes showed a skewed pattern in subjects 2, 4 and 5 (analysis was non-informative in subject 1) (Supplementary Table 3). However, it is unknown whether the wildtype allele underwent X inactivation in brain tissues as in LCLs and leukocytes from the subjects. The clinical features of the individuals with SENDA possessing WDR45 mutations are summarized in Table 1 (see also the Supplementary Note). Subjects 1 and 3 have been described recently7,8. These individuals showed psychomotor developmental delay from infancy and severe intellectual disability, while their motor function gradually developed. In adulthood, severe progressive dystoniaparkinsonism and dementia developed. Four of the subjects became bedridden within a few years of onset of cognitive decline. In all subjects, blood concentrations of ceruloplasmin, copper, iron, ferritin and lactate acid were normal. Brain MRI showed T1-weighted signal hyperintensity in the substantia nigra with a central T1-weighted hypointensity band (Fig. 2ae) and T2-weighted signal hypointensity, suggesting iron deposition in the globus pallidus and substantia nigra (Fig. 2fh), which are characteristic of SENDA. In addition, significant cerebral atrophy was found (Fig.2i,j). Substantial differences in the severity of clinical findings were not observed among the five subjects. WIPI1, WIPI2, WIPI3 and WIPI4, mammalian Atg18 homologs, have an important role in the autophagy pathway4,5. Autophagy is the major intracellular degradation system by which cytoplasmic materials are enclosed by double-membrane structures called
446

utophagosomes and subsequently delivered to lysosomes for a degradation9. More than 30 autophagy-related (ATG) genes have been identified in yeast10,11, many of which are conserved in higher eukaryotes and are essential for the formation of the autophagosome10,12. These factors include subunits of the class III phosphatidylinositol 3-kinase complex, and generation of the lipid phosphatidylinositol 3-phosphate is essential for autophagosome formation. Atg18 in yeast and WIPI subunits in mammals associate with membranes through a phosphoinositide-binding motif (FRRG) within a seven propeller structure1315. Atg18 and WIPI proteins also interact with Atg2 and its homologs in yeast and mammalian cells, respectively16,17. Autophagic activity in relation to WIPI4 expression was examined using LCLs from the subjects. Immunoblot analysis of WIPI4 showed lower expression in all five subjects compared to unaffected individuals (Fig. 3a). Although mutant WIPI4 protein sequence was relatively conserved in subjects 1 and 5, the expression of mutant WIPI4 in both subjects was severely decreased, similar to that of subjects 2, 3 and 4, in whom mutant WIPI4 was truncated. This suggests that all the mutant proteins are structurally unstable and undergo degradation. To examine the effect of the WDR45 mutations on autophagy, an autophagic flux assay was performed using LCLs. When lysosomal degradation was blocked by the lysosomal inhibitor chloroquine, the amount of LC3-II (the membrane-bound form) was higher than in cells without the inhibitor, as for control LCLs (Fig. 3b and Supplementary Fig. 2)18. The differences in LC3-II amounts between samples with and without chloroquine represent the amount of LC3 on autophagic structures delivered to lysosomes for degradation18. In the LCLs from affected subjects, accumulation of LC3-II was observed, even under normal conditions, which was more apparent when autophagy was induced by the mTORC1 inhibitor Torin1 (Supplementary Fig. 2ad). The increase in the LC3-II amount by concomitant chloroquine treatment was significant or tended to be suppressed in the LCLs from affected subjects, suggesting that the autophagic flux was blocked, probably incompletely, at an intermediate step of autophagosome formation (Fig. 3b and Supplementary Fig. 2e). Consistent with the immunoblot analysis, immunofluorescence microscopy showed the accumulation of LC3-containing autophagic
VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

npg

on tro Su l bj e R ct 5 T ( )

tro

tro

on

on

Su b

je

l1

l2

letters
Table 1 Clinical features of subjects with SENDA with a WDR45 mutation
Subject 1 Age Sex Mutation Protein alteration Neurological symptoms Current status Initial symptom Initial walking Speech ability  Cognitive dysfunction during childhood Start of cognitive decline  Period until bedridden after decline Dystonia Parkinsonism  Progressive dementia during adulthood Psychiatric symptoms Epileptic seizure Radiological features MRI Iron deposition  Central band of T1 hypointensity Cerebral atrophy 33 years Female c.439+1G>T p.[Gly147Val; Val147_Leu148ins8] Bedridden Psychomotor retardation 3 years No word Nonprogressive 26 years 4 years + Rigidity, akinesia + Aggressive behaviors + Subject 2 28 years Female c.516G>C p.Asp174Valfs*29 Subject 3 40 years Female c.437dupA p.Leu148Alafs*3 Subject 4 51 years Female c.637C>T p.Gln213* Subject 5 33 years Female c.1033_1034dupAA p.Asn345Lysfs*67

Wheelchair Psychomotor retardation 2 years 7 months One word Nonprogressive 25 years + Rigidity, akinesia + Aggressive behaviors +

Bedridden Psychomotor retardation 2 years 2 months No word Nonprogressive 30 years 3 years + Rigidity + None FS

Bedridden Psychomotor retardation 1 year 6 months Two-word sentences Nonprogressive 24 years 1 year + Rigidity + None None

Bedridden Psychomotor retardation 1 year 6 months Few words Nonprogressive 23 years 6 years + Rigidity, tremor, impairment of postural reflex + Anxiety +

2013 Nature America, Inc. All rights reserved.

Globus pallidus, substantia nigra +

Globus pallidus, substantia nigra + Moderate at 25 and 27 years Mild at 25 and 27 years Mild high density in substantia nigra Bilateral frontal spike, low voltage, slow wave NE NE NE

Globus pallidus, substantia nigra +

Globus pallidus, substantia nigra +

Globus pallidus, substantia nigra +

Moderate at 25 years, remarkable at 32 and 33 years Eye of the tiger sign White matter involvement Cerebellar atrophy Mild at 25, 32 and 33 years CT findings High density in globus pallidus Neurophysiological examination EEG Bilateral frontal spike EMG VEP ABR NE Normal Low amplitude, normal latency

Mild at 33 years, Mild at 27 years, Remarkable at 33 years remarkable at 39 years remarkable at 46 years Mild at 33 and 39 years Mild at 27 and 46 years High density in High density in ventral substantia nigra midbrain Low voltage Dystonic pattern Prolonged P100 latency No response at 100 dB Abnormal Normal NE NE Mild at 33 years High density in globus pallidus Abnormal NE Normal NE

npg

FS, febrile seizure; EEG, electroencephalogram; EMG, electromyogram; VEP, visual evoked potential; ABR, auditory brainstem response; NE, not examined.

structures in the LCLs from affected subjects, some of which were abnormally enlarged compared with those observed in control LCLs (Fig. 3c,d). Therefore, we examined whether these LC3-positive
Figure 2 Brain MRIs at 3.0 T and 1.5 T. (ae) T1-weighted imaging shows hyperintensity of the substantia nigra with a central band of T1-weighted hypointensity (arrowheads). Images are shown for subject 1 at 33 years (a), subject 2 at 25 years (b), subject 3 at 39 years (c), subject 4 at 46 years (d) and subject 5 at 33 years (e). (fh) T2-weighted imaging shows marked hypointensity of the globus pallidus (arrows), suggesting iron deposition. Cerebral atrophy and mild cerebellar atrophy are also seen. Images are shown for subject 1 (f), subject 2 (g) and subject 3 (h). (i,j) The fluid attenuated inversion recovery (FLAIR) image of subject 1 (i) and the T1-weighted FLAIR coronal image of subject 2 (j) also show cerebral atrophy.

structures in fact included premature or abnormal autophagic structures. A recent study showed that knockdown of Wdr45 in rat kidney cells and mutation in epg-6 (encoding a WIPI4 homolog)

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

447

letters a
WIPI4
Family 1 Family 2

b
Father Torin1

Fa M the o r Si the s Su ter r Fa bje c M the t 1 o r Si the s r Su ter Subje c Subject 2 Subject 3 bj t 4 ec t5

Family 1 Mother Sister Subject 1 Father + + ++ + + + +

Family 2 Mother Brother Subject 2 ++ + + + + + + + + + + + + + + kDa 15 37 100

c
LC3

Family 1 Sister Subject 1

Family 2 Brother Subject 2

* * *

kDa 37 25 25 20 15 100

Chloroquine + + + + + + + + LC3-I LC3-II WIPI4 HSP90

WIPI4

HSP90

2013 Nature America, Inc. All rights reserved.

Figure 3 Defective autophagy in LCLs derived from subjects with SENDA. ( a) Immunoblot analysis of the WIPI4 protein (apparent mobility at ~35 kDa) in LCLs (top). Truncated forms were not detected (middle). HSP90 was used as a loading control (bottom). An asterisk indicates nonspecific immunoreactive bands. (b) Cells were treated with 250 nM Torin1 in the presence or absence of 20 M chloroquine for 2 h. Cell lysates were analyzed by SDS-PAGE and immunoblotting using antibodies to LC3, WIPI4 and HSP90. The positions of LC3-I (cytosolic) and LC3-II (membrane bound) are indicated. ( ce) Cells were cultured in the presence of Torin1 for 2 h. (c) Cytospun cells were fixed and analyzed by immunofluorescence microscopy using antibodies to LC3 and ATG9A. Abnormal colocalization of LC3 with ATG9A was observed in the LCLs of affected subjects. Scale bars, 10 m and 1 m in the inset. (d,e) The numbers of LC3+ (d) and LC3+ATG9A+ (e) foci were quantified from more than 20 images from 3 independent samples (Online Methods). Data are presented as mean s.e.m. *P < 0.05, ANOVA followed by Bonferroni-Dunn post-hoc test.

d
5
LC3+ foci per cell

Merge

ATG9A

e
*
LC3+ATG9A+ foci per cell

4 3 2 1 0

0.6 0.5 0.4 0.3 0.2 0.1 0

* *

S Su iste bj r ec Br t 1 o Su the bj r Su ect bj 2 e Su ct bj 3 Su ect bj 4 ec t5

Family 1 Family 2

Family 1 Family 2

in Caenorhabditis elegans cause accumulation of early autophagic structures5. One supposed function of WIPI4 (Epg-6) is to regulate the distribution of ATG9A-marked vesicles5, which transiently localize to the autophagosome formation site and induce autophagosome formation19,20. ATG9A is absent from completed autophagosomes in mammalian cells; therefore, colocalization of ATG9A and LC3 is rare. However, enlarged structures positive for both ATG9A and LC3 accumulated in LCLs from all five subjects (Fig. 3c,e), indicating improper autophagosome formation. The importance of the housekeeping activity of autophagy in neurons, as well as the ubiquitin-proteasome system, has been demonstrated in mice. Mice lacking autophagy in the central nervous system developed progressive motor and behavioral deficits21,22. Histologically, inclusion bodies containing polyubiquitinated proteins were observed in neurons, and their size and number increased with age21,22. Neuronal cell death was observed in subsets of neurons21,22, implying that the impairment of autophagy contributes to the pathogenesis of neurodegenerative disorders. Indeed, dysregulation of autophagy has been suggested in various neurodegenerative disorders in humans23. In addition, mutations in PARK2 and PINK1, both of which cause familial Parkinsons disease24,25, impair the selective autophagic degradation of damaged mitochondria, called mitophagy (PARK2, also called Parkin, is recruited to damaged mitochondria in a PINK1-dependent manner)26,27. However, a direct link between the core autophagy machinery and human neurodegenerative disorders has not been reported. Here, we showed that mutations in WDR45, a core autophagy gene, result in a neurodegenerative disorder. Notably, the autophagy defects were partial, implying that some autophagic activity could be maintained in the neurons of affected subjects. We hypothesize that this might be a possible explanation of why childhood intellectual disability in individuals with SENDA remains static until adulthood, unlike in other forms of NBIA13. In contrast to heterozygous WDR45 mutations in females, hemizygous germline mutations in males, leading to the expression of mutant WDR45 in all cells, possibly cause lethal phenotypes from complete loss of WDR45 function, as mice defective in autophagy die shortly after birth 2832. While this paper was under review, Haack et al. reported WDR45 mutations in 20 subjects, including 3 males, 1 of whom possessed
448

a mutation that was somatic mosaic, supporting the idea that male germline mutations could be lethal33. WDR45 is widely expressed in human tissues, with the highest expression found in skeletal muscle34. Nevertheless, SENDA phenotypes seem to be limited to the brain. These facts may reflect cell typedependent differences: autophagy could be more important in neurons (non-dividing, terminally differentiated cells) than in LCLs (rapidly dividing cells). In addition, it is possible that the other WIPI homologs (WIPI1, WIPI2 and WIPI3) could compensate for the deficiency in WIPI4 in a cell typedependent manner, and the relative contribution of WIPI4 among WIPI factors may be high in neurons. In conclusion, heterozygous mutations of X-linked WDR45, a core autophagy gene, were identified in SENDA, providing direct evidence that an autophagy defect is indeed associated with a neuro degenerative disorder in humans. URLs. NHLBI Exome Sequencing Project, http://evs.gs.washington. edu/EVS/; Picard, http://picard.sourceforge.net/; SAMtools, http:// samtools.sourceforge.net/; dbSNP, http://www.ncbi.nlm.nih.gov/ projects/SNP/. Methods Methods and any associated references are available in the online version of the paper. Accession codes. Reference sequences are available from GenBank for Homo sapiens WDR45 transcript variant 1 mRNA (NM_007075.3) and WIPI4 isoform 1 (NP_009006.2).
Note: Supplementary information is available in the online version of the paper. Acknowledgments We would like to thank the individuals with SENDA and their families for their participation in this study. We thank M. Shiina and K. Ogata for their helpful comments on the protein structure. This work was supported by research grants from the Ministry of Health, Labour and Welfare (H.S., N. Miyake and N. Matsumoto), the Japan Science and Technology Agency (N. Matsumoto) and the Strategic Research Program for Brain Sciences (N. Matsumoto) and by a Grantin-Aid for Scientific Research on Innovative Areas (Transcription Cycle) from the Ministry of Education, Culture, Sports, Science and Technology of Japan

npg

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

S Su iste bj r ec Br t 1 ot Su he bj r Su ect bj 2 e Su ct bj 3 e Su ct bj 4 ec t5

letters
(N. Matsumoto), a Grant-in-Aid for Scientific Research from the Japan Society for the Promotion of Science (N. Matsumoto), a Grant-in-Aid for Young Scientists from the Japan Society for the Promotion of Science (H.S. and N. Miyake), the Funding Program for Next-Generation World-Leading Researchers (N. Mizushima) and a grant from the Takeda Science Foundation (N. Miyake, N. Mizushima and N. Matsumoto). AUTHOR CONTRIBUTIONS H.S., N. Mizushima and N. Matsumoto designed and directed the study. H.S., T.N., K.M., N. Mizushima and N. Matsumoto wrote the manuscript. K.M., S.K., K.S., E.K.-Y., N.S., H.N., A.H., F.R., S.Y., H.A. and M.K. collected samples and provided the subjects clinical information. H.S., H.K., K.N., Y.T., M.N. and N. Miyake performed exome sequencing and Sanger sequencing. H.S. and K.N. performed the RNA analysis. Y.K. performed the X-inactivation analysis. T.N. and N. Mizushima analyzed protein expression and autophagic activity. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.
Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html.
1. Gregory, A., Polster, B.J. & Hayflick, S.J. Clinical and genetic delineation of neurodegeneration with brain iron accumulation. J. Med. Genet. 46, 7380 (2009). 2. Kruer, M.C. et al. Neuroimaging features of neurodegeneration with brain iron accumulation. AJNR Am. J. Neuroradiol. 33, 407414 (2012). 3. Schneider, S.A. & Bhatia, K.P. Syndromes of neurodegeneration with brain iron accumulation. Semin. Pediatr. Neurol. 19, 5766 (2012). 4. Polson, H.E. et al. Mammalian Atg18 (WIPI2) localizes to omegasome-anchored phagophores and positively regulates LC3 lipidation. Autophagy 6, 506522 (2010). 5. Lu, Q. et al. The WD40 repeat PtdIns(3)P-binding protein EPG-6 regulates progression of omegasomes to autophagosomes. Dev. Cell 21, 343357 (2011). 6. Gregory, A. & Hayflick, S.J. Genetics of neurodegeneration with brain iron accumulation. Curr. Neurol. Neurosci. Rep. 11, 254261 (2011). 7. Kimura, Y. et al. MRI, MR spectroscopy, and diffusion tensor imaging findings in patient with static encephalopathy of childhood with neurodegeneration in adulthood (SENDA). Brain Dev. published online; doi:10.1016/j.braindev.2012.07.008 (11 August 2012). 8. Kasai-Yoshida, E. et al. First video report of static encephalopathy of childhood with neurodegeneration in adulthood. Mov. Disord. published online; doi:10.1002/ mds.25158 (6 February 2013). 9. Mizushima, N. & Komatsu, M. Autophagy: renovation of cells and tissues. Cell 147, 728741 (2011). 10. Nakatogawa, H., Suzuki, K., Kamada, Y. & Ohsumi, Y. Dynamics and diversity in autophagy mechanisms: lessons from yeast. Nat. Rev. Mol. Cell Biol. 10, 458467 (2009). 11. Xie, Z. & Klionsky, D.J. Autophagosome formation: core machinery and adaptations. Nat. Cell Biol. 9, 11021109 (2007). 12. Mizushima, N., Yoshimori, T. & Ohsumi, Y. The role of Atg proteins in autophagosome formation. Annu. Rev. Cell Dev. Biol. 27, 107132 (2011). 13. Baskaran, S., Ragusa, M.J., Boura, E. & Hurley, J.H. Two-site recognition of phosphatidylinositol 3-phosphate by PROPPINs in autophagy. Mol. Cell 47, 339348 (2012). 14. Krick, R. et al. Structural and functional characterization of the two phosphoinositide binding sites of PROPPINs, a -propeller protein family. Proc. Natl. Acad. Sci. USA 109, E2042E2049 (2012). 15. Watanabe, Y. et al. Structure-based analyses reveal distinct binding sites for Atg2 and phosphoinositides in Atg18. J. Biol. Chem. 287, 3168131690 (2012). 16. Suzuki, K., Kubota, Y., Sekito, T. & Ohsumi, Y. Hierarchy of Atg proteins in preautophagosomal structure organization. Genes Cells 12, 209218 (2007). 17. Velikkakath, A.K., Nishimura, T., Oita, E., Ishihara, N. & Mizushima, N. Mammalian Atg2 proteins are essential for autophagosome formation and important for regulation of size and distribution of lipid droplets. Mol. Biol. Cell 23, 896909 (2012). 18. Mizushima, N., Yoshimori, T. & Levine, B. Methods in mammalian autophagy research. Cell 140, 313326 (2010). 19. Itakura, E., Kishi-Itakura, C., Koyama-Honda, I. & Mizushima, N. Structures containing Atg9A and the ULK1 complex independently target depolarized mitochondria at initial stages of Parkin-mediated mitophagy. J. Cell Sci. 125, 14881499 (2012). 20. Orsi, A. et al. Dynamic and transient interactions of Atg9 with autophagosomes, but not membrane integration, are required for autophagy. Mol. Biol. Cell 23, 18601873 (2012). 21. Hara, T. et al. Suppression of basal autophagy in neural cells causes neurodegenerative disease in mice. Nature 441, 885889 (2006). 22. Komatsu, M. et al. Loss of autophagy in the central nervous system causes neurodegeneration in mice. Nature 441, 880884 (2006). 23. Menzies, F.M., Moreau, K. & Rubinsztein, D.C. Protein misfolding disorders and macroautophagy. Curr. Opin. Cell Biol. 23, 190197 (2011). 24. Valente, E.M. et al. Hereditary early-onset Parkinsons disease caused by mutations in PINK1. Science 304, 11581160 (2004). 25. Kitada, T. et al. Mutations in the parkin gene cause autosomal recessive juvenile parkinsonism. Nature 392, 605608 (1998). 26. Youle, R.J. & van der Bliek, A.M. Mitochondrial fission, fusion, and stress. Science 337, 10621065 (2012). 27. Youle, R.J. & Narendra, D.P. Mechanisms of mitophagy. Nat. Rev. Mol. Cell Biol. 12, 914 (2011). 28. Kuma, A. et al. The role of autophagy during the early neonatal starvation period. Nature 432, 10321036 (2004). 29. Saitoh, T. et al. Loss of the autophagy protein Atg16L1 enhances endotoxin-induced IL-1 production. Nature 456, 264268 (2008). 30. Saitoh, T. et al. Atg9a controls dsDNA-driven dynamic translocation of STING and the innate immune response. Proc. Natl. Acad. Sci. USA 106, 2084220846 (2009). 31. Sou, Y.S. et al. The Atg8 conjugation system is indispensable for proper development of autophagic isolation membranes in mice. Mol. Biol. Cell 19, 47624775 (2008). 32. Komatsu, M. et al. Impairment of starvation-induced and constitutive autophagy in Atg7-deficient mice. J. Cell Biol. 169, 425434 (2005). 33. Haack, T.B. et al. Exome sequencing reveals de novo WDR45 mutations causing a phenotypically distinct, X-linked dominant form of NBIA. Am. J. Hum. Genet. 91, 11441149 (2012). 34. Proikas-Cezanne, T. et al. WIPI-1 (WIPI49), a member of the novel 7-bladed WIPI protein family, is aberrantly expressed in human cancer and is linked to starvationinduced autophagy. Oncogene 23, 93149325 (2004).

npg

2013 Nature America, Inc. All rights reserved.

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

449

ONLINE METHODS

Subjects. We analyzed five Japanese individuals with SENDA. Diagnosis was made on the basis of clinical features, including psychomotor retardation at early childhood that was static for decades and severe progressive dystoniaparkinsonism and dementia after several decades, as well as characteristic findings on brain MRI scans. Genomic DNA was isolated from blood leukocytes according to standard protocols. The Institutional Review Board of Yokohama City University approved the experimental protocols. Informed consent was obtained for all individuals included in this study in agreement with the requirements of Japanese regulations. Clinical information on the subjects with a WDR45 mutation is presented in Table 1 and in the Supplementary Note. Mutation screening. Mutation screening of exons 312 covering the WDR45 coding region (of transcript variant 1, GenBank accession NM_007075.3) was performed by direct sequencing. PCR was performed in a 20-l mixture containing 1 l of genomic DNA, 1 PCR Buffer for KOD FX NEO, 0.4 mM of each dNTP, 0.3 M of each primer and 0.4 U of KOD FX NEO polymerase (Toyobo). Details on PCR conditions and primer sequences are given in Supplementary Table 4. Exome sequencing. Genomic DNA was captured using the SureSelect Human All Exon v4 kit (51 Mb; Agilent Technologies) and sequenced with four samples per lane on an Illumina HiSeq2000 with 101-bp paired-end reads. Image analysis and base calling were performed by sequence control software realtime analysis and CASAVA software v1.8 (Illumina). Reads were aligned to GRCh37 with Novoalign (Novocraft Technologies). Duplicate reads were marked using Picard (see URLs) and excluded from downstream analysis. After merging the BAM files of all members in each family using SAMtools, local realignments around indels and base quality score recalibration were performed with the Genome Analysis Toolkit (GATK)35. Single-nucleotide variants and small indels were identified using the GATK UnifiedGenotyper and filtered according to the Broad Institutes best-practice guidelines (version 3). Variants registered in dbSNP135, which were not flagged as clinically associated, were excluded. Variants that passed the filters were annotated using ANNOVAR36. RNA analysis. LCLs were established from five affected subjects and their family members. RT-PCR using total RNA extracted from LCLs was performed as previously described37. Briefly, 4 g of total RNA extracted with an RNeasy Plus Mini kit (Qiagen) was subjected to reverse transcription, and 2 l of cDNA was used for PCR. Details on primer sequences and PCR conditions are given in Supplementary Table 4. PCR products were electrophoresed in a 10% polyacrylamide gel and sequenced. X-inactivation analysis. The X-inactivation pattern was studied using the human androgen receptor (HUMARA) assay and a fragile X mental retardation (FRAXA) locus methylation assay as previously described 3840. Briefly, genomic DNA from the subjects, a control male and a control female was digested with two methylation-sensitive enzymes, HpaII and HhaI. Details on PCR conditions and primer sequences are given in Supplementary Table 4. Fluorescently labeled products were analyzed on an ABI PRISM 3500 Genetic Analyzer with GeneMapper Software version 4.0 (Applied Biosystems). X-inactivation ratios of less than or equal to 80:20 were considered to represent a random pattern, ratios greater than 80:20 were considered to represent a

skewed pattern, and ratios greater than 90:10 were considered to represent a markedly skewed pattern38. Cell culture. LCLs were cultured in RPMI 1640 supplemented with 10% FBS, L-glutamine, tylosin and antibiotic-antimycotic solution in a 5% CO 2 incubator. Immunoblotting. An affinity-purified rabbit polyclonal antibody against WIPI4 peptide antigen (CFPDNPRKLFEFDTRDNP, amino acids 129145) was generated by Medical & Biological Laboratories. The specificity of the antibody was tested using lysate from HeLa cells in which WDR45 was knocked down. For immunoblot analysis, cells were lysed with lysis buffer (50 mM Tris-HCl, pH 7.5, 150 mM NaCl, 1 mM EDTA, 1% Triton X-100, 1 mM phenylmethanesulfonyl fluoride and a protease inhibitor cocktail (Complete EDTA-free protease inhibitor, Roche)). Cell lysates were clarified by centrifugation at 12,000g for 20 min and analyzed by SDS-PAGE and immunoblotting using antibodies to WIPI4, LC3 (ref. 41) and HSP90 (BD Transduction Laboratories, 610418). Signal intensities were analyzed using a LAS-3000 mini imaging analyzer and Multi Gauge software version 3.0 (Fujifilm). Contrast and brightness adjustments were applied to the images using Photoshop 7.0.1 (Adobe Systems). Fluorescence microscopy. LCLs were spun onto a glass slide at 500 RPM (28g) for 1 min in a Shandon Cytospin 4 cytofuge (Thermo Electron). Cells were fixed with 4% paraformaldehyde, permeabilized using 50 g/ml digitonin and then stained with antibodies to LC3 (clone 1703, Cosmo Bio) and Atg9A19. Cells were observed with a confocal laser microscope (FV1000D IX81, Olympus) using a 60 PlanApoN oil immersion lens (1.42 numerical aperture (N.A.), Olympus). For final output, images were processed using Adobe Photoshop 7.0.1 software. The number of staining foci was determined as follows: foci were extracted using the top hat operation (parameter of 300 300 pixel area), and a binary image was created. Small foci (with an area of less than 10 10 pixels) were removed using an open operation. The number of foci was counted using the integrated morphometry analysis program. False foci were removed by comparison with the original image. Statistical analysis. Differences were analyzed statistically using unpaired t tests or analysis of variance (ANOVA) with a Bonferroni-Dunn posthoc test.
35. DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491498 (2011). 36. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010). 37. Saitsu, H. et al. STXBP1 mutations in early infantile epileptic encephalopathy with suppression-burst pattern. Epilepsia 51, 23972405 (2010). 38. Kondo, Y. et al. A family of oculofaciocardiodental syndrome (OFCD) with a novel BCOR mutation and genomic rearrangements involving NHS. J. Hum. Genet. 57, 197201 (2012). 39. Allen, R.C., Zoghbi, H.Y., Moseley, A.B., Rosenblatt, H.M. & Belmont, J.W. Methylation of HpaII and HhaI sites near the polymorphic CAG repeat in the human androgen-receptor gene correlates with X chromosome inactivation. Am. J. Hum. Genet. 51, 12291239 (1992). 40. Carrel, L. & Willard, H.F. An assay for X inactivation based on differential methylation at the fragile X locus, FMR1. Am. J. Med. Genet. 64, 2730 (1996). 41. Hosokawa, N., Hara, Y. & Mizushima, N. Generation of cell lines with tetracyclineregulated autophagy and a role for autophagy in controlling cell size. FEBS Lett. 580, 26232629 (2006).

npg

2013 Nature America, Inc. All rights reserved.

Nature Genetics

doi:10.1038/ng.2562

letters

Sequencing ancient calcified dental plaque shows changes in oral microbiota with dietary shifts of the Neolithic and Industrial revolutions
Christina J Adler13, Keith Dobney4, Laura S Weyrich1,2, John Kaidonis5, Alan W Walker6, Wolfgang Haak1,2, Corey J A Bradshaw2,7,8, Grant Townsend5, Arkadiusz Sotysiak9, Kurt W Alt10, Julian Parkhill6 & Alan Cooper1,2
2013 Nature America, Inc. All rights reserved.

The importance of commensal microbes for human health is increasingly recognized15, yet the impacts of evolutionary changes in human diet and culture on commensal microbiota remain almost unknown. Two of the greatest dietary shifts in human evolution involved the adoption of carbohydrate-rich Neolithic (farming) diets6,7 (beginning ~10,000 years before the present6,8) and the more recent advent of industrially processed flour and sugar (in ~1850)9. Here, we show that calcified dental plaque (dental calculus) on ancient teeth preserves a detailed genetic record throughout this period. Data from 34 early European skeletons indicate that the transition from hunter-gatherer to farming shifted the oral microbial community to a disease-associated configuration. The composition of oral microbiota remained unexpectedly constant between Neolithic and medieval times, after which (the now ubiquitous) cariogenic bacteria became dominant, apparently during the Industrial Revolution. Modern oral microbiotic ecosystems are markedly less diverse than historic populations, which might be contributing to chronic oral (and other) disease in postindustrial lifestyles. Commensal microbiota comprise the majority of cells in the body and have a key role in human health15,10. However, their evolution remains poorly understood, and detailed genetic records from commensal bacteria have yet to be recovered from the archaeological record. Dental calculus is ubiquitous in both present-day and ancient human populations11, and microscopic analysis has shown that it accurately preserves bacterial morphology over millennia1214. Dental calculus develops when dental plaque, an extremely dense bacterial biofilm15, becomes mineralized with calcium phosphate16. Bacteria in calculus become locked in a crystalline matrix similar to bone16 (Supplementary Fig. 1), with deposits occurring both above
1Australian

and below the gum or gingiva (supra- and subgingivally)17. Calculus represents one of the few sources of preserved human and hominid microbiota, and genetic analysis has the potential to create a powerful new record of dietary impacts, health changes and oral pathogen genomic evolution deep into the past. In addition, oral bacteria are transferred vertically from the primary caregiver(s) in early childhood18 and horizontally between family members later in life18,19, making archaeological dental calculus a potentially unique means of tracing population structure, movement and admixture in ancient cultures, as well as the spread of diseases. The increased consumption of domesticated cereals (wheat and barley in the Near East) beginning with the Neolithic period was associated with a marked increase in the prevalence of dental calculus and oral pathology20. These oral diseases include dental caries (tooth decay)1 and periodontal disease (an infection causing damage to the supporting connective tissues of the tooth and resorption of bone) 21, both of which were rare in pre-Neolithic hunter-gatherer societies20 and early hominins22. Caries and periodontal disease are both polymicrobial, plaque-mediated infections, thought to result from perturbation of a healthy, ecologically balanced oral biofilm23,24 that can occur because of dietary changes, such as the increased consumption of fermentable carbohydrates25,26. Caries has become a major endemic disease, affecting 6090% of school-aged children in industrialized countries, whereas periodontal disease occurs in 520% of the adult population worldwide27. Notably, oral bacteria are also associated with many systemic diseases, including arthritis28, cardiovascular disease3 and diabetes4, in addition to other diseases of the oral cavity1,2. We collected a mixture of supra- and subgingival calculus samples (determined morphologically; Supplementary Note) from the teeth of 34 prehistoric European human skeletons (11 males, 11 females and 12 of unknown sex, ranging in age from <20 to >60 years at death; Supplementary Table 1), dating from before the Mesolithic period

npg

Centre for Ancient DNA, School of Earth and Environmental Sciences, The University of Adelaide, Adelaide, South Australia, Australia. 2Environment Institute, The University of Adelaide, Adelaide, South Australia, Australia. 3Institute of Dental Research, Westmead Millennium Institute, Faculty of Dentistry, University of Sydney, Sydney, New South Wales, Australia. 4Department of Archaeology, School of Geosciences, University of Aberdeen, Aberdeen, UK. 5School of Dentistry, The University of Adelaide, Adelaide, South Australia, Australia. 6The Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK. 7School of Earth and Environmental Sciences, The University of Adelaide, Adelaide, South Australia, Australia. 8South Australian Research and Development Institute, Henley Beach, South Australia, Australia. 9Department of Bioarchaeology, Institute of Archaeology, University of Warsaw, Warsaw, Poland. 10Institute for Anthropology, Johannes Gutenberg University of Mainz, Mainz, Germany. Correspondence should be addressed to A.C. (alan.cooper@adelaide.edu.au). Received 2 May 2012; accepted 29 December 2012; published online 17 February 2013; doi:10.1038/ng.2536

450

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

letters
Table 1 Archaeological and anthropological samples used in the study
Sample ID 12011 12012 12013 12015 12016 12017 8215 8240 8247 8275 8277 4331 9436 8890 8891 8894 8326 8330 8332 8477 8482 8814 8863 8333 8335 8337 8341 8868 8869 8873 8874 8877 8878 8883 1 2 3 4 5 6 7 8 9 10 Museum VI-1 B VI-7 A VI-14 A VI-2 D VI-1 A VI-7 A HK2000:4083a, 613.1 HK2000:4228a, 861 HK2000:4233a, 870 HK2000:7374a, 1324 HK2000:4014b, 413.1 HK2004:9463a, 6255.1 HK, 43 T82GF 14Barrow 163 T98 2095 2106 4440 2357 2654 4161 4485 R5287 R5252 R5136 R5206 R5157 R5229 5228 5241 5113 5203 5244 NA NA NA NA NA NA NA NA NA NA Group or culture Hunter-gatherer Hunter-gatherer Hunter-gatherer Hunter-gatherer Hunter-gatherer Hunter-gatherer LBK LBK LBK LBK LBK Bell Beaker LN/BA Bronze Age Bronze Age Bronze Age Jewbury Jewbury Jewbury Jewbury Jewbury Jewbury Jewbury Raunds Furnells Raunds Furnells Raunds Furnells Raunds Furnells Raunds Furnells Raunds Furnells St. Helen-on-the-Walls St. Helen-on-the-Walls St. Helen-on-the-Walls St. Helen-on-the-Walls St. Helen-on-the-Walls European descent European descent European descent European descent European descent European descent European descent European descent European descent European descent Period (years BP) Mesolithic/Paraneolithic (7,5505,450) Mesolithic/Paraneolithic (7,5505,450) Mesolithic/Paraneolithic (7,5505,450) Mesolithic/Paraneolithic (7,5505,450) Mesolithic/Paraneolithic (7,5505,450) Mesolithic/Paraneolithic (7,5505,450) Neolithic (7,4006,725) Neolithic (7,4006,725) Neolithic (7,4006,725) Neolithic (7,4006,725) Neolithic (7,4006,725) Neolithic (4,4504,000) Late Neolithic/Bronze Age (4,1503,600) Bronze Age (4,1002,800) Bronze Age (4,1002,800) Bronze Age (4,1002,800) Late Medieval (750650) Late Medieval (750650) Late Medieval (750650) Late Medieval (750650) Late Medieval (750650) Late Medieval (750650) Late Medieval (750650) Early Medieval (1,100850) Early Medieval (1,100850) Early Medieval (1,100850) Early Medieval (1,100850) Early Medieval (1,100850) Early Medieval (1,100850) Late Medieval (1,000400) Late Medieval (1,000400) Late Medieval (1,000400) Late Medieval (1,000400) Late Medieval (1,000400) 0 0 0 0 0 0 0 0 0 0 Location Dudka, Poland Dudka, Poland Dudka, Poland Dudka, Poland Dudka, Poland Dudka, Poland Halberstadt-Sonntagsfeld, Germany Halberstadt-Sonntagsfeld, Germany Halberstadt-Sonntagsfeld, Germany Halberstadt-Sonntagsfeld, Germany Halberstadt-Sonntagsfeld, Germany Quedlinburg XII, Germany Benzingerode-Heimburg, Germany Yorkshire, England Yorkshire, England Yorkshire, England York, England York, England York, England York, England York, England York, England York, England Northamptonshire, England Northamptonshire, England Northamptonshire, England Northamptonshire, England Northamptonshire, England Northamptonshire, England York, England York, England York, England York, England York, England Adelaide, Australia Adelaide, Australia Adelaide, Australia Adelaide, Australia Adelaide, Australia Adelaide, Australia Adelaide, Australia Adelaide, Australia Adelaide, Australia Adelaide, Australia

npg

2013 Nature America, Inc. All rights reserved.

Mesolithic/Paraneolithic is the terminology used for the transitional cultures of the forest zone of eastern Europe. Further information about the ancient calculus and modern plaque and calculus samples is provided in Supplementary Table 1.

(before farming) to the medieval period. The samples were collected from the remains of the last hunter-gatherers in Poland and the earliest farming culture in Europe (the Linear Pottery Culture, LBK), as well as late Neolithic (Bell Beaker Culture), early and later Bronze Age, and medieval rural and urban populations (Table 1; description provided in the Supplementary Note). All work on ancient DNA was conducted in a physically isolated, specialist laboratory dedicated to ancient environmental and bacterial DNA research at the Australian Centre for Ancient DNA, using strict decontamination and authentication protocols (Supplementary Note). We extracted bacterial DNA from sterilized ancient calculus samples (n = 34) and generated PCR amplicon libraries of the 16S rRNA gene, targeting three hypervariable regions (V1, V3 and V6) with barcoded primers (Supplementary
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

Tables 2 and 3). In addition, primers specific to Streptococcus mutans and Porphyromonas gingivalis were used to detect oral pathogens in ancient dental calculus (Supplementary Table 2). We compared the ancient samples to modern calculus (n = 6) and plaque (n = 13) samples that were extracted and sequenced in an analogous manner (Online Methods and Supplementary Tables 3 and 4). We also extracted and sequenced bacterial DNA from within the teeth that provided the ancient calculus samples to determine the background bacterial contribution of the postmortem depositional environment (n = 6; Supplementary Note). Amplicons generated from extracted samples and multiple extraction blanks were sequenced using both conventional and pyrosequencing technology. Of the 998,575 sequences generated, we discarded ~50% after quality filtering and
451

letters
Figure 1 Phylum-level microbial composition of ancient dental calculus deposits. The distribution is similar to that of modern oral samples and distinct from those of non-template controls, ancient human teeth and environmental samples. The phylum frequencies for the V3 region are presented for the ancient calculus samples (BB, Bell Beaker), modern oral samples, which included pyrosequenced (calculus, plaque and saliva 31) and cloned (plaque1,2,21) data, non-template controls (or extraction blanks), ancient human teeth and environmental samples (freshwater, sediments and soils3440) (Supplementary Table 1). Phylum frequencies from HOMD were generated from partial and full-length sequences of the 16S rRNA gene. The phyla with a frequency of <1% include ABY1_ OD1, AD3, Armatimonadetes, BRC1, CCM11b, Chlamydiae, Chlorobi, Cyanobacteria, Elusimicrobia, Euryarchaeota, Fibrobacteres, GAL15, Gemmatimonadetes, GN02, GN04, GOUTA4, KSB1, Lentisphaerae, NC10, Nitrospirae, NKB19, OP11, OP3, OP9, PAUC34f, Planctomycetes, SBR1093, SC3, SC4, SM2F11, SPAM, Spirochaetes, SR1, Tenericutes, Thermi, TM6, Verrucomicrobia, WPS-2, WS3 and ZB2.
Bronze Age

Mesolithic

Neolithic

Medieval

Modern Unclassified bacteria Acidobacteria Actinobacteria Bacteroidetes Chloroflexi Firmicutes Fusobacteria Proteobacteria Spirochaetes Synergistetes TM7 Under 1%

1.0 0.8 Frequency (%)


La te H un te r-g at

0.6 0.4 0.2 0

2013 Nature America, Inc. All rights reserved.

denoising to remove sequences containing PCR and sequencing errors29, leaving 451,241 sequences (Supplementary Table 4). At the phylum level, the bacterial composition of ancient calculus was similar to that of modern oral samples and sequences from the Human Oral Microbiome Database (HOMD)30 but markedly distinct from the compositions identified for laboratory reagents (extraction blanks) and environmental samples (soils, sediments and water) and within the ancient teeth themselves (Fig. 1, Supplementary Fig. 2 and Supplementary Note). The archaeological calculus was dominated by Firmicutes (33% for the V3 region, which was the most phylogenetically informative fragment; Supplementary Note), which was found at a frequency comparable to those in both the HOMD (37%), and modern oral samples (average 50%; Supplementary Fig. 3)1,2,21,31. Again, the distribution was clearly distinguishable from those generated with the bacterial sequences obtained from extraction blanks (6% Firmicutes, P = 0.003), environmental samples (1.6% Firmicutes, P < 0.001) and within the ancient teeth (8% Firmicutes, P < 0.001), which were all dominated by Proteobacteria (73%, P = 0.2; 56%, P < 0.001; 31%, P = 0.005, respectively). The sequences from the extraction blanks were typical of bacterial communities found

a
PC2 (percent variation explained 12.57%)

PC2 (percent variation explained 12.57%)

0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.4 0.2 0 0.2 0.4 PC1 (percent variation explained 13.17%)

0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 PC3 (percent variation explained 7.40%) 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4

in clean-room environments32 and non-template controls33. In addition to Firmicutes, the ancient dental calculus samples contained all 15 phyla commonly found in the modern human oral cavity30, with high percentages of Actinobacteria (19%), as is observed in modern calculus deposits (7%). We have shown that dental calculus from samples that are thousands of years old preserve representative and informative microbial signatures of past human-associated microbiota. Phylogenetic analyses of the diversity, which measures the number of operational taxonomic units (OTUs) that are unique between the groups (Supplementary Note), confirmed that the V3 sequences from ancient calculus were clearly more similar to those from modern dental calculus, plaque and saliva samples1,2,21,31 than to those from environmental samples3440 (Fig. 2a,b, Supplementary Fig. 4 and Supplementary Table 5). Similar patterns were observed for sequences from V1 and V6 (Supplementary Fig. 5 and Supplementary Note). Furthermore, the bacterial sequences of the ancient calculus samples clustered separately from the sequences present within the ancient teeth (P = 0.002; Supplementary Fig. 6 and Supplementary Table 6). Overall, these results strongly suggest that DNA sequences from ancient calculus samMesolithic (n = 6) ples are not derived as a result of contamination Neolithic (n = 6) Bronze Age (n = 4) from the postmortem environment. Medieval (n = 18)
Extraction blanks (n = 2) Modern calculus (n = 6) Modern plaque (n = 13) Caries plaque (n = 5) Healthy plaque (n = 5) Periodontal plaque (n = 10) Saliva (n = 5) Freshwater (n = 1) Sediment (n = 4) Soil, pyrosequenced (n = 8) Soil, United States (n = 3) Soil, Germany (n = 1) Soil, Scotland (n = 1) Mesolithic (n = 6) Neolithic (n = 6) Bronze Age (n = 4) Medieval (n = 18) Modern calculus (n = 6) Modern plaque (n = 13)

npg

c
PC2 (percent variation explained 7.94%)

0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4

PC1 (percent variation explained 18.43%)

PC2 (percent variation explained 7.94%)

0.5

PC3 (percent variation explained 5.37%)

452

N eo he lit re hi c/ LB r (n K = Br Bro on nz B (n 6) ze e B = A St Ag g (n 5) .H e = Je e U (n 1) M el od en w K = er -o R bu (n 1) n n- au ry = or th n (n 3 al e- ds = ) sa W (n 7 Ex m alls = ) pl ( 6 tra es n ) ct (n = 5 An ion = ) b ci la en n H 44) Fr t t ks OM es ee (n D h t Se wa h (n = 2 di ter = ) m ( 6 e n ) So nt ( = 1 il n = ) (n 3 = ) 13 )

Figure 2 Principal-components plot of diversity. Principal-components analysis (PCA) shows a close phylogenetic relationship between ancient dental calculus and modern oral samples, both of which are distinct from the non-template controls and environmental samples. diversity was calculated for all samples (Supplementary Note) using the UniFrac metric for the V3 region, and PCA was applied to the unweighted UniFrac distances. (a,b) Plots of the first and second components (PC1 and PC2) (a) and the second and third components (PC2 and PC3) (b) from PCA clustered the ancient dental calculus samples with the modern oral pyrosequenced data (calculus, plaque and saliva), which were separated from the environmental samples and extraction blanks. (c,d) Restricted PCA plots of PC1 and PC2 (c) and PC2 and PC3 (d) that only include ancient and modern oral pyrosequencing samples separated the huntergatherer (Mesolithic) samples from modern, medieval and Neolithic samples.

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

letters
Figure 3 Changes in the diversity and composition S. mutans P. gingivalis of oral microbiota. (a) For the V3 region sequences, 1.0 we estimated the phylogenetic diversity50 15 (Supplementary Note) of the archaeological dental calculus samples (n = 34) and compared them 10 0.5 to modern calculus (n = 6) and plaque (n = 13). 5 We estimated phylogenetic diversity from only 0 0 classified, Gram-positive bacterial sequences to minimize the influence of taphonomic bias (Supplementary Note). Diversity was calculated at a depth of 34 sequences and bootstrapped to assess the robustness of the pattern. Error bars represent bootstrapped diversity values generated by sampling 255 replicates without replacement. BP, years before the present. (b) Specific primers were used to amplify sequences unique to the oral pathogens S. mutans and P. gingivalis. Error bars represent bootstrapped frequencies generated by sampling 255 replicates without replacement.
Phylogenetic diversity M od er n 0 Me 40 die 0 va BP l )

B 0 ron 3, ze 00 A 0 g BP e )

4, Neo 00 li 0 thi B c 0 M P) 5, eso 45 li 0 thi BP c )

(4 ,2 0

(7 ,4 0

npg

The temporal transect of ancient dental calculus samples provides the first idea of the timing and nature of change in human oral bacterial composition and diversity over the last 7,500 years. The composition of oral microbiota underwent a distinct shift with the introduction of farming in the early Neolithic period (Fig. 2c,d), with the earlier hunter-gatherer groups having fewer caries- and periodontal diseaseassociated taxa (Fig. 3). This is consistent with skeletal evidence showing marked increases in periodontal disease41 following the transition to an agricultural diet, suggesting a major impact on the human oral ecosystem around this time. This is thought to be caused by increased amounts of soft carbohydrate foods compared with hunter-gatherer diets26. After the transition to agriculture in the early Neolithic period, there was a notable consistency in the composition of bacteria through the medieval period (~400 years before the present) (Fig. 3), in parallel with the broad similarity of foodprocessing technologies during these times9. In contrast, todays oral environment is much less biodiverse and is dominated by potentially cariogenic bacteria (for example, S. mutans; Fig. 3a,b, Supplementary Figs. 79 and Supplementary Table 7). Random forest analysis was used to identify the taxa that discriminate the different time periods (Supplementary Note). This analysis showed that Clostridia taxa, such as Clostridiales (importance score = 0.014 0.002) and non-pathogenic oral microbial family Ruminococcaceae (importance score = 0.0035 0.0009), were predictive of huntergatherer microbial communities compared to early agriculturists
150

100

Function 2, 5.7% (P < 0.001)

50 LN/BA LBK

BB Jewbury StHW HG Raunds

(ratio of baseline error to observed error = 3.5; Supplementary Table8). Farming groups from the Neolithic and medieval periods were discriminated by both non-pathogenic taxa, such as Clostridiales Incertae Sedis (importance score = 0.014 0.003), and decayassociated Veillonellaceae (importance score = 0.012 0.0038). Farming populations also had more periodontal diseaseassociated taxa, including P. gingivalis and members of the Tannerella and Treponema genera than did hunter-gatherers. Although there is also a strong association between periodontal disease and individual age at death42, we found periodontal diseaseassociated taxa across a range of ages, including the youngest individual in the study (34 years old, ID 8247). Random forest analysis showed that only a limited number of taxa distinguished modern oral environments from farming groups in the medieval and Neolithic periods (ratio of baseline error to observed error = 4.0; Supplementary Table 9). These taxa include decay-associated Veillonellaceae (importance score = 0.021 0.004), in addition to Lachnospiraceae (importance score = 0.019 0.007) and Actinomycetales (importance score = 0.0013 0.0005). Notably, the frequency of S. mutans is significantly higher in modern samples than in preindustrial agricultural samples (P < 0.0001; Fig. 3b and Supplementary Note), indicating that caries-associated bacteria have only become dominant after medieval times. This change is most likely associated with the onset of the Industrial Revolution, which began some 200 years ago and represents the largest change in food production and processing technology since the shift to farming9. The Industrial Revolution saw the production of refined grain and concentrated sugar from processed sugar beet and cane9, generating mono- and disaccharides, which are the main substrates for the microbial fermentation that lowers plaque pH and causes enamel demineralization26. Overall, it is clear that modern Europeans have much lower oral microbial diversity than either Mesolithic or preindustrial Neolithic groups (P < 0.001; Supplementary Table 7), including fewer bacteria associated with good health (Ruminococcaceae), periodontal diseaseassociated taxa (for example, P. gingivalis and members of the Tannerella and Treponema genera), similar to early agriculturists, and a markedly higher abundance of (now ubiquitous) pathogens

2013 Nature America, Inc. All rights reserved.

50

100

150 150 100 50 0 50 Function 1, 91.2% (P < 0.001) 100 150

Figure 4 Discriminant analysis of diversity. Discriminant analysis was applied to the principal coordinates generated from the unweighted UniFrac distances calculated from the V3 region sequences. Each individual is represented by a circle and colored according to archaeological grouping (HG, hunter-gatherer; LN/BA, late Neolithic/ Bronze Age; StHW, St. Helen-on-the-Walls). The majority of phylogenetic variation (91.2%) was described by the first discriminant function, showing that individuals from the same archaeological groups cluster according to microbial composition.

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

(7 ,5 5

M od

er n (n M ed = 1 , ie va 413 Br l( ) on n ze = 18 Ag ) e N (n eo = lit 4 hi ) M es c (n ol = ith 6 ) ic (n = 6)

(1 ,1 0

Frequency (%)

453

letters
such as S. mutans (Fig. 2 and Supplementary Note). Perhaps more notably, the decline in overall oral microbial diversity indicates that, over the past few hundred years, the human mouth has become a substantially less biodiverse ecosystem. In both human-associated microbiota4345 and macroecological46,47 contexts, higher phylogenetic diversity is associated with greater ecosystem resilience and productivity. Therefore, the modern oral environment is likely to be less resilient to perturbations48 in the form of dietary imbalances or invasion49 by pathogenic bacterial species. Major changes in carbohydrate intake in human history seem to have affected the ecosystem of the mouth, opening up pathological niches for periodontal disease in the early Neolithic period and caries in the recent past. These data are potentially important for assessing current associations with systemic diseases: for example, it has been proposed that periodontal disease might contribute to the development of diabetes and heart disease25 through the production of a prolonged inflammatory state3. However, although the frequency of these systemic diseases has risen over the last few decades25, our data show that the abundance of periodontal diseaseassociated bacteria has been relatively stable since the introduction of farming (for example, P. gingivalis; Fig. 3). This indicates that, although periodontal disease might contribute to pathogenesis, it is probably not a factor in the rising incidence of these systemic diseases. Our research has identified a powerful new avenue for bioanthropological research, which promises to provide the first detailed genetic records of the evolution of human microbiota. This provides the potential to directly examine the effects of nutritional and cultural transitions (Fig. 4, Supplementary Figs. 10 and 11, and Supplementary Note) on human health through time and to record the genomic evolution of human commensals and pathogens. URLs. Human Microbiome Project, http://www.hmpdacc.org/tools_ protocols/tools_protocols.php. Methods Methods and any associated references are available in the online version of the paper. Accession codes. Sequence data have been deposited in GenBank under accession ERP002107.
Note: Supplementary information is available in the online version of the paper. Acknowledgments We thank D. Brothwell for original inspiration, N. Gully and S. Bent for critical discussions and J. Soubrier for bioinformatics assistance. We thank H. Meller from the State Heritage Museum of Saxony-Anhalt, Germany, and W. Gumiski from the Institute of Archaeology, University of Warsaw, Poland, for prehistoric samples and members of the Australian Centre for Ancient DNA for practical help and providing samples of plaque and calculus. We thank several anonymous reviewers whose comments have considerably improved the manuscript. We thank the Australian Research Council, the Wellcome Trust (WT092799/Z/10/Z and WT098051) and the Sir Mark Mitchell Foundation for funding support. AUTHOR CONTRIBUTIONS C.J.A., A.C., K.D., A.W.W., J.P., K.W.A., G.T., J.K. and W.H. designed the study. C.J.A., K.D., K.W.A., A.S., W.H., A.C. and J.K. collected samples. C.J.A. and L.S.W. extracted and amplified DNA from dental calculus. C.J.A. and L.S.W. analyzed sequence data. A.W.W. performed 454 sequencing. C.J.A.B. performed diversity bootstrapping analyses. C.J.A., A.C. and K.D. wrote the manuscript. All authors discussed the results and contributed to writing the manuscript. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.
Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html.
1. Aas, J.A. et al. Bacteria of dental caries in primary and permanent teeth in children and young adults. J. Clin. Microbiol. 46, 14071417 (2008). 2. Aas, J.A., Paster, B.J., Stokes, L.N., Olsen, I. & Dewhirst, F.E. Defining the normal bacterial flora of the oral cavity. J. Clin. Microbiol. 43, 57215732 (2005). 3. Dav, S. & Van Dyke, T. The link between periodontal disease and cardiovascular disease is probably inflammation. Oral Dis. 14, 95101 (2008). 4. Grossi, S.G. & Genco, R.J. Periodontal disease and diabetes mellitus: a two-way relationship. Ann. Periodontol. 3, 5161 (1998). 5. Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207214 (2012). 6. Braidwood, R.J., Howe, B. & Reed, C.A. The Iranian Prehistoric Project: new problems arise as more is learned of the first attempts at food production and settled village life. Science 133, 20082010 (1961). 7. Oelzea, V.M. et al. Early Neolithic diet and animal husbandry: stable isotope evidence from three Linearbandkeramik (LBK) sites in Central Germany. J. Archaeol. Sci. 38, 270279 (2011). 8. Childe, V.G. The Dawn of European Civilisation (Kegan Paul, London, 1925). 9. Cordain, L. et al. Origins and evolution of the Western diet: health implications for the 21st century. Am. J. Clin. Nutr. 81, 341354 (2005). 10. Savage, D.C. Microbial ecology of the gastrointestinal tract. Annu. Rev. Microbiol. 31, 107133 (1977). 11. Scott, G.R. & Poulson, S.R. Stable carbon and nitrogen isotopes of human dental calculus: a potentially new non-destructive proxy for paleodietary analysis. J. Archaeol. Sci. 39, 13881393 (2012). 12. Lilley, J., Stroud, G. & Brothwell, D. The Jewish burial ground at Jewbury. in The Archaeology of York Vol. 12 (eds. Addyman, P.V. & Kinsler, V.A.) 291578 (Council for British Archaeology, York, UK, 1994). 13. Preus, H.R., Marvik, O.J., Selvig, K.A. & Bennike, P. Ancient bacterial DNA (aDNA) in dental calculus from archaeological human remains. J. Archaeol. Sci. 38, 18271831 (2011). 14. Vandermeersch, B. et al. Middle Palaeolithic dental bacteria from Kebara, Israel. C.R. Acad. Sci. Paris 319, 727731 (1994). 15. Socransky, S.S. & Haffajee, A.D. Dental biofilms: difficult therapeutic targets. Periodontol. 2000 28, 1255 (2002). 16. Jin, Y. & Yip, H.K. Supragingival calculus: formation and control. Crit. Rev. Oral Biol. Med. 13, 426441 (2002). 17. Lieverse, A.R. Diet and the aetiology of dental calculus. Int. J. Osteoarchaeol. 9, 219232 (1999). 18. Asikainen, S., Chen, C. & Slots, J. Likelihood of transmitting Actinobacillus actinomycetemcomitans and Porphyromonas gingivalis in families with periodontitis. Oral Microbiol. Immunol. 11, 387394 (1996). 19. Van Steenbergen, T.J., Menard, C., Tijhof, C.J., Mouton, C. & De Graaff, J. Comparison of three molecular typing methods in studies of transmission of Porphyromonas gingivalis. J. Med. Microbiol. 39, 416421 (1993). 20. Aufderheide, A.C., Rodriguez-Martin, C. & Langsjoen, O. The Cambridge Encyclopedia of Human Paleopathology (Cambridge University Press, Cambridge, 1998). 21. Faveri, M. et al. Microbiological diversity of generalized aggressive periodontitis by 16S rRNA clonal analysis. Oral Microbiol. Immunol. 23, 112118 (2008). 22. Grine, F.E., Gwinnett, A.J. & Oaks, J.H. Early hominid dental pathology: interproximal caries in 1.5 million-year-old Paranthropus robustus from Swartkrans. Arch. Oral Biol. 35, 381386 (1990). 23. Marsh, P.D. Sugar, fluoride, pH and microbial homeostasis in dental plaque. Proc. Finn. Dent. Soc. 87, 515525 (1991). 24. Marsh, P.D. Are dental diseases examples of ecological catastrophes? Microbiology 149, 279294 (2003). 25. Hujoel, P. Dietary carbohydrates and dental-systemic diseases. J. Dent. Res. 88, 490502 (2009). 26. Marsh, P.D. Microbiology of dental plaque biofilms and their role in oral health and caries. Dent. Clin. North Am. 54, 441454 (2010). 27. Petersen, P.E., Bourgeois, D., Ogawa, H., Estupinan-Day, S. & Ndiaye, C. The global burden of oral diseases and risks to oral health. Bull. World Health Organ. 83, 661669 (2005). 28. Mercado, F.B., Marshall, R.I., Klestov, A.C. & Bartold, P.M. Relationship between rheumatoid arthritis and periodontitis. J. Periodontol. 72, 779787 (2001). 29. Quince, C. et al. Accurate determination of microbial diversity from 454 pyroseq uencing data. Nat. Methods 6, 639641 (2009). 30. Dewhirst, F.E. et al. The human oral microbiome. J. Bacteriol. 192, 50025017 (2010). 31. Lazarevic, V., Whiteson, K., Hernandez, D., Francois, P. & Schrenzel, J. Study of inter- and intra-individual variations in the salivary microbiota. BMC Genomics 11, 523 (2010). 32. La Duc, M.T., Kern, R. & Venkateswaran, K. Microbial monitoring of spacecraft and associated environments. Microb. Ecol. 47, 150158 (2004). 33. DCosta, V.M. et al. Antibiotic resistance is ancient. Nature 477, 457461 (2011). 34. Beier, S., Witzel, K.P. & Marxsen, J. Bacterial community composition in Central European running waters examined by temperature gradient gel electrophoresis and sequence analysis of 16S rRNA genes. Appl. Environ. Microbiol. 74, 188199 (2008). 35. Schloss, P.D. & Handelsman, J. Toward a census of bacteria in soil. PLoS Comput. Biol. 2, e92 (2006). 36. Ellis, R.J., Morgan, P., Weightman, A.J. & Fry, J.C. Cultivation-dependent and -independent approaches for determining bacterial diversity in heavymetal-contaminated soil. Appl. Environ. Microbiol. 69, 32233230 (2003).

npg

2013 Nature America, Inc. All rights reserved.

454

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

letters
37. Nogales, B. et al. Combined use of 16S ribosomal DNA and 16S rRNA to study the bacterial community of polychlorinated biphenylpolluted soil. Appl. Environ. Microbiol. 67, 18741884 (2001). 38. Elshahed, M.S. et al. Novelty and uniqueness patterns of rare members of the soil biosphere. Appl. Environ. Microbiol. 74, 54225428 (2008). 39. Tringe, S.G. et al. Comparative metagenomics of microbial communities. Science 308, 554557 (2005). 40. Will, C. et al. Horizon-specific bacterial community composition of German grassland soils, as revealed by pyrosequencing-based analysis of 16S rRNA genes. Appl. Environ. Microbiol. 76, 67516759 (2010). 41. Kerr, N.W. Prevalence and natural history of periodontal disease in prehistoric Scots (pre-900 AD). J. Periodontal Res. 33, 131137 (1998). 42. Albandar, J.M., Brunelle, J.A. & Kingman, A. Destructive periodontal disease in adults 30 years of age and older in the United States, 19881994. J. Periodontol. 70, 1329 (1999). 43. Bailey, M.T. et al. Stressor exposure disrupts commensal microbial populations in the intestines and leads to increased colonization by Citrobacter rodentium. Infect. Immun. 78, 15091519 (2010). 44. Lawley, T.D. et al. Antibiotic treatment of Clostridium difficile carrier mice triggers a supershedder state, spore-mediated transmission, and severe disease in immunocompromised hosts. Infect. Immun. 77, 36613669 (2009). 45. Lozupone, C.A., Stombaugh, J.I., Gordon, J.I., Jansson, J.K. & Knight, R. Diversity, stability and resilience of the human gut microbiota. Nature 489, 220230 (2012). 46. Cadotte, M., Dinnage, R. & Tilman, G.D. Phylogenetic diversity promotes ecosystem stability. Ecology 93, S223S233 (2012). 47. Zhang, Y., Chen, H.Y.H. & Reich, P.B. Forest productivity increases with evenness, species richness and trait variation: a global meta-analysis. J. Ecol. 100, 742749 (2012). 48. Petchey, O. & Gaston, K. Effects on ecosystem resilience of biodiversity, extinctions, and the structure of regional species pools. Theor. Ecol. 2, 177187 (2009). 49. Loreau, M. et al. A new look at the relationship between diversity and stability. in Biodiversity and Ecosystem Functioning. Synthesis and Perspectives (eds. Loreau, M., Naeem, S. & Inchausti, P.) 7991 (Oxford University Press, Oxford, 2002). 50. Faith, D.P. Conservation evaluation and phylogenetic diversity. Biol. Conserv. 61, 110 (1992).

npg
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

2013 Nature America, Inc. All rights reserved.

455

ONLINE METHODS

Further details for the ancient dental calculus samples, including archaeological information, preparation methodology and authentication criteria, are provided in the Supplementary Note. DNA extraction, PCR, cloning and sequencing. We extracted DNA from 0.050.2 g of sterilized and powdered ancient dental calculus and included a non-template control for every three extractions. Ancient dental calculus deposits, modern calculus and plaque samples, and non-template controls were lysed in 1 ml of lysis buffer containing 0.5 M EDTA (pH 8), SDS (10%) and Proteinase K (20 mg/ml). Samples in lysis buffer were rotated for 24 h at 55 C. After sample lysis, DNA was isolated using the QIAamp DNA Investigator kit (Qiagen). DNA was eluted in a final volume of 100 l and extracts were stored at 4 C. Tooth samples were extracted using protocols described previously51. Independent extractions were not possible owing to the small size of samples; commonly, only one calculus deposit per individual was available for DNA analysis. PCR was used to amplify microbial DNA in the ancient dental calculus samples, modern oral samples, tooth samples and extraction controls using both primers for the universal microbial 16S rRNA gene and primers specific for the oral pathogens S. mutans (GtfB gene) and P. gingivalis (16S rRNA gene) (Supplementary Table 2). We also attempted (unsuccessfully) to amplify human mitochondrial DNA from the ancient dental calculus samples. For all primer sets, the PCR conditions included 2 U of AmpliTaq Gold (Applied Biosystems) in a 25-l volume using 1 Buffer Gold, 2.5 mM MgCl2, 0.25 mM of each dNTP (Fermentas), 400 M of each primer, 1 mg/ml rabbit serum albumin (RSA, Sigma-Aldrich), ShrimpDNase (Affymetrix) at 0.002 U/l and 2 l of DNA extract. ShrimpDNase was used to remove microbial contamination from PCR reagents before the amplification reaction and was added to the PCR mixture (minus the extract), samples were incubated at 37 C for 15 min and the enzyme was then inactivated by heating the mixture to 65 C for 15 min. For the specific primers, the thermocycling conditions consisted of an initial enzyme activation step at 95 C for 6 min, followed by 45 cycles of denaturation at 94 C for 30 s, annealing at 58 C for 30 s and elongation at 72 C for 30 s, with a single final extension step at 60 C for 10 min. We used 40 cycles to amplify 16S rRNA universal sequence, and the annealing temperature was 50 C. Each set of PCRs included multiple extraction and PCR blanks. All PCR products were visually examined by electrophoresis on 3.5% agarose TBE gels. Specific PCR products were purified using 5 l of amplified product, exonuclease I (0.8 U/l) and shrimp alkaline phosphatase (1 U/l). Mixtures were heated to 37 C for 40 min and then heat inactivated at 80 C for 10 min. Purified amplicons were sequenced bidirectionally using PCR primers and the BigDye Terminator 3.1 kit (Applied Biosystems) according to the manufacturers instructions. Sequencing products were purified using a MultiscreenHTS Vacuum Manifold (Millipore) according to the manufacturers protocol. Sequencing products were separated on the 3130xl Genetic Analyzer (Applied Biosystems), and the resulting sequences were edited using Sequencher (version 4.7). We cloned the 16S rRNA gene universal amplicons to monitor contamination within the ancient samples and non-template controls and to assess the suitability of calculus samples for 454 sequencing. PCR products were purified using Agencourt AMPure (Beckman Coulter) according to the manufacturers instructions and cloned using a StrataClone PCR cloning kit (Stratagene). Clones were added directly to the colony PCR mix, which contained 10 HotMaster Buffer (Eppendorf), 0.5 U/l HotMaster Taq (5Prime) and 10 M of forward and reverse M13 primers (Supplementary Table 2) in a 25-l reaction. The thermocycling conditions consisted of an initial step at 94 C for 10 min and 35 cycles of 94 C for 20 s, 55 C for 10 s and 65 C for 45 s, with a single extension of 65 C for 10 min. Colony PCR products were visually inspected on 2% agarose TBE gels, and products were purified and sequenced using the same protocols as described for the products of the specific primers.

454 GS FLX Titanium sequencing. Pyrosequencing (GS-FLX Titanium) was used to examine 34 ancient dental calculus samples, 19 modern oral samples (6 calculus and 13 plaque), 6 tooth samples and 2 extraction blanks. For the ancient calculus samples, modern oral samples and non-template controls, three hypervariable regions of the 16S rRNA gene (V1, V3 and V6) were amplified using the described conditions. For the tooth samples, only the V3 region was amplified, using the same conditions. The forward and reverse primers contained 454 Lib-L kit A and B adaptors, respectively, at the 5 end. The forward primer also contained sample-specific barcodes (Supplementary Table 3) that were developed by the Human Microbiome Project. The barcode sequences had not previously been used in either the Australian Centre for Ancient DNA or the Wellcome Trust Sanger Institute, where the 454 sequencing was performed. Hence, all sequences retrieved that did not contain a barcode were assumed to be contaminants and were discarded. Each region of the 16S rRNA gene was amplified twice (on different days), and duplicates were pooled for 454 sequencing to minimize the potential impact of preferential sequence amplification. Filtering, OTU selection, alignment and taxonomic assignment of 454 sequences. Sequences from the GS FLX Titanium platform were processed using the QIIME (version 1.5.0) software package52. Quality filtering was performed to remove sequences that were either under 60 bp in length (potential primer dimers), contained ambiguous bases, had primer or barcode mismatches, contained homopolymers that exceeded 6 bases or had an average quality score below 25. The remaining sequences ranged between 60 and 210 bp in length. The quality-filtered sequences were denoised53 and chimera checked to remove sequences containing errors produced during pyrosequencing and PCR, respectively, which resulted in the removal of ~50% of the sequences that were identified as having ambiguous flow data (Supplementary Table 4). However, we found that sequence classifications and diversity analyses were comparable between the data set on which only quality filtering had been performed and the denoised data set, as has previously been shown53. Similar sequences were binned into OTUs using optimal UCLUST54 at a 95% likeness. Clustering is more commonly performed at 97%; however, a 95% cutoff has been found to classify OTUs more accurately for closely related, short sequences55. Representative sequences from each OTU were aligned using PyNAST52 against the GreenGenes core set, with a minimum length of 60 bp and identity of 75%. PyNAST aligns the short GS FLXgenerated sequences (60210 bp) against the full 16S rRNA gene. Columns that solely contained gaps were removed from the alignment before building phylogenetic trees. To overcome the difficulty in aligning highly variable 16S rRNA gene sequences, it is common to hide or lane mask regions where at least 50% of the base composition is not conserved56. We did not hide variable regions because lane-masked alignments can mute the phylogenetic diversity observed55. The gap-filtered sequences were taxonomically assigned using the RDP classifier and nomenclature57. Detailed descriptions of the analyses performed on the ancient dental calculus, modern oral and extraction blank sequences are described in the Supplementary Note, including information on diversity, diversity, random forest and discriminant analyses.
51. Haak, W. et al. Ancient DNA from European early neolithic farmers reveals their near eastern affinities. PLoS Biol. 8, e1000536 (2010). 52. Caporaso, J.G. et al. QIIME allows analysis of high-throughput community sequencing data. Nat. Methods 7, 335336 (2010). 53. Reeder, J. & Knight, R. Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions. Nat. Methods 7, 668669 (2010). 54. Edgar, R.C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 24602461 (2010). 55. Schloss, P.D. The effects of alignment quality, distance calculation method, sequence filtering, and region on the analysis of 16S rRNA gene-based studies. PLoS Comput. Biol. 6, e1000844 (2010). 56. Weisburg, W.G., Barns, S.M., Pelletier, D.A. & Lane, D.J. 16S ribosomal DNA amplification for phylogenetic study. J. Bacteriol. 173, 697703 (1991). 57. Wang, Q., Garrity, G.M., Tiedje, J.M. & Cole, J.R. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73, 52615267 (2007).

npg

2013 Nature America, Inc. All rights reserved.

Nature Genetics

doi:10.1038/ng.2536

letters

OPEN

The draft genome of the fast-growing non-timber forest species moso bamboo (Phyllostachys heterocycla)
Zhenhua Peng1,4, Ying Lu2,4, Lubin Li1,4, Qiang Zhao2,4, Qi Feng2,4, Zhimin Gao3,4, Hengyun Lu2, Tao Hu3, Na Yao1, Kunyan Liu2, Yan Li2, Danlin Fan2, Yunli Guo2, Wenjun Li2, Yiqi Lu2, Qijun Weng2, CongCong Zhou2, Lei Zhang2, Tao Huang2, Yan Zhao2, Chuanrang Zhu2, Xinge Liu3, Xuewen Yang3, Tao Wang1, Kun Miao1, Caiyun Zhuang1, Xiaolu Cao1, Wenli Tang3, Guanshui Liu3, Yingli Liu3, Jie Chen1, Zhenjing Liu1, Licai Yuan3, Zhenhua Liu1, Xuehui Huang2, Tingting Lu2, Benhua Fei3, Zemin Ning2, Bin Han2 & Zehui Jiang1,3
Bamboo represents the only major lineage of grasses that is native to forests and is one of the most important nontimber forest products in the world. However, no species in the Bambusoideae subfamily has been sequenced. Here, we report a high-quality draft genome sequence of moso bamboo (P.heterocycla var. pubescens). The 2.05-Gb assembly covers 95% of the genomic region. Gene prediction modeling identified 31,987 genes, most of which are supported by cDNA and deep RNA sequencing data. Analyses of clustered gene families and gene collinearity show that bamboo underwent whole-genome duplication 712 million years ago. Identification of gene families that are key in cell wall biosynthesis suggests that the whole-genome duplication event generated more gene duplicates involved in bamboo shoot development. RNA sequencing analysis of bamboo flowering tissues suggests a potential connection between droughtresponsive and flowering genes. Bamboo is one of the most important non-timber forest products in the world. About 2.5 billion people depend economically on bamboo, and international trade in bamboo amounts to over 2.5 billion US dollars per year1. Bamboo has a rather striking life history, characterized by a prolonged vegetative phase lasting decades before flowering, thereby inhibiting genetic improvement. Recent genomic studies in bamboo have included genome-wide full-length cDNA sequencing2, chloroplast genome sequencing3, identification of syntenic genes between bamboo and other grasses4 and phylogenetic analysis of Bambusoideae subspecies5. Fifty-nine simple sequence repeat markers from rice and sug arcane were used in the genetic diversity analyses of 23 bamboo species6, and 2 species-specific sequence-characterized amplified region markers were developed in the identification of different bamboo species7. Here, we report the draft genome of moso bamboo, a large woody bamboo that has ecological, economic and cultural value in Asia and accounts for ~70% of the total bamboo growth area. Comparative
1Research 2National

2013 Nature America, Inc. All rights reserved.

genome-wide analyses of bamboo to other grass species, including rice, maize and sorghum, yielded new genetic insights into the rapid and marked phenotypic and ecological divergence of bamboo and closely related grasses. The moso bamboo genome contains 24 pairs of chromosomes8 (2n = 48) and is characteristic of a diploid (Supplementary Fig. 1a). We conducted a flow cytometry analysis and estimated that it had a genome size of 2.075 Gb (2C = 4.24 pg; Supplementary Fig. 1b), which was very close to that estimated in a previous report9. Because it is difficult to generate an inbred line of moso bamboo, owing to its infrequent sexual reproduction and the long periods of time between flowering intervals, we selected five plants from a single individual rhizome of the moso bamboo ecotype (P. heterocycla var. pubescens) and performed whole-genome shotgun sequencing. We generated 295 Gb of raw sequence data (approximately 147-fold coverage), including Illumina short reads and 10,327 pairs of BAC end sequences (Supplementary Table 1a). The final assembly of 2.05 Gb was generated using the de novo Phusion-meta assembly pipeline that was developed in this study ( Supplementary Fig. 2). The N50 length of the assembled scaffolds was over 328 kb, and about 80% of the assembly mapped to 5,499 scaffolds of greater than 62kb in length (Table 1 and Supplementary Table 1b). The scaffolds assembled using the Phusion-meta assembly method were much longer in length than the scaffolds generated using the SOAPdenovo program10 (Fig. 1a and Supplementary Table 1c). Given the presence of small fragments in the assembly, the estimated size of the moso bamboo genome was approximately 2.07 to 2.10 Gb, which was supported by the analysis of the distribution of 51-mer frequencies (Supplementary Fig. 3). Hence, the final scaffolds of 2.05 Gb and initial contigs of 1.86 Gb covered approximately 95% and 88% of the genomic region, respectively. Sequence comparison of the assembled scaffolds to existing cDNA and survey sequences in the database and eight BAC sequences individually determined through Sanger sequencing showed good agreement in genomic coverage at over

npg

Institute of Forestry, Chinese Academy of Forestry, Key Laboratory of Tree Breeding and Cultivation, State Forestry Administration, Beijing, China. Center for Gene Research, Shanghai Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China. 3International Center for Bamboo and Rattan, Beijing, China. 4These authors contributed equally to this work. Correspondence should be addressed to B.H. (bhan@ncgr.ac.cn) or Z.J. (jiangzehui@icbr.ac.cn). Received 20 July 2012; accepted 1 February 2013; published online 24 February 2013; doi:10.1038/ng.2569

456

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

letters
Table 1 Statistics of assembly and annotation for the moso bamboo genome
Total length N50 length (contigs) N50 length (scaffolds) N80 length (scaffolds) Number of scaffolds (>N80 length) Largest scaffold GC content Number of protein-coding genes Average length of protein-coding genes Total size of transposable elements Content of transposable elements
Final

2,051,719,643 bp 11,882 bp 328,698 bp 62,052 bp 5,499 4,869,017 bp 43.9% 31,987 3,350 bp 1,210,862,930 bp 59.0%

scaffolds with less than 500 bp were excluded.

2013 Nature America, Inc. All rights reserved.

88% of the initial contigs and 98% of the scaffolds (Supplementary Figs.4,5 and Supplementary Tables 24). The frequencies of singlebase differences and insertions and/or deletions (indels) in the alignment using BAC sequences were as low as 0.19 and 0.09 instances per kilobase, respectively, which were much lower than those determined for the SOAPdenovo assemblies (Supplementary Fig. 6 and Supplementary Table 5). Alignment of all of the reads used to build the assembly identified 2,009,487 heterozygous SNPs and 51,223 short indels (6 nucleotides in length or less) (Supplementary Table 6). An overall heterozygous rate of the occurrence of SNPs and short indels was estimated at approximately 1.0 polymorphism per kilobase, which was lower than that (2.6 per kilobase) of the poplar genome11 and that (4.2 per kilobase) of the grape genome12.

We predicted 31,987 protein-coding genes in the moso bamboo genome, with the support of RNA sequencing (RNA-seq) data (127 Gb) obtained from 7 bamboo tissues and 8,253 bamboo full-length cDNA sequences2 (Online Methods, Supplementary Figs. 7,8 and Supplementary Table 7). Most basic metabolic pathways among the grass species were compared by aligning the annotated protein sequences to the KEGG data set13, which showed high similarity between bamboo and rice (Supplementary Table8). We also annotated 1,167 tRNA (Supplementary Table 9a), 279 rRNA, 321 small nucleolar RNA, 173 small nuclear RNA and 225 microRNA (miRNA) genes (Supplementary Table 9b). A total of 241 miRNA-targeted genes were predicted by the alignment of conserved miRNAs to our gene models (Supplementary Table 9c). De novo repeat annotation showed that approximately 59% of the moso bamboo genome consists of transposable elements (Online Methods and Supplementary Table10a), a proportion that was much higher than the previous estimation (23.3%) in the analysis of survey sequences9. The most abundant repeats were long-terminal repeat elements (LTRs), including 24.6% Gypsy-type LTRs and 12.3% Copia-type LTRs (Supplementary Table10b,c). When we used the sequences of the eight moso bamboo BACs, we observed that 52% of the genomes consisted of transposable elements (Supplementary Table 4). Comparing gene families among the four grass subfamilies, including Pooideae (Brachypodium), Ehrhartoideae (rice), Panicoideae (maize, sorghum and foxtail millet) and Bambusoideae (moso bamboo), and two dicots (Arabidopsis thaliana and the woody plant poplar), we identified 21,730 bamboo genes in 14,030 families, with 9,451 gene families shared by maize, sorghum, rice and Brachypodium (Fig.1b). There were 492 unique gene families in bamboo, of which

a 350
300
Scaffold length (kb)

Pure SOAPdenovo Phusion-meta

b
Brachypodium 15,257 (19,636)
492 (831)

Bamboo 14,030 (21,730)

c
Percentage of gene clusters

d
12 10 8 6 4 2 0 0 20 40 60 80 100 Divergence time (MYA)

(0.15)

+ 1,848 15,852 (0.27) (0.33) + 6 2,286 11,913 (0.43) (0.26) + 2,077 2,940 9,097 (0.28) + 1,115 2,728 (0.23) + 196 14,121 2,016 + 2,898 (0.30) 3,867 12,966 + 4,248 (0.25) 2,901 + 921 3,281 + 3,663 4,044

250 200 150 100 50 0

60 N 70 N 80 N 90

50

295 135 86 (671) (310) 86 (193) 252 (314) 72 (252) (902) 757 733 112 312 922 (3,698) 35 (260) (684) (4,536) (126) 43 (1,634) 438 (101) (1,760) 9,451 199 1,837 (69,253) 289 (681) 248 (1,349) (659) (8,766) 914 155 687 160 (2,158) (525) (3,318) (541) 144 587 208 (2,146) (307) (780) 103 Rice Sorghum (224) 390 17,535 17,262 (1,084) 1,108 893 (2,666) (1,657) (24,885) (22,736)

Maize 16,133 (23,400)

Arabidopsis Brachypodium Bamboo Rice Sorghum Maize Foxtail millet

Rice Sorghum Brachypodium Bamboo Maize Foxtail millet


(0.09) 6,823

+7 11,382 (0.42) 4,870 + 731 4,932

npg

(0.25)

+ 533 3,820

e
Maize Rice Brachypodium Bamboo Divergence of two progenitors 712

f
Bamboo

Rice

46.9 48.6 64.6

Sorghum (MYA)

100 kb

Figure 1 Assemblies and comparative genomics. (a) Comparison of the lengths of assembled scaffolds by the pure SOAPdenovo and Phusion-meta assembly methods. (b) Venn diagram of shared orthologous gene families among five grass genomes. The gene family number is listed in each component. The number of genes within the families is noted in parentheses. ( c) Genome duplication in grass genomes. The calculated KS values of the 2-member gene clusters were converted to divergence time, using a substitution rate of 6.5 10 9 mutations per site per year34. The y axis shows the percentage of the two-member gene clusters. MYA, million years ago. ( d) Evolution of orthologous gene clusters. The black numbers above and below each branch indicate the quantity of expanded (+) or contracted () orthologous clusters after the corresponding speciation, respectively. The estimated numbers of clusters in the common ancestors are indicated in the rectangles. The dN/dS ratio of each branch is shown in blue. ( e) Divergence time between bamboo and grass species from different subfamilies (mean Ks values are given in Supplementary Table 11). (f) Gene synteny between rice, sorghum and moso bamboo. The collinear region is located on rice chromosome 1 (40,565 to 40,983 kb; MSU RGAP 6.1; ref. 35), sorghum chromosome 3 (71,771 to 72,334 kb; ref. 36) and bamboo scaffold PH01000002 (1,890 to 2,862 kb). Non-hypothetical gene (blue), hypothetical genes (gray), LTR retrotransposons (orange), DNA transposons (purple), miniature inverted-repeat transposable elements (MITEs) (green) and other transposable elements (pink) are represented by boxes. Syntenic loci are connected by gray lines between the genomes.

Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

457

letters
Figure 2 Recent duplication and the expression of bamboo CesA and Csl genes. (a) Phylogenetic neighbor-joining tree of the CesA genes. Red branches indicate a recent duplication after speciation. Filled circles indicate the tissues where the gene had high expression. Clades A, B, C, E and G correspond to the phylogenetic tree in Supplementary Figure12a. The divergence time of the corresponding duplication is shown in blue. The scale bar represents the bootstrap percentage of each branch. (b) Phylogenic tree of the Csl genes. The clustered CslA, CslC, CslD, CslE and CslF genes were derived from the phylogenic tree in Supplementary Figure 12b. Filled circles indicate the tissues where the gene had high expression. The divergence time of recent duplications is shown in blue beside the corresponding branch in red. (c) History of recent duplication for the CesA and Csl genes. Each bracket indicates a duplication event of the CesA or Csl genes. Divergence time is shown along a bar ranging from 0 to 50 million years ago. Filled red circles indicate genes highly expressed in the shoot.

a
12.0 8.0 45.7 Clade A 10.2 13.2 Clade G Clade E 36.2 Clade C 26.5 20 Clade B

0 MYA 10

20

30

40

b
some were potentially employed in important biological processes (for example, the control of flowering time or secondary metabolism). Approximately 70 gene families were shared by Arabidopsis, poplar and moso bamboo. In comparative analysis of single-copy genes and gene families containing two to four gene members in moso bamboo and five other Poaceae plants, we found that the bamboo genome had the fewest single-member gene families, whereas it had the most two-member families among grasses (Supplementary Fig. 9). The timing of gene duplication events in grass genomes was estimated by calculating the synonymous substitution rate (KS) and the divergence time between homologous genes within the two-member gene families in which only a single divergence might have occurred. The divergence within most gene clusters occurred around 7 to 12 million years ago in both the moso bamboo and maize genomes (Fig. 1c), suggesting the occurrence of a putative whole-genome duplication event. The estimated time of the duplication at 11 to 15 million years ago in maize is consistent with the reported divergence time of two progenitor genomes at about 11.9 million years ago14, suggesting that there might have been a similar tetraploidization event during bamboo history. Investigation of collinear orthologs in bamboo and rice not only reinforced the occurrence of the whole-genome duplication event but also supported a tetraploid origin of bamboo, as the most recent whole-genome duplication was likely linked to polyploidy events15 (Supplementary Fig. 10a). The divergence time of two progenitors was estimated at 7 to 15 million years ago (Supplementary Fig. 10b), consistent with the divergence time estimated using two-member gene families. For other grass species, such as rice and sorghum, there was no obvious evidence of whole-genome duplication occurring later than the divergence time of grasses at 50 million years ago1619. Using 968 one-to-one single-copy genes from the 5 fully sequenced grass genomes as well as the bamboo genome, we reconstructed a phylogenetic tree to show the relationships among four subfamilies: Panicoideae, Pooideae, Ehrhartoideae and Bambusoideae (Fig.1d). The analyzed grasses were divided into two sister groups, the BEP clade (Bambusoideae, Ehrhartoideae and Pooideae) and the Panicoideae clade, consistent with stated phylogeny and classification of grass subfamilies in early studies2022. The tree supported the idea that the closest relationship exists between Brachypodium and bamboo, agreeing with the result from the analysis of chloroplast genome sequences3. The dN/dS value (the ratio of the rate of nonsynonymous substitution to the rate of synonymous substitution) of the bamboo lineage was the highest among the compared species, suggesting strong selection pressure on bamboo genes. The estimated times for the divergence of bamboo from Brachypodium, rice, foxtail
458
CslC CslA 8.9 17.9 11.7 30.0 CslE 18.2 33.3 CslF 10.3 15.5

50 CesA Csl

2013 Nature America, Inc. All rights reserved.

CslD

Shoot Rhizome Root Leaf None

millet, sorghum and maize were approximately 46.9, 48.6, 53.9, 58.5 and 64.6 million years ago, respectively (Fig. 1e, Supplementary Fig.11 and Supplementary Table 11), indicating that the relationship between Brachypodium and bamboo was closer than that between rice and bamboo. To investigate the evolutionary dynamics of the gene families, expansion and contraction were correlated with copy number. For Arabidopsis and six grass genomes, the number of gene families with gene contraction was greater than that of families with gene expansion, except in foxtail millet (Fig. 1d). Variance of family sizes occurred in a large number of gene families in bamboo (Supplementary Table 12). Gene families involved in the biosynthesis of carbohydrates, such as cellulose, glucan and sucrose, showed significant expansion in bamboo (P value < 0.01) relative to other grass species. With alignment of the 30,379 gene models located on the largesized scaffolds (>50 kb in length) to the rice and sorghum gene models, we identified 1,617 rice-bamboo and 1,539 sorghum bamboo syntenic gene blocks, which consisted of 17,735 and 15,746 bamboo genes, respectively (Supplementary Table 13). The average gene number per block was approximately 11. The large number of syntenic blocks suggested good gene collinearity between bamboo and grass genomes (Fig. 1f). Sequence comparison indicated that approximately 85% of the bamboo genes were aligned to rice or sorghum homologs. In analysis of gene collinearity between bamboo and rice, we identified 5,370 gene losses after the whole-genome duplication event, representing approximately 28% of the total genes in the collinear regions. A recent proteomics study showed that many metabolic processes of cell wall structure were employed in the fast growth of bamboo culms23. The bamboo genome sequence made it possible to investigate
VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

npg

letters
Figure 3 Gene expression at flowering time. (a) Clustered transcription factor and stress-responsive genes with high expression in panicles. Gene expression was measured by quantified transcription levels (reads per kilobase of exon model per million mapped reads, RPKM 37) derived from transcriptome analysis. The gene expression levels in the tip of a 20-cm-long shoot (S20), the tip of a 50-cm-long shoot (S50), the rhizome (RH), the root (RT), the panicle at the early stage (P1) and the panicle at the flowering stage (P2) were normalized to the fold change over the expression levels in the leaf (LF) and are indicated by color. The abbreviations indicating the conserved domains encoded by flowering genes are listed in Supplementary Table 16. (b) Predicted pathway in the control of flowering time in bamboo. Blue arrows indicate that the involved genes are more highly expressed in the floral tissues, whereas red double-headed arrows indicate that the genes are not activated. Single dashed arrows represent pathways that were not used during flowering. Double dashed arrow represents stronger connections between drought-responsive and FMI genes.

S2 0 S5 0 R H R T LF P1 P2

S2 S50 0 R H R T LF P1 P2
HSP20 log2 (fold change) 4 3 2 1 1 2 3 4 5 HSP70 HSP DnaJ HSF Peroxidase Dehydrin Thaumatin HM Metallothionein BURP MIP Drought or other environmental stresses Stress-responsive genes FMI

ERF

bZIP CCT/B-box F-box

HTH, Myb-type

the genes that might affect the formation of the cell wall structure. We detected 19 cellulose synthase (CesA) and 38 cellulose synthaselike (Csl) genes24,25 in the bamboo genome, representing nearly the highest copy number of these genes among the 7 sequenced plant genomes (Supplementary Table 14a). A neighbor-joining tree showed seven recent duplications of the CesA genes (Fig. 2a) and eight duplications of the Csl genes in bamboo after speciation (Fig. 2b). The CesA, CslA and CslC gene families greatly expanded in the bamboo genome, similar to what was observed in the maize genome26. For CesA genes, the four most recent duplications were identified in the grass-specific clades B and G at 8.0 to 13.2 million years ago. Of the 15 CesA gene duplications, 9 occurred later than 20 million years ago (Fig. 2c). Transcriptome analysis showed that the recently occurring duplicates of the CesA and Csl genes had relatively high expressional levels in the shoot (Fig. 2a,b and Supplementary Table 15). It was also found that there were few tandem duplicates in these recent duplicates, implying that the duplications might have resulted from large-scale chromosome reconstruction. We observed that the ancient duplicated genes had high expression in the root, leaf and rhizome ( Fig. 2). It was concluded that most of the duplications of the CesA and Csl genes were derived from whole-genome duplication, suggesting that tetraploidization was critical for the evolution of these genes. To identify the genes involved in the biosynthesis of lignin, a structural component of the secondary cell wall, we investigated the analogous set of genes involved in the phenylpropanoid and lignin biosynthetic pathways27,28 (Supplementary Fig. 12c,d and Supplementary Table14b). The bamboo genome contained high copy numbers of HCT (hydroxycinnamoyl-CoA, shikimate/quinate) and CCR (cinnamoyl CoA reductase) genes, which were similar to those found in poplar. The estimated divergence time of bamboo CCR and HCT gene duplications was from 17.5 to 52.1 million years ago, earlier than the whole-genome duplication event. Both HCT and CCR family genes are key enzymes in catalyzing the conversion of phenylpropanoid pathway products into the material for lignin biosynthesis27,29. Although the functions of bamboo CCR and HCT genes have not yet been identified, the duplicated copies might provide multiple pathways to channel phenylpropanoid metabolism into lignin biosynthesis. The switch to flowering after a very long period of vegetative growth and the rapid growth of spring shoots are unique characteristics of bamboo. To compare gene expression between flowering and vegetative tissues, we collected flowering (panicle) and vegetative tissues from moso bamboo plants for RNA-seq data analysis. More than 600 bamboo genes were highly expressed in the 2 panicle tissues (with at least a 2-fold difference in the expression level in panicles relative to
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

Homeobox

2013 Nature America, Inc. All rights reserved.

MADS- box

NAC WD-40
YABBY

zf-Dof

Light clock Photoperiod pathway Gibberellin, ambient-temperature and light-quality pathways

CO
FPI genes

npg

Flowering

the levels in 5 vegetative tissues; Online Methods). Over 30% of the identified flowering genes could be categorized as transcription factor genes, heat shock protein genes or other stress-responsive genes (Fig.3a and Supplementary Table 16). The transcription factor genes that are homologs of OsMADS1, OsMADS2, OsMADS3 and OsMADS14 in rice30 were determined to be involved in floral meristem identity (FMI), which converts the vegetative meristem to a flowering fate. However, the genes employed in typical flowering promotion pathways (such as those in the photoperiod, gibberellins, ambient-temperature or lightquality pathways) and floral pathway integrator (FPI) genes31,32 were not highly expressed in these floral tissues in bamboo. Repeat insertions were found in the genic or regulatory region of most homologs encoding CONSTANS (CO)33 and FPI genes, which might result in low gene expression in floral tissues (Supplementary Tables 17 and 18). The CO and FPI genes constitute the critical link between the flowering promotion pathways and the FMI in the flowering gene network. Low expression of CO and FPI genes and high expression of genes involved in FMI suggested that activation of FMI might not depend more on these known promotion pathways in bamboo flowering (Fig. 3b).
459

letters
Contrasting with the expression pattern of flowering pathway genes, over 100 stress-responsive genes (15% of 600) showed high expression levels in panicles, being on average 11.1-fold more highly expressed in panicle tissues. Sequence alignment showed that a total of 70 bamboo genes shared high identity with known rice genes, which were mainly involved in the abscisic acid pathway, the ethylene-responsive pathway, sugar metabolism and the calcium-dependent signal transduction pathway, besides the FMI or FPI pathways (Supplementary Table19). Of these genes, 45 (65% of 70) were involved in the response to drought stress or to other correlative stresses (such as oxidative stress), and 10 (15%) were involved in flowering pathways. Some FMI-related genes and their upstream regulatory drought-responsive genes had been observed to have high expression during flowering (Supplementary Fig. 13), suggesting a potential connection between severe drought stress and flowering (Fig.3b). It is noteworthy that the bamboo panicles were collected in southern China, where a severe drought occurred just 2 months before the collection of our samples. However, further experiments are necessary to identify the mechanisms underlying the activation of bamboo flowering.
2013 Nature America, Inc. All rights reserved.
developed the de novo assembly pipeline and performed de novo genome assembly. L.Z. performed BAC sequence assembly. L.L., Z.G., X.Y., T.W., K.M., C. Zhuang, X.C., W.T., G.L., Y. Liu, J.C., Zhenjing Liu, L.Y. and Zhenhua Liu collected bamboo samples and performed cytogenetics studies and functional analysis. T. Huang, Y.Z. and C. Zhu provided IT support. B.F. and X.L. coordinated the project. Ying Lu, B.H., Z.P. and Z.J. analyzed the data as a whole and wrote the manuscript. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.
Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html.
This work is licensed under a Creative Commons AttributionNonCommercial-Share Alike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/ 1. Lobovikov, M., Paudel, S., Piazza, M., Ren, H. & Wu, J. World Bamboo Resources: A Thematic Study Prepared in the Framework of the Global Forest Resources Assessment 2005 (Food and Agriculture Organization of the United Nations, Rome, 2007). 2. Peng, Z. et al. Genome-wide characterization of the biggest grass, bamboo, based on 10,608 putative full-length cDNA sequences. BMC Plant Biol. 10, 116 (2010). 3. Zhang, Y.J., Ma, P.F. & Li, D.Z. High-throughput sequencing of six bamboo chloroplast genomes: phylogenetic implications for temperate woody bamboos (Poaceae: Bambusoideae). PLoS ONE 6, e20596 (2011). 4. Gui, Y.J. et al. Insights into the bamboo genome: syntenic relationships to rice and sorghum. J. Integr. Plant Biol. 52, 10081015 (2010). 5. Sungkaew, S., Stapleton, C.M., Salamin, N. & Hodkinson, T.R. Non-monophyly of the woody bamboos (Bambuseae; Poaceae): a multi-gene region phylogenetic analysis of Bambusoideae s.s. J. Plant Res. 122, 95108 (2009). 6. Sharma, R.K. et al. Evaluation of rice and sugarcane SSR markers for phylogenetic and genetic diversity analyses in bamboo. Genome 51, 91103 (2008). 7. Das, M., Bhattacharya, S. & Pal, A. Generation and characterization of SCARs by cloning and sequencing of RAPD products: a strategy for species-specific marker development in bamboo. Ann. Bot. (Lond.) 95, 835841 (2005). 8. Chen, R. et al. Chromosome Atlas of Major Economic Plants Genome in China, Tomus IVChromosome Atlas of Various Bamboo Species (Science Press, Beijing, 2003). 9. Gui, Y. et al. Genome size and sequence composition of moso bamboo: a comparative study. Sci. China C Life Sci. 50, 700705 (2007). 10. Li, R. et al. SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713714 (2008). 11. Tuskan, G.A. et al. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313, 15961604 (2006). 12. Velasco, R. et al. A high quality draft consensus sequence of the genome of a heterozygous grapevine variety. PLoS ONE 2, e1326 (2007). 13. Kanehisa, M., Goto, S., Sato, Y., Furumichi, M. & Tanabe, M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 40 Database issue, D109D114 (2012). 14. Swigonov, Z. et al. Close split of sorghum and maize genome progenitors. Genome Res. 14, 19161923 (2004). 15. Wendel, J.F. Genome evolution in polyploids. Plant Mol. Biol. 42, 225249 (2000). 16. Gaut, B.S. Evolutionary dynamics of grass genomes. New Phytol. 154, 1528 (2002). 17. Kellogg, E.A. Relationships of cereal crops and other grasses. Proc. Natl. Acad. Sci. USA 95, 20052010 (1998). 18. Goff, S.A. et al. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296, 92100 (2002). 19. Guyot, R. & Keller, B. Ancestral genome duplication in rice. Genome 47, 610614 (2004). 20. Barker, N.P. et al. Phylogeny and subfamilial classification of the grasses (Poaceae). Ann. Mo. Bot. Gard. 88, 373457 (2001). 21. Snchen-Ken, J.G., Clark, L.G., Kellogg, E.A. & Kay, E.E. Reinstatement and emendation of subfamily Micrairoideae (Poaceae). Syst. Bot. 32, 7180 (2007). 22. Bouchenak-Khelladi, Y. et al. Large multi-gene phylogenetic trees of the grasses (Poaceae): progress towards complete tribal and generic level sampling. Mol. Phylogenet. Evol. 47, 488505 (2008). 23. Cui, K., He, C.Y., Zhang, J.G., Duan, A.G. & Zeng, Y.F. Temporal and spatial profiling of internode elongation-associated protein expression in rapidly growing culms of bamboo. J. Proteome Res. 11, 24922507 (2012). 24. Somerville, C. Cellulose synthesis in higher plants. Annu. Rev. Cell Dev. Biol. 22, 5378 (2006). 25. Yin, Y., Huang, J. & Xu, Y. The cellulose synthase superfamily in fully sequenced plants and algae. BMC Plant Biol. 9, 99 (2009). 26. Schnable, P.S. et al. The B73 maize genome: complexity, diversity, and dynamics. Science 326, 11121115 (2009). 27. Humphreys, J.M. & Chapple, C. Rewriting the lignin roadmap. Curr. Opin. Plant Biol. 5, 224229 (2002).

URLs. KEGG, http://www.genome.jp/kegg/; SMALT, http://www. sanger.ac.uk/resources/software/smalt/; SOAPdenovo, http://soap. genomics.org.cn/; Repbase, http://www.girinst.org/repbase/; cell wall genomics, http://cellwall.genomics.purdue.edu/families/; PHYLIP version 3.69, http://evolution.genetics.washington.edu/phylip.html; PLAZA Comparative Genomics Platform, http://bioinformatics. psb.ugent.be/plaza/; RepeatModeler, http://www.repeatmasker.org/ RepeatModeler.html; RepeatMasker, http://www.repeatmasker.org/; EMBL, http://www.ebi.ac.uk/. GenBank, http://www.ncbi.nlm.nih. gov/nuccore/. Methods Methods and any associated references are available in the online version of the paper. Accession codes. Short-read sequencing data from this whole-genome shotgun project have been deposited at the European Molecular Biology Laboratory (EMBL) under the accession ERP001340. RNA-seq data have also been deposited at EMBL under accession ERP001341. Data from the Sanger sequencing of BACs were deposited at EMBL and GenBank under the accessions included in parentheses: B001E05 (FO203447), B001G05 (FO203436), B001I05 (FO203448), B001I13 (FO203437), B015M02 (FO203443), B019A14 (FO203439), B031C15 (FO203444) and B035L11 (FO203441). All bamboo data have been released at the official website of the National Center for Gene Research (http://www.ncgr.ac.cn/bamboo). The entire data set includes genome assemblies, BAC end sequences and annotation of genes and lists of repeat elements, heterozygous SNPs, tRNAs, miRNAs and gene clusters. The current version of the data set is the first version.
Note: Supplementary information is available in the online version of the paper. Acknowledgments We thank C. Xu for her technological support in cytogenetic analysis. This work was supported by the Forestry Project of the Ministry of Science and Technology of the Peoples Republic of China (grant 200704001 to Z.J.) and the Chinese Academy of Sciences (KSCX2-YW-G-034 to B.H.). AUTHOR CONTRIBUTIONS Z.J., Z.P. and B.H. conceived the project and its components, designed the studies and contributed to the original concept of the project. Q.F., D.F., Y.G., W.L., Yiqi Lu, T. Hu, N.Y., C. Zhou and Q.W. performed DNA preparation and genome sequencing. Ying Lu, Y. Li, K.L., T.L. and X.H. performed genome data analysis. Ying Lu and T.L. performed transcriptome (RNA-seq and cDNA) analyses. Z.N., H.L. and Q.Z.

npg

460

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

letters
28. Boerjan, W., Ralph, J. & Baucher, M. Lignin biosynthesis. Annu. Rev. Plant Biol. 54, 519546 (2003). 29. Hamberger, B. et al. Genome-wide analyses of phenylpropanoid-related genes in Populus trichocarpa, Arabidopsis thaliana, and Oryza sativa: the Populus lignin toolbox and conservation and diversification of angiosperm gene families. Can. J. Bot. 85, 11821201 (2007). 30. Arora, R. et al. MADS-box gene family in rice: genome-wide identification, organization and expression profiling during reproductive development and stress. BMC Genomics 8, 242 (2007). 31. Ehrenreich, I.M. et al. Candidate gene association mapping of Arabidopsis flowering time. Genetics 183, 325335 (2009). 32. Fornara, F., Montaigu, A. & Coupland, G. SnapShot: control of flowering in. Arabidopsis. Cell 141, 550 e1550.e2 (2010). 33. Putterill, J., Robson, F., Lee, K., Simon, R. & Coupland, G. The CONSTANS gene of Arabidopsis promotes flowering and encodes a protein showing similarities to zinc finger transcription factors. Cell 80, 847857 (1995). 34. Gaut, B.S., Morton, B.R., McCaig, B.C. & Clegg, M.T. Substitution rate comparisons between grasses and palms: synonymous rate differences at the nuclear gene Adh parallel rate differences at the plastid gene rbcL. Proc. Natl. Acad. Sci. USA 93, 1027410279 (1996). 35. Ouyang, S. et al. The TIGR Rice Genome Annotation Resource: improvements and new features. Nucleic Acids Res. 35, D883D887 (2007). 36. Paterson, A.H. et al. The Sorghum bicolor genome and the diversification of grasses. Nature 457, 551556 (2009). 37. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621628 (2008).

npg
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

2013 Nature America, Inc. All rights reserved.

461

ONLINE METHODS

DNA library preparation and sequencing. Moso bamboo samples for shotgun sequencing were collected in the Tianmu Mountain National Nature Reserve in Zhejiang Province, China, from five plants that were determined to be a single individual when they were found to share the same rhizome system. Using the DNeasy Plant Mini kit (Qiagen), we extracted total DNA from moso bamboo leaves. Genomic DNA was purified according to the protocol for the isolation of high-molecular-weight nuclear DNA38. We applied an amplification-free approach to prepare sequencing libraries with a short insert size of 350 to 400bp for paired-end reads, following a modified version of the manufacturers protocol (Illumina) and methods described previously39. For construction of libraries with insert sizes of 3, 8 and 16 kb for mate-paired reads, we used combined protocols from the Mate Pair Library v2 Sample Preparation Guide (Illumina) and the Paired-End Library Preparation Method Manual (Roche). Raw data from paired-end libraries with read lengths of 2 120 bp and 2 100 bp were generated by an Illumina Genome Analyzer IIx sequencer and a HiSeq 2000 sequencer, respectively. The mate-paired reads (2 50 bp and 2 76 bp) were generated by the Illumina Genome Analyzer IIx sequencer. Sequence assembly. We developed a de novo assembly pipeline to assemble the Illumina short reads (Supplementary Fig. 2), which integrated the existing assemblers Phusion2 (ref. 40), SOAPdenovo, Abyss41 and SSPACE42. Before assembling sequences, paired-end reads were screened to remove low-quality reads that contained ten or more unique K-mers. Screened paired-end reads were then clustered into thousands of groups by Phusion2 with K-mer of 51 bp. During clustering, K-tuples (contiguous DNA sequences that are K bases long) were merged and sorted into a table, and shared K-mer words were linked in a relation matrix. The reads in each cluster were assembled in SOAP_contigs and Abyss_contigs by SOAPdenovo and Abyss, respectively. Contigs derived from both assemblers were then merged to generate the initial contigs by GAP5 (ref. 43). Mate-paired reads were mapped to the initial contigs by the aligner SMALT. To reduce redundancy, when two or more mate-paired reads were mapped to the same location, only one pair of them could be kept for the following assembly. The average insert size of each mate pair library was estimated by determining the distance between mate-paired reads that were well mapped to the same contig (Supplementary Fig. 14). Using paired-end and mate-paired reads, preliminary scaffolds were assembled by SOAPdenovo with K-mer of 61 bp. Scaffolds were rearranged by mapping the initial contigs to the primary scaffolds. The final scaffolding was performed by SSPACE, using mate-paired reads and BAC end sequences. Scaffolds less than 500 bp in length were not included in statistics and the following annotation. Transcriptome sequencing with an amplification-free library preparation method. Five vegetative tissues (young leaves, rhizomes, roots, tips of 20-cmhigh shoots and tips of 50-cm-high shoots) were collected from the same individual used in genome sequencing. Flowering tissues were collected from the plants of a single individual growing in Guangxi Province in southern China (Supplementary Note). Up to 400 g of total RNA was isolated from each tissue using a TRIzol-based method at the beginning of the preparation of cDNA sequencing libraries. Libraries were constructed with Illumina sequencing technology and an amplification-free method39. Briefly, after treatment with DNase, mRNA was isolated from total RNA with the Oligotex mRNA Midi kit (Qiagen). Fragmentation of mRNA followed the protocol of the Ambion RNA Fragmentation Reagents kit. Sequencing libraries of cDNA were constructed using the same amplification-free approach as used in genomic sequencing. Annotation of protein-coding genes. Protein-coding gene models were derived from evidence-based FgeneSH++ (Softberry) pseudomolecules (Supplementary Fig. 7). To facilitate gene models and address interesting biological questions, a total of 110 billion RNA-seq reads were generated from 7 libraries, and a select group of 8,253 cDNA sequences was used. Each potential gene model was supported by the expressed sequences from the moso bamboo cDNA or transcriptome sequences (Supplementary Note). Using amplification-free RNA-seq data, each library detected over 24,000 loci matching our requirement that candidate gene models be supported by the full-length cDNA or 2 or more uniquely matched RNA-seq sequences (Supplementary Fig. 15).The coverage of RNA-seq reads on the coding

regions of annotated loci indicated that up to 27,000 predicted gene models were strongly supported by transcriptome sequences (RNA-seq data coverage in coding regions of >70%; Supplementary Fig. 16). In combination with ab initio gene prediction and alignments of the transcriptome and cDNA data, a total of 31,987 high-confidence genes were identified in the annotation. Identification of genes involved in cell wall biosynthesis. To investigate the genes involved in cell wall biosynthesis, we compared the CesA, Csl and phenylpropanoid-lignin biosynthesis genes in bamboo and other grass genomes, as well as in the Arabidopsis and poplar genomes. We used sequences encoded by the identified CesA or Csl genes in Arabidopsis25, poplar44,45, rice46, maize26 and sorghum24 for alignment to those encoded by the gene models of Brachypodium and bamboo by BLASTP with E values under 1 1010. Aligned hits with at least 200 amino acids of matched length and over 50% protein sequence identity were considered to be homologs of the CesA or Csl genes. For the phenylpropanoidlignin genes, the reported homologs of Arabidopsis47, poplar, rice29 and maize downloaded from the cell wall genomics browser (see URLs) were used as the seed sequences to detect the bamboo, Brachypodium and sorghum gene models by BLASTP with E values under 1 1010 and with over 50% identity over the whole protein sequence. Detected homologs consisted of not only phenylpropanoidlignin genes but also many phenylpropanoid-ligninlike genes, which might be involved in different pathways, even though they share high sequence identity (such as At4CL-like genes48). To remove these phenylpropanoid-ligninlike genes, we used the phenylpropanoid-lignin genes from Arabidopsis, maize, rice and poplar to build an initial neighbor-joining tree to cluster the phenylpropanoidlignin and phenylpropanoid-ligninlike genes into different clades. According to this cluster information, we manually filtered the top BLASTP hits of each homolog to include only phenylpropanoid-lignin genes in our phylogenetic analysis. Consensus neighbor-joining trees were generated using PHYLIP (version 3.69) on the basis of 100 bootstrap trees. Identification of flowering genes. Use of the amplification-free approach for the preparation of transcriptome sequencing libraries eliminated much of the redundancy in transcripts introduced by the amplification of templates during library construction. Generated RNA-seq reads were aligned to the gene model set with the SMALT aligner. The quantity of reads uniquely mapped to the gene models was converted to a quantification of the transcript levels in RPKM. We then used the R package DEGseq49 to digitally measure the differential expression at annotated loci. A gene with expression that was more than twofold higher (Q value < 0.001; ref. 50) in panicles relative to any other vegetative tissue and that had at least five mapped transcripts was considered to be a potential flowering gene in moso bamboo. Both amino-acid sequences and the conserved Interpro function domains encoded by the loci were compared to those of known Arabidopsis (TAIR10)51 and rice (MSU RGAP 6.1) genes, the outputs of which were manually checked to determine the putative functions of the loci involved in the flowering pathways. Construction of gene families among fully sequenced grass genomes. We applied OrthoMCL52 clustering to identify gene families enriched in the Pooideae, Ehrhartoideae, Panicoideae and Bambusoideae families. The bamboo gene predictions and (MSU RGAP 6.1), Brachypodium (MIPS1.2), sorghum (JGI 1.4), maize (5b.60), foxtail millet (v8.0), poplar (JGI 2.0) and Arabidopsis (TAIR10) gene sequences downloaded from the PLAZA comparative genome database (version 2.0)53 were used to infer potential orthologous families of genes. The rice genome represented Ehrhartoideae; the maize, sorghum and foxtail millet genomes represented Panicoideae; the Brachypodium genome represented Pooideae; and the bamboo genome represented Bambusoideae. The transposable elementderived genes in the genomes from the PLAZA database were removed before they were added to the alignment. An allagainst-all comparison was then performed using BLASTP with an E value of 1 1010. We then used the standard setting to compute gene similarities across all eight genomes. A total of 194,376 protein sequences were grouped into 27,294 gene clusters. OrthoMCL clustered a total of 968 single-copy gene families, which were subjected to phylogenetic analyses by Mrbayes54. The expansion and contraction of the gene clusters were determined by a CAFE calculation (version 2.1)55 on the basis of changes in gene family size in generated phylogenetic history.

npg

2013 Nature America, Inc. All rights reserved.

Nature Genetics

doi:10.1038/ng.2569

Repeat annotation. A de novo repeat prediction for the moso bamboo genome was carried out by successively using RepeatModeler (version 1.0.3) and RepeatMasker (version 3.3.0) (see URLs). We first constructed a moso bamboo repeat library using RepeatModeler with default parameters. Two complementary programs, RECON and RepeatScout56,57, were configured at the center of RepeatModeler and were employed in the identification of repeat family sequences in the genome. The consensus sequences for the families were manually examined by aligning them to the known Repbase transposable element library (version 16.0), and known gene and genome sequences downloaded from the NCBI database (nt and nr; released 9 September 2011). The moso bamboo transposable element library was composed of a total of 1,403 generated consensus sequences and their classification information, and the library was used to run RepeatMasker on the whole-genome assemblies. Full-length LTR retrotransposons were predicted using LTRharvest58 and LTR_FINDER59.
38. Peterson, D.G., Tomkins, J.P., Frisch, D.A., Wing, R.A. & Paterson, A.H. Construction of plant bacterial artificial chromosome (BAC) libraries: an illustrated guide. J. Agric. Genomics 5, 3440 (2000). 39. Kozarewa, I. et al. Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nat. Methods 6, 291295 (2009). 40. Mullikin, J.C. & Ning, Z. The Phusion assembler. Genome Res. 13, 8190 (2003). 41. Simpson, J.T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 11171123 (2009). 42. Boetzer, M., Henkel, C.V., Jansen, H.J., Butler, D. & Pirovano, W. Scaffolding preassembled contigs using SSPACE. Bioinformatics 27, 578579 (2011). 43. Bonfield, J.K. & Whitwham, A. Gap5editing the billion fragment sequence assembly. Bioinformatics 26, 16991703 (2010). 44. Djerbi, S., Lindskog, M., Arvestad, L., Sterky, F. & Teeri, T.T. The genome sequence of black cottonwood (Populus trichocarpa) reveals 18 conserved cellulose synthase (CesA) genes. Planta 221, 739746 (2005).

npg

2013 Nature America, Inc. All rights reserved.

45. Suzuki, S., Li, L., Sun, Y.H. & Chiang, V.L. The cellulose synthase gene superfamily and biochemical functions of xylem-specific cellulose synthaselike genes in Populus trichocarpa. Plant Physiol. 142, 12331245 (2006). 46. Hazen, S.P., Scott-Craig, J.S. & Walton, J.D. Cellulose synthaselike genes of rice. Plant Physiol. 128, 336340 (2002). 47. Ehlting, J. et al. Global transcript profiling of primary stems from Arabidopsis thaliana identifies candidate genes for missing links in lignin biosynthesis and transcriptional regulators of fiber differentiation. Plant J. 42, 618640 (2005). 48. Costa, M.A. et al. Characterization in vitro and in vivo of the putative multigene 4-coumarate:CoA ligase network in Arabidopsis: syringyl lignin and sinapate/sinapyl alcohol derivative formation. Phytochemistry 66, 20722091 (2005). 49. Wang, L., Feng, Z., Wang, X., Wang, X. & Zhang, X. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics 26, 136138 (2010). 50. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc., B 57, 289300 (1995). 51. Childs, K.L. et al. The TIGR Plant Transcript Assemblies database. Nucleic Acids Res. 35 Database issue, D846D851 (2007). 52. Li, L., Stoeckert, C.J. Jr. & Roos, D.S. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13, 21782189 (2003). 53. Van Bel, M. et al. Dissecting plant genomes with the PLAZA comparative genomics platform. Plant Physiol. 158, 590600 (2012). 54. Huelsenbeck, J.P. & Ronquist, F. MRBAYES: Bayesian inference of phylogeny. Bioinformatics 17, 754755 (2001). 55. De Bie, T., Cristianini, N., Demuth, J.P. & Hahn, M.W. CAFE: a computational tool for the study of gene family evolution. Bioinformatics 22, 12691271 (2006). 56. Bao, Z. & Eddy, S.R. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 12, 12691276 (2002). 57. Price, A.L., Jones, N.C. & Pevzner, P.A. De novo identification of repeat families in large genomes. Bioinformatics 21 (suppl. 1), i351i358 (2005). 58. Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9, 18 (2008). 59. Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, W265W268 (2007).

doi:10.1038/ng.2569

Nature Genetics

letters

OsLG1 regulates a closed panicle trait in domesticated rice


Takashige Ishii1, Koji Numaguchi1, Kotaro Miura2,3, Kentaro Yoshida4,5, Pham Thien Thanh1, Than Myint Htun1, Masanori Yamasaki1, Norio Komeda2, Takashi Matsumoto6, Ryohei Terauchi4, Ryo Ishikawa1 & Motoyuki Ashikari2
Reduction in seed shattering was an important phenotypic change during cereal domestication1,2. Here we show that a simple morphological change in rice panicle shape, controlled by the SPR3 locus, has a large impact on seed-shedding and pollinating behaviors. In the wild genetic background of rice, we found that plants with a cultivated-like type of closed panicle had significantly reduced seed shedding through seed retention. In addition, the long awns in closed panicles disturbed the free exposure of anthers and stigmas on the flowering spikelets, resulting in a significant reduction of the outcrossing rate. We localized the SPR3 locus to a 9.3-kb genomic region, and our complementation tests suggest that this region regulates the liguleless gene (OsLG1). Sequencing analysis identified reduced nucleotide diversity and a selective sweep at the SPR3 locus in cultivated rice. Our results suggest that a closed panicle was a selected trait during rice domestication. Rice (Oryza sativa L.) is an important crop and a major source of food for more than one-third of the worlds population3. Cultivated rice is derived from the Asian wild species Oryza rufipogon Griff4. Ancient humans are thought to have initiated rice domestication about 10,000 years ago5. Compared with cultivated rice, wild O. rufipogon maintains several propagation-related traits such as prostrate growth habit, seed-shattering habit, open panicles, seed awning and strong seed dormancy. Among the traits needed for survival under natural conditions, seed-shattering behavior guarantees the success of pro pagation through seed dispersal. This character, however, may have made seed collection difficult. In rice, two loci, qSH1 and sh4, have been reported to have a strong influence on seed shattering6,7. These loci had been finemapped, and the genes had been isolated6,8. The wild alleles from O. rufipogon at both loci are responsible for the formation and development of an abscission layer between the pedicel and spikelet in the genetic background of the cultivars. Previously, we had produced two backcross populations with reciprocal genetic backgrounds of cultivated rice (O. sativa Japonica cv. Nipponbare) and wild rice (O. rufipogon acc. W630), and we had examined allelic effects at both loci sh4 and qSH1 (ref. 9). In the genetic background
1Graduate

of cultivated rice, we had confirmed the wild qSH1 and sh4 alleles to be responsible for seed shattering. In the genetic background of O.rufipogon W630, however, the backcrossed plants with nonfunctional alleles from cultivated plants at either qSH1 or sh4 shed all of their seeds9. In addition, among Asian O. rufipogon accessions, several shattering accessions have nonfunctional alleles at sh4 (ref. 10). This suggests that the nonshattering behavior was not obtained through a single mutation at either locus implicated in seed shattering in O.rufipogon. Here we found that a simple morphological change in panicle shape had a large impact on seed shedding. Morphological differences in panicles between cultivated O. sativa and wild O. rufipogon are explained by a single locus (SPR3): O. rufipogon has the dominant allele Spr3 for spreading panicles11. To investigate gene effects, we selected a cultivar of O. sativa Nipponbare (with closed panicles) and a wild accession of O. rufipogon W630 (with spreading panicles) (Fig.1ag). Their panicle shapes are formed by the basal structure of primary branches (Fig. 1c,f). At the cellular level, O. rufipogon W630 has tissue resembling a bump structure between the main and primary branches (Fig. 1d,g). To confirm the chromosomal location of the panicle-spreading locus, we performed quantitative trait locus (QTL) analysis using 161 BC2F8 plants between O. sativa Nipponbare (a recurrent parent) and O. rufipogon W630 (a donor parent). We detected only one strong QTL, which explained 80.1% of phenotypic variance, between molecular markers S-E3 and RM5506 on chromosome 4 (Fig. 1h, Supplementary Fig. 1 and Supplementary Table 1). The region overlaps with the putative position of SPR3 (ref. 12), and we developed a near-isogenic line named NIL(SPR3-Npb) that contained a small chromosomal segment from O. sativa Nipponbare at the SPR3 region in the genetic background of O. rufipogon W630 (Fig. 1hn and Supplementary Fig. 2). The NIL(SPR3-Npb) had closed panicles with no bump structure tissue at the basal parts of primary branches (Supplementary Fig. 3). This line exhibited the shattering habit; but it had a tendency to retain upper mature seeds on the panicles through support from long awns in the lower immature seeds (Fig. 1n). In addition, the long awns in closed panicles disturbed the free exposure of anthers and stigmas of the flowering spikelets (Fig. 1m). These observations suggest that a simple morphological change in the panicle may have influenced seed-shedding and pollinating behaviors.

npg

2013 Nature America, Inc. All rights reserved.

School of Agricultural Science, Kobe University, Kobe, Japan. 2Bioscience and Biotechnology Center, Nagoya University, Nagoya, Japan. 3Faculty of Biotechnology, Fukui Prefectural University, Yoshida, Japan. 4Iwate Biotechnology Institute, Kitakami, Japan. 5The Sainsbury Laboratory, Norwich Research Park, Norwich, UK. 6National Institute of Agrobiological Sciences, Tsukuba, Japan. Correspondence should be addressed to T.I. (tishii@kobe-u.ac.jp). Received 27 August 2012; accepted 31 January 2013; published online 24 February 2013; doi:10.1038/ng.2567

462

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

letters a b c e f

O. sativa Nipponbare
2 3 4 5 6 7 8

O. rufipogon W630
9 10 11 12 1 2 3 4 5 6 7 8

O. sativa Nipponbare

O. rufipogon W630

9 10 11 12

9 10 11 12

9 10 11 12

2013 Nature America, Inc. All rights reserved.

SPR3

O. sativa Nipponbare

O. rufipogon W630

NIL(SPR3-Npb)

Figure 1 Plant morphology of O. sativa Nipponbare, O. rufipogon W630 and NIL(SPR3-Npb). (a) Plant morphology of O. sativa Nipponbare and O. rufipogon W630 in the vegetative growth stage. (bg) Panicles of O. sativa Nipponbare (bd) and O. rufipogon W630 (eg) in the heading stage (b,e) with magnification of the boxed regions showing the basal structure of the primary branches (c,f) and further magnification showing longitudinal sections of the basal parts of primary branches (d,g; arrows mark the bump structure tissue). (h) Chromosomal location of SPR3 and graphical genotypes of Nipponbare, W630 and NIL(SPR3-Npb). Chromosomal segments of Nipponbare and W630 are shown in yellow and green, respectively. ( in) Panicles of O. sativa Nipponbare (i,j), O. rufipogon W630 (k,l) and NIL(SPR3-Npb) (m,n) in the flowering stage (i,k,m) and maturing stage (j,l,n). Scale bars, 20 cm (a); 5 cm (b,e,in); 1 mm (c,f); and 100 m (d,g).

npg

To evaluate the effects of the cultivated allele on seed shedding, we also produced two near-isogenic lines for qSH1 and sh4, that is, NIL(qSH1-Npb) and NIL(sh4-Npb), in the same combination of parents (Supplementary Fig. 2). These lines had the cultivated nonfunctional alleles at qSH1 and sh4 loci, respectively. We subjected the three near-isogenic lines, together with the wild parental accession, to a seed-gathering experiment in the field (Supplementary Fig. 4). As all of the plants exhibited the seed-shattering habit; we collected their seeds directly from the panicles by hand at the maturing stage and calculated seed-gathering rates (Fig. 2a and Supplementary Table 2). We collected significantly more seeds from NIL(SPR3-Npb) (mean gathering rate of 30.6%) than from NIL(qSH1-Npb), NIL(sh4-Npb) and the wild parental accession O. rufipogon W630 (mean gathering rates of 21.9%, 23.1% and 19.8%, respectively), indicating that mature seeds could be gathered efficiently from the closed panicles. To confirm that plants with closed panicles can retain mature seeds longer than those with open panicles, we evaluated days to seed shedding with NIL(SPR3-Npb) and the wild parent. We grew plants in the natural environment and recorded the number of days from flowering to seed
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

shedding under the following two observation conditions: we detached seeds by tapping by hand (hand-tapping condition), and we allowed the seeds to fall by themselves (natural condition). In the hand-tapping condition, the average number of days to shedding was not significantly different between O.rufipogon W630 and NIL(SPR3-Npb) (Fig. 2b), indicating that the formation of abscission layers in NIL(SPR3-Npb) and O.rufipogon W630 was almost completed at the same time. In contrast, we observed a significant difference between the average values under the natural condition: 13.8 d and 14.8 d for O. rufipogon W630 and NIL(SPR3-Npb), respectively (Fig. 2b). These results indicate that plants with closed panicles retained mature seeds for about 1 d longer than those with open panicles and explain why we could collect mature seeds efficiently from the plants with closed panicles. We examined the outcrossing rates for three near-isogenic lines: NIL(qSH1-Npb), NIL(sh4-Npb) and NIL(SPR3-Npb). They exhibited a similar plant and floral morphology to that of recurrent O. rufipogon W630 (Supplementary Table 3), except for the closed panicles observed in NIL(SPR3-Npb). We planted the lines in a paddy field surrounded by O. rufipogon W630, and checked the fertilized seeds
463

letters
Figure 2 Effects of closed panicles on B 40 16 ** 35 seed-shedding and pollinating behaviors 15 A 30 A A of O.rufipogon. (a) Seed-gathering rates 14 25 n.s. 20 13 (mean s.d. of six plot replicates, with nine 15 12 plants in each plot) for O. rufipogon W630 10 11 5 and three indicated near-isogenic lines. Genetic 0 10 W630 NIL NIL NIL W630 NIL(SPR3-Npb) W630 NIL(SPR3-Npb) backgrounds, locus genotypes, seed-shattering (qSH1-Npb) (sh4-Npb) (SPR3-Npb) Hand-tapping condition Natural condition behaviors and panicle shapes are listed under Background W W W W A the graph. W and Npb indicate O. rufipogon 18 Npb W qSH1 W W 16 W630 and O.sativa Nipponbare, respectively. A W W sh4 W Npb 14 (b) Days to seed shedding observed for 12 W SPR3 W W Npb 10 O. rufipogon W630 and NIL(SPR3-Npb) 8 Seed Shattering Shattering Shattering Shattering 6 B shattering under hand-tapping and natural conditions. 4 2 Panicle Error bars, s.e.m. of four (hand-tapping Open Open Open Closed 0 shape condition) and eight panicles (natural NIL(qSH1-Npb) NIL(sh4-Npb) NIL(SPR3-Npb) condition), with 40 spikelets in each panicle. n.s. and ** indicate not significant and significant at 1% level by unpaired Students t-test, respectively. (c) Outcrossing rates (mean s.d. of six plants) estimated for NIL(qSH1-Npb), NIL(sh4-Npb) and NIL(SPR3-Npb). In a and c, mean values labeled with different letters are significantly different, whereas those with same letters are not (Tukeys test, P < 0.05).
Gathering rate (%)

2013 Nature America, Inc. All rights reserved.

to assess self-pollination or outcrossing (Supplementary Fig.5 and Supplementary Table 4). We observed outcrossing rates of more than 10% for NIL(qSH1-Npb) and NIL(sh4-Npb), whereas the average for NIL(SPR3-Npb) was 2.82%. A significant reduction in outcrossing rate was caused by a single cultivated allele at SPR3 that changed panicle structure from open to closed (Fig. 2c). This morphological change may have a big impact on pollination behavior during rice domestication. To identify the chromosomal location of SPR3, we conducted a large-scale survey with 2,358 plants segregating between S-E3 and RM5506 in the genetic background of O. rufipogon W630 (Fig. 3a and Supplementary Table 1). We obtained four plants (plant numbers 595, 645, 1016 and 1069) containing recombination between two marker loci, S-K3 and S-G4. Three exhibited open panicles with bump structure tissue in the basal parts of primary branches, whereas plant number 645 exhibited an intermediate phenotype: most of the primary branches were closed but they had bump structure tissue (Fig. 3a and Supplementary Figs. 68). For three recombinant plants (plant numbers 595, 1016 and 1069), we estimated that the region responsible

for the change in the panicle phenotype and formation of the bump structure is located in the 9.3 kb between S-K3 and S-G4 (Fig. 3a). We detected 118 polymorphic sites in the 9.3-kb region for W630 versus Nipponbare (Supplementary Table5). However, no coding sequences were predicted in either genomic region. We noted that the rice OsLG1 gene is located 10 kb away from the responsible region (Fig. 3b). The OsLG1 gene encodes a SBP (SQUAMOSA promoterbinding protein) domain and controls laminar joint and ligule development13. OsLG1 alleles in both Nipponbare and W630 were functional because leaf and ligule morphology was normal for Nipponbare, W630 and NIL(SPR3-Npb) (Supplementary Table 6 and Supplementary Fig. 9). In the basal parts of primary branches, OsLG1 expression was greater in W630 than in Nipponbare (Fig. 3c). To investigate whether this flanking gene is associated with the upstream 9.3-kb region, we performed complementation tests by transforming two W630 fragments into Nipponbare (Fig. 3b and Supplementary Table 7). One transgenic line with the W630 fragment (construct number 9) possessing both the 9.3-kb region and the OsLG1 region exhibited

a npg

Chr. 4

c
S-F4 S-K3 S-K4 RM17578 RM17579 S-F6 S-G4 RM5506

Outcrossing rate (%)

Time to seed shedding (d)

Basal part of primary branch N W F1

d
OsLG1 / ACT1

10 kb

S-E3

S-F1

S-F3

OsLG1 ACT1 0.12 0.10


OsLG1 / ACT1

21

19

310 0 0 02

Plant number 595 1016 645 1069

Panicle type Open Open (Closed) Open

Bump structure Present Present Present Present

0.16 0.14 0.12 0.10 0.08 0.06 0.04 0.02 0 N Vector Plant 9 Plant 81

0.08 0.06 0.04 0.02 0

Nipponbare segment Wild segment

9.3 kb Os04g0656100 OsLG1 Os04g0656800 Construct 9 Construct 81

F1

Figure 3 Fine mapping and complementation tests of SPR3. (a) Molecular markers between S-E3 and RM5506 on the long arm of chromosome 4 were used to detect recombinants near the SPR3 locus. Numbers of recombinants are shown below the positions of the 11 markers. Chromosomal constitutions of four recombinants (plant numbers 595, 1016, 645 and 1069) are shown with their panicle phenotypes and bump structure. Green and yellow bars indicate chromosomal segments of O. rufipogon W630 and O. sativa Nipponbare, respectively. The SPR3 locus was estimated in the 9.3-kb region between S-K3 and S-G4 (orange). (b) Positions of the genomic regions inserted in two constructs (numbers 9 and 81). OsLG1 and two predicted genes were located near the 9.3-kb region. (c) Expression of OsLG1 relative to that of ACT1 in the basal parts of primary branches of Nipponbare (N), W630 (W) and their F1 plant (mean s.d., n = 36). (d) Expression of OsLG1 relative to that of ACT1 and panicle phenotypes observed for three transformants with vector control, constructs 9 and 81 (mean s.d., n = 34). Only transgenic plants with the wild fragment (construct 9) covering both 9.3-kb and OsLG1 regions had open panicles. Scale bars, 5 cm.

464

VOLUME 45 | NUMBER 4 | APRIL 2013 Nature Genetics

letters
Chromosomal position (kb) 33350 33400 33450 9.3 kb 33500 OsLG1 33550

Intron 1 Intergenic region 0.7 0.6 0.5 0.4 O. sativa O. rufipogon 0.3 0.2 0.1 0

3 45 6

We conclude that a mutation at SPR3 in O. rufipogon changes the panicle structure from open to closed. This simple morphological change likely had a large impact on the seed-shedding and pollinating behaviors of O. rufipogon. Our results suggest that a closed panicle was a selected trait during rice domestication. Methods Methods and any associated references are available in the online version of the paper. Accession codes. Sequences of rice OsLG1 gene in O. sativa Nipponbare and O. rufipogon W630 have been deposited in the DNA Data Bank of Japan (DDBJ): AB776991 and AB776992, respectively.
Note: Supplementary information is available in the online version of the paper. Acknowledgments We thank the National Institute of Genetics (National Bioresource Project), Japan, and the National Institute of Agrobiological Sciences, Japan, for providing the seeds of wild and cultivated rice, Y. Takezaki and P.D.T. Phuong for helping with field experiments and H. Fukaki for supporting expression analysis. This work was supported in part by a Grant-in-Aid from Japanese Society for Promotion of Science to T.I. (20580005, 23580006, 23.01390) and by the Japan Science and Technology Agency-Japan International Cooperation Agency within the framework of the Science and Technology Research Partnership for Sustainable Development to M.A. AUTHOR CONTRIBUTIONS T.I., K.N. and P.T.T. performed the field experiments and analyzed the results. T.M.H. and R.I. conducted the histological analysis. T.M. produced constructs, and K.M., N.K., R.I. and M.A. generated and analyzed transformants. K.N., M.Y., K.Y. and R.T. participated in sequence analysis, and T.I. and M.A. designed the research and wrote the manuscript. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.
Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html.

Very low diversity in O. sativa

Figure 4 Nucleotide diversity for seven noncoding sites around the 9.3-kb region observed in O. rufipogon and O. sativa. Chromosomal positions of seven noncoding sites and the 9.3-kb region on chromosome 4 (top; Supplementary Table 12). Black boxes indicate predicted gene regions. Relative ratio of nucleotide diversity in O. sativa relative to that in O.rufipogon (bottom).

open panicles similar to those of the wild parent, but the other transgenic line with the fragment covering only the OsLG1 region (construct number 81) had closed panicles ( Fig. 3d). The panicle phenotypes corresponded to the expression levels of OsLG1 in the basal parts of the primary branches (Fig.3d). The elevated expression of OsLG1 in transgenic line containing construct number 9 was associated with higher expression of the W630 allele transformed into Nipponbare (Supplementary Fig.10). These results suggest that the 9.3-kb region contains regulatory sequences of OsLG1. To identify the regulatory sequences for panicle shape, we compared nucleotide sequences in the 9.3-kb region among 12 accessions of O. rufipogon and 31 landraces of O. sativa from diverse geographical locations in Asia (Supplementary Table 8). We detected several fixed polymorphic sites between O. rufipogon and O. sativa in the 9.3-kb region excluding some repeat sequences (Supplementary Table 9). A database search identified some transcription factor binding motifs in the fixed polymorphic sites (Supplementary Table 10). If a single base substitution disturbs binding of the transcription factor, one of these may be responsible for OsLG1 expression in the basal parts of the primary branches. We also used the above sequence data to assess the impact of artificial selection at SPR3 during rice domestication. We observed a 20-fold reduction in genetic variation at the SPR3 locus of the 9.3-kb region in O. sativa (average proportion of pairwise differences per base pair () = 0.00043) compared to O. rufipogon ( = 0.00883) (Supplementary Table 11). Both coalescent simulation and HudsonKreitman-Aguade (HKA) tests confirmed that O.sativa genetic diversity in this region was significantly lower (P < 0.032) than that for O. rufipogon. We also examined the levels of DNA polymorphism for seven noncoding sites (ranging from 607 bp to 753 bp in O. sativa) around the 9.3-kb region using 14 accessions of O. rufipogon and 31 landraces of O. sativa (Fig. 4 and Supplementary Table 8). In O.sativa, we observed quite low values of nucleotide diversity in both noncoding sites adjacent to the SPR3, giving one or no polymorphic nucleotide sites among O. sativa landraces (Supplementary Table12). We calculated diversity ratios based on the nucleotide diversity in O.sativa ( O. sativa) relative to that in O. rufipogon ( O. rufipogon) (Fig. 4). We observed a selective sweep with a notable reduction in relative diversity in a small segment (<50 kb) that includes putative regulatory and coding regions of OsLG1. These results indicate that a strong selection pressure was imposed on the SPR3 locus during rice domestication14,15.
Nature Genetics VOLUME 45 | NUMBER 4 | APRIL 2013

2013 Nature America, Inc. All rights reserved.

1. Harlan, J.R., de Wet, J.M. & Price, E.G. Comparative evolution of cereals. Evolution 27, 311325 (1973). 2. Flannery, K.V. The origins of agriculture. Annu. Rev. Anthropol. 2, 271310 (1973). 3. Khush, G.S. Origin, dispersal, cultivation and variation of rice. Plant Mol. Biol. 35, 2534 (1997). 4. Oka, H.I. Origin of Cultivated Rice (Elsevier, Amsterdam, 1988). 5. Mannion, A.M. Domestication and the origins of agriculture: an appraisal. Prog. Phys. Geogr. 23, 3756 (1999). 6. Konishi, S. et al. An SNP caused loss of seed shattering during rice domestication. Science 312, 13921396 (2006). 7. Li, C., Zhou, A. & Sang, T. Genetic analysis of rice domestication syndrome with the wild annual species, Oryza nivara. New Phytol. 170, 185193 (2006). 8. Li, C., Zhou, A. & Sang, T. Rice domestication by reducing shattering. Science 311, 19361939 (2006). 9. Ishikawa, R. et al. Allelic interaction at seed-shattering loci in the genetic backgrounds of wild and cultivated rice species. Genes Genet. Syst. 85, 265271 (2010). 10. Thurber, C.S. et al. Molecular evolution of shattering loci in U.S. weedy rice. Mol. Ecol. 19, 32713284 (2010). 11. Eiguchi, M. & Sano, Y. A gene complex responsible for seed shattering and panicle spreading found in common wild rices. Rice Genet. Newsletter 7, 105107 (1990). 12. Luo, J.J., Hao, W., Jin, J., Gao, J.P. & Lin, H.X. Fine mapping of Spr3, a locus for spreading panicle from African cultivated rice (Oryza glaberrima Steud.). Mol. Plant 1, 830838 (2008). 13. Lee, J., Park, J.J., Kim, S.L., Yim, J. & An, G. Mutations in the rice liguleless gene result in a complete loss of the auricle, ligule, and laminar joint. Plant Mol. Biol. 65, 487499 (2007). 14. Clark, R.M., Linton, E., Messing, J. & Doebley, J.F. Pattern of diversity in the genomic region near the maize domestication gene tb1. Proc. Natl. Acad. Sci. USA 101, 700707 (2004). 15. Purugganan, M.D. & Fuller, D.Q. The nature of selection during plant domestication. Nature 457, 843848 (2009).

npg

465

ONLINE METHODS

Plant materials. A rice cultivar, O. sativa Nipponbare, a wild accession of O.rufipogon from Myanmar (accession number W630), and their crossed progenies were used in this study. O. sativa Nipponbare has closed panicles with nonshattering seeds, whereas O. rufipogon W630 has open panicles and a seed-shattering behavior. The wild accession was provided by the National Institute of Genetics, Japan. QTL analysis for panicle type. Backcross inbred lines consisting of 161 BC2F8 plants between O. sativa Nipponbare (a recurrent parent) and O. rufipogon W630 (a donor parent) were previously produced16. Their panicle types were examined on a scale of 13 (1, closed; 2, intermediate; 3, open) according to the standard evaluation system for rice17, and the genotypes at 181 simple sequence repeat marker loci covering 12 chromosomes were determined18. Based on these data, QTL analysis was performed using qGene software (version 3.06)19. Development of near-isogenic lines. Three near-isogenic lines named NIL(SPR3-Npb), NIL(qSH1-Npb) and NIL(sh4-Npb) were developed by backcrossing in the genetic background of O. rufipogon W630 (Supplementary Fig. 2). NIL(SPR3-Npb) contained a small segment on chromosome 2 from O. sativa Nipponbare together with the target locus of SPR3. However, QTL analysis indicated that the region on chromosome 2 was not responsible for panicle spreading. Histological analysis. Panicle tissues were collected and fixed under vacuum in 5% (v/v) formaldehyde, 5% (v/v) acetic acid and 63% (v/v) ethanol. Samples were dehydrated through an ethanol series and embedded in Technovit 7100 resin (Kulzer) according to the manufacturers instructions. The sections were stained with Toluidine O (Waldeck) and photographed using a microscope (Eclipse 80i, Nikon). Estimation of seed-gathering rate. Ancient gatherers collected wild O.rufipogon seeds for a long time. Even today, various methods are used to collect seeds from O. rufipogon4,20. Among them, the simplest method is to beat the panicles to collect mature seeds. Therefore, we used this method to estimate seed-gathering rate in this study. A wild accession of O. rufipogon W630 and three near-isogenic lines were planted in a paddy field at Kobe University, Japan (3443 N, 13514 E). One experimental plot was placed in a 40-cm square block with nine plants: 3 plants 3 plants at intervals of 20 cm (Supplementary Fig. 4). The space between the plots was 80 cm. Four genotypes were separately planted in the field, with six replicates. In the seedmaturing stage, all of the panicles in each plot were tapped by hand, and the shattered seeds were collected in a plastic vat on alternate days. To estimate the total number of seeds produced in each plot, the following parameters were examined: collected seed weight (CSW; total weight of fully filled seeds), 1,000-seed weight (1000SW; weight of 1,000 fully filled seeds collected), seed setting rate (SSR; average of ten panicles), number of seeds per panicle (NSP; average of 30 panicles), number of panicles (NP; total number of all effective panicles), number of seeds collected ((CSW/1000SW) 1,000), total number of seeds produced (SSR NSP NP). The seed-gathering rate was estimated as the percentage of the number of seeds collected relative to the total number of seeds produced (Supplementary Table 2). Tukeys test was performed to determine the significance of differences in seed-gathering rates among wild and three near-isogenic lines using R software (version 2.13, R Development Core Team, http://www.R-project.org/). Evaluation of seed shedding. In wild O. rufipogon, panicle flowering begins from spikelets in the upper part of the panicle, and is complete in about 1week. As it takes about 2 weeks for each seed to mature, different degrees of seed maturation are observed in a single panicle after flowering. Mature seeds with complete formation of abscission layers shed by themselves, and almost-matured seeds can be easily detached from the panicle by weak contact. In this study, evaluation of seed shedding was carried out as follows. Four plants each of O. rufipogon W630 and NIL(SPR3-Npb), which headed in the period of 1422 August, 2008, were chosen. After heading, three panicles of the plant were selected, and flowering dates of 40 fertile spikelets per panicle were recorded. In the seed-maturing stage, seeds of one panicle

were detached by hand-tapping, and seeds of the other two panicles were allowed to fall under the natural condition. The average value was taken as days to seed shedding for each plant. Significance of days to shedding between O. rufipogon W630 and NIL(SPR3-Npb) was tested using a nonpaired, two-tailed Students t-test. There were no storms and typhoons on the evaluation days. Measurement of plant and floral morphological traits. Plant and floral traits are considered to have an important role in outcrossing ability. The following six traits were measured for O. rufipogon W630 and the three near-isogenic lines: culm length, panicle length, awn length, anther length, stigma length and style length. Of these, floral traits (awn, anther, stigma and style lengths) were evaluated from 10 spikelets per plant. Average values of these traits were calculated from 10 plants from each line. Estimation of outcrossing rate. The outcross experiment was conducted in the paddy field at Kobe University, Japan. A single plant of each of the three near-isogenic lines was surrounded by two rows of O. rufipogon W630 (Supplementary Fig. 5). A total of 18 plants (3 near-isogenic lines 6 replicates) and 268 wild plants were planted at intervals of 20 cm. About 2 weeks after heading of O. rufipogon W630, the near-isogenic line plants were transferred from the paddy field to pots. Mature seeds (276 seeds on average) were collected by hand from each plant on every other day for 11 d. The nearisogenic line plants had a homozygous target segment of O. sativa Nipponbare in the genetic background of O. rufipogon W630, and some microsatellite markers are known to be located in this region. Therefore, self-pollinated and outcrossed seeds could be distinguished based on their marker genotypes, that is, Nipponbare homozygote and heterozygote, respectively. The outcrossing rate of each near-isogenic line plant was determined by calculating the number of seeds with heterozygous alleles among the total number of seeds examined (Supplementary Table 4). Tukeys test was performed to determine the significance of differences in outcrossing rates among the three near-isogenic lines using R software (version 2.13). Fine mapping. A strong QTL for panicle shape was detected between mole cular markers S-E3 and RM5506. We screened the recombinants between these marker loci from 2,358 BC2F3 segregating plants. A total of 30 recombinant plants were examined for the genotypes at nine marker loci in this region (Supplementary Table 1). As one recombinant (number 645) showed an intermediate phenotype of the panicles, this was not included for fine mapping. The candidate region of SPR3 was estimated with the recombinants showing a different panicle phenotype and bump-structure formation (Fig. 3a). Complementation test. Two constructs (numbers 9 and 81) containing genomic fragments of O. rufipogon W630 were used for transformation (Supplementary Table 7). We first selected a BAC clone, W630-19P07, carrying the SPR3 region in pIndigoBAC-5. Candidate regions surrounding SPR3 were obtained from partially digested W630-19P07 DNA and were cloned into a binary vector pYLTAC7 (provided by RIKEN BioResource Center, Ibaraki, Japan). These binary vectors were introduced into Agrobacterium tumefaciens strain EHA105, which was used to transform O. sativa Nipponbare. Control plants were produced by transformation of the empty vector. All transgenic plants were grown in a closed greenhouse. Panicle phenotypes for two constructs were confirmed using T0 and T1 plants. RNA isolation and quantitative reverse-transcriptase PCR analysis. Total RNA was isolated using an RNeasy plant mini kit (Qiagen) and treated with DNase I (Invitrogen). cDNAs were synthesized from 200 ng of total RNA using PrimeScript reverse transcriptase (Takara) with oligo(dT) primer according to the manufacturers instructions. Quantitative RT-PCR was performed using SYBR Green Thunderbird qPCR Mix (TOYOBO Life Science) and data were collected using a Thermal Cycler DICE Real-time System (Takara). Relative expression levels by quantitative (q)RT-PCR analysis were normalized against ACT1. Allele-specific RT-PCR was carried out using dCAPS primers designed in the coding region of OsLG1, where a SNP was detected between O. sativa Nipponbare and O. rufipogon W630 (Supplementary Fig. 9). The PCR product from the W630 allele was sensitive to AvaII digestion. The primers used

npg

2013 Nature America, Inc. All rights reserved.

Nature Genetics

doi:10.1038/ng.2567

in the RT-PCR analysis, allele-specific RT-PCR (dCAPS assay) and qRT-PCR are listed in Supplementary Table 13. Sequence analysis. Nucleotide sequences in the 9.3-kb region were determined for 12 accessions of O. rufipogon and 31 landraces of O. sativa from diverse geographical locations in Asia (Supplementary Table 8). Further, seven noncoding regions (ranging from 607 bp to 753 bp in O. sativa) around the SPR3 locus were sequenced using 14 accessions of O. rufipogon and 31 landraces of O.sativa from diverse geographical locations in Asia (Supplementary Table8). The number of polymorphic sites (S), number of haplotypes (h) and average proportion of pairwise differences per base pair () were calculated using DnaSP (version 5) (ref. 21) (Supplementary Tables 11 and 12). Testing for selection. To test for selection of the 9.3-kb region, we examined HKA tests and coalescent simulations to determined whether the observed genetic variation in O. sativa was significantly smaller than expected under neutrality. The HKA tests were conducted with p-VATPase22, Lhs1 (ref. 23), and four regions surrounding SPR3 as control genes and with sequences from O. glumaepatula W1169 (originated from Cuba) or O. barthii W652 (Sierra Leone) as an outgroup using MLHKA software24. The number of cycles of Markov chain was 100,000. The likelihood ratio tests in a species ( O. sativa or O. rufipogon) were performed between the neutral model and the model assuming that the 9.3-kb region was artificially selected during rice domestication. We obtained smaller probability values (P < 0.0012) in O. sativa and high probability values (P > 0.43) in O. rufipogon. We conducted coalescent simulations with a commonly used two-population model of domestication as described before25,26. This model assumes that there is a large stable population with constant size, representing the wild progenitor species, O. rufipogon. The founder population of the Asian cultivated rice species, O. sativa was domesticated from O. rufipogon, and the population before O. sativa formed was affected by a bottleneck effect. Using this model, we assumed Nrufipogon = Nsativa = 125,000. For the time of the domestication event, we used several values (Tdomestication = (7,500, 9,000, 10,000, 12,000)). The selfing rate of O. sativa was assumed to be 95% and that of O. rufipogon was the weighted average of 60% in our simulation26. The recombination rate

was assumed to be 4 cM/Mb across the genome. Selection and bottleneck cause a reduction in genetic diversity of O. sativa. To distinguish the two factors, based on two-population model with bottleneck model (as neutral model), we collected 10,000 simulation replications. We tested whether the low nucleotide diversity observed in O. sativa cannot be explained by a population bottleneck alone because this effect would have caused a reduction in nucleotide diversity throughout the genome. We compared with the observed nucleotide diversities for O. rufipogon and O. sativa (rufipogon = 0.00883; sativa = 0.00043). Smaller probability values (P < 0.032) were obtained. Overall, our HKA tests and coalescent simulations supported artificial selection of the 9.3-kb region during rice domestication.
16. Thanh, P.T., Phan, P.D.T., Mori, N., Ishikawa, R. & Ishii, T. Development of backcross recombinant inbred lines between Oryza sativa Nipponbare and O.rufipogon and QTL detection on drought tolerance. Breed. Sci. 61, 7679 (2011). 17. IRRI. Standard evaluation system for rice (International Rice Research Institute, Philippines, 2002). 18. McCouch, S.R. et al. Development and mapping of 2240 new SSR markers for rice (Oryza sativa L.). DNA Res. 9, 257279 (2002). 19. Nelson, J.C. qGENE: Software for marker-based genomic analysis and breeding. Mol. Breed. 3, 239245 (1997). 20. Vaughan, D.A., Balazs, E. & Heslop-Harrison, J.S. From crop domestication to super-domestication. Ann. Bot. 100, 893901 (2007). 21. Librado, P. & Rozas, J. DnaSP v5: A software for comprehensive analysis of DNA polymorphism data. Bioinformatics 25, 14511452 (2009). 22. Londo, J.P., Chiang, Y.C., Hung, K.H., Chiang, T.Y. & Shaal, B.A. Phylogeography of Asian wild rice, Oryza rufipogon, reveals multiple independent domestications of cultivated rice, Oryza sativa. Proc. Natl. Acad. Sci. USA 103, 95789583 (2006). 23. Zhu, Q., Zheng, X., Luo, J., Gaut, B.S. & Ge, S. Multilocus analysis of nucleotide variation of Oryza sativa and its wild relatives: severe bottleneck during domestication of rice. Mol. Biol. Evol. 24, 875888 (2007). 24. Wright, S.I. & Charlesworth, B. The HKA test revisited: a maximum-likelihood-ratio test of the standard neutral model. Genetics 168, 10711076 (2004). 25. Gao, L.Z. & Innan, H. Nonindependent domestication of the two rice subspecies, Oryza sativa ssp. indica and ssp. japonica, demonstrated by multilocus microsatellites. Genetics 179, 965976 (2008). 26. Asano, K. et al. Artificial selection for a green revolution gene during japonica rice domestication. Proc. Natl. Acad. Sci. USA 108, 1103411039 (2011).

npg
doi:10.1038/ng.2567

2013 Nature America, Inc. All rights reserved.

Nature Genetics

You might also like