Abstract
Crohn's disease1, 2 and ulcerative colitis, the two main types of chronic inflammatory bowel disease, are multifactorial conditions of unknown aetiology. A susceptibility locus for Crohn's disease has been mapped3 to chromosome 16. Here we have used a positional-cloning strategy, based on linkage analysis followed by linkage disequilibrium mapping, to identify three independent associations for Crohn's disease: a frameshift variant and two missense variants of NOD2, encoding a member of the Apaf-1/Ced-4 superfamily of apoptosis regulators that is expressed in monocytes. These NOD2 variants alter the structure of either the leucine-rich repeat domain of the protein or the adjacent region. NOD2 activates nuclear factor NF-kB; this activating function is regulated by the carboxy-terminal leucine-rich repeat domain, which has an inhibitory role and also acts as an intracellular receptor for components of microbial pathogens. These observations suggest that the NOD2 gene product confers susceptibility to Crohn's disease by altering the recognition of these components and/or by over-activating NF-kB in monocytes, thus documenting a molecular model for the pathogenic mechanism of Crohn's disease that can now be further investigated.
Crohn's disease (CD; MIM 266600) occurs primarily in young adults, with an estimated prevalence of 1 in 1,000 in western countries1. Its incidence has increased markedly over the past half century, arguing for the involvement of recent, unidentified, environmental factors2. Familial aggregation of the disease suggests that genetic factors may also be involved—an hypothesis that was substantiated in 1996 by the discovery of a susceptibility locus for CD, IBD1, on chromosome 16 (ref. 3). Identification of the exact nature of the genetic changes that are implicated in CD susceptibility would provide a specific approach to understanding this common disorder.
Because candidate genes previously localized on chromosome 16 failed to show an association with CD4, 5, we refined the localization of the IBD1 susceptibility locus by typing 26 microsatellite markers spaced at an average distance of 1 cM in the pericentromeric region of chromosome 16. Model-free linkage analyses performed on 77 multiplex families indicated that the probability was 0.7 for the location of the susceptibility locus between D16S541 and D16S2623 (Fig. 1a). We constructed bacterial artificial chromosome (BAC) contigs spanning this region (Fig. 1b), which supported linkage disequilibrium mapping6. The transmission disequilibrium test7, performed on a single trio from each of 108 (77 multiplex and 31 simplex) families, showed a borderline significant association (P < 0.05) between the disease phenotype and the 207-base-pair (bp) allele of D16S3136. This observation was replicated with another set of 76 families, although with a different allele (the 205-bp allele; P < 0.01). These two observations may be due to type-one errors. Alternatively, they may reflect true association in two sets of families drawn from genetically different populations.
Figure 1: Strategy used to identify the IBD1 locus.

a, Profile for the multipoint non-parametric linkage (NPL) scores. Approximate cytogenetic localizations are shown for selected microsatellite markers used in the study3 that localized IBD1 to the pericentromeric region of chromosome 16. Subsequent linkage analyses were focused on the region between SPN and D16S408. The highest NPL score (maximum NPL score, 3.49; P = 2.37
10-4) was in the region between markers D16S541 and D16S2623 (ref. 6). b, Physical map of the IBD1 region6. White and black boxes correspond to the two BAC contigs and BAC clone hb87b10, respectively. Five yeast artificial chromosomes (YACs) bridge a gap of
100 kb between these two contigs. The position on the physical map of the microsatellite markers used in the linkage analysis is indicated. Distance between D16S541 and D16S2623 is
2 Mb. c, Representation of the sequenced region containing the IBD1 candidate gene. Unigene clusters and 11 exons of the IBD1 candidate gene are indicated by white and black boxes, respectively. Bold horizontal arrow indicates direction of transcription. Positions of SNP 1–13, D16S3035 and D16S3136, which were typed in 235 CD families for linkage disequilibrium studies, are shown.
The latter hypothesis led to the following strategy: a 164-kb BAC clone (hb87b10) from the CEPH-BAC library containing D16S3136 was sequenced completely (EMBL accession number AJ303140). A public database search extended the sequence of the corresponding region to 260 kb but did not identify characterized genes, with the exception of KIAA0849, which codes for a ubiquitin C-terminal hydrolase homologue in Caenorhabditis elegans. However, analysis by GRAIL and an expressed sequence tag (EST) homology search identified many putatively transcribed regions (Fig. 1c).
Eleven single-nucleotide polymorphisms (SNP 1–11) selected from these regions were genotyped, together with microsatellite markers D16S3035 and D16S3136, in a total of 235 available CD families (Table 1). Strong linkage disequilibrium was observed among most markers (data not shown). Several SNPs showed significant association with CD by the pedigree disequilibrium test8 (PDT), confirming the existence of linkage disequilibrium, with the disease locus over the investigated region (specially SNP 2, nominal P value 0.00002; Table 1).
These observations prompted the characterization of neighbouring Unigene clusters (Fig. 1c). Eleven overlapping clones, isolated from a human leukocyte complementary DNA library, extended Unigene cluster hs135201 and identified 11 exons of a single gene. The previously identified SNP 5–8 were contained in exon 3 of this gene and shown to be non-synonymous variants. To find additional disease-related variants, all exons of this gene were sequenced in 50 unrelated CD patients—each a member of an affected sibling pair identical by descent for both chromosome 16 homologous regions. Two additional non-synonymous SNPs (SNP 12 and 13), with rare-allele frequencies greater than 0.03, were identified and subsequently used to type the 235 CD families.
The PDT was most significant for SNP 13 (P = 6
10-6). Families were divided into two groups: those with at least one member carrying the rare allele of SNP 13 and those without this allele. The latter group of families failed to show association between CD and SNP 4–6, and showed considerable decrease in the significance of the SNP 2 association. This result indicates that the associations of these four loci with CD were not independent of SNP 13 (Table 1). In contrast, significance of the CD associations with SNP 8 and 12 decreased modestly in these families, indicating a minimal contribution of the rare SNP 13 allele to these associations.
The 8 intragenic SNPs that were initially identified defined 41 different haplotypes. Three of these revealed preferential transmission to affected individuals (Table 2). These three haplotypes each contain one rare allele of SNP 8, 12 or 13 in a context of a common background. Notably, the haplotype defined by the same background and by the absence of these rare alleles did not show such transmission distortion (Table 2). Furthermore, the rare alleles of SNP 8, 12 and 13 were never found on the same haplotype, indicating independent association of CD susceptibility with each of three non-synonymous variants of a same gene.
As a result of these associations, the allele frequencies of SNP 8, 12 and 13 differed in the group of CD patients as compared with controls (Table 3). Average risks for CD, computed for genotypes containing zero, one or two variants (Table 3), revealed a gene-dosage effect. The major increase in risk associated with two variant alleles confirms the recessive nature of CD susceptibility, which was suggested by previous segregation analysis9, 10, 11 and linkage studies3. It may also contribute to explain the unusual precision of the affected sibling-pair analysis in mapping the susceptibility gene.
The rare allele of SNP 8 was associated positively with the 205-bp allele of D16S3136 and negatively with the 207-bp allele. The inverse association was noted for the rare alleles of SNP 12 and 13, thus providing a rationale for the initial observations made with this microsatellite marker (data not shown). Genotype frequencies were comparable in CD patients originating from uniquely and multiply affected kindred—an observation compatible with the close clinical similarity of the sporadic and familial diseases12. The observed linkage of CD to chromosome 16 could not be entirely explained by the present associations, because GeneHunter analysis of 85 multiplex families without SNP 8, 12 and 13 revealed a component of linkage (nonparametric lod score (NPL) 1.6, pointwise significance P < 0.02). Thus, other variants of this gene or additional genes on chromosome 16 may be involved in CD susceptibility.
Genotyping of 167 patients with ulcerative colitis revealed genotype frequencies comparable to those of controls, indicating that these SNPs were not associated with susceptibility to ulcerative colitis—an observation in agreement with its lack of linkage to the IBD1 locus13.
The candidate IBD1 gene has high expression in leukocytes, but low or no expression in the other investigated tissues, including normal colon and small intestine (results not shown). It encodes a 1,013-amino-acid protein that is identical to NOD2—a member of the CED4/APAF1 superfamily of apoptosis regulators14. From its amino terminus to its carboxy terminus, NOD2 is composed of two caspase recruitment domains (CARD), a nucleotide-binding domain (NBD) and a LRR region (Fig. 2b). The LRR domain of NOD2 has binding activity for bacterial lipopolysaccharides15 (LPS) and its deletion stimulates the NF-kB pathway16, 17, 18.
Figure 2: Representation of the IBD1/NOD2 protein variants.

The translation product deduced from the cDNA sequence of the candidate IBD1 gene is identical to that of NOD2 (ref. 14). The polypeptide contains two caspase recruitment domains (CARD), a nucleotide-binding domain (NBD) and ten 27-amino-acid, leucine-rich repeats (LRRs). Black circle indicates the consensus sequence of the ATP/GTP-binding site motif A (P-loop) of the NBD. The sequence changes encoded by the three main variants associated with CD are SNP 8 (R675W), SNP 12 (G881R) and SNP 13 (980 frameshift). This frameshift changes a leucine to a proline at position 980, and is immediately followed by a stop codon. SNP 5 is described in Table 1. The allele frequencies of the V928I polymorphism were not significantly different (0.92:0.08) in the three groups, and the corresponding genotypes were in Hardy–Weinberg equilibrium. The positions of the rarer missense variants, observed in 457 CD patients, 159 ulcerative colitis patients and 103 unaffected unrelated individuals, are indicated for these groups. Left scale indicates the number of each identified variant in the investigated groups; right scale measures the mutation frequency.
High resolution image and legend (36K)The rare allele of SNP 13 corresponds to a 1-bp insertion in exon 10 (980fs) predicted to truncate NOD2 in the LRR region. Those of SNP 8 and 12 cause non-conservative substitutions in the LRR domain (G881R) and in the proximal adjacent region (R675W), respectively (Fig. 2a). Systematic sequencing of the coding sequence of NOD2 revealed additional very rare missense variants, which together were observed in 5% of controls and 4% of patients with ulcerative colitis. This percentage rose to 17% for CD patients, where the most frequent variants tended to cluster in the LRR and its adjacent regions (Fig. 2b). This excess suggests that, in addition to SNP 8, 12 and 13, more variants in this part of the NOD2 protein may be associated with CD susceptibility. Thus, the LRR domain of CD-associated variants is likely to be impaired, possibly to various degrees, in its recognition of microbial components and/or in the physiological inhibition of NOD2 dimerization, thus resulting in the inappropriate activation of NF-kB in monocytes.
Much evidence supports bacteria-induced NF-kB disregulation in CD. First, susceptibility to spontaneous inflammatory bowel disease (IBD) in mice has been associated with mutations in Toll-like receptor 4 (TLR4)—a member of a family of NF-kB activators that is known to bind LPS through its LRR domain19, 20. Second, antibiotic therapy causes transient improvement of CD patients, supporting the hypothesis that enteric bacteria may have an aetiological role in CD21. Third, NF-kB has a pivotal role in IBD and is activated in mononuclear cells of the intestinal lamina propria in CD22. Last, CD treatment is based on the use of sulphasalazine and glucocorticoids—two known NF-kB inhibitors23, 24.
Genetic susceptibility to CD is not limited to chromosome 16 and at least five additional loci have been implicated25, 26, 27, 28, 29. The recognition of a transduction pathway that, when disregulated, contributes to the pathogenesis of CD will accelerate the discovery of additional susceptibility genes. It will also contribute to the identification of associated environmental factors and focus the search for specific therapies.
Methods
Families, microsatellite markers and contig construction
A total of 235 CD families (117 simplex nuclear families, 96 multiplex nuclear families, and 22 extended pedigrees, corresponding to a total of 179 CD patients and 261 unaffected relatives) was progressively recruited according to published diagnostic criteria30. In addition, 100 multiplex and 59 simplex ulcerative colitis families were recruited from the same hospitals. Written informed consent was obtained from all participants. All relatives from 77 multiplex families were typed for 26 mapped microsatellite markers with an average resolution of 1 cM between SPN and D16S408. We constructed contigs using seven previously localized sequence tag sites (STSs; D16S541, D16S3035, D16S3136, D16S3117, D16S770, D16S416, D16S2623) and subsequently eight additional ones (wi-9288, wi-16305, shgc-17274, sgc-31023, sgc-32374, stSG-30035, wi-5812, D16S766) and 79 new STSs derived from the end sequences of the isolated BAC clones6.
Clones, sequencing and SNPs
The DNA of BAC clone hb87b10 containing D16S3136 was fragmented by sonication and subcloned in bacteriophage M13. We used sequences from both strands of 706 subclones and from direct primer walking to reconstruct the initial BAC sequence using PhredPhrap (http://www.phrap.org). Identity search in DNA databases identified two overlapping sequenced BACs (AC007334, GenBank; AC007728, GenBank). Homology search, performed on the extended sequence with BLAST v1.4 in GenBank release 114, identified 10 Unigene clusters. The following EST clones corresponding to some of these clusters were obtained from the American Type Culture Collection (http://www.atcc.org) and sequenced completely to identify additional transcribed regions: AI125217, AA417810, AI375427, AA021341, AI090427, AA910520, AA731089. Clones AI090427 and AA910520, corresponding to hs135201, were used to screen a blood leukocyte cDNA library (no. 938202; Stratagene), and retrieved 11 clones of the IBD1 candidate gene.
A total of 123 STSs, mostly selected from putatively transcribed sequences (EST homologies and GRAIL v2.0 predicted exons), including 11 exons of KIAA0849, was sequenced following amplification from the DNA of ten CD patients and two unaffected individuals. Of 35 identified SNPs, SNP 1–11, selected for their rare-allele frequencies greater than 0.08, were typed on 1,272 members of the 235 CD families. SNP 12 and 13 were further identified by sequencing the 11 exons of the candidate IBD1 gene in 50 CD patients (SNP 1–13, GenBank accession numbers G67943–G67955, are submitted to dbSNP of the National Center for Biotechnology Information) and typed on the same group of individuals. To search for rare variant alleles, we subsequently investigated the 11 exons of 457 CD patients, 159 ulcerative colitis patients and 103 unaffected unrelated individuals. All variant alleles were confirmed by sequencing a second independent amplification product.
Data analysis
Genotypic data were analysed for linkage using the NPL score of GeneHunter v2.0. Data from linkage disequilibrium mapping of CD were analysed initially with the transmission disequilibrium test7 using a single trio (one affected and both parents) per family. Subsequently, the pedigree disequilibrium test was performed using the PDT 2.11 program8 to analyse data from all family relatives. We estimated allele frequencies for 3 groups, 418 unrelated CD patients, 159 ulcerative colitis patients and 103 controls (including 78 unaffected, unrelated spouses of CD patients and 25 unrelated CEPH family members).

B synthesis. Science 270, 286-290 (1995). | 
