The human Nramp2 gene: characterization of the gene structure, alternative splicing, promoter region and polymorphisms.
Nramp2 is a gene encoding a transmembrane protein that is important in metal transport, in particular iron. Mutations in nramp2 have been shown to be associated with microcytic anemia in mk/mk mice and defective iron transport in Belgrade rats. Nramp2 contains a classical iron responsive element in the 3' untranslated region that confers iron dependent mRNA stabilization. In this report, we describe a splice variant form of human nramp2 that has the carboxyl terminal 18 amino acids substituted with 25 novel amino acids and has a new 3' untranslated region lacking a classical iron-responsive element. This splice form of nramp2, nramp2 non-IRE, was found to be derived from splicing of an additional exon into the terminal coding exon. The nramp2 gene is comprised of 17 exons and spans more than 36 kb. It contains an additional 5' exon and intron (exon and intron 1) and an additional 3' exon (exon 17) and intron (intron 16) as compared to nramp1, a homologous gene. The additional exons and introns account for much of the difference in length between nramp2 (> 36 kb) and nramp1 (12 kb). The exon-intron borders of nramp2 exons 3-15 are homologous to nramp1 exons 2-14. The nramp2 5' regulatory region contains two CCAAT boxes but lacks a TATA box. The 5' regulatory region of nramp2 also contains five potential metal response elements (MRE's) that are similar to the MRE's found in the metallothionein-IIA gene, three potential SP1 binding sites and a single gamma-interferon regulatory element. Five single nucleotide mutations or polymorphisms were identified within the nramp2 gene. One of these, 1303C-->A, occurs in the coding region of nramp2 and results in an amino acid change from leucine to isolecine. A polymorphism, 1254T/C, also occurs in the coding region of nramp2 but does not cause an amino acid change. The other 3 polymorphisms are within introns (IVS2 + 11A/G, IVS4 + 44C/A, and IVS6 + 538G/Gdel). In addition, a polymorphic microsatellite TATATCTATATATC (TA)6-7 (CA)10-11 CCCCCTATA (TATC)3 (TCTG)5 TCCG (TCTA)6 was identified in intron 3. Analysis of cDNA derived by direct amplification of reversed transcribed RNA or cDNA clones isolated from a library provide evidence of skipping of exons 10 and 12 of nramp2. Deletion of either of these exons would result in a sequence that remains in frame yet would generate a protein that would lack transmembrane spanning region 7 or 8 respectively. The deletion of a single transmembrane domain would have severe topological consequences. The coding region of the nramp2 gene of hemochromatosis patients with or without mutations in the hemochromatosis gene, HFE, were examined and found to be normal. One hemochromatosis patient, with a normal HFE genotype, was heterozygous for the 1303C-->A mutation. Furthermore, in an examination of hemochromatosis patients with mutant HFE and normal HFE genes, we did not observe a linkage disequilibrium of either group with a particular nramp2 haplotype. These data suggest that mutations in nramp2 are not commonly associated with hemochromatosis.
Lee PL
,Gelbart T
,West C
,Halloran C
,Beutler E
... -
《BLOOD CELLS MOLECULES AND DISEASES》
[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].
We found that human genome coding regions annotated by computers have different kinds of many errors in public domain through homologous BLAST of our cloned genes in non-redundant (nr) database, including insertions, deletions or mutations of one base pair or a segment in sequences at the cDNA level, or different permutation and combination of these errors. Basically, we use the three means for validating and identifying some errors of the model genes appeared in NCBI GENOME ANNOTATION PROJECT REFSEQS: (I) Evaluating the support degree of human EST clustering and draft human genome BLAST. (2) Preparation of chromosomal mapping of our verified genes and analysis of genomic organization of the genes. All of the exon/intron boundaries should be consistent with the GT/AG rule, and consensuses surrounding the splice boundaries should be found as well. (3) Experimental verification by RT-PCR of the in silico cloning genes and further by cDNA sequencing. And then we use the three means as reference: (1) Web searching or in silico cloning of the genes of different species, especially mouse and rat homologous genes, and thus judging the gene existence by ontology. (2) By using the released genes in public domain as standard, which should be highly homologous to our verified genes, especially the released human genes appeared in NCBI GENOME ANNOTATION PROJECT REFSEQS, we try to clone each a highly homologous complete gene similar to the released genes in public domain according to the strategy we developed in this paper. If we can not get it, our verified gene may be correct and the released gene in public domain may be wrong. (3) To find more evidence, we verified our cloned genes by RT-PCR or hybrid technique. Here we list some errors we found from NCBI GENOME ANNOTATION PROJECT REFSEQs: (1) Insert a base in the ORF by mistake which causes the frame shift of the coding amino acid. In detail, abase in the ORF of a gene is a redundant insertion, which causes a reading frame shift in the translation of an alternative protein, such as LOC124919 is wrong form of C17 orf32 (with mouse and rat orthologs determined by us). (2) Put together by mistake (with force). This is a wrong assembly of non-relating cDNA segment, such as LOC147007 is wrong form of C17orf32. (3) Mistakenly insert a base or one section of cDNA in the ORF which causes it ending beforehand, only coding cDNA sequence of N-terminal amino acids, incomplete. For example, LOC123722 is wrong form of SPRYD1, and even the human hypothetical gene LOC126250 or PDCD5 is wrong form of our PDCD5 (TFAR19). (4) Incomplete, only coding cDNA sequence of C-terminal amino acids. For example, human LOC149076 and mouse LOC230761 are wrong form of our verified human ZNF362 and mouse Zfp362, respectively. (5) Incomplete, only coding one section of coding protein cDNA sequence of correct gene ORF, lacking N-terminal and C-terminal amino acids sequence, and at the same time, mistakenly anticipates the first non-initiation codon amino acid of the incomplete protein amino acid as the initiation codon, e.g. anticipating L as M. For example, LOC200084 is wrong form of ZNF362. (6) Mistakenly insert a base or one section of cDNA in the ORF, wrongly causing unwanted termination codon before the insertion, so the coding protein lacks the first part of the amino acids. For example, the GenBank Acc. No. AL096883 ( LOCUS No. HS323M22B) is wrong form of an experimentally verified human NM_012263 with mouse ortholog of BC010510 determined. (7) It may regard the polluted genomic sequence as complete gene cDNA sequence and anticipate the so-called single exon gene, even the real one, only a small ORF in the very long single exon mRNA, while there really exists termination code in the same phase of the upper part of the ORF initiation code, no other characters accord with the gene's condition. For example, LOC91126 is wrong form of ZNF362. (8) The anticipated genes only have ORF which has no EST proofs on both terminal sides. Depending on this ORF, a complete gene cDNA with double support of EST and human genome (there are termination codes at the same phase of the upper part of ORF) which indicates the anticipated ORF reference sequence may be incorrect. For example, LOC164395 may be wrong form of novel human gene bankit4590055. (9) A similar but smaller protein-coding gene is anticipated in the range of the human genome sequence that has the support of EST experimental proof, so other new anticipated gene may be incorrect. For example, LOC167563 may be wrong form of CMYA5. However,these errors can be corrected or avoided by using our strategy. Here we give one example in detail: Comparision of the sequence SPRYD1 with human hypothetical gene LOC123722. The TAA bases in the position of 478-480 in LOC123722 cDNA is redundant, which causes a reading frame shift in the translation of an alternative protein. The redundancy of GTAAA of LOC123722 is not supported by our experimental clone,and is almost fully rejected by human EST alignment, and is shown as the next intron sequence by genomic GT/AG organization analysis. The verification of cDNA or genomic DNA sequence of SPRYD1 implies that LOC123722 has a wrong stop codon within its ORF because of the prediction program, thus being not complete cds. To sum up, by combining bioinformatics analyses with experimental verification, we have found that there are many errors of at least nine kinds appeared in NCBI GENOME ANNOTATION PROJECT REFSEQs through BLAST of our cloned genes in non-redundant database, and our strategy is helpful in correcting them, such as LOC14907, LOC200084 and LOC91126 (all of them should be ZNF362, but are three different kinds of wrong forms of ZNF362), three model reference sequences predicted from NCBI contig NT_004511 by automated computational analysis using gene prediction method, or such as LOC124919 and LOC147007 (both should be C17orf32, but are two different kinds of wrong forms of C17orf32), two model reference sequences predicted from NCBI contig NT_010808 by automated computational analysis using gene prediction method. Therefore, the correct identification and annotation of novel human genes may be still a heavy task, which can be finished within a long period of time. So human genome coding regions annotated by computer should be used with caution. The articles published in the past did not clearly point out the existence of mistakes in the NCBI human gene mode reference sequence. At the Seventh International Human Genome Conference held in April 2002, we first published the researching result on this aspect in the communication form of Posterly insert a base or one section of cDNA in the ORF, wrongly causing unwanted termination codon before the insertion, so the coding protein lacks the first part of the amino acids. For example, the GenBank Acc. No. AL096883 ( LOCUS No. HS323M22B) is wrong form of an experimentally verified human NM_012263 with mouse ortholog of BC010510 determined. (7) It may regard the polluted genomic sequence as complete gene cDNA sequence and anticipate the so-called single exon gene, even the real one, only a small ORF in the very long single exon mRNA, while there really exists termination code in the same phase of the upper part of the ORF initiation code, no other characters accord with the gene's condition. For example, LOC91126 is wrong form of ZNF362. (8) The anticipated genes only have ORF which has no EST proofs on both terminal sides. Depending on this ORF, a complete gene cDNA with double support of EST and human genome (there are termination codes at the same phase of the upper part of ORF) which indicates the anticipated ORF reference sequence may be incorrect. For example, LOC164395 may be wrong form of novel human gene bankit4590055. (9) A similar but smaller protein-coding gene is anticipated in the range of the human genome sequence that has the support of EST experimental proof, so other new anticipated gene may be incorrect. For example, LOC167563 may be wrong form of CMYA5. However, these errors can be corrected or avoided by using our strategy. Here we give one example in detail: Comparision of the sequence SPRYD1 with human hypothetical gene LOC123722. The TAA bases in the position of 478-480 in LOC123722 cDNA is redundant, which causes a reading frame shift in the translation of an alternative protein. The redundancy of GTAAA of LOC123722 is not supported by our experimental clone, and is almost fully rejected by human EST alignment, and is shown as the next intron sequence by genomic GT/AG organization analysis. The verification of cDNA or genomic DNA sequence of SPRYD1 implies that LOC123722 has a wrong stop codon within its ORF because of the prediction program, thus being not complete cds. To sum up, by combining bioinformatics analyses with experimental verification, we have found that there are many errors of at least nine kinds appeared in NCBI GENOME ANNOTATION PROJECT REFSEQs through BLAST of our cloned genes in non-redundant database, and our strategy is helpful in correcting them, such as LOC14907, LOC200084 and LOC91126 (all of them should be ZNF362, but are three different kinds of wrong forms of ZNF362), three model reference sequences predicted from NCBI contig NT_004511 by automated computational analysis using gene prediction method, or such as LOC124919 and LOC147007 (both should be C17orf32, but are two different kinds of wrong forms of C17orf32), two model reference sequences predicted from NCBI contig NT_010808 by automated computational analysis using gene prediction method. Therefore, the correct identification and annotation of novel human genes may be still a heavy task, which can be finished within a long period of time. So human genome coding regions annotated by computer should be used with caution. (ABSTRACT TRUNCATED)
Zhang DL
,Ji L
,Li YD
《-》