Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Orphan Genes Bioinformatics: Identification and properties of de novo created genes
Stockholm University, Faculty of Science, Department of Biochemistry and Biophysics.
2017 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Even today, many genes are without any known homolog. These "orphans" are found in all species, from Viruses to Prokaryotes and Eukaryotes. For a portion of these genes, we might simply not have enough data to find homologs yet. Some of them are imported from taxonomically distant organisms via lateral transfer; others have homologs, but mutated beyond the point of recognition.

However, a sizeable fraction of orphan genes is unambiguously created via "de novo" mechanisms. The study of such novel genes can contribute to our understanding of the emergence of functional novelty and the adaptation of species to new ecological niches.

In this work, we first survey the field of orphan studies, and illustrate some of the common issues. Next, we analyze some of the intrinsic properties of orphans proteins, including secondary structure elements and Intrinsic Structural Disorder; specifically, we observe that in young proteins the relationship between these properties and the G+C content of their coding sequence is stronger than in older proteins.

We then tackle some of the methodological problems often found in orphan studies. We find that using evolutionarily close species, and sensitive, state-of-the art homology recognition methods is instrumental to the identification of a set of orphans enriched in de novo created ones.

Finally, we compare how intrinsic disorder is distributed in bacteria versus eukaryota. Eukaryotic proteins are longer and more disordered; the difference is to be attributed primarily to eukaryotic-specific domains and linker regions. In these sections of the proteins, a higher frequency of the disorder-promoting amino acid Serine can be observed in Eukaryotes.

Place, publisher, year, edition, pages
Stockholm: Department of Biochemistry and Biophysics, Stockholm University , 2017.
Keywords [en]
bioinformatics, de novo, orphans, evolutionary genetics
National Category
Biological Sciences
Research subject
Biochemistry towards Bioinformatics
Identifiers
URN: urn:nbn:se:su:diva-149168ISBN: 978-91-7797-085-9 (print)ISBN: 978-91-7797-086-6 (electronic)OAI: oai:DiVA.org:su-149168DiVA, id: diva2:1158369
Public defence
2018-01-12, Magnélisalen, Kemiska övningslaboratoriet, Svante Arrhenius väg 16 B, Stockholm, 13:00 (English)
Opponent
Supervisors
Note

At the time of the doctoral defense, the following papers were unpublished and had a status as follows: Paper 3: Submitted. Paper 4: Manuscript.

Available from: 2017-12-20 Created: 2017-11-20 Last updated: 2017-12-20Bibliographically approved
List of papers
1. Orphans and new gene origination, a structural and evolutionary perspective
Open this publication in new window or tab >>Orphans and new gene origination, a structural and evolutionary perspective
2014 (English)In: Current opinion in structural biology, ISSN 0959-440X, E-ISSN 1879-033X, Vol. 26, p. 73-83Article in journal (Refereed) Published
Abstract [en]

The frequency of de novo creation of proteins has been debated. Early it was assumed that de novo creation should be extremely rare and that the vast majority of all protein coding genes were created in early history of life. However, the early genomics era lead to the insight that protein coding genes do appear to be lineage-specific. Today, with thousands of completely sequenced genomes, this impression remains. It has even been proposed that the creation of novel genes, a continuous process where most de novo genes are short-lived, is as frequent as gene duplications. There exist reports with strongly indicative evidence for de novo gene emergence in many organisms ranging from Bacteria, sometimes generated through bacteriophages, to humans, where orphans appear to be overexpressed in brain and testis. In contrast, research on protein evolution indicates that many very distantly related proteins appear to share partial homology. Here, we discuss recent results on de novo gene emergence, as well as important technical challenges limiting our ability to get a definite answer to the extent of de novo protein creation.

National Category
Biological Sciences
Research subject
Biochemistry towards Bioinformatics
Identifiers
urn:nbn:se:su:diva-107638 (URN)10.1016/j.sbi.2014.05.006 (DOI)000340852000012 ()
Note

AuthorCount:3;

Available from: 2014-09-22 Created: 2014-09-22 Last updated: 2017-11-20Bibliographically approved
2. High GC content causes orphan proteins to be intrinsically disordered
Open this publication in new window or tab >>High GC content causes orphan proteins to be intrinsically disordered
2017 (English)In: PloS Computational Biology, ISSN 1553-734X, E-ISSN 1553-7358, Vol. 13, no 3, article id e1005375Article in journal (Refereed) Published
Abstract [en]

De novo creation of protein coding genes involves the formation of short ORFs from noncoding regions; some of these ORFs might then become fixed in the population These orphan proteins need to, at the bare minimum, not cause serious harm to the organism, meaning that they should for instance not aggregate. Therefore, although the creation of short ORFs could be truly random, the fixation should be subjected to some selective pressure. The selective forces acting on orphan proteins have been elusive, and contradictory results have been reported. In Drosophila young proteins are more disordered than ancient ones, while the opposite trend is present in yeast. To the best of our knowledge no valid explanation for this difference has been proposed. To solve this riddle we studied structural properties and age of proteins in 187 eukaryotic organisms. We find that, with the exception of length, there are only small differences in the properties between proteins of different ages. However, when we take the GC content into account we noted that it could explain the opposite trends observed for orphans in yeast (low GC) and Drosophila (high GC). GC content is correlated with codons coding for disorder promoting amino acids. This leads us to propose that intrinsic disorder is not a strong determining factor for fixation of orphan proteins. Instead these proteins largely resemble random proteins given a particular GC level. During evolution the properties of a protein change faster than the GC level causing the relationship between disorder and GC to gradually weaken.

National Category
Biological Sciences Bioinformatics (Computational Biology)
Research subject
Biochemistry towards Bioinformatics
Identifiers
urn:nbn:se:su:diva-142711 (URN)10.1371/journal.pcbi.1005375 (DOI)000398031900014 ()28355220 (PubMedID)
Available from: 2017-05-05 Created: 2017-05-05 Last updated: 2018-01-13Bibliographically approved
3. The classification of orphans is improved by combining searches in both proteomes and genomes
Open this publication in new window or tab >>The classification of orphans is improved by combining searches in both proteomes and genomes
2017 (English)In: PLoS ONE, ISSN 1932-6203, E-ISSN 1932-6203Article in journal (Refereed) Submitted
Abstract [en]

The detection of genes without homologs (“orphans”) in other species is important, as it provides a glimpse on the evolutionary processes that create novel genes. However, for an unbiased view of such de novo gene creation the detection of these genes needs to be accurate. The estimation of the conservation, and in general the age determination of any gene, is dependent on two factors: (i) a method to detect homologs in a genome and (ii) a set of related genomes. Here, we set out to investigate how the detection of orphans is influenced be these factors. We show that when using multiple genomes and six-frame translations of complete genomes the number of orphans is significantly reduced, when compared with earlier studies. Given these premises we obtain a strict set of 34 orphan Saccharomyces cerevisiae genes, and show that the number of orphans in Drosophila melanogaster and Drosophila pseudoobscura can be reduced to only 30 and 17, respectively.

National Category
Biological Sciences
Research subject
Biochemistry towards Bioinformatics
Identifiers
urn:nbn:se:su:diva-149079 (URN)
Available from: 2017-11-20 Created: 2017-11-20 Last updated: 2017-11-29Bibliographically approved
4. Difference in disorder between eukaryotes and prokaryotes is largely due to Serine in linker regions
Open this publication in new window or tab >>Difference in disorder between eukaryotes and prokaryotes is largely due to Serine in linker regions
(English)Manuscript (preprint) (Other academic)
Abstract [en]

In this study we ask what are the molecular properties that make eukaryotic proteins more disordered than prokaryotic ones. First, we show that on average eukaryotic proteins contain more amino acids that are promoting disorder. In particular the fraction of Serine residues is close to 8% of all residues in eukaryotes and less than 6% in prokaryotes. Second, we show that domains unique to eukaryotes and linker regions in eukaryotes are both more disordered and more abundant than corresponding regions in prokaryotic proteins. Serine is an important residue for post-translational modification and regulatory mechanisms. Therefore, we conclude that it is not unlikely that both the need for regulation in a complex eukaryotic cell and the increased amount of longer multi-domain proteins contribute to the higher intrinsic structural disorder in eukaryotic proteins.

National Category
Biological Sciences
Research subject
Biochemistry towards Bioinformatics
Identifiers
urn:nbn:se:su:diva-149167 (URN)
Available from: 2017-11-20 Created: 2017-11-20 Last updated: 2017-11-20Bibliographically approved

Open Access in DiVA

fulltext(960 kB)61 downloads
File information
File name FULLTEXT01.pdfFile size 960 kBChecksum SHA-512
846da634fc2309fe008ef438caecfaabb543bfac859ec6759d3acf7e70fd1c4ed3b1cd0d1148266d403de8931a71316a3e1fbe1d0d9ef37b7b58b75deff20341
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Basile, Walter
By organisation
Department of Biochemistry and Biophysics
Biological Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 61 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1009 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf