Identification of Known and Novel Recurrent Viral Sequences in Data from Multiple Patients and Multiple Cancers

Virus discovery from high throughput sequencing data often follows a bottom-up approach where taxonomic annotation takes place prior to association to disease. Albeit effective in some cases, the approach fails to detect novel pathogens and remote variants not present in reference databases. We have...

Full description

Bibliographic Details
Main Authors: Jens Friis-Nielsen, Kristín Rós Kjartansdóttir, Sarah Mollerup, Maria Asplund, Tobias Mourier, Randi Holm Jensen, Thomas Arn Hansen, Alba Rey-Iglesia, Stine Raith Richter, Ida Broman Nielsen, David E. Alquezar-Planas, Pernille V. S. Olsen, Lasse Vinner, Helena Fridholm, Lars Peter Nielsen, Eske Willerslev, Thomas Sicheritz-Pontén, Ole Lund, Anders Johannes Hansen, Jose M. G. Izarzugaza, Søren Brunak
Format: Article
Language:English
Published: MDPI AG 2016-02-01
Series:Viruses
Subjects:
Online Access:http://www.mdpi.com/1999-4915/8/2/53
_version_ 1819155374022852608
author Jens Friis-Nielsen
Kristín Rós Kjartansdóttir
Sarah Mollerup
Maria Asplund
Tobias Mourier
Randi Holm Jensen
Thomas Arn Hansen
Alba Rey-Iglesia
Stine Raith Richter
Ida Broman Nielsen
David E. Alquezar-Planas
Pernille V. S. Olsen
Lasse Vinner
Helena Fridholm
Lars Peter Nielsen
Eske Willerslev
Thomas Sicheritz-Pontén
Ole Lund
Anders Johannes Hansen
Jose M. G. Izarzugaza
Søren Brunak
author_facet Jens Friis-Nielsen
Kristín Rós Kjartansdóttir
Sarah Mollerup
Maria Asplund
Tobias Mourier
Randi Holm Jensen
Thomas Arn Hansen
Alba Rey-Iglesia
Stine Raith Richter
Ida Broman Nielsen
David E. Alquezar-Planas
Pernille V. S. Olsen
Lasse Vinner
Helena Fridholm
Lars Peter Nielsen
Eske Willerslev
Thomas Sicheritz-Pontén
Ole Lund
Anders Johannes Hansen
Jose M. G. Izarzugaza
Søren Brunak
author_sort Jens Friis-Nielsen
collection DOAJ
description Virus discovery from high throughput sequencing data often follows a bottom-up approach where taxonomic annotation takes place prior to association to disease. Albeit effective in some cases, the approach fails to detect novel pathogens and remote variants not present in reference databases. We have developed a species independent pipeline that utilises sequence clustering for the identification of nucleotide sequences that co-occur across multiple sequencing data instances. We applied the workflow to 686 sequencing libraries from 252 cancer samples of different cancer and tissue types, 32 non-template controls, and 24 test samples. Recurrent sequences were statistically associated to biological, methodological or technical features with the aim to identify novel pathogens or plausible contaminants that may associate to a particular kit or method. We provide examples of identified inhabitants of the healthy tissue flora as well as experimental contaminants. Unmapped sequences that co-occur with high statistical significance potentially represent the unknown sequence space where novel pathogens can be identified.
first_indexed 2024-12-22T15:35:57Z
format Article
id doaj.art-75912c2609a04407bf20a51d474a1a95
institution Directory Open Access Journal
issn 1999-4915
language English
last_indexed 2024-12-22T15:35:57Z
publishDate 2016-02-01
publisher MDPI AG
record_format Article
series Viruses
spelling doaj.art-75912c2609a04407bf20a51d474a1a952022-12-21T18:21:15ZengMDPI AGViruses1999-49152016-02-01825310.3390/v8020053v8020053Identification of Known and Novel Recurrent Viral Sequences in Data from Multiple Patients and Multiple CancersJens Friis-Nielsen0Kristín Rós Kjartansdóttir1Sarah Mollerup2Maria Asplund3Tobias Mourier4Randi Holm Jensen5Thomas Arn Hansen6Alba Rey-Iglesia7Stine Raith Richter8Ida Broman Nielsen9David E. Alquezar-Planas10Pernille V. S. Olsen11Lasse Vinner12Helena Fridholm13Lars Peter Nielsen14Eske Willerslev15Thomas Sicheritz-Pontén16Ole Lund17Anders Johannes Hansen18Jose M. G. Izarzugaza19Søren Brunak20Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, DK-2800 Kgs. Lyngby, DenmarkCentre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, DK-1350 Copenhagen, DenmarkCentre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, DK-1350 Copenhagen, DenmarkCentre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, DK-1350 Copenhagen, DenmarkCentre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, DK-1350 Copenhagen, DenmarkCentre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, DK-1350 Copenhagen, DenmarkCentre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, DK-1350 Copenhagen, DenmarkCentre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, DK-1350 Copenhagen, DenmarkCentre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, DK-1350 Copenhagen, DenmarkCentre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, DK-1350 Copenhagen, DenmarkCentre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, DK-1350 Copenhagen, DenmarkCentre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, DK-1350 Copenhagen, DenmarkCentre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, DK-1350 Copenhagen, DenmarkCentre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, DK-1350 Copenhagen, DenmarkDepartment of Autoimmunology and Biomarkers, Statens Serum Institut, DK-2300 Copenhagen S, DenmarkCentre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, DK-1350 Copenhagen, DenmarkCenter for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, DK-2800 Kgs. Lyngby, DenmarkCenter for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, DK-2800 Kgs. Lyngby, DenmarkCentre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, DK-1350 Copenhagen, DenmarkCenter for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, DK-2800 Kgs. Lyngby, DenmarkCenter for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, DK-2800 Kgs. Lyngby, DenmarkVirus discovery from high throughput sequencing data often follows a bottom-up approach where taxonomic annotation takes place prior to association to disease. Albeit effective in some cases, the approach fails to detect novel pathogens and remote variants not present in reference databases. We have developed a species independent pipeline that utilises sequence clustering for the identification of nucleotide sequences that co-occur across multiple sequencing data instances. We applied the workflow to 686 sequencing libraries from 252 cancer samples of different cancer and tissue types, 32 non-template controls, and 24 test samples. Recurrent sequences were statistically associated to biological, methodological or technical features with the aim to identify novel pathogens or plausible contaminants that may associate to a particular kit or method. We provide examples of identified inhabitants of the healthy tissue flora as well as experimental contaminants. Unmapped sequences that co-occur with high statistical significance potentially represent the unknown sequence space where novel pathogens can be identified.http://www.mdpi.com/1999-4915/8/2/53sequence clusteringtaxonomic characterisationnovel sequence identificationnext generation sequencingcancer causing virusesoncovirusesassay contamination
spellingShingle Jens Friis-Nielsen
Kristín Rós Kjartansdóttir
Sarah Mollerup
Maria Asplund
Tobias Mourier
Randi Holm Jensen
Thomas Arn Hansen
Alba Rey-Iglesia
Stine Raith Richter
Ida Broman Nielsen
David E. Alquezar-Planas
Pernille V. S. Olsen
Lasse Vinner
Helena Fridholm
Lars Peter Nielsen
Eske Willerslev
Thomas Sicheritz-Pontén
Ole Lund
Anders Johannes Hansen
Jose M. G. Izarzugaza
Søren Brunak
Identification of Known and Novel Recurrent Viral Sequences in Data from Multiple Patients and Multiple Cancers
Viruses
sequence clustering
taxonomic characterisation
novel sequence identification
next generation sequencing
cancer causing viruses
oncoviruses
assay contamination
title Identification of Known and Novel Recurrent Viral Sequences in Data from Multiple Patients and Multiple Cancers
title_full Identification of Known and Novel Recurrent Viral Sequences in Data from Multiple Patients and Multiple Cancers
title_fullStr Identification of Known and Novel Recurrent Viral Sequences in Data from Multiple Patients and Multiple Cancers
title_full_unstemmed Identification of Known and Novel Recurrent Viral Sequences in Data from Multiple Patients and Multiple Cancers
title_short Identification of Known and Novel Recurrent Viral Sequences in Data from Multiple Patients and Multiple Cancers
title_sort identification of known and novel recurrent viral sequences in data from multiple patients and multiple cancers
topic sequence clustering
taxonomic characterisation
novel sequence identification
next generation sequencing
cancer causing viruses
oncoviruses
assay contamination
url http://www.mdpi.com/1999-4915/8/2/53
work_keys_str_mv AT jensfriisnielsen identificationofknownandnovelrecurrentviralsequencesindatafrommultiplepatientsandmultiplecancers
AT kristinroskjartansdottir identificationofknownandnovelrecurrentviralsequencesindatafrommultiplepatientsandmultiplecancers
AT sarahmollerup identificationofknownandnovelrecurrentviralsequencesindatafrommultiplepatientsandmultiplecancers
AT mariaasplund identificationofknownandnovelrecurrentviralsequencesindatafrommultiplepatientsandmultiplecancers
AT tobiasmourier identificationofknownandnovelrecurrentviralsequencesindatafrommultiplepatientsandmultiplecancers
AT randiholmjensen identificationofknownandnovelrecurrentviralsequencesindatafrommultiplepatientsandmultiplecancers
AT thomasarnhansen identificationofknownandnovelrecurrentviralsequencesindatafrommultiplepatientsandmultiplecancers
AT albareyiglesia identificationofknownandnovelrecurrentviralsequencesindatafrommultiplepatientsandmultiplecancers
AT stineraithrichter identificationofknownandnovelrecurrentviralsequencesindatafrommultiplepatientsandmultiplecancers
AT idabromannielsen identificationofknownandnovelrecurrentviralsequencesindatafrommultiplepatientsandmultiplecancers
AT davidealquezarplanas identificationofknownandnovelrecurrentviralsequencesindatafrommultiplepatientsandmultiplecancers
AT pernillevsolsen identificationofknownandnovelrecurrentviralsequencesindatafrommultiplepatientsandmultiplecancers
AT lassevinner identificationofknownandnovelrecurrentviralsequencesindatafrommultiplepatientsandmultiplecancers
AT helenafridholm identificationofknownandnovelrecurrentviralsequencesindatafrommultiplepatientsandmultiplecancers
AT larspeternielsen identificationofknownandnovelrecurrentviralsequencesindatafrommultiplepatientsandmultiplecancers
AT eskewillerslev identificationofknownandnovelrecurrentviralsequencesindatafrommultiplepatientsandmultiplecancers
AT thomassicheritzponten identificationofknownandnovelrecurrentviralsequencesindatafrommultiplepatientsandmultiplecancers
AT olelund identificationofknownandnovelrecurrentviralsequencesindatafrommultiplepatientsandmultiplecancers
AT andersjohanneshansen identificationofknownandnovelrecurrentviralsequencesindatafrommultiplepatientsandmultiplecancers
AT josemgizarzugaza identificationofknownandnovelrecurrentviralsequencesindatafrommultiplepatientsandmultiplecancers
AT sørenbrunak identificationofknownandnovelrecurrentviralsequencesindatafrommultiplepatientsandmultiplecancers