iPRESTO: Automated discovery of biosynthetic sub-clusters linked to specific natural product substructures.

Microbial specialised metabolism is full of valuable natural products that are applied clinically, agriculturally, and industrially. The genes that encode their biosynthesis are often physically clustered on the genome in biosynthetic gene clusters (BGCs). Many BGCs consist of multiple groups of co-...

Full description

Bibliographic Details
Main Authors: Joris J R Louwen, Satria A Kautsar, Sven van der Burg, Marnix H Medema, Justin J J van der Hooft
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2023-02-01
Series:PLoS Computational Biology
Online Access:https://doi.org/10.1371/journal.pcbi.1010462
_version_ 1811160739357917184
author Joris J R Louwen
Satria A Kautsar
Sven van der Burg
Marnix H Medema
Justin J J van der Hooft
author_facet Joris J R Louwen
Satria A Kautsar
Sven van der Burg
Marnix H Medema
Justin J J van der Hooft
author_sort Joris J R Louwen
collection DOAJ
description Microbial specialised metabolism is full of valuable natural products that are applied clinically, agriculturally, and industrially. The genes that encode their biosynthesis are often physically clustered on the genome in biosynthetic gene clusters (BGCs). Many BGCs consist of multiple groups of co-evolving genes called sub-clusters that are responsible for the biosynthesis of a specific chemical moiety in a natural product. Sub-clusters therefore provide an important link between the structures of a natural product and its BGC, which can be leveraged for predicting natural product structures from sequence, as well as for linking chemical structures and metabolomics-derived mass features to BGCs. While some initial computational methodologies have been devised for sub-cluster detection, current approaches are not scalable, have only been run on small and outdated datasets, or produce an impractically large number of possible sub-clusters to mine through. Here, we constructed a scalable method for unsupervised sub-cluster detection, called iPRESTO, based on topic modelling and statistical analysis of co-occurrence patterns of enzyme-coding protein families. iPRESTO was used to mine sub-clusters across 150,000 prokaryotic BGCs from antiSMASH-DB. After annotating a fraction of the resulting sub-cluster families, we could predict a substructure for 16% of the antiSMASH-DB BGCs. Additionally, our method was able to confirm 83% of the experimentally characterised sub-clusters in MIBiG reference BGCs. Based on iPRESTO-detected sub-clusters, we could correctly identify the BGCs for xenorhabdin and salbostatin biosynthesis (which had not yet been annotated in BGC databases), as well as propose a candidate BGC for akashin biosynthesis. Additionally, we show for a collection of 145 actinobacteria how substructures can aid in linking BGCs to molecules by correlating iPRESTO-detected sub-clusters to MS/MS-derived Mass2Motifs substructure patterns. This work paves the way for deeper functional and structural annotation of microbial BGCs by improved linking of orphan molecules to their cognate gene clusters, thus facilitating accelerated natural product discovery.
first_indexed 2024-04-10T06:03:42Z
format Article
id doaj.art-419ecef4d2274796a4183bc7e93ac229
institution Directory Open Access Journal
issn 1553-734X
1553-7358
language English
last_indexed 2024-04-10T06:03:42Z
publishDate 2023-02-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS Computational Biology
spelling doaj.art-419ecef4d2274796a4183bc7e93ac2292023-03-03T05:31:03ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582023-02-01192e101046210.1371/journal.pcbi.1010462iPRESTO: Automated discovery of biosynthetic sub-clusters linked to specific natural product substructures.Joris J R LouwenSatria A KautsarSven van der BurgMarnix H MedemaJustin J J van der HooftMicrobial specialised metabolism is full of valuable natural products that are applied clinically, agriculturally, and industrially. The genes that encode their biosynthesis are often physically clustered on the genome in biosynthetic gene clusters (BGCs). Many BGCs consist of multiple groups of co-evolving genes called sub-clusters that are responsible for the biosynthesis of a specific chemical moiety in a natural product. Sub-clusters therefore provide an important link between the structures of a natural product and its BGC, which can be leveraged for predicting natural product structures from sequence, as well as for linking chemical structures and metabolomics-derived mass features to BGCs. While some initial computational methodologies have been devised for sub-cluster detection, current approaches are not scalable, have only been run on small and outdated datasets, or produce an impractically large number of possible sub-clusters to mine through. Here, we constructed a scalable method for unsupervised sub-cluster detection, called iPRESTO, based on topic modelling and statistical analysis of co-occurrence patterns of enzyme-coding protein families. iPRESTO was used to mine sub-clusters across 150,000 prokaryotic BGCs from antiSMASH-DB. After annotating a fraction of the resulting sub-cluster families, we could predict a substructure for 16% of the antiSMASH-DB BGCs. Additionally, our method was able to confirm 83% of the experimentally characterised sub-clusters in MIBiG reference BGCs. Based on iPRESTO-detected sub-clusters, we could correctly identify the BGCs for xenorhabdin and salbostatin biosynthesis (which had not yet been annotated in BGC databases), as well as propose a candidate BGC for akashin biosynthesis. Additionally, we show for a collection of 145 actinobacteria how substructures can aid in linking BGCs to molecules by correlating iPRESTO-detected sub-clusters to MS/MS-derived Mass2Motifs substructure patterns. This work paves the way for deeper functional and structural annotation of microbial BGCs by improved linking of orphan molecules to their cognate gene clusters, thus facilitating accelerated natural product discovery.https://doi.org/10.1371/journal.pcbi.1010462
spellingShingle Joris J R Louwen
Satria A Kautsar
Sven van der Burg
Marnix H Medema
Justin J J van der Hooft
iPRESTO: Automated discovery of biosynthetic sub-clusters linked to specific natural product substructures.
PLoS Computational Biology
title iPRESTO: Automated discovery of biosynthetic sub-clusters linked to specific natural product substructures.
title_full iPRESTO: Automated discovery of biosynthetic sub-clusters linked to specific natural product substructures.
title_fullStr iPRESTO: Automated discovery of biosynthetic sub-clusters linked to specific natural product substructures.
title_full_unstemmed iPRESTO: Automated discovery of biosynthetic sub-clusters linked to specific natural product substructures.
title_short iPRESTO: Automated discovery of biosynthetic sub-clusters linked to specific natural product substructures.
title_sort ipresto automated discovery of biosynthetic sub clusters linked to specific natural product substructures
url https://doi.org/10.1371/journal.pcbi.1010462
work_keys_str_mv AT jorisjrlouwen iprestoautomateddiscoveryofbiosyntheticsubclusterslinkedtospecificnaturalproductsubstructures
AT satriaakautsar iprestoautomateddiscoveryofbiosyntheticsubclusterslinkedtospecificnaturalproductsubstructures
AT svenvanderburg iprestoautomateddiscoveryofbiosyntheticsubclusterslinkedtospecificnaturalproductsubstructures
AT marnixhmedema iprestoautomateddiscoveryofbiosyntheticsubclusterslinkedtospecificnaturalproductsubstructures
AT justinjjvanderhooft iprestoautomateddiscoveryofbiosyntheticsubclusterslinkedtospecificnaturalproductsubstructures