OpenMS  2.7.0
Public Types | Public Member Functions | Static Public Member Functions | Private Types | Private Attributes | List of all members
AhoCorasickAmbiguous Class Reference

Extended Aho-Corasick algorithm capable of matching ambiguous amino acids in the pattern (i.e. proteins). More...

#include <OpenMS/ANALYSIS/ID/AhoCorasickAmbiguous.h>

Collaboration diagram for AhoCorasickAmbiguous:
[legend]

Public Types

typedef ::seqan::StringSet<::seqan::AAStringPeptideDB
 
typedef ::seqan::Pattern< PeptideDB, ::seqan::FuzzyACFuzzyACPattern
 

Public Member Functions

 AhoCorasickAmbiguous ()
 Default Ctor; call setProtein() before using findNext(). More...
 
 AhoCorasickAmbiguous (const String &protein_sequence)
 Prepare to start searching for hits in a new protein sequence. More...
 
void setProtein (const String &protein_sequence)
 Reset to new protein sequence. All previous data is forgotten. More...
 
bool findNext (const FuzzyACPattern &pattern)
 Enumerate hits. More...
 
Size getHitDBIndex ()
 Get index of hit into peptide database of the pattern. More...
 
Int getHitProteinPosition ()
 Offset into protein sequence where hit was found. More...
 

Static Public Member Functions

static void initPattern (const PeptideDB &pep_db, const int aaa_max, const int mm_max, FuzzyACPattern &pattern)
 Construct a trie from a set of peptide sequences (which are to be found in a protein). More...
 

Private Types

typedef FuzzyACPattern::KeyWordLengthType KeyWordLengthType
 

Private Attributes

::seqan::Finder< seqan::AAStringfinder_
 locate the next peptide hit in protein More...
 
::seqan::AAString protein_
 the protein sequence - we need to store it since the finder only keeps a pointer to protein when constructed More...
 
::seqan::PatternAuxData< PeptideDBdh_
 auxiliary data to hold a state after searching More...
 

Detailed Description

Extended Aho-Corasick algorithm capable of matching ambiguous amino acids in the pattern (i.e. proteins).

... Features: + blazingly fast + low memory usage + number of allowed ambAA's can be capped by user (default 3).

This implementation is based on the original AC in SeqAn.

Member Typedef Documentation

◆ FuzzyACPattern

typedef ::seqan::Pattern<PeptideDB, ::seqan::FuzzyAC> FuzzyACPattern

◆ KeyWordLengthType

typedef FuzzyACPattern::KeyWordLengthType KeyWordLengthType
private

◆ PeptideDB

typedef ::seqan::StringSet<::seqan::AAString> PeptideDB

Constructor & Destructor Documentation

◆ AhoCorasickAmbiguous() [1/2]

Default Ctor; call setProtein() before using findNext().

◆ AhoCorasickAmbiguous() [2/2]

AhoCorasickAmbiguous ( const String protein_sequence)
inline

Prepare to start searching for hits in a new protein sequence.

This only sets the sequence. No computation is performed. Use findNext() to enumerate the hits.

Parameters
protein_sequenceSequence (ambiguous characters allowed)

References AhoCorasickAmbiguous::setProtein().

Member Function Documentation

◆ findNext()

bool findNext ( const FuzzyACPattern pattern)
inline

Enumerate hits.

Parameters
patternThe pattern (i.e. trie) created with initPattern().
Returns
False if end of protein is reached. True if a hit is found.

References AhoCorasickAmbiguous::dh_, seqan::find(), and AhoCorasickAmbiguous::finder_.

Referenced by PeptideIndexing::addHits_().

◆ getHitDBIndex()

Size getHitDBIndex ( )
inline

Get index of hit into peptide database of the pattern.

Only valid if findNext() returned true before.

References AhoCorasickAmbiguous::dh_, and seqan::position().

Referenced by PeptideIndexing::addHits_().

◆ getHitProteinPosition()

Int getHitProteinPosition ( )
inline

Offset into protein sequence where hit was found.

Only valid if findNext() returned true before.

References AhoCorasickAmbiguous::finder_, and seqan::position().

Referenced by PeptideIndexing::addHits_().

◆ initPattern()

static void initPattern ( const PeptideDB pep_db,
const int  aaa_max,
const int  mm_max,
FuzzyACPattern pattern 
)
inlinestatic

Construct a trie from a set of peptide sequences (which are to be found in a protein).

Peptides must not contain ambiguous characters (exception thrown otherwise) or unknown characters (such as J or U). Ambiguous characters are only allowed in protein sequences.

Usage: Build the pattern only once and use it multiple times when running findNext().

Parameters
pep_dbSet of peptides
aaa_maxMaximum allowed ambiguous characters in the matching protein sequence
mm_maxMaximum allowed mismatches in the matching protein sequence
patternThe pattern to be created
Exceptions
Exception::InvalidValueif a peptide contains an unknown (U,J,...) or ambiguous character

Referenced by PeptideIndexing::run().

◆ setProtein()

void setProtein ( const String protein_sequence)
inline

Member Data Documentation

◆ dh_

auxiliary data to hold a state after searching

Referenced by AhoCorasickAmbiguous::findNext(), AhoCorasickAmbiguous::getHitDBIndex(), and AhoCorasickAmbiguous::setProtein().

◆ finder_

::seqan::Finder<seqan::AAString> finder_
private

◆ protein_

::seqan::AAString protein_
private

the protein sequence - we need to store it since the finder only keeps a pointer to protein when constructed

Referenced by AhoCorasickAmbiguous::setProtein().