Detects features in MS1 data based on peptide identifications.

pot. predecessor tools	→ FeatureFinderIdentification →	pot. successor tools
PeakPickerHiRes (optional)		ProteinQuantifier
IDFilter		ProteinQuantifier

Reference:
Weisser & Choudhary: Targeted Feature Detection for Data-Dependent Shotgun Proteomics (J. Proteome Res., 2017, PMID: 28673088).

This tool detects quantitative features in MS1 data based on information from peptide identifications (derived from MS2 spectra). It uses algorithms for targeted data analysis from the OpenSWATH pipeline.

The aim is to detect features that enable the quantification of (ideally) all peptides in the identification input. This is based on the following principle: When a high-confidence identification (ID) of a peptide was made based on an MS2 spectrum from a certain (precursor) position in the LC-MS map, this indicates that the particular peptide is present at that position, so a feature for it should be detectable there.

Note: It is important that only high-confidence (i.e. reliable) peptide identifications are used as input!

Targeted data analysis on the MS1 level uses OpenSWATH algorithms and follows roughly the steps outlined below.

Use of inferred ("external") IDs

The situation becomes more complicated when several LC-MS/MS runs from related samples of a label-free experiment are considered. In order to quantify a larger fraction of the peptides/proteins in the samples, it is desirable to infer peptide identifications across runs. Ideally, all peptides identified in any of the runs should be quantified in each and every run. However, for feature detection of inferred ("external") IDs, the following problems arise: First, retention times may be shifted between the run being quantified and the run that gave rise to the ID. Such shifts can be corrected (see MapAlignerIdentification), but only to an extent. Thus, the RT location of the inferred ID may not necessarily lie within the RT range of the correct feature. Second, since the peptide in question was not directly identified in the run being quantified, it may not actually be present in detectable amounts in that sample, e.g. due to differential regulation of the corresponding protein. There is thus a risk of introducing false-positive features.

FeatureFinderIdentification deals with these challenges by explicitly distinguishing between internal IDs (derived from the LC-MS/MS run being quantified) and external IDs (inferred from related runs). Features derived from internal IDs give rise to a training dataset for an SVM classifier. The SVM is then used to predict which feature candidates derived from external IDs are most likely to be correct. See steps 4 and 5 below for more details.

1. Assay generation

Feature detection is based on assays for identified peptides, each of which incorporates the retention time (RT), mass-to-charge ratio (m/z), and isotopic distribution (derived from the sequence) of a peptide. Peptides with different modifications are considered different peptides. One assay will be generated for every combination of (modified) peptide sequence, charge state, and RT region that has been identified. The RT regions arise by pooling all identifications of the same peptide, considering a window of size extract:rt_window around every RT location that gave rise to an ID, and then merging overlapping windows.

2. Ion chromatogram extraction

Ion chromatograms (XICs) are extracted from the LC-MS data (parameter in). One XIC per isotope in an assay is generated, with the corresponding m/z value and RT range (variable, depending on the RT region of the assay).

See also: OpenSwathChromatogramExtractor

3. Feature detection

Next feature candidates - typically several per assay - are detected in the XICs and scored. A variety of scores for different quality aspects are calculated by OpenSWATH.

See also: OpenSwathAnalyzer

4. Feature classification

Feature candidates derived from assays with "internal" IDs are classed as "negative" (candidates without matching internal IDs), "positive" (the single best candidate per assay with matching internal IDs), and "ambiguous" (other candidates with matching internal IDs). If "external" IDs were given as input, features based on them are initially classed as "unknown". Also in this case, a support vector machine (SVM) is trained on the "positive" and "negative" candidates, to distinguish between the two classes based on the different OpenSWATH quality scores (plus an RT deviation score). After parameter optimization by cross-validation, the resulting SVM is used to predict the probability of "unknown" feature candidates being positives.

5. Feature filtering

Feature candidates are filtered so that at most one feature per peptide and charge state remains. For assays with internal IDs, only candidates previously classed as "positive" are kept. For assays based solely on external IDs, the feature candidate with the highest SVM probability is selected and kept (possibly subject to the svm:min_prob threshold).

6. Elution model fitting

Elution models can be fitted to the features to improve the quantification. For robustness, one model is fitted to all isotopic mass traces of a feature in parallel. A symmetric (Gaussian) and an asymmetric (exponential-Gaussian hybrid) model type are available. The fitted models are checked for plausibility before they are accepted.

Finally the results (feature maps, parameter out) are returned.

Note: Currently mzIdentML (mzid) is not directly supported as an input/output format of this tool. Convert mzid files to/from idXML using IDFileConverter if necessary.

The command line parameters of this tool are:

FeatureFinderIdentification -- Detects features in MS1 data based on peptide identifications.
Full documentation: http://www.openms.de/doxygen/release/3.1.0/html/TOPP_FeatureFinderIdentification.html
Version: 3.1.0 Oct 18 2023, 10:27:18, Revision: 17a07f8
To cite OpenMS:
 + Rost HL, Sachsenberg T, Aiche S, Bielow C et al.. OpenMS: a flexible open-source software platform for 
   mass spectrometry data analysis. Nat Meth. 2016; 13, 9: 741-748. doi:10.1038/nmeth.3959.
To cite FeatureFinderIdentification:
 + Weisser H, Choudhary JS. Targeted Feature Detection for Data-Dependent Shotgun Proteomics. J. Proteome 
   Res. 2017; 16, 8:2964-2974. doi:10.1021/acs.jproteome.7b00248.

Usage:
  FeatureFinderIdentification <options>

Options (mandatory options marked with '*'):
  -in <file>*                         Input file: LC-MS raw data (valid formats: 'mzML')
  -id <file>*                         Input file: Peptide identifications derived directly from 'in' (valid 
                                      formats: 'idXML')
  -id_ext <file>                      Input file: 'External' peptide identifications (e.g. from aligned runs)
                                       (valid formats: 'idXML')
  -out <file>*                        Output file: Features (valid formats: 'featureXML')
  -lib_out <file>                     Output file: Assay library (valid formats: 'traML')
  -chrom_out <file>                   Output file: Chromatograms (valid formats: 'mzML')
  -candidates_out <file>              Output file: Feature candidates (before filtering and model fitting) 
                                      (valid formats: 'featureXML')
  -quantify_decoys                    Whether decoy peptides should be quantified (true) or skipped (false).
  -min_psm_cutoff <text>              Minimum score for the best PSM of a spectrum to be used as seed. Use 
                                      'none' for no cutoff. (default: 'none')
  -add_mass_offset_peptides <value>   If for every peptide (or seed) also an offset peptide is extracted (tru
                                      e). Can be used to downstream to determine MBR false transfer rates. 
                                      (0.0 = disabled) (default: '0.0') (min: '0.0')

Parameters for ion chromatogram extraction:
  -extract:batch_size <number>        Nr of peptides used in each batch of chromatogram extraction. Smaller 
                                      values decrease memory usage but increase runtime. (default: '5000') 
                                      (min: '1')
  -extract:mz_window <value>          M/z window size for chromatogram extraction (unit: ppm if 1 or greater,
                                       else Da/Th) (default: '10.0') (min: '0.0')
  -extract:n_isotopes <number>        Number of isotopes to include in each peptide assay. (default: '2') 
                                      (min: '2')

Parameters for detecting features in extracted ion chromatograms:
  -detect:peak_width <value>          Expected elution peak width in seconds, for smoothing (Gauss filter). 
                                      Also determines the RT extration window, unless set explicitly via 'ext
                                      ract:rt_window'. (default: '60.0') (min: '0.0')
  -detect:mapping_tolerance <value>   RT tolerance (plus/minus) for mapping peptide IDs to features. Absolute
                                       value in seconds if 1 or greater, else relative to the RT span of the 
                                      feature. (default: '0.0') (min: '0.0')

Parameters for scoring features using a support vector machine (SVM):
  -svm:samples <number>               Number of observations to use for training ('0' for all) (default: '0')
                                       (min: '0')
  -svm:no_selection                   By default, roughly the same number of positive and negative observatio
                                      ns, with the same intensity distribution, are selected for training. 
                                      This aims to reduce biases, but also reduces the amount of training 
                                      data. Set this flag to skip this procedure and consider all available 
                                      observations (subject to 'svm:samples').
  -svm:xval_out <choice>              Output file: SVM cross-validation (parameter optimization) results (val
                                      id formats: 'csv')
  -svm:kernel <choice>                SVM kernel (default: 'RBF') (valid: 'RBF', 'linear')
  -svm:xval <number>                  Number of partitions for cross-validation (parameter optimization) (def
                                      ault: '5') (min: '1')
  -svm:log2_C <values>                Values to try for the SVM parameter 'C' during parameter optimization. 
                                      A value 'x' is used as 'C = 2^x'. (default: '[-5.0 -3.0 -1.0 1.0 3.0 
                                      5.0 7.0 9.0 11.0 13.0 15.0]')
  -svm:log2_gamma <values>            Values to try for the SVM parameter 'gamma' during parameter optimizati
                                      on (RBF kernel only). A value 'x' is used as 'gamma = 2^x'. (default: 
                                      '[-15.0 -13.0 -11.0 -9.0 -7.0 -5.0 -3.0 -1.0 1.0 3.0]')
  -svm:log2_p <values>                Values to try for the SVM parameter 'epsilon' during parameter optimiza
                                      tion (epsilon-SVR only). A value 'x' is used as 'epsilon = 2^x'. (defau
                                      lt: '[-15.0 -12.0 -9.0 -6.0 -3.32192809489 0.0 3.32192809489 6.0 9.0 
                                      12.0 15.0]')

Parameters for fitting elution models to features:
  -model:type <choice>                Type of elution model to fit to features (default: 'symmetric') (valid:
                                       'symmetric', 'asymmetric', 'none')

Parameters for fitting exp. mod. Gaussians to mass traces.:
  -EMGScoring:max_iteration <number>  Maximum number of iterations for EMG fitting. (default: '100') (min: 
                                      '1')
  -EMGScoring:init_mom                Alternative initial parameters for fitting through method of moments.

                                      
Common TOPP options:
  -ini <file>                         Use the given TOPP INI file
  -threads <n>                        Sets the number of threads allowed to be used by the TOPP tool (default
                                      : '1')
  -write_ini <file>                   Writes the default configuration file
  --help                              Shows options
  --helphelp                          Shows all options (including advanced)

INI file documentation of this tool:

Legend:

required parameter

advanced parameter

+FeatureFinderIdentificationDetects features in MS1 data based on peptide identifications.

version3.1.0 Version of the tool that generated this parameters file.

++1Instance '1' section for 'FeatureFinderIdentification'

in Input file: LC-MS raw datainput file*.mzML

id Input file: Peptide identifications derived directly from 'in'input file*.idXML

id_ext Input file: 'External' peptide identifications (e.g. from aligned runs)input file*.idXML

out Output file: Featuresoutput file*.featureXML

lib_out Output file: Assay libraryoutput file*.traML

chrom_out Output file: Chromatogramsoutput file*.mzML

candidates_out Output file: Feature candidates (before filtering and model fitting)output file*.featureXML

candidates_in Input file: Feature candidates from a previous run. If set, only feature classification and elution model fitting are carried out, if enabled. Many parameters are ignored.input file*.featureXML

debug0 Sets the debug level0:∞

quantify_decoysfalse Whether decoy peptides should be quantified (true) or skipped (false).true, false

min_psm_cutoffnone Minimum score for the best PSM of a spectrum to be used as seed. Use 'none' for no cutoff.

add_mass_offset_peptides0.0 If for every peptide (or seed) also an offset peptide is extracted (true). Can be used to downstream to determine MBR false transfer rates. (0.0 = disabled)0.0:∞

log Name of log file (created only when specified)

threads1 Sets the number of threads allowed to be used by the TOPP tool

no_progressfalse Disables progress logging to command linetrue, false

forcefalse Overrides tool-specific checkstrue, false

testfalse Enables the test mode (needed for internal use only)true, false

+++extractParameters for ion chromatogram extraction

batch_size5000 Nr of peptides used in each batch of chromatogram extraction. Smaller values decrease memory usage but increase runtime.1:∞

mz_window10.0 m/z window size for chromatogram extraction (unit: ppm if 1 or greater, else Da/Th)0.0:∞

n_isotopes2 Number of isotopes to include in each peptide assay.2:∞

isotope_pmin0.0 Minimum probability for an isotope to be included in the assay for a peptide. If set, this parameter takes precedence over 'extract:n_isotopes'.0.0:1.0

rt_quantile0.95 Quantile of the RT deviations between aligned internal and external IDs to use for scaling the RT extraction window0.0:1.0

rt_window0.0 RT window size (in sec.) for chromatogram extraction. If set, this parameter takes precedence over 'extract:rt_quantile'.0.0:∞

+++detectParameters for detecting features in extracted ion chromatograms

peak_width60.0 Expected elution peak width in seconds, for smoothing (Gauss filter). Also determines the RT extration window, unless set explicitly via 'extract:rt_window'.0.0:∞

min_peak_width0.2 Minimum elution peak width. Absolute value in seconds if 1 or greater, else relative to 'peak_width'.0.0:∞

signal_to_noise0.8 Signal-to-noise threshold for OpenSWATH feature detection0.1:∞

mapping_tolerance0.0 RT tolerance (plus/minus) for mapping peptide IDs to features. Absolute value in seconds if 1 or greater, else relative to the RT span of the feature.0.0:∞

+++svmParameters for scoring features using a support vector machine (SVM)

samples0 Number of observations to use for training ('0' for all)0:∞

no_selectionfalse By default, roughly the same number of positive and negative observations, with the same intensity distribution, are selected for training. This aims to reduce biases, but also reduces the amount of training data. Set this flag to skip this procedure and consider all available observations (subject to 'svm:samples').true, false

xval_out Output file: SVM cross-validation (parameter optimization) resultsoutput file*.csv

kernelRBF SVM kernelRBF, linear

xval5 Number of partitions for cross-validation (parameter optimization)1:∞

log2_C[-5.0, -3.0, -1.0, 1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0] Values to try for the SVM parameter 'C' during parameter optimization. A value 'x' is used as 'C = 2^x'.

log2_gamma[-15.0, -13.0, -11.0, -9.0, -7.0, -5.0, -3.0, -1.0, 1.0, 3.0] Values to try for the SVM parameter 'gamma' during parameter optimization (RBF kernel only). A value 'x' is used as 'gamma = 2^x'.

log2_p[-15.0, -12.0, -9.0, -6.0, -3.32192809489, 0.0, 3.32192809489, 6.0, 9.0, 12.0, 15.0] Values to try for the SVM parameter 'epsilon' during parameter optimization (epsilon-SVR only). A value 'x' is used as 'epsilon = 2^x'.

epsilon1.0e-03 Stopping criterion0.0:∞

cache_size100.0 Size of the kernel cache (in MB)1.0:∞

no_shrinkingfalse Disable the shrinking heuristicstrue, false

predictorspeak_apices_sum,var_xcorr_coelution,var_xcorr_shape,var_library_sangle,var_intensity_score,sn_ratio,var_log_sn_score,var_elution_model_fit_score,xx_lda_prelim_score,var_ms1_isotope_correlation_score,var_ms1_isotope_overlap_score,var_massdev_score,main_var_xx_swath_prelim_score Names of OpenSWATH scores to use as predictors for the SVM (comma-separated list)

min_prob0.0 Minimum probability of correctness, as predicted by the SVM, required to retain a feature candidate0.0:1.0

+++modelParameters for fitting elution models to features

typesymmetric Type of elution model to fit to featuressymmetric, asymmetric, none

add_zeros0.2 Add zero-intensity points outside the feature range to constrain the model fit. This parameter sets the weight given to these points during model fitting; '0' to disable.0.0:∞

unweighted_fitfalse Suppress weighting of mass traces according to theoretical intensities when fitting elution modelstrue, false

no_imputationfalse If fitting the elution model fails for a feature, set its intensity to zero instead of imputing a value from the initial intensity estimatetrue, false

each_tracefalse Fit elution model to each individual mass tracetrue, false

++++checkParameters for checking the validity of elution models (and rejecting them if necessary)

min_area1.0 Lower bound for the area under the curve of a valid elution model0.0:∞

boundaries0.5 Time points corresponding to this fraction of the elution model height have to be within the data region used for model fitting0.0:1.0

width10.0 Upper limit for acceptable widths of elution models (Gaussian or EGH), expressed in terms of modified (median-based) z-scores. '0' to disable. Not applied to individual mass traces (parameter 'each_trace').0.0:∞

asymmetry10.0 Upper limit for acceptable asymmetry of elution models (EGH only), expressed in terms of modified (median-based) z-scores. '0' to disable. Not applied to individual mass traces (parameter 'each_trace').0.0:∞

+++EMGScoringParameters for fitting exp. mod. Gaussians to mass traces.

max_iteration100 Maximum number of iterations for EMG fitting.1:∞

init_momfalse Alternative initial parameters for fitting through method of moments.true, false