OpenPepXL: An Open source Peptide Cross-Link identification tool

OpenPepXL is a protein-protein cross-link identification tool implemented in C++ as part of The OpenMS Proteomics Pipeline (TOPP). It is applicable to all labeled and label-free cross-linkers. With labeled linkers it can denoise spectra by comparing the spectra containing light and heavy linkers. The tool is applicable to CID and HCD spectra (although it works optimally with high-resolution HCD data) and can make use of high-resolution instruments by deisotoping fragment peaks and considering their charge for matching to theoretical peaks.
Although no heuristics are used to reduce the search space, it still has a reasonable memory usage and can be used effectively on a desktop computer with 16GB memory for almost all applications.
OpenPepXL supports the mzIdentML 1.2 format for cross-link identification data and a visualization of matched peaks based this format is implemented in TOPPView.
The tool settings can be changed with the INIFileEditor GUI. The input files should be in mzML format with centroided MS2 spectra (centroided either on acquisition, on conversion to mzML or using the TOPP tool PeakPickerHiRes). For label-free cross-linkers the executable OpenPepXLLF (OpenPepXL Label-Free) must be used. To make use of labeled linkers, the executable OpenPepXL must be used and a consensusXML file produced by the TOPP tool FeatureFinderMultiplex is needed for each mzML file.

OpenPepXL output is compatible with several tools for further processing, visualization and publishing of XL-MS data:

OpenPepXL supports the output format of xQuest (Rinner et al., 2008). This means the output is compatible with any post-processing and visualization tools developed for the xQuest pipeline, such as xProphet (Leitner et al., 2014) for FDR estimation and xTract (Walzthoeni et al., 2015) for quantification, as well as the UCSF Chimera plug-in Xlink Analyzer (Kosinski et al., 2015) for visualizing and analyzing cross-links on structures.
OpenMS internal tools for FDR estimation (XFDR) and quantification (various FeatureFinder tools, such as FeatureFinderMultiplex for labeled data, or FeatureFinderIdentification for targeted quantification of identified cross-links) are also available and make it possible to build complex automated data processing pipelines.
Output of cross-link identification data in mzIdentML 1.2 format (Vizcaíno et al., 2017) will allow complete submissions of Cross-Linking MS data to the PRIDE database and ProteomeXchange (Ternent et al., 2014).

Installation

OpenPepXL is integrated into the OpenMS Proteomics Pipeline. Version 1.0 is available with OpenMS 2.4. OpenPepXL 1.1 will be available as part of OpenMS 2.5.
To use the current version now, a pre-release of OpenMS 2.5 installers for Windows, Linux and Mac OS X including fully functioning versions of OpenPepXL 1.1 and all related OpenMS tools is available on github here:
github.com/enetz/OpenMS/releases
If you want to use OpenPepXL 1.1 in KNIME, you should install the newest KNIME release (see www.openms.de/getting-started/creating-workflows/) and install OpenMS 2.4 through KNIME community contributions. You can then update the OpenMS nodes to the 2.5 pre-release using the archive ending with ...KNIME-plugin.zip among the OpenMS 2.5 pre-release installers linked above. To do that go to the install new software menu, click on the button Add... at the repository selection, then on the button Archive... in the new pop-up window and directly select the downloaded ...KNIME-plugin.zip file. Now you should see the Generic Workflow Nodes (GWNs) and OpenMS in the list of plugins, under the category called Uncategorized. Select OpenMS and ignore the GWNs (updating the GWNs is not necessary and attempting to do so using this archive will abort the installation with an error message). Click on Next at the bottom and continue the installation of the OpenMS nodes. This will update all OpenMS nodes to a newer version. Now you can import the KNIME workflow that you can download from the OpenMS 2.5 pre-release page (*.knwf file) as a starting point.
If you have any questions, suggestions, or bug reports, please visit the support page, open up an issue on github, or write an email to eugen.netz@tuebingen.mpg.de.

Tutorials

Below you will find instructions on how to use OpenPepXL starting with command line usage with some OpenMS GUI tools. This way of running OpenMS tools is possible on all platforms and remote computing environments and computing clusters. This section also explains the most important parameters and contains general information about these tools. Further down is an explanation of how to use OpenPepXL with the KNIME workflow manager GUI. This is the preferred way of running OpenMS tools on a desktop computer and does not require experience in using terminals or command line tools. KNIME is also convenient to process the output from OpenPepXL further.

Introduction to TOPP INI files, tool settings for command line use

All the TOPP tools of the OpenPepXL workflow share a common framework of setting up and running the tools in the command line using INI files. To generate a tool specific INI file with default settings, call the tool executable with the parameter -write_ini filename.ini.
Example:

OpenPepXLLF -write_ini OPXL.ini

To edit the settings open the .ini file in the TOPP tool INIFileEditor. The INIFileEditor shows a description of each parameter at the bottom and helps to fill out many parameters, e.g. by using a file browser to select input and output files and showing the possible choices of parameters with limited options. To see the full list of parameters, including advanced parameters that should not be necessary for most users, check the box for advanced parameters on the bottom. The INI files can also be edited using a text editor when opening a GUI is not possible, e.g. when working on a remote server.
To run a tool using the edited INI file, call the tool executable with the parameter -ini filename.ini.

 

Example:

OpenPepXLLF -ini path/to/OPXL.ini

One INI file can be reused for several runs with different parameters (e.g. another input and output file) by explicitly giving the tool additional parameters on the command line. These command line parameters will have a higher priority and the values for these parameters written in the INI file will be ignored for this run. This allows to have a fixed set of search parameters that you can forward to the tool using an INI file with a subset of variable parameters that you can change for each run.
Examples:

OpenPepXLLF -ini path/OPXL.ini -in path/input_file_01.mzML -out_idXML path/output_file_01.idXML
OpenPepXLLF -ini path/OPXL.ini -in path/input_file_02.mzML -out_idXML path/output_file_02.idXML

All parameters adjustable through an INI file are also adjustable through the command line. Many parameters are grouped into categories, e.g. all OpenPepXL parameters concerning precursor masses are in the precursor group. To adjust these through the command line, you have to use the group name as a prefix, e.g. -precursor:mass_tolerance 10 and -precursor:mass_tolerance_unit ppm. The examples shown here assume OpenMS with OpenPepXL is installed on a Linux computer and the binaries are in the PATH. On Windows computers the binaries of the tools would end with .exe, otherwise everything should work in the same way.

Setting up and running OpenPepXLLF (label-free linkers):

Quickstart:

1. Generate a default parameter file for OpenPepXLLF:

OpenPepXLLF -write_ini OPXL.ini

2. Edit the parameter file using INIFileEditor (or a text editor) and run the tool using the parameter file:

OpenPepXLLF -ini OPXL.ini -in filename.mzML -out_idXML filename.idXML

Full Explanation:

Generate an INI file for OpenPepXLLF (OpenPepXLLF -write_ini OPXL.ini).
Open the generated INI file with the INIFileEditor. Choose your mzML input file (parameter: -in) and output files in any of the three supported formats (-out_xquestxml, -out_idxml, -out_mzid). If you do not specify any of these, the tool will run through to the end, but will not write out any results! With the parameter -threads you can choose the number of CPU threads the tool will use for this run with the given input file. Alternatively you can start the tool with different parameters or input files in parallel using multiple terminals.
Hint: Processing multiple files in parallel will be a bit more time efficient, but will add up the required memory for each run. Using the -threads parameter to process one input file in parallel will use the same amount of memory as processing it on one thread.
Adapt the precursor and fragment mass tolerances to your MS instrument (for Orbitrap data, usually a precursor mass tolerance of 10 ppm and a fragment mass tolerance of 20 ppm) and add fixed and variable modifications, which you expect in your samples (aside from the cross-linker).
In the cross-linker category you can define your cross-linker. The default settings are for DSS and are also correct for BS3. The -residue1 and -residue2 parameters accept lists of residues for each reactive group of the linker, so that you can define any heterobifunctional cross-linker. N-term and C-term are also valid entries for these two parameters and will link protein termini. The preferred output format to use is the internal identification format of OpenMS: idXML. This format is most compatible with XFDR and other post-processing and filtering tools within OpenMS, like IDFilter and IDMerger. In the end the results can be formatted into other formats from idXML, e.g. into mzIdentML using IDFileConverter or into a CSV table format using TextExporter. Before running the tool, make sure you set up the protein database in a way that is compatible with XFDR (see the next section).
Run the tool using the command line:

OpenPepXLLF -ini OPXL.ini

or in case of multiple input files and/or parameter sets

OpenPepXLLF -ini OPXL1.ini -in path/file1.mzML -out_idXML path/file1.idXML
OpenPepXLLF -ini OPXL2.ini -in path/file2.mzML -out_idXML path/file2.idXML

Setting up and running XFDR (FDR estimation for XL-MS):

Quickstart:

1. Generate a default parameter file for XFDR:

XFDR -write_ini XFDR.ini

2. Edit parameter file using INIFileEditor (or a text editor) and run XFDR:

XFDR -ini XFDR.ini -in filename.idXML -out_idXML filename.idXML

Full Explanation:

XFDR is a reimplementation of xProphet (Leitner et al., 2014). It reads in XL-MS identifications (IDs) and computes score distributions for decoys and targets from the first ranked IDs to each MS2 spectrum. Using these score distributions it assigns an FDR value to every hit. XFDR divides the set if IDs into intra- and inter-protein cross-links as well as mono-links/dead-end-links and loop-links and computes separate score distributions and FDRs for each of these groups. To be able to count the number of intra-protein target-decoy hybrids, XFDR expects that the OpenPepXL/LF search has been done with a corresponding decoy protein for each target protein in the database. The name of each decoy protein should be equal to the target protein with a decoy prefix, e.g. target: “Protein1”, decoy: “decoy_Protein1”. Most importantly removing the decoy prefix should make the names exactly the same, in the example case it would be “decoy_”. The specific prefix for decoys can be set to any prefix you wish, but it has to be set consistently in OpenPepXL/LF and XFDR in the -decoy_string parameter.

Some of the parameters of the tool influence which IDs are used to compute the score distributions for FDR estimation. By default all first ranked IDs are used. The -minborder and -maxborder parameters allow you to set a range of precursor mass errors (the default values of -1 will not apply this filter), e.g. you can compute FDRs from IDs with a precursor mass error between -5 ppm and 5 ppm even if the OpenPepXL/LF search used a precursor tolerance of 10 or 15 ppm. The parameter -mindeltas applies a filter to the score difference between the first and second ranked IDs of an MS2 spectrum. A value of 0.95 will only use IDs with a score ratio of at most 95% between the second and first ranked IDs (or in other words will enforce a score difference of at least 5%). Setting the parameter -uniquexl to “true” will only consider the top scoring ID among those IDs recognized as equal by XFDR. Equal IDs are those that link the same positions in the same digested peptide sequences and are not what one would consider unique cross-links on the protein level.

Setting up and running OpenPepXL (labeled linkers):

Quickstart:

1. Generate a default parameter file for FeatureFinderMultiplex (FFM):

FeatureFinderMultiplex -write_ini FFM.ini

2. Edit parameter file using INIFileEditor (or a text editor) and run the FFM:

FeatureFinderMultiplex -ini FFM.ini -in filename.mzML -out_multiplets filename.consensusXML

3. Generate a default parameter file for OpenPepXL:

OpenPepXL -write_ini OPXL.ini 

4. Edit parameter file using INIFileEditor (or a text editor) and run the tool using the parameter file:

OpenPepXL -ini OPXL.ini -in filename.mzML -consensus filename.consensusXML -out_idXML filename.idXML

Full Explanation:

All of the parameters as well as the FDR estimation procedure described for OpenPepXLLF above is also applicable to OpenPepXL, so this section will mainly describe the additional steps necessary to run OpenPepXL to search for labeled linkers.
To use the additional preprocessing features for labeled cross-links, OpenPepXL requires additional information to link together the correct MS2 spectra. This information is provided by a consensusXML file, an OpenMS format containing the boundaries of MS1 features and connections between them. For every .mzML input file OpenPepXL requires a .consensusXML with linked MS1 features from that .mzML file. Generating a .consensusXML file is a two step process. First finding the MS1 features defined as a group of isotopic mass traces showing a characteristic intensity bell curve along the retention time axis. Each MS1 feature represents one species of molecules, e.g. one peptide or cross-linked peptide pair, localized to a contiguous patch of the two-dimensional m/z vs. retention time map. The second step is linking features at a mass distance that corresponds to the mass difference of the labeled cross-linkers. There are multiple ways of achieving that with different TOPP tools. The recommended way is to use the FeatureFinderMultiplex. This tool takes care of both steps at once and will only report paired features and discard any unpaired features from the first step.

Setting up FeatureFinderMultiplex (finding feature pairs on the MS1 level):

To run the FeatureFinderMultiplex (FFM) generate an *.ini file (FeatureFinderMultiplex -write_ini FFM.ini) and open it with the INIFileEditor. FFM expects an .mzML input file with centroided MS1 spectra. Choose your .mzML input file and a .consensusXML output file (parameter: -out_multiplets). The parameter value for -algorithm:labels should look like this: [0][mass_difference], e.g. [0][12.07573] for a mixture of labeled DSS linker, DSS-d0 and DSS-d12. Change the -algorithm:charge parameter to what you want to search for (for XL-MS usually 3:7, meaning the range from +3 to +7) and -algorithm:mz_tolerance and -algorithm:mz_unit to your MS instruments MS1 tolerance. This can be set to the same value as the precursor mass tolerance for the OpenPepXL search. Set the parameter -algorithm:rt_band to at least 2, this will make the search more robust and increase the sensitivity by sacrificing specificity. Because in this context this tool is mainly used to generate candidate spectrum pairs to search through, sensitivity is more important than specificity.
Run FeatureFinderMultiplex in the command line with

FeatureFinderMultiplex -ini FFM.ini

You can now run OpenPepXL using the command line in the same way as OpenPepXLLF with the additional input parameter -consensus filename.consensusXML defined in the INI file or directly as a command line parameter.

Setting up PeakPickerHiRes (centroiding / peak picking of MS1 and/or MS2 spectra):

FeatureFinderMultiplex expects centroided MS1 spectra, while OpenPepXL and OpenPepXLLF expect centroided MS2 spectra. If you have .mzML files that are not centroided or if you are not sure whether they are, you can use the PeakPickerHiRes.
Generate an INI file (PeakPickerHiRes -write_ini picker.ini) and open it with the INIFileEditor. Choose your input and output file and the MS levels that you want to centroid using the -algorithm:ms_levels parameter. This will force peak picking on these levels, even if they are already picked. You can also leave the -algorithm:ms_levels parameter empty to pick peaks at all MS levels that are not centroided yet. The other settings do not need to be changed from the default for most applications.
Run the tool using the command line:

PeakPickerHiRes -ini picker.ini

Visualizing spectra and matched peaks with TOPPView:

First, open a spectrum in TOPPView. Then go to Tools->Annotate with identification and select the .mzid or .idXML file produced by OpenPepXL.
You can select an identified cross-link in the table on the right side of the TOPPView window to visualize it.
TOPPView allows you to zoom in and out freely. Peak annotations are read from the identification file but can be edited, moved, added or removed, e.g. to prepare clean images for publication. These custom labels can also be stored to file and retrieved in .idXML files, the internal OpenMS identification data format.

The identified features and feature pairs from FeatureFinderMultiplex (FFM) can also be visualized in TOPPView by loading in the .mzML, then the .featureXML output file (optional, not necessary for OpenPepXL, but contains detailed information about each feature, FFM parameter: -out) and the .consensusXML as additional layers through File->Open file and selecting new layer in the appearing pop-up window.

Setting up and running OpenPepXL using KNIME

First take a look at our getting started page for KNIME workflows to see what KNIME is all about:
https://www.openms.de/getting-started/creating-workflows/
After installing the OpenMS plugin in KNIME, building workflows can be done by finding the tools in the tool list (e.g. by text search) and dragging and dropping their nodes on the workflow pane. The output from one node can be connected to the input of another by dragging the mouse from from one port to another. The ports are the rectangular or triangular symbols on the left and right sides of tool nodes. Inputs are always on the left, outputs on the right. Each node can be configured by double-clicking on it or right-clicking and selecting the configure option. A pop-up window will show up that looks very similar to the OpenMS INIFileEditor GUI and has essentially the same function. Here the parameters of each tool can be set up. The figures below show simple example workflows for OpenPepXL and OpenPepXLLF, that can be extended.

For reading in input files there are the nodes Input File and Input Files. Input File has one file path as its parameter and can be used for single files, like a protein database that is reused between different mzML input files. The Input Files node takes a list of file paths for other nodes to iterate over. The ZipLoopStart and ZipLoopEnd nodes allow for a section of the workflow to iterate over multiple entries in a list. In this case FeatureFinderMultiplex and OpenPepXL or OpenPepXLLF are run multiple times, once for each input file defined in the Input Files node. The Output Folder nodes have a path to a directory as their parameter and write out all incoming data as files into that directory. The names of these files will be equal to the names of the mzML input files, but with a different file ending. The ZipLoopEnd node collects all the results of the loop before sending them further to the output nodes, so the results are written once after all processing is finished. Putting an Output Folder node inside the loop, e.g. right after OpenPepXL, will write out the results of the current input file right after OpenPepXL is finished with it. This type of workflow will go through the input files iteratively one by one and parallelization for each run can be set using the -threads parameter of OpenPepXL. To run a KNIME workflow, click on the green arrow button above the workflow pane.

Below you can see a more complex workflow with FDR estimation and additional filtering. This is the label-free XL-MS workflow available for download at the OpenMS 2.5 pre-release page. It will run OpenPepXLLF on all mzML input files, run the FDR estimation, filter by 5% FDR, remove decoys and write out idXML files for visualization in TOPPView, as well as a CSV file containing a table that contains all 1st ranked PSMs with FDR < 5% from all input mzML files. To run FeatureFinderMultiplex and OpenPepXL for labeled cross-link search, remove the OpenPepXLLF node and drag those two in. Connect the inputs and outputs in a similar way to the workflow above. You can also add a PeakPickerHiRes node for peak picking in front of FeatureFinderMultiplex or OpenPepXLLF if needed and adapt the filters as needed. The IDFilter node has a lot of options, but the only function used in this workflow is to filter by specific meta values stored for each ID. The parameter -remove_peptide_hits_by_metavalue accepts three text strings that make up a condition. All IDs that fulfill this condition are accepted, others are filtered out. Look at the conditions set in the nodes of the workflow to get an idea about how this works. Most of the columns in table returned by the TextExporter can be used to set up a filter.

 

References

Rinner O, Seebacher J, Walzthoeni T, Mueller L, Beck M, Schmidt A, Mueller M, Aebersold R (2008) Identification of cross-linked peptides from large sequence databases.

Leitner A, Walzthoeni T, Aebersold R. (2014) Lysine-specific chemical cross-linking of protein complexes and identification of cross-linking sites using LC-MS/MS and the xQuest/xProphet software pipeline.

Walzthoeni T, Joachimiak LA, Rosenberger G, Röst HL, Malmström L, Leitner A, Frydman R, Aebersold R (2015) xTract: software for characterizing conformational changes of protein complexes by quantitative cross-linking mass spectrometry.

Kosinski, J., et al. (2015) Xlink Analyzer: Software for analysis and visualization of cross-linking data in the context of three-dimensional structures. J. Struct. Biol.

Vizcaíno, J. A., Mayer, G., Perkins, S. R., Barsnes, H., Vaudel, M., Perez-Riverol, Y., … & Rappsilber, J. (2017). The mzIdentML data standard version 1.2, supporting advances in proteome informatics. Molecular & Cellular Proteomics, mcp-M117.

Ternent, T., Csordas, A., Qi, D., Gómez‐Baena, G., Beynon, R.J., Jones, A.R., Hermjakob, H. and Vizcaíno, J.A., 2014. How to submit MS proteomics data to ProteomeXchange via the PRIDE database. Proteomics, 14(20), pp.2233-2241.