MLE Charge Determination in ESI Mass Spectrometry

Calculate mass and charge from an ESI-mass spectrum
Author

Colin Kinz-Thompson

Published

February 1, 2024

Modified

March 17, 2024

Editted 2024/03/17: to fix math

Open in MS MLE tools in Colab: Open In Colab

Background

Often in an ESI-mass spectrum, you can see many peaks corresponding to different charge-states of the same molecule (i.e., with the same mass). Unfortunately, we only know the \(m/z\) value of these peaks, and do not know the charge corresponding to each peak. While neighboring peaks should have a \(\pm 1\) relationship, we do not know if the first peak (i.e., highest \(m/z\) we can see) is the \(z=+1\) peak or some other charge. Charge determination must be performed in order to figure out the charge of each peak, and thus determine the mass of the molecule. Advanced deconvolution techinques exist for this purpose (including max-entropy based techniques), but here I’ll derive an accessible formula using maximum-likelihood estimation to calculate the charge and mass of the molecule simultaneously from several peak locations.

Nomenclature

  • A list of peaks \(\{x_i\}\). This is just several \(m/z\) values you have identified as belonging to the same series in a mass spectrum (i.e., via their common envelope).
  • A reference index, \(0\). Pick one, it doesn’t matter which – it could be the largest, or the first.
  • The mass of the neutral molecule, \(m\).
  • The charge of the reference molecule, \(z_0\). Drop the \(0\) for ease of typing (\(z\)).
  • \(n_i\) is the index relative to the reference peak (i.e., \(n_0=0\), \(n_1=1\), \(n_{-1}=-1, ...\)).
  • \(\langle X \rangle \equiv \frac{1}{N} \sum_{i=1}^{N} X_i\)

Assumptions

Peaks are located at: \[x_i = \frac{m+(z_0+n_i)* M_{H}}{z_0+n_i},\] where \(M_H = 1.00784\) Da.

MLE Equations

Rearrange to find \[m = z(x_i-M_H)+n_i(x_i-M_H) = zh_i + n_ih_i,\] where \(h_i\equiv x_i-M_H\).

Note that this implies we can transform the peaks into mass if we know the charge and index. If the we assume normal distributed errors, and a common, constant variance, we can write \[\mathcal{L} = \prod_i \mathcal{N}(zh_i + n_ih_i \vert m, \sigma^2).\]

We’ll use the maximum-likelihood approach to estimate \(m\) and \(z\). But first, note that \[ \ln \mathcal{L} \propto \sum_i (zh_i + n_ih_i- m)^2\]

Estimate \(m\)

The MLE estimate is obtained by setting the derivative to zero \[0=\frac{d \ln \mathcal{L}}{d m}.\] We can solve for m to get \[0 = \sum_i 2(zh_i-n_ih_i-m)(-1) = \sum_i (zh_i-n_ih_i-m) = z\langle h \rangle + \langle nh\rangle - m\]

\[\implies m = z \langle h\rangle + \langle nh\rangle\]

Estimate \(z\)

The MLE estimate is obtained by setting the derivate to zero \[0=\frac{d \ln \mathcal{L}}{d z}.\]

Solving, we obtain \[0= \sum_i 2(zh_i + n_ih_i - m)h_i\] \[\implies z = \frac{m\langle h \rangle - \langle nh^2\rangle}{\langle h^2 \rangle } \]

Decouple the results

The equations above for \(m\) and \(z\) are not independent. By plugging the equation for \(m\) into the equation for \(z\) we obtain

\[ z = \frac{\langle h\rangle \langle nh \rangle - \langle nh^2 \rangle}{\langle h^2 \rangle - \langle h\rangle ^2}, \]

which is just in terms of the statistics of the peaks and (relative) charge assignment indexing. The equation for \(m\) is then obtained by plugging in to yield

\[ m = \frac{\langle nh \rangle \langle h^2 \rangle - \langle nh^2 \rangle \langle h \rangle}{\langle h^2 \rangle - \langle h \rangle^2} \]

Thoughts

Therefore, we have two separate equations that only depend on the values of \(\{x\}\) and \(\{n\}\) and we can estimate \(m\) and \(z\) from all the datapoints simultaneously. Note this is better (i.e., statistically speaking) than the typical approach of taking two neighboring peaks and doing: \[ x_1=\frac{m+1}{z+1} \approx \frac{m}{z+1} \rightarrow \frac{1}{x_1}= \frac{z}{m} + \frac{1}{m} = \frac{1}{x_0} + \frac{1}{m}\] which leads to

\[ m\approx \frac{x_0x_1}{x_0-x_1},\] and \[ z_0 \approx \frac{x_1}{x_0-x_1}.\]

Example

Here’s some Python code showing how to do this calculation. Note: it assumes you didn’t skip any peaks (i.e., …+14,15,17,18…); better to use a smaller peak list with no skipped peaks.

import numpy as np

## Myoglobin peaks from Fig. 5: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1853331/
x = [1305.0,1212,1131,1060,998,943,893,849,808,772,738,707,679,653]
x = np.array(x)
x = np.sort(x)[::-1]

n = np.arange(x.size)

H = 1.00784
hi = x-H
E_h = np.mean(hi)
E_h2 = np.mean(hi*hi)
E_nh = np.mean(n*hi)
E_nh2 = np.mean(n*hi*hi)

z = (E_h*E_nh-E_nh2)/(E_h2-E_h**2.)
m = z*E_h+E_nh


print("Mass: %.3f"%(m))
print('Charge_0:',z)
print('----------------')
print('z, Peak (m/z)')
for i in range(x.size):
    print(int(z+n[i]),x[i])
Mass: 16951.387
Charge_0: 13.00020082077729
----------------
z, Peak (m/z)
13 1305.0
14 1212.0
15 1131.0
16 1060.0
17 998.0
18 943.0
19 893.0
20 849.0
21 808.0
22 772.0
23 738.0
24 707.0
25 679.0
26 653.0