PSP - TMScore(US-align)、RMSD、Sequence 源码

article/2025/7/18 10:51:03

欢迎关注我的CSDN：https://spike.blog.csdn.net/
本文地址：https://blog.csdn.net/caroline_wendy/article/details/129125467

参考文档：Nature Methods | 蛋白、RNA、DNA及其复合物结构的比对算法US-align

官网地址：https://zhanggroup.org/US-align/

TMScore

TMScore，即 Template Modeling Score，参考Wikipedia、ZhangGroup：

US-align (Universal Structural alignment) is a unified protocol to compare 3D structures of different macromolecules (proteins, RNAs and DNAs) in different forms (monomers, oligomers and heterocomplexes) for both pairwise and multiple structure alignments.

US-align（通用结构比对）是一种统一协议，用于比较不同形式（单体、低聚物和异源复合物）的不同大分子（蛋白质、RNA 和 DNA）的 3D 结构，用于成对和多结构比对。

The core alogrithm of US-align is extended from TM-align and generates optimal structural alignments by maximizing TM-score of compared strucures through heuristic dynamic programming iterations.

US-align 的核心算法是从TM-align扩展而来的，通过最大化TM-score来生成最优结构比对通过启发式动态编程迭代比较结构。

Large-scale benchmark tests showed that US-align can generate more accurate structural alignments with significantly reduced CPU time, compared to the state-of-the-art methods developed for specific structural alignment tasks.

大规模基准测试表明，与为特定结构对齐任务开发的最先进方法相比，US-align 可以生成更准确的结构对齐，同时显着减少 CPU 时间。

TM-score has values in (0,1] with 1 indicating an identical structure match, where a TM-score ≥0.5 (or 0.45) means the structures share the same global topology for proteins (or RNAs).

TM-score 的值在 (0,1] 范围内，其中 1 表示相同的结构匹配，其中 TM-score ≥0.5（或 0.45）表示结构共享相同的蛋白质（或 RNA）全局拓扑结构。

TMScore的计算，一般使用US-align，来源于

US-align: Universal Structure Alignment of Proteins, Nucleic Acids and Macromolecular Complexes
蛋白质、核酸和大分子复合物的通用结构比对

全部比对功能都统一使用的TM-score作为打分函数。支持：

单链结构的两两比对，Monomeric Structure Alignment
复合物(低聚物)结构的两两比对，Oligomeric Structure Alignment
多结构比对，Multiple Structure Alignment，MSTA
基于模板的分子对接，Template-based Docking

RMSD

Root Mean Squared Deviation，均方根偏移，是重叠后的两个蛋白质的原子（一般是主链原子）之间的平均距离的度量。完全重合为0A，一般认为小于3A时，两个结构相似。

计算公式：
$RMSD(X,X^{ref})=\min_{R,t}\sqrt{\frac{1}{N}\sum_{i=1}^{N}[(R \cdot X_{i}(t)+t)-X^{ref}_{i}]^2}$

$X$ 和 $X^{ref}$ 为两个要对比的蛋白， $X_i$ 和 $X_{i}^{ref}$ 为用于比对的两个原子， $N$ 为比对的序列长度
蛋白重叠对齐的过程就是，找到一组旋转矩阵R和向量t使得两个用于比较的蛋白 $X$ 和 $X^{ref}$ 有最小的RMSD

具体步骤：

下载USalign.cpp：https://zhanggroup.org/US-align/bin/module/USalign.cpp
编译CPP脚本：
1. g++ -static -O3 -ffast-math -lm -o USalign USalign.cpp
2. macOS，不使用’-static’，即g++ -O3 -ffast-math -lm -o USalign USalign.cpp

注意：如果编译TMalign.cpp，无法编译，将#include <malloc.h>替换为#include <stdlib.h>

源码：

#!/usr/bin/env python
# -- coding: utf-8 --
"""
Copyright (c) 2022. All rights reserved.
Created by C. L. Wang on 2023/2/16
"""import os
import subprocessfrom root_dir import DATA_DIR, ROOT_DIRclass TMScoreAndRMSD(object):"""TMScore，来源于USalign"""def __init__(self):self.bin_dir = os.path.join(ROOT_DIR, "bin")self.usalign = os.path.join(self.bin_dir, "USalign")def get_tms_and_rmsd_with_usalign(self, pdb1_path="", pdb2_path=""):"""计算 TM Score参考: https://zhanggroup.org/US-align/"""if not pdb1_path or not pdb2_path:return -1.0cmd = " ".join([f"{self.usalign}", "-TMscore 1", "-outfmt 2", pdb1_path, pdb2_path])res = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE).stdout.read()res = res.decode("utf-8")# [PDBchain1 PDBchain2 TM1 TM2 RMSD ID1 ID2 IDali L1 L2 Lali]# IDali 一致ID/对齐ID# Lali 对齐长度items = res.split("\n")[1].split("\t")tm_score = (float(items[2]) + float(items[3])) * 0.5  # TM1和TM2的均值rmsd = float(items[4])return tm_score, rmsddef main():pdb1_path = os.path.join(DATA_DIR, "test_7roa.pdb")pdb2_path = os.path.join(DATA_DIR, "7roa.pdb")tsu = TMScoreAndRMSD()tm_score, rmsd = tsu.get_tms_and_rmsd_with_usalign(pdb1_path, pdb2_path)print(f"[Info] tm_score: {tm_score}, rmsd: {rmsd}")if __name__ == '__main__':main()

Sequence

标准FASTA序列，例如7ROA：

https://www.rcsb.org/structure/7ROA

>7ROA_1|Chain A|EntV|Enterococcus faecalis OG1RF (474186)
SDQLEDSEVEAVAKGLEEMYANGVTEDNFKNYVKNNFAQQEISSVEEELNVNISDASTVVQARFNWNALGSCVANKIKDEFFAMISISAIVKAAQKKAWKELAVTVLRFAKANGLKTNAIIVAGQLALWAVQCGLS

源码：

#!/usr/bin/env python
# -- coding: utf-8 --
"""
Copyright (c) 2022. All rights reserved.
Created by C. L. Wang on 2023/2/17
"""
import os.path
import warningsfrom Bio.Data.PDBData import protein_letters_3to1_extended as d3to1
from Bio.PDB import PDBParserfrom root_dir import DATA_DIRdef get_seq_from_pdb(pdb_path):"""从PDB中获取序列:param pdb_path: PDB路径:return: PDB结构"""# 常见的20种结构# d3to1 = {'CYS': 'C', 'ASP': 'D', 'SER': 'S', 'GLN': 'Q', 'LYS': 'K',#          'ILE': 'I', 'PRO': 'P', 'THR': 'T', 'PHE': 'F', 'ASN': 'N',#          'GLY': 'G', 'HIS': 'H', 'LEU': 'L', 'ARG': 'R', 'TRP': 'W',#          'ALA': 'A', 'VAL': 'V', 'GLU': 'E', 'TYR': 'Y', 'MET': 'M',}record = pdb_path  # PDB路径# 解析器parser = PDBParser(QUIET=True)structure = parser.get_structure('struct', record)res_str = ""for model in structure:for chain in model:header = chain.idseq = []for residue in chain:if residue.resname in d3to1.keys():seq.append(d3to1[residue.resname])elif residue.resname == "HOH":continueelse:warnings.warn(f"{residue.resname} don't support!")seq_str = "".join(seq)res_str += f">{header}\n{seq_str}\n"return res_strdef main():pdb_path = os.path.join(DATA_DIR, "7roa.pdb")res_str = get_seq_from_pdb(pdb_path)print(f"[Info] res: \n{res_str}")if __name__ == '__main__':main()