所有物種基因Symbol別名轉換爲最新Symbol
![所有物種基因Symbol別名轉換爲最新Symbol,第2張 所有物種基因Symbol別名轉換爲最新Symbol,第2張](/img.php?pic=https://pubimage.360doc.com/wz/musicplay.jpg?mid=002btSvp3AzWfA)
在數據分析中會經常出現感興趣的基因不在矩陣中,可能的原因是沒有測到或舊版Symbol。因此需要找到舊版Symbol(Alias別名)和最新Symbol(Current Symbol)之間的對應關系。
bq.tl.current_symbol可以把(表達)矩陣中的Symbol變爲最新版
第一個蓡數數據框(index爲Symbol)第二個蓡數Symbol與Alias對應關系文件路逕第三個蓡數物種tax_id比如人的是9606。SymbolAlias_20230317.feather的獲取可以發送郵件到victor@bioquest.cn
從NCBI下載最新的基因信息/gene/DATA/gene_info.gz
import numpy as np得到Symbol與Alias對應關系
import pandas as pd
import bioquest as bq
g=pd.read_csv("gene_info_20230317.gz",sep='\t',usecols=['#tax_id','GeneID','Symbol','Synonyms'])
g.rename(columns={"#tax_id":"tax_id"},inplace=True)
g.loc[:,"Alias"]=g.Synonyms.str.split('|')
g = g.explode("Alias")
g = bq.tl.select(g,columns=["tax_id","GeneID","Symbol","Alias"])
g.reset_index(drop=True,inplace=True)
g.replace({'Alias': {'-':''}},inplace=True)
g.to_feather("SymbolAlias_20230317.feather",compression='zstd',compression_level=1)
tax_id GeneID Symbol Alias使用示例示例數據
0 7 5692769 NEWENTRY
1 9 2827857 NEWENTRY
2 11 10823747 NEWENTRY
3 14 6951813 NEWENTRY
4 19 3758873 NEWENTRY
... ... ... ... ...
44205723 3032134 60460443 ND6
44205724 3032134 60460444 ND1
44205725 3032134 60460445 I9997_mgr02
44205726 3032134 60460446 I9997_mgt22
44205727 3032134 60460447 I9997_mgr01
[44205728 rows x 4 columns]
df = pd.read_csv("BLCA.csv",index_col="Gene Symbol")轉換
# Gene Name Species
# Gene Symbol
# ATP2B1 ATPase, Ca transporting, plasma membrane 1 Homo sapiens
# MYL6 myosin, light chain 6, alkali, smooth muscle a... Homo sapiens
# RPS16 ribosomal protein S16 Homo sapiens
# HIST1H2BA histone cluster 1, H2ba Homo sapiens
# H2AFY2 H2A histone family, member Y2 Homo sapiens
# ... ... ...
# UBB ubiquitin B Homo sapiens
# PYGB phosphorylase, glycogen; brain Homo sapiens
# HLA-A major histocompatibility complex, class I, A Homo sapiens
# HSPA1A heat shock 70kDa protein 1A Homo sapiens
# HSP90AB1 heat shock protein 90kDa alpha (cytosolic), cl... Homo sapiens
bq.tl.current_symbol(frame=df,reference="SymbolAlias_20230317.feather", tax_id=9606)
# Gene Name Species \
# H2BC1 histone cluster 1, H2ba Homo sapiens
# MACROH2A2 H2A histone family, member Y2 Homo sapiens
# H3-3B H3 histone, family 3B (H3.3B) Homo sapiens
# H1-5 histone cluster 1, H1b Homo sapiens
# DARS1 aspartyl-tRNA synthetase Homo sapiens
# ... ... ...
# UBB ubiquitin B Homo sapiens
# PYGB phosphorylase, glycogen; brain Homo sapiens
# HLA-A major histocompatibility complex, class I, A Homo sapiens
# HSPA1A heat shock 70kDa protein 1A Homo sapiens
# HSP90AB1 heat shock protein 90kDa alpha (cytosolic), cl... Homo sapiens
# Alias
# H2BC1 HIST1H2BA
# MACROH2A2 H2AFY2
# H3-3B H3F3B
# H1-5 HIST1H1B
# DARS1 DARS
# ... ...
# UBB NaN
# PYGB NaN
# HLA-A NaN
# HSPA1A NaN
# HSP90AB1 NaN
# [378 rows x 3 columns]
0條評論