13 02 2010 chemoinformatics Tweet
chemoinformaticsな用途でbayonを使ってみる。
データセットはPrimary screen for compounds that inhibit Insulin promoter activity in TRM-6 cells.でアクティブだった1153検体
ダウンロードしてきたsdfをopenbabelでフィンガープリントに変換
babel -imol pc_sample.sdf -ofpt pc_sample.fpt -xh -xfFP2
これをbayonでクラスタリングにするためのTSVに変換
python f2bayon.py pc_sample.fpt > pc_sample.tsv
f2bayon.pyのソース
def hex2bin(fingerprint):
bf = ""
h2b = {"0":"0000","1":"0001","2":"0010","3":"0011",
"4":"0100","5":"0101","6":"0110","7":"0111",
"8":"1000","9":"1001","a":"1010","b":"1011",
"c":"1100","d":"1101","e":"1110","f":"1111",
}
for l in fingerprint:
for c in l:
b = h2b.get(c)
if b: bf += b
return bf
def convert(file):
result = ""
for data in open(file,"r").read().split("\n>"):
fp = ""
for list in data.split("\n")[1:]:
fp += hex2bin(list)
result += data.split("\n")[0].split(" ")[0] + " " + fp + "\n"
return result
if __name__ == "__main__":
import sys
file = sys.argv[1]
c = convert(file)
for l in c.split("\n")[:-1]:
id,fp = l.split()
fps = ""
for num,bit in enumerate(fp):
if int(bit) > 0:
fps += "\t%d\t%s" % (num,bit)
print id + fps
でbayonで10クラスターに分割
$ time bayon -n 10 pc_sample.tsv > pc_sample.clust
real 0m0.859s
user 0m0.839s
sys 0m0.015s
超速い。