Drkcore

28032010 chemoinformatics Scala MX

chemoinformaics in Scala

朝から、openbabelのjavaインターフェースを使えるようにしようといじっていたのだけど、java.lang.UnsatisfiedLinkErrorを解決することができないまま夕方になってしまった。

んーCDK使うことになるのかなぁと思ったがMXのほうが楽そうなので今回はこっちを使った。

import com.metamolecular.mx.calc.MassCalculator
import com.metamolecular.mx.io.daylight.SMILESReader

object MxTest {
  def main(args: Array[String]) {
    val calc = new MassCalculator
    for (mol <- args) {
      val mw = calc.findAveragedMass(SMILESReader.read(mol))
      println(mol + ": " + mw)
    }
  }
}

実行してみる

scalac mxtest.scala 
scala MxTest "CCC" "CO" "CNC"
CCC: 44.09562
CO: 32.04186
CNC: 45.08368

Liftを使って構造関係を扱えるWikiを作ってみたい。

Scalaスケーラブルプログラミング[コンセプト&コーディング] (Programming in Scala)
Martin Odersky,Lex Spoon、Bill Venners
インプレスジャパン / ￥ 4,830 ()
在庫あり。

18032010 chemoinformatics Python

CouchDBにsdfを突っ込む

ほとんど化合物情報をTokyo Cabinetで管理してみると同じノリで出し入れできそう。

#!/usr/bin/env python
# -*- encoding:utf-8 -*-

import pybel
import couchdb

sdffile = "pc_sample.sdf"
couch = couchdb.Server()
db = couch.create('pubchem')

mols = pybel.readfile("mol",sdffile)

for mol in mols:
    title = mol.OBMol.GetTitle()
    molstring = mol.write("mol")
    db[title] = {"mol":molstring}

化合物情報みたいな、RDBで管理しやすいようなデータよりは、in vivoの薬理試験とか動態試験みたいな、プロトコル間で比較があまりなくて、かつ所見とかの文章データが重要なもののほうが向いているのかもしれんなぁと思った。PKデータなんてIDで探して、時点と個体のデータがJSONで引っ張ってこれるようにしとけばナイスすぎる！

javascriptからRの関数呼べんかな？そうすればJSONでデータ受け取ってPKfitとか使えんのにな。それかbrewとか

書籍もいくつか出てるけど、日本語ないなぁ。そのうち出るんかな？

CouchDB: The Definitive Guide
J. Chris Anderson
Oreilly & Associates Inc / 3390円 ( 2010-01-15 )

Beginning CouchDB
Joe Lennon
Apress / 3783円 ( 2009-12-30 )

Couchdb in Action
Christopher Chandler
Manning Pubns Co / 4180円 ( 2010-12-28 )

10032010 chemoinformatics Python network matplotlib

LeadOptimizationをシミュレーションしたい

ふと、つぶやいた。

例えば

プロジェクトで500検体合成
ひとつの親あたり5-25化合物くらいの子を合成する
プロジェクトの人員の制限により、一度に10ラインまでしか同時に走らない
- 11番目以降はそれ以降のオプティマイゼーションは行われない

その時知りたいことは、

{計画、合成、アッセイ、解析}というサイクルは何回まわるのか、
有望そうな化合物だけど、人的リソースの関係で選択されなかった歴史はどこにあるか？
バックアップ候補の化合物はどこら辺で現れるか（初期？中期？）

あたり。

現実の系に近づけるために、親化合物にはポテンシャルがあって、近傍探索すると、ポテンシャルの近くで活性が変化するモデル。あと同じライン（ブランチ）を継続して合成すると、ポテンシャル曲線が底に近づくのと、合成もSARに関する知識がついてくるので、活性は上がりやすくなって、かつ変動幅も小さくなるようにweightはだんだん小さくなるようにしてみた。

#!/usr/bin/env python
# -*- encoding:utf-8 -*-

# kzfm <kerolinq@gmail.com>

from random import random
import networkx as nx
import matplotlib.pyplot as plt


class Branch(object):
    """LeadOptimization flow"""

    count = 0

    @classmethod
    def inc_count(cls):
        cls.count = cls.count + 1
        return cls.count

    @classmethod
    def get_count(cls): return cls.count

    def __cmp__(self, other):
        return cmp(self.act, other.act)

    def __init__(self,potency,weight):
        self.num = Branch.inc_count()
        self.potency = potency
        self.weight  = weight
        self.act = self.potency + self.weight * random()

    def make_child(self,num_childs,potency,weight):
        return [Branch(self.potency + self.weight * random(),self.weight * 0.7) for i in range(num_childs)]

if __name__ == "__main__":
    max_comps        = 500
    initial_branches = 3
    nodesize         = []
    nrange           = []
    erange           = []
    generation       = 1
    heads = [Branch(0.3,1) for i in range(initial_branches)]
    G=nx.Graph()

    for h in heads: 
        nodesize.append(h.act * 30)
        nrange.append(generation)

    while Branch.get_count() < max_comps:
        new_heads = []
        generation += 1
        for branch in heads[:10]:
            for new_comp in branch.make_child(int(5+20*random()),branch.potency,branch.weight):
                nodesize.append(new_comp.act * 30)
                nrange.append(generation)
                erange.append(generation)
                G.add_edge(branch.num,new_comp.num)
                if new_comp.act > 0.75:
                    new_heads.append(new_comp)
        heads = new_heads
        heads.sort()

    nx.draw_spring(G, node_size=nodesize, node_color=nrange, edge_color=erange,alpha=0.7, \
cmap=plt.cm.hot, edge_cmap=plt.cm.hot, with_labels=False)
    plt.savefig("proj.png")

project

んーなんかいまいち。一応活性の強弱にあわせてノードのサイズが変わるようにしてみたんだけど、大きさがわからん。

05032010 chemoinformatics DMPK

PK-PDの論文

Pharmacokinetic-Pharmacodynamic Modeling of Biomarker Response and Tumor Growth Inhibition to an Orally Available cMet Kinase Inhibitor in Human Tumor Xenograft Mouse Models という論文が良かった。というか、最近ボトムのあたりを這っていた僕のモチベーションがかなり上がった。

ケモインフォマティストとして、このレベルのPK-PD解析を出来るようにしたい。

PK-PD

この本の23,24章に最適化フェーズにおけるPK-PDモデリングとかPBPKモデリングの意義とかがあって、その内容が良い。

Preclinical Development Handbook: ADME and Biopharmaceutical Properties (Pharmaceutical Development Series)

Wiley-Interscience / ￥ 18,701 ()
通常2～4週間以内に発送

そして、ケモインフォマティストなのでSASとかNONMEMではなくてRを使う。

「非線形混合モデルを理解するために」というpdfも見つけた。

04032010 chemoinformatics

化合物情報をTokyo Cabinetで管理してみる

単にIDをキーにしてsdf呼び出すだけだったらKVSで良くね？と思ったところ、そういえばTokyoCabinetのPythonバインディングがあることを思い出したので、ちょっと試してみた。

データはこの時に落としておいたsdf

import pybel
import pytc

sdffile = "pc_sample.sdf"
db = pytc.HDB('compounds.db', pytc.HDBOWRITER | pytc.HDBOCREAT)
mols = pybel.readfile("mol",sdffile)

for mol in mols:
    title = mol.OBMol.GetTitle() 
    db[title] = mol.write("mol")

100万化合物くらいなら全部メモリに載せられそうな感じ。そしたら速いよねー、CouchDBも気になってきた。

Software Design ( ソフトウェアデザイン ) 2010年 02月号 [雑誌]

技術評論社 / ￥ 970 ()

03032010 chemoinformatics DMPK

DMPKの本

最近の僕の興味は探索フェーズで最も複雑な事象であろうPK/PDとかTK/TDのようなものに移ってしまっている。SBDDを量子化学計算の枠組みで解釈して行くのも楽しいんだけど、どうしても基礎研究寄り過ぎてしまうのと、成果主義的な主張をしづらい(理解されにくい)ので、時間に余裕のある時の知的欲求を満たすための余興的なあつかいになってしまっている。

上司に紹介してもらった本が素晴らしい。

Preclinical Development Handbook: ADME and Biopharmaceutical Properties (Pharmaceutical Development Series)

Wiley-Interscience / ￥ 18,701 ()
通常2～4週間以内に発送

概論多めだけど、1200p超え。

この二冊も良かった。

Pharmacokinetic-Pharmacodynamic Modeling and Simulation
Peter Bonate
Springer / ￥ 8,626 ()
通常1～3週間以内に発送

Pharmacometrics: The Science of Quantitative Pharmacology

Wiley-Interscience / ￥ 18,105 ()
在庫あり。

臨床系のための解析的な本は多いんだけど、それを探索段階のDMPK解析、コンピュテーショナルケミストまたはインフォマティストが使うための本があればいいなぁと。

実際、探索段階だと実験計画とかいい加減だったりとか、データポイントがやたら少なかったりして、ハードル高いうえに、そういう知見はノウハウ的なものなので外に出にくくなっているんだろうなぁと。そこが強いというのは、最終的な強さにつながるのかなぁと、最近強く思う。

ぶっちゃけSBDDなんておまけだよな。QSAR,QSPkR,QSTkRとか分析能力を如何に強くするかが重要なんじゃないかなぁと。

how to makeよりもwhat's to make

そういう気概で来年度は頑張る。

13022010 chemoinformatics

bayonで化合物クラスタリング

chemoinformaticsな用途でbayonを使ってみる。

データセットはPrimary screen for compounds that inhibit Insulin promoter activity in TRM-6 cells.でアクティブだった1153検体

ダウンロードしてきたsdfをopenbabelでフィンガープリントに変換

babel -imol pc_sample.sdf -ofpt pc_sample.fpt -xh -xfFP2

これをbayonでクラスタリングにするためのTSVに変換

python f2bayon.py pc_sample.fpt > pc_sample.tsv

f2bayon.pyのソース

def hex2bin(fingerprint):
   bf = ""
   h2b = {"0":"0000","1":"0001","2":"0010","3":"0011",
          "4":"0100","5":"0101","6":"0110","7":"0111",
          "8":"1000","9":"1001","a":"1010","b":"1011",
          "c":"1100","d":"1101","e":"1110","f":"1111",
          }

   for l in fingerprint:
       for c in l:
           b = h2b.get(c)
           if b: bf += b
   return bf

def convert(file):
   result = ""
   for data in open(file,"r").read().split("\n>"):
       fp = ""
       for list in data.split("\n")[1:]:
           fp += hex2bin(list)
       result += data.split("\n")[0].split(" ")[0] + " " + fp + "\n"
   return result

if __name__ == "__main__":
   import sys
   file = sys.argv[1]
   c = convert(file)
   for l in c.split("\n")[:-1]:
      id,fp = l.split()
      fps = ""
      for num,bit in enumerate(fp):
         if int(bit) > 0:
            fps += "\t%d\t%s" % (num,bit)
      print id + fps

でbayonで10クラスターに分割

$ time bayon -n 10 pc_sample.tsv > pc_sample.clust

real    0m0.859s
user    0m0.839s
sys 0m0.015s

超速い。

12012010 chemoinformatics DMPK

Evaluation of Drug Candidates for Preclinical Development

新しい本が出るらしい。

Evaluation of Drug Candidates for Preclinical Development: Pharmacokinetics, Metabolism, Pharmaceutics, and Toxicology (Wiley Series in Drug Discovery and Development)

Wiley / ￥ 9,585 ()
近日発売　予約可

Cover drug transporters, cytochrome P-450 and drug-drug interactions, plasma protein binding, stability, drug formulation, preclinical safety assessment, toxicology, and toxicokinetics
Address developability issues that challenge pharma companies, moving beyond isolated experimental results
Reveal connections between the key scientific areas that are critical for successful drug discovery and development
Inspire forward-thinking strategies and decision-making processes in preclinical evaluation to maximize the potential of drug candidates to progress through development efficiently and meet the increasing demands of the marketplace

Evaluation of Drug Candidates for Preclinical Development serves as an introductory reference for those new to the pharmaceutical industry and drug discovery in particular. It is especially well suited for scientists and management teams in small- to mid-sized pharmaceutical companies, as well as academic researchers and graduate students concerned with the practical aspects related to the evaluation of drug developability.

12012010 chemoinformatics bioinformatics Python

python2.5でBioPythonを使う

python2.5に入れる場合はNumericがsourceforgeになかったり、MxTextToolsの古いバージョンを入れたりとか、いろいろとあれなので、ドキュメント読みましょうってことです。

で、RCSBからgz圧縮されたpdbファイルをとりにいってBioPythonで扱うサンプル

import urllib2, StringIO, gzip
from Bio.PDB.PDBParser import PDBParser

def fetch_pdb(id):
    url = 'http://www.rcsb.org/pdb/files/%s.pdb.gz' % id
    content = urllib2.urlopen(url).read()
    sf = StringIO.StringIO(content)
    return gzip.GzipFile(fileobj=sf)

if __name__ == "__main__":
    p=PDBParser(PERMISSIVE=1)
    s=p.get_structure("1bgw", fetch_pdb("1bgw"))

    for model in s.get_list():
        for chain in model.get_list():
            for residue in chain.get_list():
                if residue.has_id("CA"):
                    ca_atom=residue["CA"]
                    print ca_atom.get_coord()

ところで、製薬系のドラッグデザインっていうかchemoinformaticsなSoftwareはpythonで拡張できるものが多いんだけれども、その割にはpythonでガリガリ書くよっていうヒトはあんまし見かけたことないんですよね。

12012010 chemoinformatics Python

化合物の最適化のグラフ表現

化合物の最適化の過程というもをグラフ表現にしてとっておけば、あとあと色々と使いまわせていいんじゃないかと思っている。

で、エッヂは化合物でいいとして、ノードは有向グラフで表したいが、部分構造を内包するかどうかで親子関係を決めるのがいいのか、それとも、時系列で古いほうを親にするのがいいのかそれとももっといいやり方があるのか悩ましい。

可視化は、Cytoscapeが丁度よさそうだったんだけど、Python から Graphviz を使うを見て、NetworkXというものがあるのを知った。

matplotlibで綺麗に出力できるし、これはなかなか楽しそうだ。

もう5年目(wishlistありマス♡)
最近はPythonとDeepLearning
日本酒自粛中
ドラムンベースからミニマルまで
ポケモンGOゆるめ