class: center, middle, inverse, title-slide # Extracting topics 📚 from ## 🎗cancer patients’ mutational profiles ### Zhi Yang ### 2019-04-30 --- ## About this talk * ### For those who've never heard of it ❓ what make a topic model (i.e. Latent Dirichlet Allocation) 📚 its general application * ### For those who know about it 🎗️ Its use in cancer research 📊 data, modeling, software 🏥 medical implications --- ## We usually extract topics from: ### books, customer reviews, tweets, ... ### scientific journals, medical records, ... ### .red[what if they can't be "read"?] -- <center> <img src="imgs/tcga.png" width=70%> --- ## Why studying somatic mutations? ![](imgs/mutations.png) --- ## What do we need computers for? .center[ ### Volume ### Velocity ### Variety ### Variability ] .footnote[[Understanding big data themes from scientific biomedical literature through topic modeling ](https://link.springer.com/article/10.1186/s40537-016-0057-0)] --- # Data sources ### 10,952 exomes ### 1,048 whole-genomes ### 40 distinct types of human cancer #### The Cancer Genome Atlas (TCGA),the International Cancer Genome Consortium (ICGC), data from peer-reviewed journals .footnote[[Signatures of Mutational Processes in Human Cancer](https://cancer.sanger.ac.uk/cosmic/signatures)] --- # What are in a topic model? .pull-left[ ### documents ### topics ### words ] .pull-right[ ![](imgs/str.png)] --- # What are in a topic model? .pull-left[ ### documents ### .grey[topics] .red[--> latent] ### words ] .pull-right[ ![](imgs/str.png)] --- # In a single document .center[<img src="imgs/str.png" width=50%, hspace="30"> <img src="imgs/topics.png", width=40%>] .pull-bottom[ * ### every document is unique * ### each topic has different fractions, `[0, 1]` ] --- # In a topic .center[<img src="imgs/str.png", width=50%, hspace="30"> <img src="imgs/words.png", width=40%>] * ### Topics are consistent across documents * ### choosing the number of topics could be a headache 😟 --- # The hierarchical structure .center[<img src="imgs/topics.png", width=40%> <img src="imgs/words.png", width=40%>] * ### a single .red[document] contains .red[topics] with .red[weights] * ### a .red[topic] is a unique distribution of .red[words] --- # LDA Topic Modeling on emojis <img src="https://github.com/lettier/lda-topic-modeling/raw/master/static/png/screenshot.png"> .footnote[https://lettier.com/projects/lda-topic-modeling/] --- # Data collection <br> .center[<img src="imgs/dataprocess.png" width=60%> ### .red[Tissue ➡Sequencing ➡ input]] -- .pull-left[ |sample|chr|position|ref|alt| |---|:---:|:---:|:--:|:--:| |sample1|chr1|100|A|C |sample1|chr2|100|G|T |sample2|chr1|300|T|C |sample3|chr3|400|T|C ] -- .pull-right[ ### .center[.red[? ?] A>C .red[? ?] ] ### .center[G T A>C T C] ] --- ## A glimpse of mutational profiles <br> ![](imgs/spectrum.png) .footnote[[Yang et al. HiLDA: a statistical approach to investigate differences in mutational signatures, 2019](https://www.biorxiv.org/content/10.1101/577452v1)] --- # Where are those "words"? <img src="imgs/1536.png"> --- ## Topics made of words <img src="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2018/05/22/sagemaker-ntm-6.gif"> .footnote[[Introduction to the Amazon SageMaker Neural Topic Model ](https://aws.amazon.com/blogs/machine-learning/introduction-to-the-amazon-sagemaker-neural-topic-model/)] --- ## Topics made of mutations .center[<img src="imgs/sigs.png" width=60%>] .footnote[[Shiraishi et al. A simple model-based approach to inferring and visualizing cancer mutation signatures, PLoS Genetics, 2015](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1005657)] --- ## Topics made of mutations .center[<img src="imgs/Signature_patterns.png" width=65%>] .footnote[[Signatures of Mutational Processes in Human Cancer](https://cancer.sanger.ac.uk/cosmic/signatures)] --- .center[<img src="imgs/membership.PNG" width="80%">] .footnote[[Shiraishi et al. A simple model-based approach to inferring and visualizing cancer mutation signatures, PLoS Genetics, 2015](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1005657)] --- ## Association with known risk factors .center[.pull-top[ <img src="imgs/smoking.png" width=30%> <img src="imgs/sunlight.png" width=30%> <img src="imgs/pole.png" width=30%> ]] .center[ .pull-bottom[ ![](imgs/smoking2.png) ![](imgs/sunlight2.png) ![](imgs/pole2.png) ] ] .footnote[[Shiraishi et al. A simple model-based approach to inferring and visualizing cancer mutation signatures, PLoS Genetics, 2015](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1005657)] --- ## Software implementation * R package and Shiny app: **pmsignature** <img src="imgs/panel.png" width=25%> <img src="imgs/analysis.png" width=70%> .footnote[[Shiraishi et al. A simple model-based approach to inferring and visualizing cancer mutation signatures, PLoS Genetics, 2015](https://friend1ws.shinyapps.io/pmsignature_shiny/)] --- ## Hierarchical topic model .pull-left[ ### documents ### topics .red[--> weights] ### words ] .pull-right[ ![](imgs/single.png) ] --- ## After applying the topic model, ![](imgs/compare.png) .center[Are they different?] --- ## Statistical inference .center[<img src="imgs/twogroups.png" width =90%>] --- ## Group differences fractions `\(p_k\)` and concentration parameters `\(\alpha_k\)` `$$p_1, \cdots, p_K \sim Dirichlet(\alpha_1,\cdots,\alpha_K)$$` `$$\mu_k = \frac{\alpha_k}{\sum_k{\alpha_k}}$$` To capture the group difference for signature `\(i\)`, `$$\Delta_i = \mu^{(1)}_i - \mu^{(2)}_i$$` What we'd like to know, `$$H_0: \Delta_i = 0, i = 1, 2,..., K$$` <br> .red[Why don't we just compare the estimated values?] `$$\Delta_i = \hat{\mu}^{(1)}_i - \hat{\mu}^{(2)}_i$$` --- ## Software implementation .left-column[ <br> <img src="https://greta-stats.org/logo.png" width=50%> <br> <img src="imgs/edward.png" width=60%> <br> <img src="https://mc-stan.org/images/stan_logo.png" width=55%> ] .right-column[ ## R packages: HiLDA, greta <br> ## Python library: edward <br> ## Bayesian software: Stan ] --- ## Potential applications <br> ![](imgs/kk.png) --- ## Personalized treatment ![](imgs/subgroups.jpg) .footnote[[Proposed subclassification of EAC based on mutational signatures](https://www.nature.com/articles/ng.3659)] --- class: center, middle # Thanks! and Keep in touch <br> ###
[@zhiiiyang](https://twitter.com/zhiiiyang) ###
[zhiiiyang](https://github.com/zhiiiyang) ###
[zyang895@gmail.com](mailto:zyang895@gmail.com) <br> Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan) and [**xaringanthemer**](https://github.com/gadenbuie/xaringanthemer)