Skip to main content

Knowledge in Data Science Symposium (KIDS23)

Abstracts

 
 

Poster Abstracts

 

Bootstrap Evaluation of Association Matrices (BEAM) for Integrating Multiple Omics Profiles with Multiple Outcomes

Author: Anna Eames Seffernick, PhD

Co-Authors: Xueyuan  Cao, PhD; Charles Mullighan, MBBS (Hons), MSc, MD; Stanley Pounds, PhD

Abstract: Genome-wide Association Study (GWAS) is a contemporary research method that deals with detecting associations between individual or groups of genotypes and phenotypes, across possibly the entire genome. This short course on GWAS consists of two parts. In the first part, we will discuss essential statistical principles underlying GWAS, including determination of proper statistical models, statistical inference and testing of associations, genome-wide significance and inference methods for massive multiple hypothesis tests, considerations in variable/model selection, and if time permits, the issue of population structure or ancestry. Most of these statistical principles are also applicable in other omics-wide analyses beyond GWAS. In the second part, we introduce the most popular and effective tool for performing a GWAS, the PLINK software package. Introductions to installation, setup, data file type and formats, features and functions, R coding, visualization (e.g. the Manhattan plot) and reporting will be provided. Other relevant software tools will also be briefly introduced. Both parts of the lecture will be supported by concrete examples and publications. 


Evaluation of a Pooling Chemoproteomics Strategy with an FDA-approved Drug Library

Author: Huan Sun, PhD

Abstract: Chemoproteomics is a key platform for characterizing the mode of action (MoA) for compounds, especially for targeted protein degraders such as proteolysis targeting chimerics (PROTACs) and molecular glues. With deep proteome coverage, multiplexed tandem mass tag-mass spectrometry (TMT-MS) can tackle up to 18 samples in a single experiment. Here, we present a pooling strategy to further enhance the throughput, and apply the strategy to an FDA-approved drug library (95 best-in-class compounds). The TMT-MS-based pooling strategy was evaluated in the following steps. First, we demonstrated the capability of TMT-MS by analyzing over 15,000 unique proteins (>12,000 gene products) in HEK293 cells treated with five PROTACs (two BRD/BET degraders and three degraders for FAK, ALK, and BTK kinases). We then introduced a rationalized pooling strategy to separate structurally similar compounds in different pools, and identified the proteomic response to 14 pools from the drug library. Finally, we validated the proteomic response from one pool by re-profiling the cells under individual drug treatment with sufficient replicates. Interestingly, numerous proteins were found to change upon drug treatment, including AMD1, ODC1, PRKX, PRKY, EXO1, AEN and LRRC58 by 7-Hydroxystaurosporine; C6orf64, HMGCR and RRM2 by Sorafenib; SYS1 and ALAS1 by Venetoclax; and ATF3, CLK1 and CLK4 by Palbocilib. Thus, the pooling chemoproteomics screening provides an efficient method for dissecting the molecular targets of compound libraries.


Computational Reconstruction and Segmentation of Whole Mouse Heart Vasculature

Author: Mia Panlilio, PhD

Co-Authors: Shahinur Alam; Joelle Magne; Nicolas Denans; Aaron Pitre; Aaron Taylor; Douglas Green, PhD; Khaled Khairy, PhD

Abstract: Light sheet fluorescence microscopy (LSFM) enables whole organ and whole organism imaging at cellular and even subcellular resolutions, and represents a major breakthrough in overcoming the traditional trade-off between image resolution and image volume size. It is particularly useful for studies of vascularization, where the structures of interest must be highly resolved, while spanning tissue at the tens of mm scale. However, the technique routinely produces extremely large datasets, terabytes to tens of terabytes in size, which complicates downstream processing. For example, LSFM is able to cover very large volumes by acquiring images of adjacent sub-volumes that need to be assembled seamlessly in a post-acquisition step. Stitching of terabyte large volumes is computationally demanding, and requires specialized hardware and often manual intervention. The latter is especially tedious when the data does not fit into the memory of a typical workstation. Even after stitching, the dataset's size still demands reformatting for efficient visualization and special data traversal strategies for image analysis. 

The above challenges motivated (a) a computational engineering effort to develop and streamline a highly parallelized image post-acquitision processing pipeline that can be executed on high performance compute (HPC)-level hardware with minimal human intervention, and (b) an image analysis effort for fully automated vasculature quantification. For (a), tiled images are stitched using St. Jude's HPC cluster with a modified version of an established volume registration package (BigStitcher). Its output is converted into a hierarchical, chunked file format for efficient multi-threaded read access. Step (b) occurs block-wise along the stitched volume, with filters for background removal, vessel-enhancement, and level-sets segmentation, resulting in a full digital separation of the vasculature from the remainder of the tissue. This is followed by a downstream quantitative vasculature analyses. The pipeline is highly parallel and user-friendly, with even intermediate results readily visualized using common scientific software. For a 0.5 TB volume, it reduces calendar processing time from multiple days to a few hours.


Image Data Management System: Accelerating Scientific Discovery & Collaboration

Author:  Michael Root, MSBME

Co-Authors: Khaled Khairy, PhD; 

Abstract: Research, innovation, and discovery-Science at St. Jude relies substantially on cutting edge imaging technology, that spans data-intense modalities from light and electron microscopy, to MRI, CT, and others. This wide range of modalities is accompanied by a multitude of image data file formats, on-disc data organization conventions, and image data engines (backend tools) optimized for image data access typical of the corresponding field. Such optimal engines and formats are essential for seamless downstream processing and visualization but generate data silos that pose a significant challenge when a more general, and unified, data access path is needed; for example to facilitate collaboration, efficient visualization, computer code consolidation, data discovery, reproducibility, and cross-modality data integration.

In this work, we develop the Image Data Management System (IDMS) that seeks to leverage existing image data engines and data file formats, largely developed and maintained by the biomedical image analysis community, to unify and simplify access and visibility of biomedical image datasets at St. Jude. This effort consists of development of  backend tools to provide a consolidated REST API (Application Programming Interface) for access to image data and metadata. REST-APIs are simple interfaces that make it convenient for researchers to consume image data in a variety of computational environments, computational tools, and programming languages.

Therefore, the IDMS will provide a generic and consistent interface where possible across existing backend tools for ease of image data access.  For users, who wish to use the full power of any backend tool, a Passthru API is provided.  

In either case, the IDMS System, specifically the IDMS Gateway, provides an interface that hides from the user where these backend tools are located and how they are accessed.  In combination with an appropriate user interface (in development), the user conveniently browses their data. Image collections are shown by a unique combination of name, project name, username, and other related metadata tracked by the IDMS database.  It is important to note that the IDMS is designed to provide a thin layer around access to the various image collections and not duplicate image data or the functionality provided by existing tools – simplify, not duplicate.

One other aspect is to furnish an opportunity for adding “plugins” that are managed and callable from the IDMS API.  These plugins will provide programmatic functionality developed by image data scientists and power-users.


Validation of FFPE-derived DNA in target-enriched Enzymatic Methyl-seq profiling for brain tumor classification

Author: Nhu Quynh Tran, PhD

Co-Authors: Sujuan Jia, PhD; Md Zhangir Alom, PhD; Ruth Tatevossian, PhD; Brent Orr, PhD

Abstract: Purpose: DNA-methylation profiling using Infinium Methylation BeadChips has led to significant advances in tumor diagnosis and molecular classification of clinical samples. These microarrays show good performance on both fresh-frozen (FF) and formalin-fixed, paraffin-embedded (FFPE) tissues, but FFPE dominates clinical profiling because it is uniformly produced as part of standard clinical workflows. A significant barrier to adopt DNA methylation arrays in the clinical environment is the requirement of specialized scanners, resulting in a high additional cost and increased laboratory footprint. DNA sequencing-based alternatives are attractive because most clinical molecular pathology laboratories have used sequencers for other molecular assays. To address these clinical needs, we explored the newly developed enzymatic methyl-seq (EM-seq), using the Twist Human Methylome panel. We developed a bioinformatic pipeline to analyze the data and further validated these methods for classification and copy number profiling of brain tumors from FFPE material.

Methods: In this pilot study, we used DNA from FF and FFPE tissues collected from 4 known diagnostic brain cancer patients and 1 control with technical replicates (n=14). DNA libraries were constructed using EM-seq. Methylation detection was targeted to 3.98M CpG sites through 123 Mb of genomic content using the Twist Human Methylome Panel. The EM-seq analysis pipeline was developed based on several open-source tools and packages. To validate the clinical utility of these methylation profiles, the data were used to predict the tumor molecular class using a central nervous system brain-tumor neural net classifier. Copy number data were also evaluated and compared to array-based copy number abnormalities.

Results: The coverage depth ranged from 66 to 346x. Mean per-base coverage for each tumor sample was at least 45x. The CpG calls between FFPE replicates are highly correlated (pairwise correlation coefficients > 0.98). Methylation profiles from 2 different batches were clustered together, indicating consistency and reproducibility. Tumor and control samples were classified into their expected molecular class with high classification scores (> 0.82). Replicate samples with tumor that was not represented in the classifier’s training set yielded subthreshold classification score, supporting high classification specificity. Copy number abnormalities were consistent between sequencing and array-based pipelines.

Conclusions: We successfully developed and validated a protocol to generate and analyze EM-seq data on FFPE samples using the Twist Human Methylome Panel. EM-seq bioinformatic pipeline showed good concordance with existing methods for tumor classification and copy number profiling. This approach has potential to complement existing methods, reducing the barrier to implementation of DNA methylation profiling to more clinical laboratories.


Multi-modality learning with copy number and methylation data from DNA methylation arrays improves classification and risk stratification

Author: Md Zahangir Alom, PhD; 

Co-Authors: Quynh Tran, PhD; Brent Orr, PhD

Abstract: Background: DNA methylation arrays are an important tool for clinically classifying brain tumors.  Methylation-based classification models are trained on differentially methylated probes from a reference series of tumors, and then applied to diagnostic test samples. While copy number profiles are routinely captured from DNA methylation array data, copy number data is reported independent of classification models. 

Certain molecular tumor classes exhibit variability in clinical risk, but current models do not output a within-class risk assessment, analagous to histopathologic grading. Multimodality models have not been established despite the known relationship of specific copy number changes to both tumor class and outcome.

Methods: We established a workflow to extract copy number features from the output of the array-based ‘conumee’ copy number algorithm in order to evaluate the utility of combined multimodality methylation and copy number data for computational modeling. Using a cohort of diffuse gliomas from The Cancer Genome Atlas, we trained deep neural net models on two tasks: tumor classification and survival prediction. We used either methylation-data alone, copy number data-alone, or combined copy number and methylation-data for the models. The accuracy, precision, recall, and F1 score of classification models were compared in cross validation and holdout test sets. The c-index (CI) and prognostic index (PI) were used to compare survival models within and across the tumor classes. 

Results: Our findings indicate that training on copy number data alone is less effective than using methylation or multimodality data for tumor subclass classification (84.7% top accuracy, compared to 95.1 % and 95.83%, respectively). Additionally, our multimodality modeling approach improved survival prediction compared to using DNA methylation or copy number-alone (CI= 86.2, compared to 83.5 and 82.2, respectively). Furthermore, our risk stratification methods accurately demonstrated associative risks within and across the tumor classes.

Conclusions: Our study has shown a systematic approach to integrate copy number data from methylation arrays into clinically relevant computational models. Multimodality models improve model performances, and our survival modeling represents a potential method for molecular grading within tumor classes. 


Integrative analysis of differential expression and differential network for ‘omics data analysis

Author: Yonghui Ni, PhD

Co-Authors: Prabhakar Chalise, PhD; Jianghua He, PhD; 

Abstract: Differential expression (DE) analysis has been commonly used to identify molecular features that are statistically significantly different between distinct biological groups. Differential network (DN) analysis explores molecular network structure changes from one biological condition to the other. Both approaches are very important and complementary to each other. However, they have often been considered separately in ‘omics data analysis.  We developed an integrative method, DNrank, to perform joint DE and DN analysis by incorporating modified Google's PageRank algorithm framework. The purpose of DNrank is to identify the disease-associated features from feature ranking technics. The resampling based cross-validation scheme within DNrank algorithm optimizes the feature ranking by leveraging DE level and DN structure of the features. We illustrate DNrank using both simulated data and three real life data. DNrank shows a better performance in identifying important molecular features with respect to predictive discrimination. Also, as compared to existing feature selection methods, the top-ranked features from DNrank outcome have a higher stability in selection. DNrank allows the researchers to identify disease associated key features accounting for both DE and DN.


Ontology-guided navigation of somatic variants, mutational signatures, gene expression and histology images for pediatric cancer

Author: Alex Gout, PhD

Co-Authors: Stephanie Sandor, MS; Delaram Rahbarinia, MS; Jobin Sunny, MS; Lucian Vacaroiu; James Madson; Kevin Benton, MS; Michael Macias, MS; Samuel Brady; Wentao Yang, PhD; Jian Wang, PhD

Abstract: PeCan knowledge base on St. Jude Cloud (pecan.stjude.cloud), initially developed as a resource of curated genomic variants of pediatric cancer (PeCan), is now significantly expanded to comprise a hub of interconnected data facets for over 9,000 hematologic (heme), CNS, and non-CNS solid (solid) tumor patient samples from around the world. The new data facets, which include gene expression, mutational signatures, and histology, can be explored alongside our existing variants data facet inspiring new hypothesis generation. Variants shows a genomic landscape view (i.e. oncoprint view) and gene -or genome-level view (i.e. ProteinPaint view). Expression data, generated from normalized RNA-seq of ~3,000 samples, can be explored via interactive 2-dimensional maps, revealing distinct subtypes relevant for patient stratification for precision therapy. Mutational signatures, identified from ~2,000 WGS/WES samples, are presented as a heatmap across subtypes, in addition to a summary view for a user-defined cohort or individual sample for which we also display a mutation profile frequency plot together and identified signatures. Histology enables review of histological slide images and associated clinical notes for ~3,000 solid tumors via a searchable interface. As all samples on PeCan have been mapped to a custom pediatric cancer classification based ontology, the user can customize the view of each data facet presented in PeCan by selecting a specific cancer subtype. Integrative analysis between the data facets has enabled new insights into pediatric cancer biology as demonstrated in the following two examples. First is the discovery of two potential subtypes of adamantinomatous craniopharyngioma. These were initially identified via expression analysis which revealed two distinct groups that were confirmed by examination of associated histology slide data which revealed delineation by brain invasion. Second is an analysis that computationally identifies homologous recombination-deficient (HRD) pediatric tumors using data from the mutational signatures data facet. This provides new insights on the applicability of therapies targeting HRD in the pediatric  cancer population. These examples demonstrate the potential value of PeCan in advancing clinical diagnosis classifications of pediatric cancer and exploration of new therapeutic opportunities. PeCan is an evolving knowledge base, as we are continuously expanding the platform and adding data over time to foster scientific discovery for the global research community, with the goal of improving treatments for pediatric cancer. 


DeepMRIRec: Deep-learning-based four-fold acceleration of MRI data acquired with RT coil configurations

Author: Shahinur Alam

Co-Authors:  Jinsoo Uh; Chia-Ho Hua, PhD;  Khaled Khairy, PhD; 

Abstract: Magnetic Resonance Imaging (MRI) is a powerful technique to discover and monitor neurological, musculoskeletal, and oncological diseases. However, MRI is slow and stressful for patients, who must remain motionless in a prolonged MRI imaging procedure that prioritizes reduction of imaging artifacts and an increase in data quality. Prolonged imaging time is particularly challenging for pediatric patients and those that are already vulnerable due to their health conditions. It is therefore important to seek methods that reduce imaging time. One approach is to record fewer measurements and then digitally recover the full information from this limited set of measurements in a post-acquisition reconstruction step. Towards that end, we develop a deep learning-based method called DeepMRIRec for MRI reconstruction from highly under-sampled raw data acquired with RT-specific receiver coils. We demonstrate our method on fully sampled k-space data of T1-weighted MRI acquired from 73 pediatric brains with tumors/surgical beds using loop and posterior coils (maximum 14 channels). We thoroughly evaluated DeepMRIRec and state-of-the-art deep learning-based reconstruction methods using our dataset with/without applying transfer learning from publicly available models. DeepMRIRec reduces image acquisition time by a factor of four, surpassing previous methods (structural similarity (SSIM) score of 0.95, PSNR of 33). Although our model outcome is satisfactory and outperforms state-of-the-art approaches, it still produces images that suffer from significant smoothing and will need further improvement. Importantly, our approach enables significant reduction in acquisition time for children undergoing therapy, and especially those that undergo routine MRI. As our method improves, we plan to perform a thorough evaluation of its effectiveness in the field.


Predicting Cellular Condensation by Fusion Oncoproteins

Author: Swarnendu Tripathi

Co-Authors: Hazheen Shirnekhi, PhD; Bappaditya Chandra, PhD; David Baggett, PhD; Cheon-Gil Park; Benjamin Lang; Hadi Hosseini; Brittany Pioso; Snigdha Maiti; Aaron Phillips; Jina Wang

Abstract: Fusion oncoproteins (FOs) that arise from in-frame chromosomal translocations are drivers of many aggressive pediatric cancers. Several FOs are known to promote oncogenesis by undergoing phase separation (PS) to form aberrant biomolecular condensates linked with oncogenesis. To test the generality of PS, 166 FOs derived from diverse cancers were expressed in HeLa cells, with 58% identified to form condensates. The condensate forming FOs displayed physicochemical features different from those that displayed diffuse localization. Therefore, we hypothesized that distinct patterns of physicochemical features encoded in the amino acid sequences of the FOs could be used to predict their cellular condensation behavior. To test this hypothesis, we applied supervised automatic Machine Learning (ML) and a Gradient Boosting Machines model with 65 trees performed best amongst the 220 models tested. The model accuracy on the cross validated training set of 149 FOs (96 condensate-positive and 53 condensate-negative) was 0.79. We experimentally tested 12 additional FOs for condensate formation in cells and obtained 0.92 predictive accuracy from the ML model. We determined the contributions of the physicochemical features for prediction of FO condensation behavior using a Shapley Additive exPlanations (SHAP) algorithm based on the game theoretically optimal Shapley values. SHAP analysis was used to uncover how physicochemical features contributed to ML predictions, which guided mutagenesis experiments to rationally reverse the behavior of 9 condensate forming FOs. Our ongoing studies seek to understand how the LLPS-prone intrinsically disordered regions within FOs contribute to their cellular condensation behavior and promote oncogenesis.



Repulsive Surfaces for Modeling Complex Biomembrane Morphologies

Author: Meghdad Razizadeh, PhD

Co-Authors: Khaled Khairy, PhD

Abstract: Lipid bilayers are essential components of biological membranes, playing a crucial role in maintaining the structural integrity of cells and organelles. The study of lipid bilayer vesicles, as models for biological membranes, has been an active area of research for the past four decades. The computational modeling of vesicles strives to relate morphology, membrane structure, and membrane energetics. The Canham-Helfrich energy functional and its variants represent a well-established approach to computationally investigate the mechanics and dynamics of lipid bilayers in terms of local principal curvatures of the surface. However, even with the help of such powerful formulations, simulation of lipid vesicles, complex biomembranes, and realistic membrane-bound organelles such as Golgi, ER, or mitochondria, still pose significant computational challenges. For example, the evaluation of bending forces requires approximations of derivatives of surface coordinates up to the fourth order, which can lead to numerical challenges on discretized shapes. Importantly, published works on computational membrane modeling lack reliable mechanisms to prevent self-intersections, making it challenging to extend computational predictions of biomembrane shapes to more intricate highly convoluted forms. This issue is especially prevalent at high surface area to volume ratio shapes, observed for many cell organelles, where minimized shapes can become unphysical due to self-intersections. In this study, we adopt a recently developed method for self-intersection prevention (repulsive surfaces), originally developed for computer graphics applications. We extend its application to model realistic biomembranes by adding a tangent point energy term to the Helfrich functional. Our method yields shape predictions, even at high surface-to-volume ratios, a regime that cannot be reliably explored without self-intersection treatments. Finally, our study demonstrates the ability of this model to predict complex biomembrane morphologies, where lipid bilayers are forced to form tubular or planar shapes; features commonly observed in cell organelles, for example, the ER and inner mitochondrial membrane.


Defining the Condensate Landscape of Fusion Oncoproteins

Author: David Baggett, PhD

Co-Authors: Swarnendu Tripathi, PhD; Hazheen Shirnekhi, PhD; Brittany Pioso; Bappaditya Chandra, PhD; Richard Kriwacki, PhD

Abstract: Fusion oncoproteins (FOs) occur in ~17% of cancers and are often drivers of oncogenesis through one of two molecular mechanisms.  Many FOs promote oncogenesis by changing gene expression, while others alter cellular signaling. Notably, formation of aberrant biomolecular condensates is implicated in the oncogenic mechanism of  several FOs .

To better understand the relationship between FOs and condensate formation, we compiled a database of >3,000 FO sequences (FOdb). We then transfected HeLa cells with mEGFP-tagged forms of 166 FOs selected from the FOdb and found that 58% formed condensates.  These studies show that FOs that formed condensates have physicochemical feature patterns different from those that don’t.  We find that these patterns also correlate with sub-cellular localization.  Further, using conserved domain annotation, we found that FOs localized to the nucleus contained functional terms predominantly associated with regulation of gene expression, while those localized to the cytoplasm  associated with cell signaling terms. 

We then trained a machine learning model that analyzes a FO sequence to accurately predict if it will form biomolecular condensates in cells. This model predicts that 59% of the untested FOs in the FOdb form cellular condensates, many of which have physicochemical features indicative of gene regulation and nuclear localization.  Conversely, a substantial portion of FOs predicted to not form condensates have sequence features that suggest cytoplasmic localization and cell signaling functions.  We conclude that most FOs that drive aberrant gene transcription function in part through condensate formation, and those that alter cell signaling do so without forming condensates. 


Epigenetic regulators TET2 and polycomb repressive complex 2 (PRC2) coordinate gene expression in Myelodysplastic Syndrome (MDS)

Author: Jacquelyn Myers, MS

Co-Authors: Elodie Henriet, PhD; Ana Leal-Cervantes, PhD; LaShanale Wallace, PhD; Amina Metidji, PhD; Jordan Skorupa, BS; Ilaria Iacobucci, PhD; Shondra Pruett-Miller, PhD; Yiping Fan, PhD; Esther Obeng, MD, PhD

Abstract: Myelodysplastic syndrome (MDS) is a pre-leukemic disorder arising from somatic driver mutations that cause the expansion of mutant hematopoietic stem cells (HSCs) and ultimately transformation into Acute Myeloid Leukemia (AML). Epigenetic regulators including TET2 are frequently mutated in MDS. TET2 loss of function mutations cause HSC self-renewal resulting in myeloproliferation and extramedullary hematopoiesis. We know TET2 plays a role in active DNA de-methylation, but how disruption of this process mechanistically results in AML transformation remains unknown. We hypothesize that TET2 loss causes gains in DNA methylation triggering an aberrant epigenetic cascade that dysregulates gene expression in HSCs.  

We collaborated with the St. Jude CAGE Core to develop TET2 knock out K562 cells to evaluate the effects of TET2 loss on DNA methylation and transcription. We identified >5,000 regions that gain DNA methylation upon TET2 loss that reside within intronic regions that are repressed. These gains in DNA methylation occur within genes that are up regulated upon TET2 loss. Taking these data together, we hypothesize that polycomb repressive complex 2 (PRC2) recruitment is blocked upon TET2 loss resulting in gene activation. To test this, we conducted H3K27me3 CUT&RUN and identified a significant loss of H3K27me3 at up-regulated genes. Additionally, we evaluated gene expression changes in TET2-mutant AML samples and identified a conserved set of gene candidates that are putatively regulated by DNA methylation and PRC2. These findings begin to unravel an epigenetic regulatory mechanism responsible for maintaining normal gene expression patterns in HSCs.  


Computational design of multi-specific chimeric antigen receptors

Author: Kalyan Immadisetty, PhD

Co-Authors: Jaquelyn Zoine; Vikas Trivedi; Stephen Gottschalk, PhD;  Giedre Krenciute, PhD; Paulina Velasquez, PhD; M. Madan Babu, PhD; Michaela Meehl

Abstract: Chimeric antigen receptor (CAR) T-cell immunotherapy has the potential to revolutionize the treatment of pediatric brain tumors and leukemias. However, insufficient T cell activation and/or immune escape remain significant obstacles. To address these challenges, bispecific CARs were designed that recognize two antigens IL13Ra2 and/or B7H3, both expressed in pediatric brain tumors. Initial attempts to generate tandem CARs utilizing two single chain variable fragments (scFv) as a bispecific antigen domain resulted in a lack of cell surface expression of the CAR. To test if computational evaluation of structural dynamics of scFv position can predict and/or improve cell surface expression, we developed a computational approach comprising of the state-of-the-art structure prediction tool AlphaFold2. By optimizing the linker length, type, and flexibility, we rationally designed several novel tandem and loop (diabodies) bispecific IL13Ra2-B7H3 CARs. Preliminary testing of the three tandem and one loop CARs in HEK cells showed modest cell surface expression (~20%) for the tandem CARs, while the loop CAR showed nearly full cell surface expression (~90%). We then applied this strategy to optimize several peptide-scFv bispecific including GRP78-B7H3 CARs and to design a novel trispecific peptide-scFv-scFv construct (GRP78-B7H3-CD123 CARs). Validation of their cell surface expression and antitumor activity against acute myeloid leukemia is currently underway. In summary, we have designed an innovative computational approach for multi-specific CAR generation and this strategy can be applied to a broad range of pediatric cancers requiring novel immunotherapeutic options.


Cancer dependency prediction using deep learning

Author: Hadi Hosseini, PhD

Co-Authors: Duccio Malinverni, PhD; Balaji Santhanam, PhD; Bálint Mészáros, PhD; M. Madan Babu, PhD

Abstract: Recent advances in genomic technologies have allowed for determining the genetic dependencies (i.e., gene essentiality) in various cancer cell lines. Cancer dependency is determined by comparing the growth behaviors of cells with and without the knock-out of the target gene of interest. Large-scale efforts have allowed us to determine the dependency of 18,000 genes in ~900 cell lines, resulting in the creation of the DepMap portal. This resource has created unprecedented large-scale opportunities to study cancer vulnerabilities at the genomic level. Despite the immense potential of the DepMap dataset to understand cancer dependencies and vulnerabilities at the cell line level, bringing the power of cancer dependency to personalized medicine requires overcoming significant practical barriers. Determining the dependency profile of a specific patient would require performing ~18K knock-out experiments on all human genes and corresponding cell-growth assays, which is currently out of reach of routine clinical settings. To overcome these limitations, we propose to build an AI-based approach to predict cancer dependency at the individual patient level. Our approach leverages the availability of the DepMap dataset and combines it with the Cancer Cell Line Encyclopedia (CCLE) and protein-protein interaction networks, which are publicly available in the STRING, IntAct, and BioGRID portals.

We are developing a deep learning model for OMICs data to identify novel cancer-dependent genes and make predictions on their role in disease. The question is how to predict cancer dependency using bulk RNA-seq expression and protein-protein interaction data. Predicting the cancer dependencies of the entire human genome will be useful for further research, such as target discovery and designing specific drugs that modulate cancer-dependent genes. We are designing a graph-based neural network (GNN). This model can consider the functionality of several genes together, the assumption being that if the genes interact, they function together. The advantage of GNN models is that they retain the relationship between nodes in each layer. As genes work together in a gene network, the relationship between genes is important for prediction quality. The GNN will predict cancer dependency based on particular gene-expression profiles together with accurate uncertainty estimates.

We aim to apply this model to all the St. Jude pediatric transcriptomics datasets to investigate the resulting genes and determine target dependency. We will specifically focus on AML, T-ALL, and B-ALL using the St. Jude pediatric transcriptomics dataset.


Computational approach for discovery of molecular glues: Expanding the Repertoire of Neo-Substrates for Targeted Protein Degradation 

Author: Vyoma Sheth, BDS, MS

Co-Authors: Besian I. Sejdiu, PhD; Gisele Nishiguchi, PhD; Balaji Santhanam, PhD; Benjamin Leslie, PhD; Zoran Rankovic, PhD; M. Madan Babu, PhD

Abstract: Drug development is not only difficult, time-consuming, and expensive but an inherently unpredictable process. The discovery and validation of an appropriate drug target is crucial to the success of drug discovery efforts. Still, this task is challenging given the fact that approximately 85% of proteins are considered "undruggable." Our aim is to overcome this limitation by adopting the strategy of “Targeted Protein Degradation (TPD)” to target the undruggable proteins. In TPD, a small molecule (e.g., molecular glue) brings a target of interest in proximity to an E3 ubiquitin ligase, thereby ubiquitinating the target (neo-substrate) and degrading it. A critical aim of our project is to develop an integrative data-driven approach to identify lead compounds that can be developed into molecular glues to target therapeutically relevant neo-substrates. This approach provides a scope to expand the repertoire of “druggable” proteins. 

To identify potential molecular glues for pediatric leukemia and medulloblastoma*, we initiated our research by focusing on the C2H2 zinc finger family of transcriptional regulators, as these diseases are characterized by the deregulation of several transcription factors containing C2H2 zinc finger motifs. We considered the structure of the DDB1-CRBN- C2H2 ZNF domain of IKZF1 bound to pomalidomide as a reference starting point (PDB ID: 6H0F). Our structural bioinformatics analyses enabled us to define the signatures of small molecules from the perspective of CRBN and the neo-substrate to predict C2H2 proteins that can be targeted via molecular glue strategy. We developed a virtual screening pipeline that consists of screening the Chemical Biology and Therapeutics (CBT) library (~ 600k compounds) and the enamine library of Molecular Glues (~ 5k compounds). Our pipeline was benchmarked using the above-mentioned reference structure and discriminatory analyses to separate potential binders from non-binders. Further post-docking analysis was conducted on the screening hits to validate novel binders.

We found small molecule hits with less chemical similarity to thalidomide and its analogs while maintaining the same binding pose and higher binding energy. We also rigorously benchmarked our approach as this enhances confidence in compound discovery, which could be prioritized for further experimental validation and investigations. 

The next step is to create a web portal that would disseminate our findings with stratified confidence-score levels for various potential molecular glues against any desired C2H2 zinc finger target. We hope this accelerates future experimental efforts to target therapeutically relevant novel C2H2 zinc fingers. 

*This project is part of the Crazy8 initiative titled “Small Molecule Degraders for Targeting Transcription Factor Drivers of Childhood Cancers” that involves the groups of Charles Mullighan, Zoran Rankovic, M. Madan Babu, Marcus Fischer, Jeffery Klco, Paul Northcott,  and Martine Roussel


A transmission model of avian influenza in central Chile

Author: Lauren Lazure

Co-Authors: Pedro Jimenez-Bluhm, PhD; Christopher Hamilton-West, PhD; Stacey Schultz-Cherry, PhD

Abstract: Select avian influenza virus (AIV) strains continue to cause significant morbidity and mortality worldwide in mammals and birds. A high diversity of AIV subtypes has been identified over central Chile over the past decade, including highly pathogenic avian influenza viruses. Migratory waterfowl are likely the source of novel AI subtypes in the region as central Chile is a location along several migratory bird flyways that link the Americas. This region also contains a high number of backyard poultry and swine farms that are that are susceptible to spillover infections and pose a risk for transmission to other mammals including humans. The purpose of this study is to determine factors that contribute to the AIV prevalence and determine transmission dynamics between migratory birds, non-migratory wild birds, and domestic poultry through agent-based modelling. The development of this model and outcomes of these studies are critical for pandemic preparedness. 


St. Jude Survivorship Portal: A data portal for storing, analyzing, and sharing large and complex cancer survivorship datasets

Author: Gavriel Matt

Co-Authors: Edgar Sioson; Jian Wang, PhD; Congyu Lu; Airen Zaldivar; Karishma Gangwani; Alex Acic; Jaimin Patel; Robin Paul; Colleen Reilly; Kyla Shelton; 

Abstract: Background/Purpose: Cancer survivorship research relies on large-scale cohort studies that inform a wide range of outcomes after the completion of cancer therapy along with demographic, genetic, and clinical data. To maximize the utility of these comprehensive datasets, we must be able to store and share this data in a web-based environment that can be accessed by the broader survivorship research community. Furthermore, this environment should be integrated with analytical tools for performing statistical analyses on the stored data without needing to download the data and import it into third-party analytical software. To address this need, we have created the St. Jude Survivorship Portal (https://survivorship.stjude.cloud), a web-based data portal for exploring, sharing, and analyzing data from survivors of pediatric cancer.

Methods: The portal hosts data collected from two large cohorts of pediatric cancer survivors, the St. Jude Lifetime Cohort (SJLIFE) and the Childhood Cancer Survivor Study (CCSS). All SJLIFE survivors (5,053) were included in the portal. For the CCSS cohort, only those survivors with whole genome sequencing data (2,688) were included, with a future plan for expansion to include all CCSS participants.

Results: The portal hosts both the phenotypic and genetic data of survivors. Phenotypic data include demographic data (e.g., sex, age, race/ethnicity, education, employment), clinical data (e.g., cancer diagnosis, cancer treatment exposures/doses, clinically assessed and survey-based chronic health conditions), and patient-reported data (e.g., healthcare utilization, insurance, quality-of-life assessments). Genetic data include whole-genome-sequencing-derived genotypes for over 160 million variants and polygenic risk scores computed for thousands of traits. This data can be easily explored by the user through a hierarchically organized data dictionary for phenotypic data and a genome browser for genetic data. Charts and plots of variables can be quickly created, customized, and downloaded, all within the portal environment. Statistical analyses, including cumulative incidence analysis and regression analysis, may also be performed within the portal. In cumulative incidence analysis, users can analyze the incidences of various CTCAE-graded adverse events (e.g., cardiovascular dysfunction, neurological disorders, subsequent neoplasms) in survivors. In regression analysis, users can perform either linear, logistic, or Cox regression analysis and can assign any of the variables on the portal as either outcome or explanatory variables. Lastly, we permit users who obtain necessary permission to download the data from the portal for use in local in-depth analyses. The St. Jude Survivorship Portal provides a comprehensive, powerful, and easy-to-use data portal for sharing and analyzing childhood cancer survivorship data that will serve as a valuable research tool for the broader survivorship research community.


Tug-of-war between sticker and zipper interactions in network fluid condensates determines the kinetics of amyloid fibril formation

Author: Tapojyoti Das, MBBS, MMST, PhD

Co-Authors: Fatima Zaidi, PhD; James Messing, PhD; J. Paul Taylor, MD, PhD; Tanja Mittag, PhD

Abstract: Recurrent mutations in the C-terminal low-complexity domain of the stress granule protein hnRNPA1 (A1-LCD) lead to familial amyotrophic lateral sclerosis and a spectrum of related neurodegenerative diseases. These mutations create strong steric zipper motifs that drive fibrillization in vitro and lead to poorly dynamic stress granules in cells. A1-LCD also phase separates in vitro, driven by multivalent sticker-sticker interactions between aromatic residues. Phase separated condensates of hnRNPA1 mutants and some other RNA-binding proteins give rise to fibrils over time. Together with genetic and cell biological data, this suggests that stress granules are crucibles for neurodegenerative disease. We find that disease mutants of A1-LCD have a lower driving force for phase separation than the wild-type and higher fibril stability, reflecting a higher thermodynamic driving force for conversion of condensates to fibrils. Our data further revealed that nucleation is promoted at the condensate interface, but subsequent fibril growth is kinetically inhibited in wild-type condensates. Increasing the aromatic sticker strength in a disease mutant background increased phase separation propensity and markedly slowed down fibril formation in vitro, reflecting a molecular tug-of-war between the two processes. Moreover, using a custom pipeline of automated stress granule detection and analysis of live-cell imaging data, we found that strong stickers in a disease mutant background rescued the disease mutant phenotype of delayed stress granule dissolution in cultured human cells. A1-LCD constructs with strong stickers can prevent fibrillization also in trans, indicating therapeutic potential. Our work suggests that stress granules may suppress fibril formation of constituent proteins in the short term but can become a liability in the presence of fibrillization-prone disease mutants.


JUMP Software Suite (JUMPsuite): An Extensive and Modularized Toolbox for Large-scale Mass-spectrometry-based Proteomics and Metabolomics Data Analysis

Author: Xusheng  Wang, PhD

Co-Authors: Zuo-Fei Yuan, Yingue Fu, Suresh Poudel, Abjihit Dasgupta, Yuxin Li, Ji-Hoon Cho, Handy High, Vishwajeeth Pagala, Haiyan Tan, Ashutosh MIshra, Kiran Kodali, Suiping Zhou, Kaiwen Yu, Huan Sun, Junmin Peng

Abstract: Mass spectrometry-based proteomics and metabolomics are increasingly powerful techniques for identifying and quantifying proteins and metabolites in a wide range of biological samples. However, the analysis of large proteomics and metabolomics datasets requires the use of advanced computational tools and strategies. To address this need, we present the JUMP software suite (JUMPsuite), an extensive and modularized platform specifically designed for processing, analyzing, and interpreting large-scale mass spectrometry-based proteomics and metabolomics data. JUMP not only enables the accurate identification of peptides but also the quantification of differentially expressed proteins across multiple experimental conditions, providing efficient and comprehensive data interpretation capabilities. Additionally, the software suite offers stringent false discovery rate control mechanisms, ensuring the utmost accuracy and reliability in protein identification and quantification processes. The sophisticated clustering and network analysis features within JUMP aid in uncovering functional modules and pathways present in the biological samples under investigation. Moreover, JUMP can effectively prioritize candidate biomarkers and potential therapeutic targets based on their functional relevance to specific diseases. In parallel, we have also developed the JUMPsuite to process metabolomics data. Importantly, JUMPsuite is subject to consistent, ongoing enhancements, including the integration of novel analytical methods to expand its existing functionalities and its deployment on the St. Jude Cloud platform. Collectively, JUMPsuite serves as a valuable resource for the proteomics and metabolomics research communities, delivering a comprehensive, all-in-one solution for processing and analyzing large-scale mass spectrometry datasets.


Exploring homologous recombination deficiency in pediatric high-grade gliomas

Author: Corey Xu, PhD

Co-Authors: Evan Savage; Jason Myers; Evadnie Rampersaud; Ti-Cheng Chang; Gang Wu; Anang Shelat; Christopher Tinkle, MD, PhD

Abstract: Homologous recombination deficiency (HRD) has been extensively studied in breast cancer and ovarian cancer given its correlation with sensitivity to Poly (ADP-ribose) polymerases (PARP) inhibitors. In these cancers, HRD is most frequently caused by biallelic inactivation of BRCA1/2 and tumors often exhibit ‘genomic scars’ such as elevated HRD.sum score and specific DNA mutational signatures, including specific single base substitution (SBS), small insertion/deletion (ID), and structural rearrangement (RS) signatures. Despite its clinical relevance, HRD has not been systematically studied in pediatric cancers and its therapeutic predictive value is unknown. Here, we evaluated HRD and genomic scars in 277 paired tumor and normal samples of pediatric high-grade glioma (pHGG) profiled with whole-genome sequencing (WGS) and 23 tumor-normal paired breast cancer (BC) WGS samples as controls. We profiled variants from over 500 DNA damage response genes and found that the most frequently mutated genes with biallelic loss were TP53, followed by ATM, an essential gene for genome stability. We found that 11% of the samples have HRD.sum score greater than 40, an indicator of HRD in several adult cancers. However, while mutational signature analyses identified HRD signature SBS3 in 25% of the pHGG samples, other HRD-related signatures, including ID6, RS3, or RS5, were not identified. In comparison, we identified SBS3 and ID6 in all BC samples with BRCA1/2 loss. Furthermore, we computed two recently reported composite HRD scores, i.e., HRDetect and CHORD, on the pHGG samples and BC samples. While all BC samples with BRCA1/2 loss were predicted to be HRD by both scores, none of the pHGG samples were predicted to be HRD by either score. Finally, analysis of RNA-seq data from a subset of pHGG samples (n=90) revealed that samples with elevated HRD.sum scores showed upregulation of pathways related to homology-directed DNA repair. Overall, our study suggests that HRD exists in a fraction of pHGG samples, but it is likely driven by variants in genes other than BRCA1/2, resulting in a different set of genomic scars compared to those in breast cancers. Functional analysis of sensitivity to PARP inhibitor and related DNA damage inhibitors is ongoing to evaluate potential therapeutic relevance of these genomic scars.


Radiomic features from apparent diffusion coefficient maps provide additional prognostic value in rhabdomyosarcoma.

Author:  Silu Zhang, PhD

Co-Authors: Chris Goode; Diana Storment; Zachary Abramson, MD; Matthew Krasin, MD; Alberto Pappo, MD; Mary McCarville, MD; Matthew Scoggins, PhD; 

Abstract:  Rhabdomyosarcoma (RMS) is a rare cancer that arises from skeletal muscle tissue and mainly affects children and young adults. The 5-year overall survival (OS) rate for RMS is about 70%. Currently, prognosis depends on clinical factors including age, disease stage, histologic subtype, and tumor location. However, about 30% of patients with intermediate-risk RMS patients experience recurrence with local or metastatic disease and have a poor prognosis. Thus, identifying imaging features that can accurately predict clinical outcome could help improve risk stratification. Radiomics is a technique that extracts quantitative features from diagnostic medical images that are undetectable to the human eye. In this study, we aimed to identify radiomic features on routine MRI apparent diffusion coefficient (ADC) maps that might provide prognostic value in RMS. 

Materials and Methods: We analyzed baseline MR imaging of 67 subjects enrolled in the RMS13 protocol (NCT01871766). A tumor volume of interest (VOI) was manually segmented and approved by a radiologist. Additionally, we generated a tumor-tissue interface VOI by dilating and then subtracting the tumor VOI. Radiomic features were extracted from both the tumor mask and the tumor-tissue interface mask, resulting in a total of 2446 features. We included 6 clinical variables in the statistical analysis: age at diagnosis, race, sex, primary site category (13 sites), clinical risk group (low, intermediate, high-oberlin-low, and high-oberlin-high), and FOXO3 gene translocation status. 

Results: Age at diagnosis, risk group, and translocation status were associated with 3-year OS. When controlling for these factors, 46 radiomic features were associated with 3-year OS. Specifically, larger tumor size, larger ADC range within a single tumor, and other features related heterogeneity were associated with worse outcome. Interestingly, 27 out of the 46 features associated with outcome were identified from the interface mask than from the tumor mask, suggesting potential useful information from the interaction between the tumor and host tissue. 

Conclusions: Although further studies are needed to validate these features, radiomics features extracted from routine clinical MRI may provide important prognostic information to guide treatment and improve outcomes.


Identifying disease-relevant signaling sites by studying their structural evolution using AlphaFold-based predictions across 200 species

Author: Benjamin Lang

Co-Authors: Bálint Mészáros, PhD; Besian I. Sejdiu, PhD; Duccio Malinverni, PhD; M. Madan Babu, PhD

Abstract: Post-translational modifications (PTMs) can dynamically diversify the proteome in response to intracellular and extracellular signals. They are a fundamental means of biological information processing with important functions in development, homeostasis and disease. Since hundreds of thousands of modified residues as well as entirely new modification types are being discovered in proteins, elucidating their biological functions and how they evolve is a fundamental problem. I will present a study of the evolution of modified amino acids in human proteins in their structural context. Using sequence, polymorphism, mutation, and structural data at the species, population and individual levels, we identified significant evolutionary constraints. We also observed an overrepresentation of amino acids which mimic modified residues at orthologous positions, representing clues to their evolution and structural roles.

AlphaSync (http://mbsql.stjude.org/alphasync/) is a still-unpublished internal St. Jude data processing pipeline and web server that enriches DeepMind’s AlphaFold structural database with residue-level accessible surface area and residue–residue contact data. It is also intended to synchronize the infrequently updated public AlphaFold database with the latest UniProt release, which it does by re-predicting the structures of proteins with updated sequences using the St. Jude high-performance GPU computing cluster. Importantly, this pipeline also includes prediction of intrinsically disordered residues using their relative solvent-accessible surface area, which is a recently pioneered approach, as well as building on a novel contact detection Python package from the Babu lab. A MySQL database server serves as the data repository for our refined version of AlphaFold DB. We envision that AlphaSync will be helpful to computational biologists at St. Jude and beyond.

By tracing the structural evolution of human post-translationally modified residues across 200 closely and distantly related species from the Ensembl Compara comparative genomics pipeline using AlphaSync, we aim to identify disease-relevant signaling sites that exert a high structural impact when modified or mutated.


Global Depths for Irregularly Observed Multivariate Functional Data

Author: Zhuo Qu, PhD

Co-Authors: 

Abstract: Two frameworks for multivariate functional depth based on multivariate depths are introduced in this paper. The first framework is multivariate functional integrated depth, and the second framework involves multivariate functional extremal depth, which is an extension of the extremal depth for univariate functional data. In each framework, global and local multivariate functional depths are proposed. The properties of population multivariate functional depths and consistency of finite sample depths to their population versions are established. In addition, finite sample depths under irregularly observed time grids are estimated. As a by-product, the simplified sparse functional boxplot and simplified intensity sparse functional boxplot are proposed for visualization without data reconstruction. A simulation study demonstrates the advantages of global multivariate functional depths over local multivariate functional depths in outlier detection and running time for big functional data. An application of our frameworks to cyclone tracks data demonstrates the excellent performance of our global multivariate functional depths.


St. Jude Neuro-Oncology Data Portal delivers genomic and clinical information from a large pediatric medulloblastoma cohort

Author: Congyu Lu, PhD

Co-Authors: Sandeep Dhanda; Edgar Sioson; Airen Zaldivar; Karishma Gangwani; Gavriel Matt, PhD; Colleen Reilly; Alex Acic; Giles Robinson, MD; Xin Zhou, PhD; 

Abstract: Background/purpose: Medulloblastoma is the most common malignant brain tumor of childhood and is now known to be a highly heterogeneous disease. It is comprised of diverse molecular subgroups and subtypes, each of which is associated with different clinical characteristics, genetic alterations, risk factors, and survival outcomes, necessitating a comprehensive and multidimensional approach to their investigation. In addition, advances in high-throughput genomic analyses have yielded an unprecedented amount of genetic alterations in medulloblastoma tumors, making analysis across multiple variables challenging. Integrating various types of genetic alterations with phenotypic data and presenting them in easily accessible and intuitive visualizations are essential to extrapolate novel insights from this vast dataset. To this end, we developed Neuro-Oncology portal, a web application for multidimensionally visualizing and analyzing data from medulloblastoma patients.

Methods: The portal hosts data collected from three studies undertaken in 898 patients with medulloblastoma (SJMB03, ACNS0331, and ACNS0332). DNA methylome, whole-exome and transcriptome sequencing are available for most of the tumors.

Results: The portal presents demographic, clinical, and genomic data from 898 medulloblastoma patients as hierarchically organized variables inside an interactive data dictionary.  Variables are searchable and can be visualized and analyzed by multiple plots. The summary plots (bar chart and violin plot) visualize the distribution of a given variable. The survival plot presents survival outcomes of the represented patients. The matrix plot provides a compact graphical representation of the overall annotations and genomic alterations from the patients and their samples. The tSNE plot visualizes the tumor clustering patterns based on methylomic state, where user can lasso a cluster of interest and save it as a new sample group to be explored by other plots. Each of the plots is embedded with a multitude of point-and-click functions to facilitate the visualization and analysis. By using the variables jointly in the plots, users can stratify the samples based on simultaneous consideration of these variables and perform multidimensional exploration and integrated analyses. To facilitate the filtering of the samples, we implemented a filtering system, within which any of the variables can be utilized as a filter, and multiple filters can be combined with AND/OR operators as one single filter. The portal also enables users to create novel sample groups using the filtering system and compare the sample groups they have created.

Conclusion: The St. Jude Neuro-Oncology Data Portal provides the research community a powerful tool to multidimensionally visualize and analyze data from medulloblastoma patients. The portal will be made freely accessible upon manuscript publication.


 

Invited Oral Presentation Abstracts

Unlocking ultrahigh-throughput drug combination screens using machine learning

Author: Charlie Wright, PhD

Abstract: Most complex diseases such as cancer are treated with combinations of drugs, but arriving at these combinations is extremely challenging and can take decades. One major challenge in finding novel combination therapies stems from the immense throughput needed to screen drugs in sufficient density (number of concentrations of each drug). Additionally, the current pool of single-agents to potentially combine is far too large to brute-force screen, and purely computational predictions of synergy have performed poorly. There is an urgent and unmet need to perform unbiased and dense drug-drug screens to expedite the search of novel synergistic therapies. 

Here, we present Combocat: an end-to-end platform that scales drug combination screening to ultrahigh-throughput levels with the assistance of machine learning. Combocat has two modes of screening called dense mode and sparse mode. In dense mode, 10×10 drug combination matrices are experimentally tested on a 384-well microplate and subsequently analyzed for synergy. In sparse mode, the same 10×10 matrix is produced, but only having tested 10 concentrations (corresponding to the matrix diagonal) and using machine learning to impute the remaining 90 values.

The Combocat regression-based machine learning model was trained on hundreds of samples of dense mode data generated at St. Jude and allows for accurate capture of synergy. By further experimental miniaturization (using 1536-well microplates, sparse matrix formats, and cross-plate normalization), we were able to increase the throughput nearly 300-fold. Combocat represents a union between state-of-the-art experimental and computational methods, and dramatically increases the capability of screening for new synergistic drug combinations. 


An unsupervised learning approach uncovers divergent mesenchymal-like gene expression programs across human neuroblastoma tumors and preclinical models

Author: Richard Chapple

Co-Authors: Xueying Liu, Sivaraman Natarajan, Yuna Kim, Anand Patel, Christy LaFlamme, Min Pan, Charlie Wright, Hyeong-Min Lee, Yinwen Zhang, Meifen Lu

Abstract: Neuroblastoma is the most common pediatric solid tumor, where only 50% of high-risk patients survive. Preclinical studies suggest a mesenchymal-like gene expression program is a key driver of chemotherapy resistance. However, the lack of clinical progress suggests we need a better understanding of the relationship between patient tumors and preclinical models. Here, we generated single-cell RNA-seq maps of neuroblastoma cell lines, patient-derived xenograft models (PDX), and a genetically engineered mouse model (GEMM). We developed an NMF-based unsupervised machine learning approach to compare these preclinical data to human neuroblastoma tumors. We discovered that the dominant adrenergic gene expression programs were well conserved between patient tumors and preclinical models, but contrary to previous reports do not unambiguously map to an obvious cell of origin. However, the mesenchymal-like program was less clearly conserved, with highly expressed mesenchymal-like programs restricted to cancer-associated fibroblasts and Schwann-like cells in vivo. Surprisingly however, we identified a subtle, weakly enriched, mesenchymal-like gene expression program in cancer cells in rare, drug-pretreated high-risk tumors, which was maintained in PDX, could be chemotherapy-induced in our GEMM, and may represent a new therapy-escape mechanism. Collectively, our findings improve the understanding of neuroblastoma heterogeneity in patient tumors and preclinical models, which can facilitate the development of new treatments.


Innovations in Clinical Image Visualization

Author: Zachary Abramson, MD, DMD

Abstract: Three-dimensional modeling of anatomy allows the radiologist and clinician to move beyond the standard orthogonal imaging planes to gain a better understanding of the spatial relationships of structures within the body. Pre-operative planning meetings in particular pose several well-documented advantages. Notably, a visual preview provides the surgeon with a realistic sense of what he or she may encounter in the operating room and promotes confidence by allowing the surgeon to simulate surgical approaches and anticipate problems in the mind’s eye.

Creating high-quality 3D images begins with high-quality 2D or volumetric image acquisition. Both pre- and post-processing techniques including the use of artificial intelligence and deep learning can be employed to maximize image quality to allow for accurate segmentation of tissues of interest. Once segmentation is complete, 3D models can be created and subsequently rendered to look realistic. Models must then be registered across space and time and this registration process must be optimized to maintain the veracity of the 3D relationships. Finally, the models must be displayed in the manner most appropriate to the questions posed.

Over the past five years, advanced image processing specialists within the Department of Diagnostic Imaging have created approximately 200 virtual and 20 3D-printed pre-operative models. Our work focuses on the optimization of image acquisition for 3D modeling and the incorporation of available hardware and software technologies into the 3D workflow. Beginning with 3D rendering techniques on 2D displays, we have expanded our efforts to include the use of 3D devices including VR headsets and holographic displays (Figure).

Traditionally, virtual 3D models are displayed as 3D renderings on a 2D display, simulating depth and dimensionality using linear perspective and simulated lighting. These rendering effects rely on an inherent sense of object size to infer distance. However, when rendering unfamiliar structures, such as an individual patient tumor, our inferences can be misleading, hindering our assessment of object size and relationship to other structures. 

Stereoscopic vision and motion parallax can resolve the ambiguity of linear perspective. The display technologies incorporating both stereoscopy and parallax come in two flavors: head-mounted and free-standing. Head-mounted displays include virtual and augmented reality headsets, while free-standing 3D displays include multi-view flat panel and volumetric displays. The use of these innovative display technologies not only enhances virtual surgical planning practices but also paves the way for surgical simulation and training as well as intra-operative navigation.  


A hybrid single cell demultiplexing strategy that increases both cell recovery rate and calling accuracy

Author: Lei Li

Abstract: Recent advances in single cell RNA sequencing allow users to pool multiple samples into one run and demultiplex in downstream analysis, greatly increasing the experimental efficiency and cost-effectiveness. However, the expensive reagents for cell labeling, limited pooling capacity, non-ideal cell recovery rate and calling accuracy remain great challenges for this approach. To date, there are two major demultiplexing methods, antibody-based cell hashing and Single Nucleotide Polymorphism (SNP)-based genomic signature profiling, and each of them has their own advantages and limitations.  Here, we propose a hybrid demultiplexing strategy that increases calling accuracy and cell recovery at the same time. We first develop a computational algorithm that significantly increases the calling accuracy of cell hashing. Next, we cluster all single cells based on their SNP profiles. Finally, we integrate results from both methods to make corrections and retrieve cells that are only identifiable in one method but not the other. By testing on several real-world datasets, we demonstrate that this hybrid strategy combines the advantages of both methods, resulting in increased cell recovery and calling accuracy at a lower cost. 


VirtualFlow 2.0 - The Next Generation Drug Discovery Platform Enabling Adaptive Screens of 69 Billion Molecules

Author: Christoph Gorgulla, PhD

Abstract: Early-stage drug discovery has been limited by initial hit identification and lead optimization and their associated costs. Ultra-large virtual screens (ULVSs), which involve the virtual evaluation of massive numbers of molecules to engage a macromolecular target, have the ability to significantly alleviate these problems, as was recently demonstrated in multiple studies. Despite their potential, ULVSs have so far only explored a tiny fraction of the chemical space and of available docking programs. Notably, standard ULVS approaches become prohibitively expensive when scaled up to larger ligand libraries containing tens of billions of compounds. Here, we present VirtualFlow 2.0, the next generation of the first open-source drug discovery platform dedicated to ultra-large virtual screenings. VirtualFlow 2.0 provides the REAL Space from Enamine containing 69 billion drug-like molecules in a "ready-to-dock" format, the largest library of its kind available to date. We provide an 18-dimensional matrix for intuitive exploration of the library through a web interface, where each dimension corresponds to a molecular property of the ligands. Additionally, VirtualFlow 2.0 supports multiple techniques that dramatically reduce computational costs, including a new method called Adaptive Target-Guided Virtual Screening (ATG-VS). By sampling a representative sparse version of the library, ATG-VS identifies the sections of the ultra-large chemical space that harbors the highest potential to engage the target site, leading to substantially reduced computational costs by up to a factor of 1000. In addition, VirtualFlow 2.0 supports the latest deep learning- and GPU-based docking methods, allowing further speed-ups by up to two orders of magnitude. Due to its open-source nature and versatility, we expect that VirtualFlow 2.0 will play a key role in the future of early-stage drug discovery. 


Image Analysis for Clinical Research: Using Neuroimaging to Improve Quality of Survivorship for Children Treated for Cancer

Author: Heather Conklin, PhD

Abstract: Recent advances in cancer detection and cancer-directed therapy have contributed to a significant improvement in survival rates for children diagnosed with cancer. However, there is frequently a cost of cure with survivors experiencing life-long increased risk of morbidity associated with their curative, often aggressive therapies. There is also a cost associated with missed normative developmental experiences, especially in early childhood. Clinical neuroimaging has significantly contributed to the increase in survival rates for children treated for cancer by facilitating earlier diagnosis and contributing to differential diagnosis, as well as assisting with monitoring treatment response, identifying disease progression/recurrence, and differentiating disease progression from treatment-related changes. Childhood cancer survivors who received CNS-directed therapy are at significant risk for cognitive late effects of their disease and treatment, which contribute to reduced quality of life due to a decreased likelihood of attaining important academic, vocational, and social milestones. Neuroimaging is critical to clinical investigations that seek to improve cognitive outcomes for childhood cancer survivors. Incorporation of neuroimaging into collaborative research has greatly facilitated:  1. identifying those patients at greatest cognitive risk, 2. elucidating biological mechanisms underlying neurotoxicity, and 3. demonstrating neuroplasticity associated with cognitive interventions. This presentation will provide examples of collaborative research in each of these three areas that demonstrate the use of multiple imaging modalities (i.e., structural imaging, diffusion tensor imaging [DTI], and functional magnetic resonance imaging [fMRI]), and the impact of findings on clinical care (e.g., clinical surveillance, treatment planning, caregiver education, and development of cognitive interventions). Specific challenges to using neuroimaging in clinical investigations that seek to improve cognitive outcomes among childhood cancer survivors and future directions will be discussed.


Deep learning-based image synthesis for adaptive proton therapy

Author: Jinsoo Uh, PhD

Abstract: Proton beam therapy is a preferred radiotherapy option for pediatric tumors because of its dosimetric advantage in sparing non-targeted critical structures and the potential in minimizing late toxicities. However, it is susceptible to uncertainties associated with anatomic changes during the treatment course, such as tumor regression and gain/loss of body weight. This leads to the need for on-treatment verification of the proton dose and potentially an adaptive replanning, for which multiple imaging modalities are often desired because of the modality-specific advantages and limitations, consequently increasing burdens on the patient and clinical staff. MRI provides superior soft tissue contrast, facilitating defining the target volume and non-targeted tissues, but the relatively long scan time is unfavorable for motion managements including anesthesia. A CT scan is relatively fast, and the images provide information for proton dose calculation, which is not readily available in MRI, while repeated on-treatment scans would increase exposure to ionizing radiation. Cone-beam CT (CBCT) is a special type of CT with lower radiation dose and feasibility of onboard mounting in the treatment room. As such, CBCT is useful for daily verification of patient positioning and anatomic consistency with the treatment plan, but the compromised image quality does not allow an accurate calculation of proton dose. 

Recent advances in deep learning-based methods for image synthesis render unprecedented opportunities for an efficient imaging. A deep network trained by previously acquired data enables reconstruction of a high-quality image from limited data acquisition and image-to-image translation across modalities, thereby reducing the scan time and the need for multi-modality imaging. While numerous deep learning methods have been proposed to synthesize medical images, few studies have addressed adaptive proton therapy of pediatric patients. This presentation will provide an overview of recent deep learning-based imaging studies at St. Jude Radiation Oncology Department in the context of adaptive proton therapy: (1) synthesis of CT from CBCT using a cycle-consistent generative adversarial network (CycleGAN) to estimate proton dose from daily CBCT, (2) reconstruction of MRI from undersampled k-space data using a U-net-based method to accelerate MR scan, and (3) synthesis of relative proton stopping power map from MRI using an enhanced CycleGAN with additional consistent loss for MR-guided adaptive proton therapy. The developments to deal with unique challenges in radiotherapy-specific pediatric imaging, such as age-dependent variation in body sizes and the use of compromised MR receiver coils to accommodate immobilization devices, will be highlighted. An automated curation process to identify incoming images that the deep network is likely to fail will be also discussed. 


COMET: Self-supervised learning and vision transformers improve tumor classification of histopathological images

Author: Ben Lansdell, PhD

Co-Authors: Abbas Shirinifard, PhD; 

Abstract: Characterizing histopathologic features in whole-slide images (WSIs) is an important problem in computational pathology. Yet whole slide histopathology images are huge, up to a gigapixel or more. This poses challenges to traditional computer vision methods. How can relevant features be extracted from such images, at the relevant scale, in order to serve machine learning tasks like tumor subtyping, survival prediction, and response-to-treatment prediction? Additionally, such tasks are generally only weakly supervised -- only a slide-level label is available.  

Two promising approaches are: self-supervised learning (SSL), which provides highly data-efficient representation learning, making it appropriate for this weakly supervised case; and vision transformers which, like their natural language counterparts, have provided state-of-the-art performance across many applications, particularly in cases where context-dependent attention mechanisms are needed to provide targeted feature extraction. In this study we show that these two approaches can be used to improve slide-level tumor classification, demonstrating the utility of these methods in histopathology.  

First, using the COMET histology dataset, comprising over 3000 WSIs, we use a contrastive learning framework, based on SimCLR, to extract patch-level features. We show that this approach improves slide-level separability as measured by a simple kNN classifier, when compared to a pretrained feature extraction method. Thus self-supervised learning can be used to extract features relevant for downstream tasks. 

We then train a BERT-like, encoder-only transformer in order to improve tumor classification. The transformer takes as input patch-level features extracted using our SSL model, and randomly samples a set of patches from each slide, which serves as a form of data augmentation. We show that performance improves over our baseline through using this transformer, highlighting the importance of context-dependent feature extraction in tumor classification. 

Our study demonstrates the effectiveness of self-supervised learning and vision transformers in improving slide-level tumor classification, offering a promising direction for further research in computational pathology. 


COMET: Collaborative AI in Image Analysis: Accelerating Research and Enhancing User Experiences 

Author: Abbas Shirinifard, PhD

Co-Authors: Ben Lansdell, PhD

Abstract: The COMET dataset, the largest collection of pediatric histopathological slides and methylation data, presents unparalleled opportunities for clinical research and advancements. In this talk, we will delve into the main objectives of our team in this project, which encompass: 1) developing AI-powered tools to enable image content search, thereby enhancing the user experience within the COMET portal, and 2) creating tailored image analysis methods to address specific biological questions and benchmark their performance. We are committed to sharing these tools and methods with the broader research community and actively seeking collaboration. Histopathological slides have proven invaluable for a wide range of clinical applications, including tumor classification and subtyping, predicting patient outcomes, and risk stratification. By leveraging the COMET dataset and fostering a collaborative environment, our team aims to accelerate research and drive new discoveries in pediatric histopathology.


COMET: A Comprehensive Resource for Histological Imaging and Methylation Profiling Data from Pediatric Solid Tumors in the PeCan Knowledgebase on St. Jude Cloud

Author: Clay McLeod, MSE

Co-Authors: Alex Gout, PhD;  Stephanie Sandor, MS; Delaram Rahbarinia, MS; Jobin Sunny, MS; Lucian Vacaroiu; James Madison; Kevin Benton, MS; Michael Macias, MS; Samuel Brady, PhD; Wentao Yang, PhD; 

Abstract: The Comprehensive Methylation Database (COMET) represents a highly valuable resource for pediatric oncology research, comprising histological imaging and methylation profiling data from over 5,000 pediatric solid tumors. In pursuit of sharing this data, we have developed an interactive histology slide image resource within the Pediatric Cancer Knowledge Base (PeCan) available on St. Jude Cloud. This resource enables the exploration of the entire compendium of COMET histological slide imaging data to facilitate unique insights into tumor biology. Researchers can search for slides of interest via our search/filtering interface enabling stratification of slides based on tumor diagnosis, subject age, tumor status and other clinical features. Following filtering, returned slides are displayed and, upon selection of a slide of interest, associated sample metadata can be viewed alongside the image, which can be further explored via our browser-based slide viewer. Importantly, the state is preserved upon sharing of links to the slide viewer page allowing for seamless collaborative investigation among researchers. We are currently developing further features to enhance the analysis of COMET histology slides, including an image-based slide search feature to enable histological characteristic analysis across the entire collection of solid tumor slide images. Further, we are developing an interactive resource for COMET methylation data, which will soon be available on the St. Jude Cloud Genomics Platform (https://platform.stjude.cloud), a cloud-based ecosystem for delivering and investigating large sets of biological data. Access to the COMET dataset by KIDS23 researchers can be requested by emailing support@stjude.cloud. As new data is released by the COMET project, we will continue to update data and develop associated resources within PeCan. In providing this invaluable resource to the scientific community, we aim to foster innovation, collaboration and discovery in the pursuit of breakthroughs surrounding the improved diagnosis, therapy and etiological understanding of pediatric solid tumors. 


Data-driven insights into the pathogenesis of cerebellar mutism syndrome

Author: Samuel Stuart McAfee

Abstract: Cerebellar mutism syndrome (CMS) is a surgical complication seen in some patients with posterior fossa tumors which affects speech ability, motor control, and emotional regulation. While CMS is known to be caused by surgical injury of some type, the exact neurologic cause of the disorder is not completely understood. In theory, the use of data-driven analyses on postoperative MR images could help identify common injuries associated with the disorder, but the rarity of CMS necessitates careful planning in how these analyses are carried out—generally precluding the effective use of machine learning approaches on unprocessed brain images.

We aimed to identify common injury in CMS patients by transforming raw MR images into normalized, binarized, and low-dimensional representations of estimated surgical injury prior to analysis. Here we will discuss the steps taken to achieve this image transformation (including customized normalization and lesion detection algorithms for postoperative MR images), the analyses performed after transformation (including voxel-wise, map-wise, and graph-theoretic analyses), and the new insights into CMS pathogenesis that this study has helped realize.


Genomic Variants in Pediatric Cancer

Author: Jinghui Zhang

Abstract: Genomic variants define population diversity, confer genetic predisposition, and form molecular drivers of cancer and other diseases. Innovative methods development aimed at achieving high sensitivity and accuracy for variant detection is fundamental to our understanding of the pediatric cancer genome which has a landscape distinct from that of adult cancer. These approaches have led to discoveries of new targets, mutagenesis and drug resistance mechanisms, and therapy-related clonal hematopoiesis involving the Pediatric Cancer Genome Project (PCGP), NCI TARGET, Genome4Kids clinical genomics, and St Jude Life survivorship programs. We will also share lessons learned through the journey as well as a perspective on emerging new technologies that can impact future investigation. 


Challenges of linking multi-platform microbiomes and longitudinal bulk transcriptomes to autoimmunity 

Author: Qian Li, PhD

Abstract: Background: Many clinical studies in immunology are utilizing multi-platform sequencing to profile participants’ gut microbiota, optimizing robustness and high resolution in analysis. The existing community-level association tests are based on selected distance matrices constructed by phylogenetic tree information and/or taxa abundance. But the power of community-level tests may vary across distance metrics and profiling approaches. Another high-throughput technology commonly used in large-scale study is bulk RNA sequencing. The bulk transcriptomes are mosaics of signals from multiple purified cell types. Despite numerous algorithms developed for cell type deconvolution, the inter-patient heterogeneity in pure cell reference profiles was rarely considered in longitudinal RNA-seq transcriptomes.

Methods: We employed the angle-based joint and individual variation framework to integrate the microbiotas profiled by both 16S rRNA and shotgun metagenomic sequencing, using multiple phylogenetic distance metrics. The sample-wise information shared by two platforms and distinct distance matrices is extracted by normalized scores without platform- or distance-based noises. For longitudinal RNA-seq deconvolution, we developed a tool ISLET (Individual Specific celL typE referencing Tool) to recover the pure cell gene expression signature matrix for each participant via mixed effect model, and then perform cell-type-specific differentially expressed gene (csDEG) testing from bulk data. 

Results: The integration of multi-platform microbiomes successfully stratified the risk of early-onset autoimmunity and type 1 diabetes in a pediatric study, while ISLET improved the detection of csDEG for autoimmunity and the quantification of Natural Killer cell abundance in this study.


Gene regulatory networks of the developing human brain: integrative multiomic analysis

Author: Jasmine Plummer, PhD

Abstract: Gene dysregulation during brain development in key structures of the brain can result in neurodevelopmental disorders (NDD). Interpreting NDD risks genes is further complicated by the heterogeneity of their clinical presentation Hence understanding their regulation in relation to where  NDD genes are expressed spatially during brain development is crucial to resolving how their varied clinical phenotypes emerge . To understand NDD genes and their regulation during normal brain development we used a multiomic and computational approach to construct a gene regulatory network (GRN) of NDD by integrating epigenomic profiling. (ChIPseq, eY1Hand snATACseq) with single cell and spatial RNAseq from the developing brain. These multiomic approaches provide a framework to understanding how single cell epigenome/transcriptome and spatial data give us valuable insight into spatial representation of cell populations that influence NDD and disease pathogenesis in general.


Prediction of tumor types using large multiclass digital pathology models

Author: Brent Orr, MD, PhD

Co-Authors: Md Zhangir Alom, PhD; Quynh Tran, PhD

Abstract: Hematoxylin and eosin-stained sections remain the gold standard for histologic diagnosis of cancer. With the introduction of whole slide image scanners and advances in computer vision models, methods have emerged to use whole slide images for classification and risk prediction from digital pathology images.  These approaches could be used to help triage additional testing or even as low-cost surrogates to supplant molecular testing in resource poor centers. Unfortunately, the adoption of these methods in the clinical setting has been limited to isolated use cases due to technical challenges.  The greatest clinical benefit would be realized by training and implementing large multiclass classification models with representation from most tumor types encountered by practicing pathologists. However, a major hurdle for production and implementation of such models is the size of digital images which may exceed 1 gigabyte per slide and result in multiple terabytes of training data. We have established methods to train a large histopathology image models and applied them to train a 74-class brain tumor dataset using multiple state of the art architectures.  We discuss the potential application of similar methods to the St. Jude COMET dataset, consisting of a large corpus of whole slide images matching pediatric solid tumors. We anticipate that our approach will enable the development of robust and accurate cancer classification models that can be used to augment clinical diagnosis.


Challenges in infectious disease data analysis

Author: Peijun Ma, PhD

Abstract: In this short presentation, we will discuss the challenges that researchers face in the data analysis of infectious diseases. Two specific examples will be explored. Firstly, clinical microbial isolates are known to exhibit high levels of diversity, even within the same species. This can make it extremely challenging to compare transcriptomics data from different isolates or strains. Secondly, in microbiome research, there is often a disconnection between metagenomic and transcriptomics data, particularly in the context of single-cell and metatranscriptomic analysis. These challenges present significant obstacles to the data analysis of infectious diseases, and therefore require careful consideration and resolution. By examining these challenges, we hope to gain a deeper understanding of the complexities involved in microbiological data analysis, and ultimately find ways to overcome them.


Modularized Molecule Making

Author: Daniel Blair, PhD

Abstract: Modular automated platforms have enabled significant advances in biopolymer synthesis, including RNA, DNA, proteins, and glycans. For instance, automated DNA synthesis has revolutionized the field of genomics, enabling the production of millions of DNA sequences in parallel for applications such as gene editing and synthetic biology. Automated protein synthesis has similarly expanded the toolbox of molecular biology, allowing for the production of proteins with precise amino acid sequences and modifications for use in drug discovery and materials science. Glycan synthesis has also benefited from modular automation, providing access to diverse glycans for studying their roles in biology and developing glycan-based therapeutics.

However, despite the importance of small organic molecules as drugs, functional materials, and crop protectants, a similar revolution in their automated synthesis has not yet been realized. Current methods for small molecule synthesis rely heavily on manual labor and require significant time and resources. Developing modular automated platforms for small molecule synthesis has the potential to significantly accelerate the discovery and production of new molecules with diverse functions and properties. This will enable the development of new drugs for treating diseases, materials for advanced technologies, and crop protectants for sustainable agriculture.

Our quest for developing modular automated platforms for small organic molecule synthesis has recently taken a major step forward. In this talk, we are excited to describe our development of a self-contained automated synthesis platform, which can efficiently create small organic molecules from bench-stable building blocks. The evolution of such platforms represents a crucial inflection point in the study of small organic molecules, transitioning from a labor-intensive, data-poor environment to one that empowers researchers to ask big, bold, and data-driven questions.

Wider implementation of such automated synthesis platforms will bring about a significant change in how we approach the study and development of small organic molecules. By streamlining the synthesis process, these platforms will make it possible to explore a much larger chemical space, enabling researchers to discover and optimize new compounds with unprecedented speed and efficiency. Ultimately, this could lead to the development of new drugs, materials, and crop protectants with enhanced properties and capabilities.


Exploring new frontiers in chemical space

Author: Anang Shelat, PhD

Abstract: The number of molecules that possess "drug-like" properties, and therefore, have the potential to manipulate biological function, is on the order of the number of atoms in the observable universe.  How can we efficiently navigate this vast chemical space? In the Computational and Chemical Biology session, we will learn how scientists at St. Jude are developing new methods to explore chemical space both virtually and synthetically, and how they interrogate the combinatorically larger space of drug-drug interactions.

Understanding Leukemia Evolution at Single-Cell and Single-Gene Resolution

Author: Jian Xu, PhD

Abstract: Cancers evolve as a consequence of the accumulation of somatically acquired mutations, and their malignant properties reflect the functional cooperation of these mutations. Genetic interactions are central to the selection of variant subclones during cancer evolution, resulting in acquisition of biologic attributes that drive cancer progression and pathogenesis. This is evident in acute myeloid leukemia (AML) and the preleukemic myelodysplastic/myeloproliferative neoplasms (MDS/MPN), a group of genetically and clinically heterogeneous hematological diseases. Although co-occurring somatic mutations are frequently detected in MDS/MPN and AML patients, it remains elusive how distinct oncogenic drivers cooperate to dysregulate gene expression, cellular differentiation, and disease progression. We developed genetically engineered mouse models harboring mutations in signaling (NRasG12D) and epigenetic (EZH2) regulators commonly found in human hematopoietic malignancies. These preclinical models allow not only the identification of molecular pathways controlling MPN progression to acute leukemia, but also the analysis of the functional cooperation between distinct oncogenic drivers in disease pathophysiology. We employed single-cell transcriptomic profiling to map cellular composition and gene expression alterations in healthy or diseased bone marrows during leukemogenesis. At cellular level, NRasG12D induces myeloid lineage-biased differentiation and EZH2-deficiency impairs myeloid cell maturation, whereas they cooperate to promote myeloid neoplasms with dysregulated transcriptional programs. At gene level, NRasG12D and EZH2-deficiency independently and synergistically deregulate gene expression. We integrated results from histopathology, leukemia repopulation, and leukemia-initiating cell assays to validate transcriptome-based cellular profiles. We used this resource to relate developmental hierarchies to leukemia phenotypes, evaluate oncogenic cooperation at single-cell and single-gene levels, and identify GEM as a new regulator of leukemia-initiating cells. Our studies establish an integrative approach to evaluate the functional cooperation between distinct oncogenic drivers at single-cell and single-gene resolution in vivo.


Workshops

Visual Block-Based R Programming for Data Science

Workshop leaders:  Andrew Olney; Franz Parkins, PhD; Motomi Mori, PhD

Abstract: Participants in this workshop will learn and use a blocks-based programming language to solve data science problems in the JupyterLab computational notebook environment. Blocks-based programming allows users unfamiliar with programming to manipulate blocks, which fit together like puzzle pieces, to generate R code. This removes some of the burden of learning to program (memorizing syntax, syntax errors, etc) and allows users to focus on solving data science problems. This workshop will cover introductory content covering loading data from files, basic data manipulation, plotting, and descriptive statistics. A free companion online course that takes the same approach to more advanced topics will be offered in the near future.


Data Manipulation using R dplyr

Workshop leaders: Andrew Olney; Motomi Mori, PhD; Franz Parkins, PhD

Abstract: Participants in this workshop will learn a variety of data manipulation techniques using the popular R package dplyr (and friends), including selecting rows, selecting columns, pivoting, and summarizing data, all inside the JupyterLab computational notebook environment. Participants highly familiar with R will work directly with R code; participants less familiar with R can use a blocks-based programming language that generates R code. Blocks-based programming allows users unfamiliar with programming to manipulate blocks, which fit together like puzzle pieces,removing some of the burden of learning to program (memorizing syntax, syntax errors, etc) and allowing users to focus on solving data science problems. A free companion online course that takes the same approach to more advanced topics will be offered in the near future.


Advanced Image Segmentation with ilastik and Omnipose

Workshop leaders: Mia Panlilio, PhD; Krishnan Venkataraman, MSE; 

Abstract: This mini-course introduces two state-of-the art interactive software applications for image segmentation: ilastik [1] and Omnipose [2]. Both packages have been designed with biomedical imaging modalities in mind, provide high performance, and are community developed and supported. In this workshop the user will utilize classical machine learning, deep learning and data analytics to easily segment, classify, and extract useful features from biomedical images. No coding is required. Familiarity with basic image manipulation/analysis is preferred but not required.

The workshop has two parts:

  1. Motivation and theory behind ilastik and Omnipose with demonstration on instructor-provided datasets
  2. Hands-on image segmentation using ilastik and Omnipose on datasets provided by participants

References:

[1] Berg, S. et al. (2019) ilastik: interactive machine learning for (bio)image analysis. Nat Meth 16: 1226–1232. https://www.ilastik.org/index.html

[2] Cutler, K.J. et al. (2022) Omnipose: A high-precision morphology-independent solution for bacterial cell segmentation. Nat Meth 19: 1438–1448. https://doi.org/10.1101/2021.11.03.467199


An introduction to developing re-usable interactive reports and dashboards via R Shiny

Workshop leader: Jared Andrews, PhD

Abstract: This workshop will provide an introduction to the development of re-usable interactive reports and dashboards using R Shiny.

The learning goals of the workshop are for participants to:

  1. recognize how interactive, re-usable tooling can help them work more efficiently and effectively, 
  2. understand how such tooling can empower laboratory scientists to better understand and investigate their data to derive meaningful insights, 
  3. and gain a strong foundational knowledge of basic R Shiny functionality and development that they can incorporate into their own work.

To fulfill these goals, the workshop will contain four hands-on modules:

Module 1 (30 minutes) - A brief, high-level overview of Shiny's capabilities and limitations and a primer on the concept of reactivity.

Module 2 (30 minutes) - Construction of a simple Shiny application from scratch.

Module 3 (90 minutes, breaks included) - Developing a more complex, re-usable Shiny application to investigate PCA results capable of being deployed as a standalone application or embedded within an Rmd notebook as part of a larger analysis.

Module 4 (30 minutes) - Application/report deployment and sharing via Posit Connect and shinyapps.io.

To best benefit from this workshop, basic R knowledge is expected, but no experience with Shiny is required. 


Introduction to St Jude’s Machine Learning and Artificial Intelligence Platform

Workshop leaders: Chad Burdyshaw, PhD

Abstract: This workshop will introduce users to St Jude’s purpose built architecture and operating environment for Machine Learning and Artificial Intelligence (ML/AI) applications. In this workshop you will learn how to access the ML/AI environment, how to transfer files to and from our HPC filesystem, and we will work through building an application using a BERT Natural Language Processing (NLP) model on GPU hardware to demonstrate the capabilities and features provided by our Cnvrg.io MLOps platform. 

These features include a web based GUI, a command line interface, file and dataset versioning, interactive and batch processing, pipelining multiple applications into a workflow, and endpoint serving and deployment. 

To ensure we have sufficient resources available at the time of this workshop, we are limiting enrollment to 24 attendees. 

Please contact chad.burdyshaw@stjude.org for questions, and to ensure that you have an account on both the HPC and ML/AI clusters.


Single Cell Analysis for Bench Scientists

Workshop leaders: Susanna Downing, MS; Lei Li, PhD

Abstract: This workshop will cover various analysis techniques used for single cell data. We will provide introductions to methods for quality control of data, clustering and cluster annotation, dimensionality reduction, and differential expression analysis. Participants will gain exposure to popular software tools such as Seurat and Scanpy, as well as Bioconductor workflows. The first half will cover background information and basic theory behind common analysis techniques. During the second half, we will split into groups based on interest. Suzy Downing will guide users through a sample analysis of cortical brain tissue utilizing Jupyter notebooks to work through basic analysis workflows. Lei Li will guide users through BCR/TCR analysis using graphical interface tools, while introducing useful R packages so that advanced users can customize their research. No experience is necessary to join but we ask that in-person participants bring laptops.


Genome-wide Association Study (GWAS): Basic Principles and Tools

Workshop leaders: Cheng Cheng, PhD; Wenjian Yang, PhD

Abstract: Genome-wide Association Study (GWAS) is a contemporary research method that deals with detecting associations between individual or groups of genotypes and phenotypes, across possibly the entire genome. This short course on GWAS consists of two parts. In the first part, we will discuss essential statistical principles underlying GWAS, including determination of proper statistical models, statistical inference and testing of associations, genome-wide significance and inference methods for massive multiple hypothesis tests, considerations in variable/model selection, and if time permits, the issue of population structure or ancestry. Most of these statistical principles are also applicable in other omics-wide analyses beyond GWAS. In the second part, we introduce the most popular and effective tool for performing a GWAS, the PLINK software package. Introductions to installation, setup, data file type and formats, features and functions, R coding, visualization (e.g. the Manhattan plot) and reporting will be provided. Other relevant software tools will also be briefly introduced. Both parts of the lecture will be supported by concrete examples and publications. 

 
 
Close