St. Jude Family of Websites
Explore our cutting edge research, world-class patient care, career opportunities and more.
St. Jude Children's Research Hospital Home
St. Jude Family of Websites
Explore our cutting edge research, world-class patient care, career opportunities and more.
St. Jude Children's Research Hospital Home
Michelle Stoltz had a choice about where to use her talents. But only one place helped her sister survive childhood cancer. This is why she came to St. Jude.
Title
Automated data archival/retrieval for Azure Cloud pipelines
Category
Data Management
Challenge
The Clinical Genomics analysis pipelines have been ported to the Azure cloud, where a hot filesystem is expensive. To minimize cost, we'd like to have automation that keeps the active data needed for analysis and reporting hot, and automatically initiates archival and deletion for data that's been reported. We'd also like to handle ad-hoc requests from analysts who need to look at older data. We have manual solutions and wrappers around azcopy, etc..., but nothing fully automated and easy to manage.
Benefit
Moving things after attempting an analysis and discovering the data is missing entails a 10-12 hour delay in the analysis while cold storage is re-hydrated. An automated system would eliminate those delays, by preemptively starting the rehydration as soon as the order is placed for an analysis.
Helpful Tools, Packages, or Software
Compbio has a set of tools already built for managing archives and recalls manually. The best software for coding up the azure calls is probably the MS Go library. We'd like the automation to run as server on our azure cluster; so Go is likely the best tool to use for the project.
Test Data
NA
Title
Image data management system
Category
Data Management
Challenge
This challenge provides an opportunity for 1 to 3 teams to develop a prototypical Image Data Management System (IDMS). The challenge includes the following components:
Sub-Challenge 1: IDMS Front-end Dashboard Prototype
Using React and MaterialUI (or Angular) develop a prototypical dashboard that a user can login to and access their image collections and corresponding metadata. Prototype should support visualization of images provided by any underlying tool (such as the CBI renderer) in the form of displaying a TIFF, JPEG, or PNG file. Metadata should be accessible to some degree in the front end. List of possible use cases:
1. Use case 1: As a user I want to login to the IDMS system and review my image collections and corresponding meta data.
a. This base use case requires using the API developed as part of Challenge two. To facilitate development of dashboard the Challenge 1 team should be able to take the idms_data.json file (to be provided) and populate the dashboard with that information while Challenge Team 2 is developing the database and API. Abstract out the acquiring of data (file-based or API-based) used to populate the interface to facilitate this parallel development aspect.
2. Use case 2: As a user I want to update select metadata in the interface and populate the backend database.
a. Metadata tied to a backend tool (such as the renderer) should be updated in that tools data backend using that tool’s RESTful api. The backend update to be coordinated and provided by Challenge Team 2.
3. Use case 3: As a user I want to retrieve an image from a collection based upon a specific image or tile id.
a. View image provided from backend tool within the user interface. The user should be able to select an imagecollection and specify retrieval of a specific image via the RESTful API for display within the dashboard.
Sub-Challenge 2: IDMS Backend Database and RESTful API Prototype
In collaboration with the team on Challenge 1, create a backend database and REST API using Django and the Django Rest Framework (or similar). This API will be consumed by the team developing the frontend dashboard. This feature should also communicate, via an existing REST API, with a tool backend for accessing image data. Below are the following use cases to develop.
1. Use case 1: As a developer, I need to provide a REST API to be utilized by Challenge Team 1.
a. This REST API should provide the following support:
i. Retrieve list of users, projects, image collections and image collection meta data for display in the dashboard.
ii. Update of meta data in the database and backend tool database.
iii. Retrieval of image data from backend tool based upon generic frontend API query
2. Use case 2 (optional): As a developer, I want to containerize the application for ease of deployment via Docker/Singularity.
Sub-Challenge 3 (optional): GraphQL API
Similar to challenge 2 but the team must develop a GraphQL API instead of a REST API. Compare the tradeoffs of using a GraphQL API approach vs a REST API approach. Use cases:
1. Use case 1: As a developer I need to provide a GraphQL API similar to the API in Challenge 2.
a. Work with both Challenge Team 1 and Challenge Team 2 to design and implement the API.
2. Use case 2: As a developer work with Challenge Team 1 to implement this API in the dashboard.
Benefit
Provide uniform access to image data across a variety of backend toolsets.
Helpful Tools, Packages, or Software
Python, Django, Django Rest Framework, React, MaterialUI, Javascript, HTML, CSS, CBI Renderer, Docker, SQLite, PostgreSQL
Test Data
Use / access to CBI renderer tool or stand-in needed
Title
St. Jude Travel Together
Category
DevOps and Community
Challenge
St. Jude Children’s Research Hospital is one of the largest employers in Memphis. Every day, thousands of employees flock to downtown Memphis to support its mission. However, rising gas prices, impromptu car issues, and environmental concerns present challenges for this commute.
To help the St Jude community address this challenge, I am suggesting developing infrastructure to enable two alternative options to St. Jude employees through a new app called St. Jude Travel Together (SJTT):
1. Group ride. SJTT will support a network of bike commuters in defining pick up points/times along bike paths to/from St. Jude so people can join a biking group and feel safer riding together to work.
2. Ride Share. This branch of SJTT will function like popular, commercial ride-share apps but will be tailored to the St. Jude community:
Benefit
See above
Helpful Tools, Packages, or Software
Android SDK, Swift, etc.
Test Data
NA
Title
Scaling CBI image analysis pipeline by leveraging HPC resources
Category
DevOps and Community
Challenge
The CBI currently provides a JupyterHub service where users can run image analysis pipelines using one of the two CBI workstations that are running the JupyterHub instances. We would like to scale up the computation by using the HPC while keeping the interactivity of the jupyter notebooks and minimizing the additional time spent for packaging/containerizing the analysis pipelines.
Benefit
For CBI members: minimize the time spent on packaging/containerizing and focus on development of analysis pipelines; for St. Jude researchers: after scaling up, more users will be able to make use of the image analysis tools/pipelines developed by the CBI.
Helpful Tools, Packages, or Software
Dask, Ray, JupyterLab on the HPC
Test Data
An example pipeline will be provided.
Title
Increasing the reliability and resiliency of the Image Processing Pipeline (IPP) plugin
Category
DevOps and Community
Challenge
We have developed an Image Processing Pipeline (IPP) to run any workflow and access it via web browser. IPP helps research staff who do not have computational background to run a workflow without knowing the underlying details/programming. It also monitors jobs and allows one to view the job progress in real-time. This is a unique solution, but we need to increase its reliability and resiliency by integrating load balancer and replicated instances.
Benefit
This will allow the St. Jude Scientific Community to run imaging analysis workflows from a web browser. It will eliminate the burden of needing computational skills to run these analyses. The proposed improvements to the pipeline will improve analysis speed and increase robustness of the system.
Helpful Tools, Packages, or Software
Load balancer, Kubernetes. The platform is built with python, flask, and MongoDB, so knowledge of those technologies will be helpful.
Test Data
Access to the IPP will be provided to those interested.
Title
Customizable Fiji menu interface
Category
GUI Tool Development
Challenge
FIJI is an opensource and extensible image processing and analysis application geared towards life science applications. FIJI is very powerful due to a large developer community. As a result, FIJI's core menu system is now three layers deep, encompassing over ~600 commands. The ‘Plugins’ menu provides an additional ~500 commands. Unfortunately, this myriad of functions is a double-edged sword: Novice end users quickly become lost and frustrated by these many options, most of which are irrelevant for basic needs. As a result, many end users (>50%) insist on using more user friendly, but also far more expensive, commercial packages.
The challenge is to make the FIJI menu system fully customizable. The goal is not to remove any functionality, but simply hide rarely used commands from the end user. Thus, based on user preference, FIJI could be run in an 'Expert Mode' where all commands are shown (the current default) or a 'Simple Mode' which displays only a simplified (and customizable) menu system. The ‘Simple Mode’ would then be curated for the specific user base in question. This would be especially useful in any of our imaging cores, or even for lab-based setups where many different levels of users are present.
While customizable toolsets and buttons (ActionButtons plugin) are currently possible, these are icon based and so work well as shortcuts for only a very few commands. A fully graphical GUI has also been proposed (http://www.imagejfx.net/) but seemingly never complete (and not backwards compatible). Making FIJI feel more accessible to the end user will greatly help its full power and potential efficiencies to be fully realized!
Benefit
Decrease cost and managerial overhead. Make user's imaging-related research more efficient.
Helpful Tools, Packages, or Software
Fiji
Test Data
NA
Title
Summary reporting of estimated radiation toxicity risks
Category
GUI Tool Development
Challenge
Radiation therapy treatment planning is a delicate balance between delivering enough dose to the tumor to provide disease control while not delivering too much dose to nearby healthy tissues. Most organs have well established guidelines on maximum allowable doses given some acceptable risk (as determined by the MD). There does not currently exist a way to visualize that guidance, estimate the risk given the current radiation therapy treatment plan, and provide a summary for the patient's chart. We would like to develop a simple GUI and reporting system that estimate radiation toxicity risks and print summary reports to be used in survivorship plans for patients. A good start is developing a prototype that is capable of doing this for one or two organs with published toxicity models that can be easily implemented.
Benefit
Setting up successful and resourceful survivorship care plans requires appropriate assessment of therapy risks. A more detailed accounting of the risk coming from radiation therapy could help to best direct care to radiation oncology patients.
Helpful Tools, Packages, or Software
There are many ways this could be attacked based on the skill set of the team. A Shiny app, python Dash app, etc. Everything we would need is available.
Test Data
We would generate some mock radiation dose data for testing. The actual toxicity models can be pulled from the literature, likely something from St. Jude Life or the Childhood Cancer Survivorship Series.
Title
Predicted pathway analysis for experimental compounds
Category
GUI Tool Development
Challenge
Succinct Summary: Given a database linking compounds, genes, and pathways, build a tool that will:
Descriptive Summary: The compound discovery team needs a tool that enables querying compounds using a user-provided list of compounds to identify enriched pathways and activity in target genes within those given pathways. In addition, the tool should enable querying of pathway-specific targets to identify a list of compounds that target all start, intermediate, and end-points within a given pathway and their activities on those targets. For example, querying for TLR pathway targets yields a list of compounds that target genes within that pathway + a visual overlay indicates those targets are activated or inactivated by the compound. Bonus points if you can incorporate ranking pathways/compounds based on increased or decreased activity in the pathway!
The tool should provide a visual overlay of the genes targeted by a compound after querying for a pathway and vice versa: 1) querying compound list yields list of pathways + provides an enrichment analysis of the targeted genes to determine which compounds offer the most coverage, activation, or inactivation in a given pathway. In the results of the query, enriched pathways should indicate which targets in that pathway are activated/repressed by a given compound, 2) querying a pathway results in a list of compounds + a coverage/enrichment score for each compound based on the number of genes in that pathway that are targeted and their activities, and 3) after querying, the tool identifies related compounds that target the same pathway and provides a score/readout that allows users to quickly identify if there are any additional compounds to the current query that offer additional coverage for a given pathway.
* There are existing databases linking compounds and gene targets, and we don’t want to replicate these, as keeping them up-to-date is a full time job better left for others. Likewise, we have good tools that quantify pathway enrichment using a gene list. What we are seeking here is an exploratory data analysis tool that lets users upload their list of compounds to check for pathway enrichment, and to visually explore their hits in those pathways.
Benefit
The solutions to this challenge would greatly benefit researchers attempting to identify compounds that target various biomarkers/markers/pathways in any given experiment and vice versa.
Helpful Tools, Packages, or Software
R, Python, SQL, Spotfire, etc.
Test Data
MSIGDB pathway databases, Novartis set of compounds, St. Jude-provided set of compounds currently available in compound discovery database
Title
ML dashboard for real-time model building using natural language processing on biological sequence inputs
Category
GUI Tool Development
Challenge
ML dashboard for interactive real-time model building on biological sequence inputs
Natural language processing is a highly promising field in artificial intelligence. Efforts are currently underway to extend these language-specific models to the biological domain. Biological sequences (DNA, RNA, protein sequence) can benefit from such models but the current software landscape is scarce. Our goal is to build an interactive dashboard based on Python, that will implement some of these language models on biological sequence data. Users will be able to provide any biological sequences as input (e.g., from a multiple sequence alignment) and then open the interactive dashboard where they can interactively select the language model, define the parameters, and start calculations. Depending on the model selected (and this is something that can be prioritized for the purpose of the hackathon), the user can in real-time change parameters and have the results updated on the screen.
The only solution that currently exists, to my knowledge, is BioSeq-BLM. However, the server interface is quite archaic, web-based, expects users to know all input parameters, and is not at all interactive. Our goal for the hackathon is to select only a subset of the BioSeq-BLM models (perhaps 2 or 3), which we think is sensible given the timeline, and develop a prototype. A prototype for this project would entail a working dashboard that accepts a biological sequence as input, processes it using standard Python packages on the backend, and provides a few applications to interactively explore the results (e.g., two or three interactive plots).
Because building the backend is an involved task, we think users should aim to code the initial solution on Jupyter notebooks and then use programs such as voila to serve the dashboard. This would also provide participants with the unique experience of learning about interactive data exploration, which I assume is not that common in hackathon challenges.
Benefit
We currently have a working 3D sequence clustering dashboard webserver working in our group. We would like to add the capability of making selections on these clusters, and then provide the above Python-based dashboard to users to interactively explore the different models. We also think that solutions supported by BioSeq-BLM are useful to computational proteomics and genomics and would find beneficial use by other groups here at St. Jude as well.
Helpful Tools, Packages, or Software
Jupyter Notebook, numpy, scipy, scikit-learn, BioSeq-BLM (the downloaded standalone package), bokeh/plotly.
Test Data
BioSeq-BLM can be used as a reference solution to test and compare.
Title
Reusable R Shiny modules for common plots and data types
Submitter
Jared Andrews
Category
GUI Tool Development
Challenge
R Shiny applications have become quite popular recently, with many publications using publicly available applications to share data and results in an interactive way. These applications can also be extremely useful for exploratory data analysis and figure generation. With proper design and user training, they empower bench scientists to better interact with, interpret, and explore their data.
The development of these applications can be time-consuming, particularly when designing flexible, customizable, interactive visualizations that work with many types of data. Fortunately, there is already a way to create re-usable sets of inputs & outputs for Shiny application - Shiny modules. These modules function as building blocks of an application, making them much faster to develop and reducing code redundancy across applications. As of yet, there is no package that includes a robust, flexible collection of modules for creating common plot types from typical R matrices/dataframe/Bioconducter classes.
This challenge is to create a R package containing functions for multiple Shiny modules that generate fully customizable, flexible, interactive plots. Examples could include: scatter plots, box & violin plots, bar charts, line charts, etc. Plots should be fully aesthetically customizable (color, text size, axis limits, titles, etc.) using Shiny inputs and preferably interactive via plotly or d3 (with data shown on hover selectable). Bonus points for label additions/removal on click. Specific examples of module use would be highly useful (e.g. a volcano plot for DGE results from DESeq2, edgeR, and/or limma-voom). Modules could have useful pre-defined input defaults for specific data types and classes (e.g. SummarizedExperiment, DGEList, etc.) but should also work with generic S3 objects like matrices and dataframes.
Benefit
This would be a large boon to Shiny power users and developers, as it'd simplify application development significantly. This could easily be submitted to Bioconductor for use by the greater scientific community.
Helpful Tools, Packages, or Software
Shiny, Plotly, ggplot2, r2d3
Test Data
Lots of toy R datasets for various viz available - cars, iris, airway, etc. Or just use mock matrices.
Title
Automatic detection, masking, and quantitation of cells using AI
Category
Image Analysis
Challenge
One of the critical features of cancer progression is the cancer cells’ potential to undergo cell transition. Hence, automatically discovering, detecting, and classifying the cell transition would be of immense utility in tracking cancer progression. To this end, automatically quantifying and stratifying signals from various cellular compartments based on cell morphology (e.g., membrane, mitochondria) data would aid in the above goal.
Only some available solutions are either one-size-fits-all commercial software with automated workflows or open-source software with time-consuming trial-and-error and manual steps to modify workflow for the exact experiment. Because of the recent development in image analysis using ML/AI, developing a solution based on pattern recognition and exploring the image based on available patterns inside the images will be beneficial.
Benefit
We face this issue directly because we have been using and analyzing the image data in our lab to better understand cancer states. Also, it would be helpful to many other labs at St. Jude to have a standardized one-stop shop for image analysis.
Helpful Tools, Packages, or Software
R (ggplot2, Shiny, Bioconductor), Python (scikit-learn, H2O, TensorFlow, PyTorch, Matplotlib, Django, FastAPI), JavaScript (D3.js, web frameworks, database)
Test Data
MNIST, Cell Image Library, The Broad Bioimage Benchmark Collection (BBBC)
Title
Restoration/imputation to improve analysis of low-resolution legacy MRI image data
Category
Image Analysis
Challenge
Utilize digital super resolution to improve analyses of low-resolution imaging data, especially MRI data. It is very challenging to segment smaller brain structures accurately from low-res MRI. If we can reconstruct the images and make internal details more clear then developing automated solutions to analyze them and detect specific phenotypes becomes much more viable.
Benefit
Enriching low resolution images will enable biologists to perform more granular analyses.
Helpful Tools, Packages, or Software
pytorch, torchvision, sklearn
Test Data
https://github.com/dama-lab/mouse-brain-atlas/tree/master/NeAt/ex_vivo/template
Title
Add ML-assisted image annotation to napari
Category
Image Analysis
Challenge
Add ML-assisted image annotation to napari. Computer-assisted volume annotation helps speed up generating expert-curated ground truth for downstream machine learning. Currently, only few GUI clients support it, most notably 3D Slicer. napari (https://napari.org) is a very flexible and hackable GUI to view and annotate images. Compared to Slicer, napari is easy to customize and tightly integrate with specific workflows.
Benefit
Adding support for ML-assisted image annotation to napari would provide a simple and accessible platform for future development of customized annotation workflows.
Helpful Tools, Packages, or Software
Test Data
BraTS (https://www.med.upenn.edu/cbica/brats2020/data.html, https://www.kaggle.com/awsaf49/brats20-dataset-training-validation)
Title
Allele counter for SNV, MNV across sequencing targets using SAM API for Rust
Category
Processing Pipelines and Methods
Challenge
We frequently want to confirm SNV/MNV (multi-nucleotide variant)/Indel variant calls across different sequencing targets for the same sample (i.e. we want to see that the same variant occurs in the whole genome, exome, and RNA sequences). We typically do this by examining the region in each target and counting the occurrences of the alleles. This is currently done in perl code for SNVs, and Python for MNVs and indels, but none of these packages are particularly fast. We have access now, to a native SAM API for Rust, and this challenge would be to implement an allele counter for SNV, MNV using that package.
Benefit
We would incorporate this into the St. Jude Clinical Genomics pipeline, which would make it simpler and faster. The MNV case is actually not covered well by third party tools; so publishing this on Github would be good for any group trying to do automated confirmation of variants across targets.
Helpful Tools, Packages, or Software
Noodles -- in github at umccr / htsget-rs
Test Data
St. Jude cloud genomic data has variants and bam files for multiple sequencing targets.
Title
Enhanced machine learning-based analysis of gene regulatory networks
Category
Processing Pipelines and Methods
Challenge
A significant volume of biomedical literature explains gene dysregulation in numerous diseases. However, there needs to be a comprehensive understanding of the relationship between the gene regulatory networks and their associated molecular functions and phenotypes. Many machine-learning approaches have been considered to reconstruct and analyze gene regulatory networks, and classical text-mining approaches have been recognized to produce inconsistent outcomes.
Although currently available Natural Language Processing (NLP) approaches still need to include the critical step of the relationship between gene regulatory networks and phenotypes, it's worth an attempt to use NLP by adopting a graph-neural network framework using biological data such as molecular interactome data. After reconstructing gene networks using existing relevant literature, it can be further integrated with gene ontology and GWAS data to improve the gene network's usability and reliability.
Benefit
This project with help in having a better understanding of the genes that are regulated by other genes which would eventually help researchers to identify the target for specific diseases and molecular mechanisms in which the target is involved in. This project could be very important for drug discovery projects, specifically for those that are based on all available proteomics data in St. Jude.
Helpful Tools, Packages, or Software
R (ggplot2, Shiny, Bioconductor), Cytoscape Python (scikit-learn, H2O, TensorFlow, PyTorch, Matplotlib, Django, FastAPI, NetworkX) JavaScript (D3.js, web frameworks, database)
Test Data
GEO, GWAS, DepMap
Title
Predicting destabilizing point mutations making full use of the structure in AlphaFold DB
Category
Processing Pipelines and Methods
Challenge
One very exciting use case which AlphaFold 2 sadly fails at is predicting destabilizing point mutations, since it is not sensitive enough to alter its structural prediction based on a single residue change. Can we develop a pipeline that addresses this question, perhaps by extending PolyPhen-2 to make full use of the structures in AlphaFold DB, which cover the entire human proteome? Currently, this predictor only uses structures available in the PDB structure database. Adding AlphaFold structure predictions to this should be an easy improvement to make.
Benefit
This might directly help us assess the effects of missense mutations. It could be released as an “improved” version of a variant effect predictor such as PolyPhen-2, and it could ultimately even be incorporated into the St. Jude Medal Ceremony pipeline.
Helpful Tools, Packages, or Software
AlphaFold 2; AlphaFold DB; Variant effect predictors used in the St. Jude Medal Ceremony: PolyPhen2 (HVAR), SIFT, CADD, REVEL, FATHMM, MutationAssessor, and LRT; Additional variant effect predictor: Ensembl VEP
Test Data
ClinVar, ASHG pathogenicity classification, possibly HGMD
Title
Toolbox for convenient manipulation of AlphaFold output
Category
Processing Pipelines and Methods
Challenge
AlphaFold produces up to 5 predicted structures that contain slightly different conformations of the same modeled sequence. It then names files by appending an incremental value (from 0 to 5) according to the average pLDDT confidence score. However, in some cases, there is a specific region or domain of interest for which we will want to maximize this confidence score. AlphaFold also provides aligned error values for all residue-residue distances. Working with these values, however, is not straightforward. Our objective with this challenge is to develop a library containing “convenience functions” for working with AlphaFold-generated models. The goal is to write several independent scripts that can output specific functions using AlphaFold models as input. Depending on the interest and the number of desired functionalities, it’s also tempting to build an object-oriented API to store and interact with AlphaFold output in a programmatic way.
Benefit
This would help us make the most of AlphaFold’s structure predictions by refining them further and extending the use-cases by allowing us to make better decisions.
Helpful Tools, Packages, or Software
AlphaFold 2, AlphaFold DB, Standard Python packages + numpy.
Test Data
PDB structures and AlphaFold models.
Title
Establishing a workflow for identifying important structural features of a protein of interest integrating information from multiple sources
Category
Processing Pipelines and Methods
Challenge
Next-generation sequencing has expanded our collective ability to identify common gene variants in the population, but we are still working to improve our ability to predict pathogenicity for these mutations. Along with the expansion of sequencing capabilities, GPU-enabled technologies have led to an expanded capacity to perform molecular dynamics (MD) simulations on protein structures. These MD simulations can identify features of proteins that are crucial for canonical function by identifying residue interactions networks that mediate intra- and inter- protein interactions in each protein state. The challenge is integrating these sources of information in a simplified workflow to identify the important structural features of a protein of interest based on annotated sequence information and protein dynamics information. The key deliverable from this challenge is to develop a workflow that can combine the dynamic residue interaction network of a protein with annotated sequence information to display the potential sites of pathogenicity onto a protein structure.
Protein dynamics and contact network analysis:
Visualization of dynamics and sequence-based annotations:
Benefit
The solutions developed by this challenge will help to prioritize mutations identified at the population level based on their pathogenicity. The workflow will enable sequence-based annotation of protein functional regions identified through the MD analysis. This will integrate multiple levels of evidence to highlight the most likely pathological mechanisms that affect the function of a given protein of interest. This can be used in clinical diagnosis as a tool to score the pathogenicity effects of variants. It can be used in structural studies when designing protein mutations to stabilize proteins in non-cellular environments, by informing regions in which to avoid mutations.
Helpful Tools, Packages, or Software
MDTraj, MDAnalysis, pytraj, numpy, scipy, seaborn, BioPython, pandas.
Test Data
MD simulation trajectories (or) PDB structures from RCSB database, AlphaFold2 database ELM (Eukaryotic Linear Motif resource), NCBI sequence databases, Uniprot (for annotated features, PTMs, processing sites)
Title
Machine learning pipeline to predict locations of mutations in different cancer types
Category
Processing Pipelines and Methods
Challenge
Can we predict which missense mutations will likely appear (at the residue level) for a given cancer type? The challenge would consist of building an ML pipeline to predict the locations and substitution of mutations in genes in samples of different cancer types. Current ML solutions either aim at identifying cancer driver genes (Malebary et al, Sci.Rep2021) or used older technologies (SVM, 10.1109/iarwisoci.2014.7034632). The aim of the challenge would be to assess with state-of-the art ML methods how accurately we are currently able to estimate the appearance of cancer mutations.
The input features for each sample could consist of amino-acid sequences of a target gene and a cancer type label, and the participants would be asked to predict the location and amino-acid substitution observed in that gene and cancer type. This data could be obtained from available public databases (I.e. COSMIC, TCGA, etc).
Teams would be free to pull in additional genomic data (I.e. homologous sequences, natural variant data), structural data, network data etc, with the exception of explicitly using cancer mutational data to build a predictor. The teams’ pipelines would be benchmarked on a hold out set of mutated cancer genes.
Benefit
This challenge could allow assessing what the efficacy of current ML state-of-the art methods (DL, transforms, protein language models etc) are on predicting the likelihood of mutations in cancers. Such ML pipelines could be beneficial both for understanding what genomic and biological features impact the likely location of mutations in cancer gens, as well as helping analyze and process the large amount of data available in the St.Jude PECAN database related to mutational landscape in pediatric cancers.
Helpful Tools, Packages, or Software
ML: Free choice of tools (Tensorflow, Keras, Pytorch, JAX etc) Pretrained models: MSA transformer (FAIR resource), UniRep
Test Data
Cancer genomic datasets: TCGA, COSMIC, SJ. PECAN?
Title
Putative core regulatory circuitry identification
Submitter
Jared Andrews
Category
Processing Pipelines and Methods
Challenge
Core regulatory circuitries (CRCs) have been revealed as enticing therapeutic targets in the context of cancer. These circuitries are defined as cliques of self- and mutual-regulating transcription factors (TFs) driven by super enhancers (SEs) that drive an enhanced network of genes necessary for maintenance of disease state.
In existing methodologies, putative CRCs are identified by first performing super enhancer calling from H3K27ac ChIP or CUTandRUN data and associating them with nearby genes. Next, motif scanning within SE subpeaks or SE-contained ATAC-seq peaks is performed. Lastly, network analysis is used to identify in/out degree of each TF with other SE-associated TFs via TF motif presence. Optionally, active TFs and SE-gene association can be informed via incorporation of expression data (e.g. removing all genes with TPM < 1).
Current tools to perform these steps are no longer maintained, difficult to install, and out of date (python 2, old annotations/motif databases, etc). This challenge would be to create a new, simplified tool that incorporates these methods to be used on semi-processed datasets - matched ATAC-seq, RNA-seq, and SE calls for a given sample. In addition, there now exist methods to predict bound TF motifs from ATAC-seq data via motif footprinting, which should improve the accuracy and robustness of this approach.
In short, given a BED file of predicted bound TF motifs, a BED file of SE calls, and an expression matrix of counts/TPMs for a given sample, identify putative CRCs based on network analysis. Bonus points if the resulting networks can be visualized easily or if downstream gene target information is provided.
Benefit
A solution to this problem would allow us to include putative CRC identification into our standard workflow for sample analysis. It would also kickstart the field with a much needed update to these methodologies and provide a mechanism to better integrate these multi-modal datasets into cohesive, holistic analyses.
Helpful Tools, Packages, or Software
Example methodologies can be seen in coltron, CRCmapper, and crc. All of these are quite out of date and difficult to install, but crc is the most clear. TOBIAS can be used for ATAC-seq footprinting.
Test Data
Any matched ATAC-seq, RNA-seq, and H3K27ac ChIP-seq or CUTandRUN. Motif databases from TransFac, JASPAR, HOMOCOCO, etc. ROSE for super enhancer calling.
Title
Inference of enriched miRNA target sites in alternative/differential transcripts
Submitter
Gang Wu
Category
Processing Pipelines and Methods
Challenge
MicroRNA regulates mRNA gene expression largely through recognizing the target sites at the 3'-UTR region of mRNA. Although bulk RNA-seq is readily available, miRNA expression is not often measured directly using miRNA sequencing. There have been tools such as sylamer to infer the enriched miRNA target sites in a differentially expressed gene list. With bulk and single cell RNA-seq datasets readily available, similar functionality can be extended to detect miRNA target sites enriched in transcripts with alternative/differential polyA usage between bulk RNA-seq under different conditions or between different cell types in scRNA-seq.
Benefit
A solution to this challenge would allow many published datasets to be re-analyzed to determine how miRNA expression changes may impact various conditions and cell states, in addition to providing useful functionality in RNA-seq workflows.
Helpful Tools, Packages, or Software
Sylamer (https://www.nature.com/articles/nmeth.1267) & (https://github.com/micans/sylamer)
Test Data
TCGA RNAseq and matched miRNAseq data; Bulk RNAseq and miRNAseq from miR-451/144 knockout mice; Public scRNAseq