St. Jude KIDS23 BioHackathon Challenges

Data Management

Automated data archival/retrieval for Azure Cloud pipelines

Title

Automated data archival/retrieval for Azure Cloud pipelines

Category

Data Management

Challenge

The Clinical Genomics analysis pipelines have been ported to the Azure cloud, where a hot filesystem is expensive. To minimize cost, we'd like to have automation that keeps the active data needed for analysis and reporting hot, and automatically initiates archival and deletion for data that's been reported. We'd also like to handle ad-hoc requests from analysts who need to look at older data. We have manual solutions and wrappers around azcopy, etc..., but nothing fully automated and easy to manage.

Benefit

Moving things after attempting an analysis and discovering the data is missing entails a 10-12 hour delay in the analysis while cold storage is re-hydrated. An automated system would eliminate those delays, by preemptively starting the rehydration as soon as the order is placed for an analysis.

Helpful Tools, Packages, or Software

Compbio has a set of tools already built for managing archives and recalls manually. The best software for coding up the azure calls is probably the MS Go library. We'd like the automation to run as server on our azure cluster; so Go is likely the best tool to use for the project.

Test Data

NA
Image data management system
Title

Image data management system

Category

Data Management

Challenge

This challenge provides an opportunity for 1 to 3 teams to develop a prototypical Image Data Management System (IDMS). The challenge includes the following components:
1. Front-end dashboard prototype in React (preferred) or Angular consuming REST api developed as part of Challenge 2.
2. Back-end database and RESTful API developed in Django and Django Rest Framework.
3. (Optional) – Reuse database from Challenge 2 or create new for use in GraphQL API developed in Django or similar technology. Teams for challenge 2 and 3 should be prepared to talk about the tradeoffs using a RESTful api vs a GraphQL API.
Sub-Challenge 1: IDMS Front-end Dashboard Prototype

Using React and MaterialUI (or Angular) develop a prototypical dashboard that a user can login to and access their image collections and corresponding metadata. Prototype should support visualization of images provided by any underlying tool (such as the CBI renderer) in the form of displaying a TIFF, JPEG, or PNG file. Metadata should be accessible to some degree in the front end. List of possible use cases:

1. Use case 1: As a user I want to login to the IDMS system and review my image collections and corresponding meta data.

a. This base use case requires using the API developed as part of Challenge two. To facilitate development of dashboard the Challenge 1 team should be able to take the idms_data.json file (to be provided) and populate the dashboard with that information while Challenge Team 2 is developing the database and API. Abstract out the acquiring of data (file-based or API-based) used to populate the interface to facilitate this parallel development aspect.

2. Use case 2: As a user I want to update select metadata in the interface and populate the backend database.

a. Metadata tied to a backend tool (such as the renderer) should be updated in that tools data backend using that tool’s RESTful api. The backend update to be coordinated and provided by Challenge Team 2.

3. Use case 3: As a user I want to retrieve an image from a collection based upon a specific image or tile id.

a. View image provided from backend tool within the user interface. The user should be able to select an imagecollection and specify retrieval of a specific image via the RESTful API for display within the dashboard.

Sub-Challenge 2: IDMS Backend Database and RESTful API Prototype

In collaboration with the team on Challenge 1, create a backend database and REST API using Django and the Django Rest Framework (or similar). This API will be consumed by the team developing the frontend dashboard. This feature should also communicate, via an existing REST API, with a tool backend for accessing image data. Below are the following use cases to develop.

1. Use case 1: As a developer, I need to provide a REST API to be utilized by Challenge Team 1.

a. This REST API should provide the following support:

i. Retrieve list of users, projects, image collections and image collection meta data for display in the dashboard.

ii. Update of meta data in the database and backend tool database.

iii. Retrieval of image data from backend tool based upon generic frontend API query

2. Use case 2 (optional): As a developer, I want to containerize the application for ease of deployment via Docker/Singularity.

Sub-Challenge 3 (optional): GraphQL API

Similar to challenge 2 but the team must develop a GraphQL API instead of a REST API. Compare the tradeoffs of using a GraphQL API approach vs a REST API approach. Use cases:

1. Use case 1: As a developer I need to provide a GraphQL API similar to the API in Challenge 2.

a. Work with both Challenge Team 1 and Challenge Team 2 to design and implement the API.

2. Use case 2: As a developer work with Challenge Team 1 to implement this API in the dashboard.

Benefit

Provide uniform access to image data across a variety of backend toolsets.

Helpful Tools, Packages, or Software

Python, Django, Django Rest Framework, React, MaterialUI, Javascript, HTML, CSS, CBI Renderer, Docker, SQLite, PostgreSQL

Test Data

Use / access to CBI renderer tool or stand-in needed

Dev Ops and Community

St. Jude Travel Together
Title

St. Jude Travel Together

Category

DevOps and Community

Challenge

St. Jude Children’s Research Hospital is one of the largest employers in Memphis. Every day, thousands of employees flock to downtown Memphis to support its mission. However, rising gas prices, impromptu car issues, and environmental concerns present challenges for this commute.

To help the St Jude community address this challenge, I am suggesting developing infrastructure to enable two alternative options to St. Jude employees through a new app called St. Jude Travel Together (SJTT):

1. Group ride. SJTT will support a network of bike commuters in defining pick up points/times along bike paths to/from St. Jude so people can join a biking group and feel safer riding together to work.
- step 1: SJTT will connect with people biking to/from work (may be by finding who requested a St Jude bike sticker), asking their bike route and approximate time.
- step 2: SJTT will create a map of the most commonly used bike routes and time for pick up.
- step 3: SJTT will create an app so at the start of ride, the biker will start SJTT and its position will be displayed on the app in real time. A new rider will then be able to join on the ride.
2. Ride Share. This branch of SJTT will function like popular, commercial ride-share apps but will be tailored to the St. Jude community:
- step 1: SJTT will connect groups of people living in the same area or on a specific path to/from work who are interested in ride-share options.
- step 2: SJTT will create a map of the commonly used car routes / times.
- step 3: SJTT will host an agenda so people can request a ride (one time or recurrent). People in the same area (within x miles) will receive an e-mail that a ride is posted and will be able to pick it up (like an open shift in Teams-Shifts).
- step 4: SJTT will create an app to replace the agenda.
Benefit

See above

Helpful Tools, Packages, or Software

Android SDK, Swift, etc.

Test Data

NA
Scaling CBI image analysis pipeline by leveraging HPC resources

Title

Scaling CBI image analysis pipeline by leveraging HPC resources

Category

DevOps and Community

Challenge

The CBI currently provides a JupyterHub service where users can run image analysis pipelines using one of the two CBI workstations that are running the JupyterHub instances. We would like to scale up the computation by using the HPC while keeping the interactivity of the jupyter notebooks and minimizing the additional time spent for packaging/containerizing the analysis pipelines.

Benefit

For CBI members: minimize the time spent on packaging/containerizing and focus on development of analysis pipelines; for St. Jude researchers: after scaling up, more users will be able to make use of the image analysis tools/pipelines developed by the CBI.

Helpful Tools, Packages, or Software

Dask, Ray, JupyterLab on the HPC

Test Data

An example pipeline will be provided.
Increasing the reliability and resiliency of the Image Processing Pipeline (IPP) plugin

Title

Increasing the reliability and resiliency of the Image Processing Pipeline (IPP) plugin

Category

DevOps and Community

Challenge

We have developed an Image Processing Pipeline (IPP) to run any workflow and access it via web browser. IPP helps research staff who do not have computational background to run a workflow without knowing the underlying details/programming. It also monitors jobs and allows one to view the job progress in real-time. This is a unique solution, but we need to increase its reliability and resiliency by integrating load balancer and replicated instances.

Benefit

This will allow the St. Jude Scientific Community to run imaging analysis workflows from a web browser. It will eliminate the burden of needing computational skills to run these analyses. The proposed improvements to the pipeline will improve analysis speed and increase robustness of the system.

Helpful Tools, Packages, or Software

Load balancer, Kubernetes. The platform is built with python, flask, and MongoDB, so knowledge of those technologies will be helpful.

Test Data

Access to the IPP will be provided to those interested.

Gui Tool Development

Customizable Fiji menu interface

Title

Customizable Fiji menu interface

Category

GUI Tool Development

Challenge

FIJI is an opensource and extensible image processing and analysis application geared towards life science applications. FIJI is very powerful due to a large developer community. As a result, FIJI's core menu system is now three layers deep, encompassing over ~600 commands. The ‘Plugins’ menu provides an additional ~500 commands. Unfortunately, this myriad of functions is a double-edged sword: Novice end users quickly become lost and frustrated by these many options, most of which are irrelevant for basic needs. As a result, many end users (>50%) insist on using more user friendly, but also far more expensive, commercial packages.

The challenge is to make the FIJI menu system fully customizable. The goal is not to remove any functionality, but simply hide rarely used commands from the end user. Thus, based on user preference, FIJI could be run in an 'Expert Mode' where all commands are shown (the current default) or a 'Simple Mode' which displays only a simplified (and customizable) menu system. The ‘Simple Mode’ would then be curated for the specific user base in question. This would be especially useful in any of our imaging cores, or even for lab-based setups where many different levels of users are present.

While customizable toolsets and buttons (ActionButtons plugin) are currently possible, these are icon based and so work well as shortcuts for only a very few commands. A fully graphical GUI has also been proposed (http://www.imagejfx.net/) but seemingly never complete (and not backwards compatible). Making FIJI feel more accessible to the end user will greatly help its full power and potential efficiencies to be fully realized!

Benefit

Decrease cost and managerial overhead. Make user's imaging-related research more efficient.

Helpful Tools, Packages, or Software

Fiji

Test Data

NA
Summary reporting of estimated radiation toxicity risks

Title

Summary reporting of estimated radiation toxicity risks

Category

GUI Tool Development

Challenge

Radiation therapy treatment planning is a delicate balance between delivering enough dose to the tumor to provide disease control while not delivering too much dose to nearby healthy tissues. Most organs have well established guidelines on maximum allowable doses given some acceptable risk (as determined by the MD). There does not currently exist a way to visualize that guidance, estimate the risk given the current radiation therapy treatment plan, and provide a summary for the patient's chart. We would like to develop a simple GUI and reporting system that estimate radiation toxicity risks and print summary reports to be used in survivorship plans for patients. A good start is developing a prototype that is capable of doing this for one or two organs with published toxicity models that can be easily implemented.

Benefit

Setting up successful and resourceful survivorship care plans requires appropriate assessment of therapy risks. A more detailed accounting of the risk coming from radiation therapy could help to best direct care to radiation oncology patients.

Helpful Tools, Packages, or Software

There are many ways this could be attacked based on the skill set of the team. A Shiny app, python Dash app, etc. Everything we would need is available.

Test Data

We would generate some mock radiation dose data for testing. The actual toxicity models can be pulled from the literature, likely something from St. Jude Life or the Childhood Cancer Survivorship Series.
Predicted pathway analysis for experimental compounds
Title

Predicted pathway analysis for experimental compounds

Category

GUI Tool Development

Challenge

Succinct Summary: Given a database linking compounds, genes, and pathways, build a tool that will:
1. Allow users to input a list of compounds and retrieve a list of pathways which are enriched in the targets (genes) of those compounds.
2. Allow user to visually explore these pathways—targets hit by compounds in the list should be color coded by activity (ie: red, increased activity; blue, decreased activity).
3. Targets which have a compound not in the list (weren’t screened or were not hits), should be annotated.
Descriptive Summary: The compound discovery team needs a tool that enables querying compounds using a user-provided list of compounds to identify enriched pathways and activity in target genes within those given pathways. In addition, the tool should enable querying of pathway-specific targets to identify a list of compounds that target all start, intermediate, and end-points within a given pathway and their activities on those targets. For example, querying for TLR pathway targets yields a list of compounds that target genes within that pathway + a visual overlay indicates those targets are activated or inactivated by the compound. Bonus points if you can incorporate ranking pathways/compounds based on increased or decreased activity in the pathway!

The tool should provide a visual overlay of the genes targeted by a compound after querying for a pathway and vice versa: 1) querying compound list yields list of pathways + provides an enrichment analysis of the targeted genes to determine which compounds offer the most coverage, activation, or inactivation in a given pathway. In the results of the query, enriched pathways should indicate which targets in that pathway are activated/repressed by a given compound, 2) querying a pathway results in a list of compounds + a coverage/enrichment score for each compound based on the number of genes in that pathway that are targeted and their activities, and 3) after querying, the tool identifies related compounds that target the same pathway and provides a score/readout that allows users to quickly identify if there are any additional compounds to the current query that offer additional coverage for a given pathway.

* There are existing databases linking compounds and gene targets, and we don’t want to replicate these, as keeping them up-to-date is a full time job better left for others. Likewise, we have good tools that quantify pathway enrichment using a gene list. What we are seeking here is an exploratory data analysis tool that lets users upload their list of compounds to check for pathway enrichment, and to visually explore their hits in those pathways.

Benefit

The solutions to this challenge would greatly benefit researchers attempting to identify compounds that target various biomarkers/markers/pathways in any given experiment and vice versa.

Helpful Tools, Packages, or Software

R, Python, SQL, Spotfire, etc.

Test Data

MSIGDB pathway databases, Novartis set of compounds, St. Jude-provided set of compounds currently available in compound discovery database
ML dashboard for real-time model building using natural language processing on biological sequence inputs

Title

ML dashboard for real-time model building using natural language processing on biological sequence inputs

Category

GUI Tool Development

Challenge

ML dashboard for interactive real-time model building on biological sequence inputs

Natural language processing is a highly promising field in artificial intelligence. Efforts are currently underway to extend these language-specific models to the biological domain. Biological sequences (DNA, RNA, protein sequence) can benefit from such models but the current software landscape is scarce. Our goal is to build an interactive dashboard based on Python, that will implement some of these language models on biological sequence data. Users will be able to provide any biological sequences as input (e.g., from a multiple sequence alignment) and then open the interactive dashboard where they can interactively select the language model, define the parameters, and start calculations. Depending on the model selected (and this is something that can be prioritized for the purpose of the hackathon), the user can in real-time change parameters and have the results updated on the screen.

The only solution that currently exists, to my knowledge, is BioSeq-BLM. However, the server interface is quite archaic, web-based, expects users to know all input parameters, and is not at all interactive. Our goal for the hackathon is to select only a subset of the BioSeq-BLM models (perhaps 2 or 3), which we think is sensible given the timeline, and develop a prototype. A prototype for this project would entail a working dashboard that accepts a biological sequence as input, processes it using standard Python packages on the backend, and provides a few applications to interactively explore the results (e.g., two or three interactive plots).

Because building the backend is an involved task, we think users should aim to code the initial solution on Jupyter notebooks and then use programs such as voila to serve the dashboard. This would also provide participants with the unique experience of learning about interactive data exploration, which I assume is not that common in hackathon challenges.

Benefit

We currently have a working 3D sequence clustering dashboard webserver working in our group. We would like to add the capability of making selections on these clusters, and then provide the above Python-based dashboard to users to interactively explore the different models. We also think that solutions supported by BioSeq-BLM are useful to computational proteomics and genomics and would find beneficial use by other groups here at St. Jude as well.

Helpful Tools, Packages, or Software

Jupyter Notebook, numpy, scipy, scikit-learn, BioSeq-BLM (the downloaded standalone package), bokeh/plotly.

Test Data

BioSeq-BLM can be used as a reference solution to test and compare.
Reusable R Shiny modules for common plots and data types

Title

Reusable R Shiny modules for common plots and data types

Submitter

Jared Andrews

Category

GUI Tool Development

Challenge

R Shiny applications have become quite popular recently, with many publications using publicly available applications to share data and results in an interactive way. These applications can also be extremely useful for exploratory data analysis and figure generation. With proper design and user training, they empower bench scientists to better interact with, interpret, and explore their data.

The development of these applications can be time-consuming, particularly when designing flexible, customizable, interactive visualizations that work with many types of data. Fortunately, there is already a way to create re-usable sets of inputs & outputs for Shiny application - Shiny modules. These modules function as building blocks of an application, making them much faster to develop and reducing code redundancy across applications. As of yet, there is no package that includes a robust, flexible collection of modules for creating common plot types from typical R matrices/dataframe/Bioconducter classes.

This challenge is to create a R package containing functions for multiple Shiny modules that generate fully customizable, flexible, interactive plots. Examples could include: scatter plots, box & violin plots, bar charts, line charts, etc. Plots should be fully aesthetically customizable (color, text size, axis limits, titles, etc.) using Shiny inputs and preferably interactive via plotly or d3 (with data shown on hover selectable). Bonus points for label additions/removal on click. Specific examples of module use would be highly useful (e.g. a volcano plot for DGE results from DESeq2, edgeR, and/or limma-voom). Modules could have useful pre-defined input defaults for specific data types and classes (e.g. SummarizedExperiment, DGEList, etc.) but should also work with generic S3 objects like matrices and dataframes.

Benefit

This would be a large boon to Shiny power users and developers, as it'd simplify application development significantly. This could easily be submitted to Bioconductor for use by the greater scientific community.

Helpful Tools, Packages, or Software

Shiny, Plotly, ggplot2, r2d3

Test Data

Lots of toy R datasets for various viz available - cars, iris, airway, etc. Or just use mock matrices.

Image Analysis

Automatic detection, masking, and quantitation of cells using AI

Title

Automatic detection, masking, and quantitation of cells using AI

Category

Image Analysis

Challenge

One of the critical features of cancer progression is the cancer cells’ potential to undergo cell transition. Hence, automatically discovering, detecting, and classifying the cell transition would be of immense utility in tracking cancer progression. To this end, automatically quantifying and stratifying signals from various cellular compartments based on cell morphology (e.g., membrane, mitochondria) data would aid in the above goal.

Only some available solutions are either one-size-fits-all commercial software with automated workflows or open-source software with time-consuming trial-and-error and manual steps to modify workflow for the exact experiment. Because of the recent development in image analysis using ML/AI, developing a solution based on pattern recognition and exploring the image based on available patterns inside the images will be beneficial.

Benefit

We face this issue directly because we have been using and analyzing the image data in our lab to better understand cancer states. Also, it would be helpful to many other labs at St. Jude to have a standardized one-stop shop for image analysis.

Helpful Tools, Packages, or Software

R (ggplot2, Shiny, Bioconductor), Python (scikit-learn, H2O, TensorFlow, PyTorch, Matplotlib, Django, FastAPI), JavaScript (D3.js, web frameworks, database)

Test Data

MNIST, Cell Image Library, The Broad Bioimage Benchmark Collection (BBBC)
Restoration/imputation to improve analysis of low-resolution legacy MRI image data

Title

Restoration/imputation to improve analysis of low-resolution legacy MRI image data

Category

Image Analysis

Challenge

Utilize digital super resolution to improve analyses of low-resolution imaging data, especially MRI data. It is very challenging to segment smaller brain structures accurately from low-res MRI. If we can reconstruct the images and make internal details more clear then developing automated solutions to analyze them and detect specific phenotypes becomes much more viable.

Benefit

Enriching low resolution images will enable biologists to perform more granular analyses.

Helpful Tools, Packages, or Software

pytorch, torchvision, sklearn

Test Data

https://github.com/dama-lab/mouse-brain-atlas/tree/master/NeAt/ex_vivo/template
Add ML-assisted image annotation to napari

Title

Add ML-assisted image annotation to napari

Category

Image Analysis

Challenge

Add ML-assisted image annotation to napari. Computer-assisted volume annotation helps speed up generating expert-curated ground truth for downstream machine learning. Currently, only few GUI clients support it, most notably 3D Slicer. napari (https://napari.org) is a very flexible and hackable GUI to view and annotate images. Compared to Slicer, napari is easy to customize and tightly integrate with specific workflows.

Benefit

Adding support for ML-assisted image annotation to napari would provide a simple and accessible platform for future development of customized annotation workflows.

Helpful Tools, Packages, or Software

NVIDIA Clara AIAA, napari

Test Data

BraTS (https://www.med.upenn.edu/cbica/brats2020/data.html, https://www.kaggle.com/awsaf49/brats20-dataset-training-validation)

Processing Pipelines and Methods

Allele counter for SNV, MNV across sequencing targets using SAM API for Rust

Title

Allele counter for SNV, MNV across sequencing targets using SAM API for Rust

Category

Processing Pipelines and Methods

Challenge

We frequently want to confirm SNV/MNV (multi-nucleotide variant)/Indel variant calls across different sequencing targets for the same sample (i.e. we want to see that the same variant occurs in the whole genome, exome, and RNA sequences). We typically do this by examining the region in each target and counting the occurrences of the alleles. This is currently done in perl code for SNVs, and Python for MNVs and indels, but none of these packages are particularly fast. We have access now, to a native SAM API for Rust, and this challenge would be to implement an allele counter for SNV, MNV using that package.

Benefit

We would incorporate this into the St. Jude Clinical Genomics pipeline, which would make it simpler and faster. The MNV case is actually not covered well by third party tools; so publishing this on Github would be good for any group trying to do automated confirmation of variants across targets.

Helpful Tools, Packages, or Software

Noodles -- in github at umccr / htsget-rs

Test Data

St. Jude cloud genomic data has variants and bam files for multiple sequencing targets.
Enhanced machine learning-based analysis of gene regulatory networks

Title

Enhanced machine learning-based analysis of gene regulatory networks

Category

Processing Pipelines and Methods

Challenge

A significant volume of biomedical literature explains gene dysregulation in numerous diseases. However, there needs to be a comprehensive understanding of the relationship between the gene regulatory networks and their associated molecular functions and phenotypes. Many machine-learning approaches have been considered to reconstruct and analyze gene regulatory networks, and classical text-mining approaches have been recognized to produce inconsistent outcomes.

Although currently available Natural Language Processing (NLP) approaches still need to include the critical step of the relationship between gene regulatory networks and phenotypes, it's worth an attempt to use NLP by adopting a graph-neural network framework using biological data such as molecular interactome data. After reconstructing gene networks using existing relevant literature, it can be further integrated with gene ontology and GWAS data to improve the gene network's usability and reliability.

Benefit

This project with help in having a better understanding of the genes that are regulated by other genes which would eventually help researchers to identify the target for specific diseases and molecular mechanisms in which the target is involved in. This project could be very important for drug discovery projects, specifically for those that are based on all available proteomics data in St. Jude.

Helpful Tools, Packages, or Software

R (ggplot2, Shiny, Bioconductor), Cytoscape Python (scikit-learn, H2O, TensorFlow, PyTorch, Matplotlib, Django, FastAPI, NetworkX) JavaScript (D3.js, web frameworks, database)

Test Data

GEO, GWAS, DepMap
Predicting destabilizing point mutations making full use of the structure in AlphaFold DB

Title

Predicting destabilizing point mutations making full use of the structure in AlphaFold DB

Category

Processing Pipelines and Methods

Challenge

One very exciting use case which AlphaFold 2 sadly fails at is predicting destabilizing point mutations, since it is not sensitive enough to alter its structural prediction based on a single residue change. Can we develop a pipeline that addresses this question, perhaps by extending PolyPhen-2 to make full use of the structures in AlphaFold DB, which cover the entire human proteome? Currently, this predictor only uses structures available in the PDB structure database. Adding AlphaFold structure predictions to this should be an easy improvement to make.

Benefit

This might directly help us assess the effects of missense mutations. It could be released as an “improved” version of a variant effect predictor such as PolyPhen-2, and it could ultimately even be incorporated into the St. Jude Medal Ceremony pipeline.

Helpful Tools, Packages, or Software

AlphaFold 2; AlphaFold DB; Variant effect predictors used in the St. Jude Medal Ceremony: PolyPhen2 (HVAR), SIFT, CADD, REVEL, FATHMM, MutationAssessor, and LRT; Additional variant effect predictor: Ensembl VEP

Test Data

ClinVar, ASHG pathogenicity classification, possibly HGMD
Toolbox for convenient manipulation of AlphaFold output

Title

Toolbox for convenient manipulation of AlphaFold output

Category

Processing Pipelines and Methods

Challenge

AlphaFold produces up to 5 predicted structures that contain slightly different conformations of the same modeled sequence. It then names files by appending an incremental value (from 0 to 5) according to the average pLDDT confidence score. However, in some cases, there is a specific region or domain of interest for which we will want to maximize this confidence score. AlphaFold also provides aligned error values for all residue-residue distances. Working with these values, however, is not straightforward. Our objective with this challenge is to develop a library containing “convenience functions” for working with AlphaFold-generated models. The goal is to write several independent scripts that can output specific functions using AlphaFold models as input. Depending on the interest and the number of desired functionalities, it’s also tempting to build an object-oriented API to store and interact with AlphaFold output in a programmatic way.

Benefit

This would help us make the most of AlphaFold’s structure predictions by refining them further and extending the use-cases by allowing us to make better decisions.

Helpful Tools, Packages, or Software

AlphaFold 2, AlphaFold DB, Standard Python packages + numpy.

Test Data

PDB structures and AlphaFold models.
Establishing a workflow for identifying important structural features of a protein of interest integrating information from multiple sources
Title

Establishing a workflow for identifying important structural features of a protein of interest integrating information from multiple sources

Category

Processing Pipelines and Methods

Challenge

Next-generation sequencing has expanded our collective ability to identify common gene variants in the population, but we are still working to improve our ability to predict pathogenicity for these mutations. Along with the expansion of sequencing capabilities, GPU-enabled technologies have led to an expanded capacity to perform molecular dynamics (MD) simulations on protein structures. These MD simulations can identify features of proteins that are crucial for canonical function by identifying residue interactions networks that mediate intra- and inter- protein interactions in each protein state. The challenge is integrating these sources of information in a simplified workflow to identify the important structural features of a protein of interest based on annotated sequence information and protein dynamics information. The key deliverable from this challenge is to develop a workflow that can combine the dynamic residue interaction network of a protein with annotated sequence information to display the potential sites of pathogenicity onto a protein structure.

Protein dynamics and contact network analysis:
1. First, we suggest implementing an object interface for the calculation of inter-atomic distances. We will support the most common type of interactions (e.g., hydrogen bonds, ion interactions, etc.). Various tools already exist that carry out such tasks, but with wildly varying levels of user control, software dependencies and supported API.
2. We then suggest using Cpptraj to conduct dynamic cross correlation analysis to identify correlations between dynamic regions of the receptor and plot the data via matplotlib or seaborn package. Cpptraj supports all formats of MD trajectories generated via NAMD, AMBER, GROMACS or CHARMM and analysis can be conducted in parallel. Additionally, we will quantify secondary structural elements to assess the impact of a mutation on the structure. We will use the ‘secstruct’ module from cpptraj package, which inturn uses the DSSP algorithm. Sequence-based domain and feature annotation of proteins: To understand the potential effects of variants on the PPIs of a protein it is important to be able to map the mutations/variants onto structural features including those that may alter interactions. An example of this is in the SARS-CoV2 coat protein where mutations may alter an antibody binding site or change a modification such as a glycosylation site.
3. Using nucleotide or amino acid sequence alignments alongside feature information from e.g., Uniprot, the user should be able to map potentially important protein features that are disrupted in certain sequence variants. In addition, using resources such as ELM to predict where potential Linear Motifs or modification sites are, this can provide additional insight into the potential epitope changes on viral coat proteins. This can additionally be expanded in a general way to map variants in protein features to any structure to identify potential important functional variants. The user should be able to input a large nucleotide sequence alignment and a structure and be able to map where mutations are, and which protein features are potentially disrupted.
Visualization of dynamics and sequence-based annotations:
1. The workflow should deliver a visual representation of the annotated structure features displayed on a three-dimensional protein model. We suggest using pymol to render 3D images of protein PDBs found in the RCSB or AlphaFold2 database. The annotations from structural dynamics analysis should be visualized as a “scene” separate from the annotations from sequence-based analysis. A third “scene” should be used to overlay the combined annotations.
Benefit

The solutions developed by this challenge will help to prioritize mutations identified at the population level based on their pathogenicity. The workflow will enable sequence-based annotation of protein functional regions identified through the MD analysis. This will integrate multiple levels of evidence to highlight the most likely pathological mechanisms that affect the function of a given protein of interest. This can be used in clinical diagnosis as a tool to score the pathogenicity effects of variants. It can be used in structural studies when designing protein mutations to stabilize proteins in non-cellular environments, by informing regions in which to avoid mutations.

Helpful Tools, Packages, or Software

MDTraj, MDAnalysis, pytraj, numpy, scipy, seaborn, BioPython, pandas.

Test Data

MD simulation trajectories (or) PDB structures from RCSB database, AlphaFold2 database ELM (Eukaryotic Linear Motif resource), NCBI sequence databases, Uniprot (for annotated features, PTMs, processing sites)
Machine learning pipeline to predict locations of mutations in different cancer types

Title

Machine learning pipeline to predict locations of mutations in different cancer types

Category

Processing Pipelines and Methods

Challenge

Can we predict which missense mutations will likely appear (at the residue level) for a given cancer type? The challenge would consist of building an ML pipeline to predict the locations and substitution of mutations in genes in samples of different cancer types. Current ML solutions either aim at identifying cancer driver genes (Malebary et al, Sci.Rep2021) or used older technologies (SVM, 10.1109/iarwisoci.2014.7034632). The aim of the challenge would be to assess with state-of-the art ML methods how accurately we are currently able to estimate the appearance of cancer mutations.

The input features for each sample could consist of amino-acid sequences of a target gene and a cancer type label, and the participants would be asked to predict the location and amino-acid substitution observed in that gene and cancer type. This data could be obtained from available public databases (I.e. COSMIC, TCGA, etc).

Teams would be free to pull in additional genomic data (I.e. homologous sequences, natural variant data), structural data, network data etc, with the exception of explicitly using cancer mutational data to build a predictor. The teams’ pipelines would be benchmarked on a hold out set of mutated cancer genes.

Benefit

This challenge could allow assessing what the efficacy of current ML state-of-the art methods (DL, transforms, protein language models etc) are on predicting the likelihood of mutations in cancers. Such ML pipelines could be beneficial both for understanding what genomic and biological features impact the likely location of mutations in cancer gens, as well as helping analyze and process the large amount of data available in the St.Jude PECAN database related to mutational landscape in pediatric cancers.

Helpful Tools, Packages, or Software

ML: Free choice of tools (Tensorflow, Keras, Pytorch, JAX etc) Pretrained models: MSA transformer (FAIR resource), UniRep

Test Data

Cancer genomic datasets: TCGA, COSMIC, SJ. PECAN?
Putative core regulatory circuitry identification

Title

Putative core regulatory circuitry identification

Submitter

Jared Andrews

Category

Processing Pipelines and Methods

Challenge

Core regulatory circuitries (CRCs) have been revealed as enticing therapeutic targets in the context of cancer. These circuitries are defined as cliques of self- and mutual-regulating transcription factors (TFs) driven by super enhancers (SEs) that drive an enhanced network of genes necessary for maintenance of disease state.

In existing methodologies, putative CRCs are identified by first performing super enhancer calling from H3K27ac ChIP or CUTandRUN data and associating them with nearby genes. Next, motif scanning within SE subpeaks or SE-contained ATAC-seq peaks is performed. Lastly, network analysis is used to identify in/out degree of each TF with other SE-associated TFs via TF motif presence. Optionally, active TFs and SE-gene association can be informed via incorporation of expression data (e.g. removing all genes with TPM < 1).

Current tools to perform these steps are no longer maintained, difficult to install, and out of date (python 2, old annotations/motif databases, etc). This challenge would be to create a new, simplified tool that incorporates these methods to be used on semi-processed datasets - matched ATAC-seq, RNA-seq, and SE calls for a given sample. In addition, there now exist methods to predict bound TF motifs from ATAC-seq data via motif footprinting, which should improve the accuracy and robustness of this approach.

In short, given a BED file of predicted bound TF motifs, a BED file of SE calls, and an expression matrix of counts/TPMs for a given sample, identify putative CRCs based on network analysis. Bonus points if the resulting networks can be visualized easily or if downstream gene target information is provided.

Benefit

A solution to this problem would allow us to include putative CRC identification into our standard workflow for sample analysis. It would also kickstart the field with a much needed update to these methodologies and provide a mechanism to better integrate these multi-modal datasets into cohesive, holistic analyses.

Helpful Tools, Packages, or Software

Example methodologies can be seen in coltron, CRCmapper, and crc. All of these are quite out of date and difficult to install, but crc is the most clear. TOBIAS can be used for ATAC-seq footprinting.

Test Data

Any matched ATAC-seq, RNA-seq, and H3K27ac ChIP-seq or CUTandRUN. Motif databases from TransFac, JASPAR, HOMOCOCO, etc. ROSE for super enhancer calling.
Inference of enriched miRNA target sites in alternative/differential transcripts

Title

Inference of enriched miRNA target sites in alternative/differential transcripts

Submitter

Gang Wu

Category

Processing Pipelines and Methods

Challenge

MicroRNA regulates mRNA gene expression largely through recognizing the target sites at the 3'-UTR region of mRNA. Although bulk RNA-seq is readily available, miRNA expression is not often measured directly using miRNA sequencing. There have been tools such as sylamer to infer the enriched miRNA target sites in a differentially expressed gene list. With bulk and single cell RNA-seq datasets readily available, similar functionality can be extended to detect miRNA target sites enriched in transcripts with alternative/differential polyA usage between bulk RNA-seq under different conditions or between different cell types in scRNA-seq.

Benefit

A solution to this challenge would allow many published datasets to be re-analyzed to determine how miRNA expression changes may impact various conditions and cell states, in addition to providing useful functionality in RNA-seq workflows.

Helpful Tools, Packages, or Software

Sylamer (https://www.nature.com/articles/nmeth.1267) & (https://github.com/micans/sylamer)

Test Data

TCGA RNAseq and matched miRNAseq data; Bulk RNAseq and miRNAseq from miR-451/144 knockout mice; Public scRNAseq

Breakthroughs require heart

St. Jude KIDS23 BioHackathon Challenges

Data Management

Automated data archival/retrieval for Azure Cloud pipelines

Image data management system

Dev Ops and Community

St. Jude Travel Together

Scaling CBI image analysis pipeline by leveraging HPC resources

Increasing the reliability and resiliency of the Image Processing Pipeline (IPP) plugin

Gui Tool Development

Customizable Fiji menu interface

Summary reporting of estimated radiation toxicity risks

Predicted pathway analysis for experimental compounds

ML dashboard for real-time model building using natural language processing on biological sequence inputs

Reusable R Shiny modules for common plots and data types

Image Analysis

Automatic detection, masking, and quantitation of cells using AI

Restoration/imputation to improve analysis of low-resolution legacy MRI image data

Add ML-assisted image annotation to napari

Processing Pipelines and Methods

Allele counter for SNV, MNV across sequencing targets using SAM API for Rust

Enhanced machine learning-based analysis of gene regulatory networks

Predicting destabilizing point mutations making full use of the structure in AlphaFold DB

Toolbox for convenient manipulation of AlphaFold output

Establishing a workflow for identifying important structural features of a protein of interest integrating information from multiple sources

Machine learning pipeline to predict locations of mutations in different cancer types

Putative core regulatory circuitry identification

Inference of enriched miRNA target sites in alternative/differential transcripts