Projects

Recommendation Engine for Ballotpedia Users

Sponsor: Ballotpedia

Contact: Ken Carbullido <ken.carbullido@ballotpedia.org>

Abstract: Ballotpedia connects people with politics by changing the way they access the information they need to be informed about federal, state, local, and territorial politics. Our content includes neutral, accurate, and verifiable information on government officials and the offices they hold, political issues and public policy, elections, candidates, and the influencers of politics. Ballotpedia currently has over 320,000 encyclopedic articles. Students will access Google analytics for the Ballotpedia website in order to build a recommendation engine to help Ballotpedia users locate related articles of interest.
Restrictions: Data Use agreement (protect privacy of IP addresses)

Q: Can you discuss the type of data that students work with in more detail (e.g. browser history, location, etc.)? Do you have any existing recommendation engines or any other methods of customizing content for readers? "

A: "Students will work with the data that Google Analytics (GA) captures. Here are links that can explain how GA works and the data it captures.
https://www.bounteous.com/insights/2016/06/22/how-does-google-analytics-collect-information/
https://www.mattlane.co.nz/2014/10/09/an-idiots-guide-to-google-analytics/
https://www.digitalthirdcoast.com/blog/how-does-google-analytics-actually-work
Ballotpedia does not currently use a recommendation engine. All pages on Ballotpedia are built by the Ballotpedia editorial staff using the MediaWiki platform."

Q: Ballotpedia -- have you tried to create a recommendation system before? If so, can you elaborate on the success (or lack thereof) you had? Recommendation systems are difficult and dependent on the richness of information you have on the people you are recommending to. How do you know if your recommendation is helpful?

A: "Ballotpedia has never tried to create a recommendation engine. With over 288 million page views in 2020, and over 160 million page views this year, there is a great deal of Google Analytics data captured that shows how users are navigating into and to the various Ballotpedia pages. Ballotpedia does not currently have users sign-on so there is no user demographic or other data other than what Google captures if a user has signed into Google. If a user has signed in, then that demographic data may be available, although some research may be necessary to validate that as we have never sought out that data. A recommendation that is based on where users typically navigate from and to, even without demographic data, may still prove to be a very useful recommendation. I would expect that if the recommended links are being clicked/used, then that is a measure of success."

Q: Is there an opportunity/need in this project for the team to provide recommendations for the implementation of additional Google analytic event tracking tags to the site?

A: "Yes, we are able add GA event tracking tags to the site.

Classify Mortgage Documents with CV & NLP

Sponsor: Capital Center LLC (CapCenter)

Contact: Adrian Mead <amead@capcenter.com>

Abstract: CapCenter is a Richmond-based full-service mortgage company with a focus on cost-savings for clients. At its core, the mortgage industry is a document-driven business. The typical underwriting process is robust and spans hundreds of different forms including W2, paystubs, liens, assets, insurance policies, etc. Given the wide variety of documents in addition to significant differences across financial institutions, current day-to-day workflow is demanding and relies heavily on humans to judge the nature of submitted documents and reliably extract structured data from them. We see the potential for Data Science to play a role in automating away this manual process with the aid of Computer Vision and NLP. The major deliverables would be threefold: the delivery of a well-trained, reliable neural network classifier capable of ingesting document images and returning a reasonably good classification of document type (W2, etc.). Then an additional NLP-based classifier leveraging text data we’ve been able to successfully extract ourselves from a subset of the documents which returns its own independent prediction of the document type. Finally, an ensembling algorithm for combining the predictions of the previous two models into a single, more robust prediction. Additional outcomes would include model documentation, predictions on an unlabeled, holdout dataset, and high-level performance metrics.Restrictions: Students must sign data use agreement

Q: "Can you discuss the size of your dataset for the various document types (specifically the labeled data)? What format(s) are the different documents in?"

A: ""Yes, I have some numbers here. I'd say we're looking at an ~65-class problem. Cardinality is good across classes, ranging from ~25K-75K instances of each document type. In terms of physical memory, I think the dataset size is >2TB
Formats vary: overwhelming PDF (73%), then TIF (17%). Additionally we see some HTML, HTM, JPG, TXT, and proprietary formats."

Q: How will the final neural net model be deployed? Will it make classification inferences in batches? Or streaming as the documents come in?

A: Deployment is an open question but likely cloud-deployed. Operation would be streaming.

Q: Will part of this project require development work to automatically download the multiple data sources as part of the extraction discussed?

A: Not beyond import code in Python. We'll provide the data. We'd also handle the MLOps around model deployment / retraining / production runtime. I'd like you to focus on the ML piece.

Q: 1) Since some of this extraction and classification process is part of the underwriting decision process, will the group be educated on how underwriting decisions are made, in order to validate data extracted, as well as find additional qualitative data previously overlooked by underwriters to assist their decisions? 2) Would there be an opportunity for the group to develop a probability score on how likely an individual would meet the underwriting qualifications and compare that to actual approved or unapproved applications?

A: "1) Probably not. The scope of our problem is to ingest documents and return a classification and text guesses from the document based on that classification. This is pretty far removed from the actual approve/decline decision that's made during underwriting.
2) No that's out of scope."

Q: 1) Since some of this extraction and classification process is part of the underwriting decision process, will the group be educated on how underwriting decisions are made, in order to validate data extracted, as well as find additional qualitative data previously overlooked by underwriters to assist their decisions? 2) Would there be an opportunity for the group to develop a probability score on how likely an individual would meet the underwriting qualifications and compare that to actual approved or unapproved applications?

A: "1) Probably not. The scope of our problem is to ingest documents and return a classification and text guesses from the document based on that classification. This is pretty far removed from the actual approve/decline decision that's made during underwriting.
2) No that's out of scope."

Q: My concern is that I'm not strong enough to do the work. We did not get into computer vision (I assume pytorch??) nor extract text from documents (pdf or otherwise). And despite only having two classes in R, I'm much better in R.
So, can you help set expectations. Is this as difficult as it seems to be?
Any thoughts would be appreciated."

A: "Thanks for reaching out! I’d say this is hard, yes. But that was done on purpose. This is also totally feasible with where ML is today. I’m 100% sure of that. In some ways I think this is the ultimate project given the clearly-defined goal, the supervised nature of the problem, the importance it has to me as the stakeholder (also a fluent data scientist), and the “sexiness” of the material (CV and NLP). If the material is interesting to you and if you feel like you are someone that thrives when challenged, this could be a great fit. My goal is to set you all up for success with the data and engineering work while letting you guys totally handle the ML piece (and I can give feedback there/you’ll have a professor supporting you). I will push you guys to do your best and can guarantee the same courtesy in return. And to try and be somewhat explicitly more helpful, here’s my opinion. I’d rather have someone who’s passionate about either the technology piece/the application than not. And I’d rather have someone that will work hard and doesn’t know Python that well than someone who knows Python but won’t work hard. I think those combinations will give us the best chance of success."

Transparency and Justice in the Virginia Court System

Sponsor: Code for Charlottesville

Contact: Jonathan Kropko <jkropko@virginia.edu>; Ben Frye <benfrye@gmail.com>

Abstract: Virginia has a public, web-based system through which anyone can look up the information on any court case that occurred in district or circuit criminal, civic, or traffic courts. Civic Tech volunteer Ben Schoenfeld has been running a web-scraper to collect all cases from all Virginia courts since 2000. There are about 175 CSVs on this website, excluding civil cases, comprising about 10Gb of data. Personally identifiable information has been replaced by an anonymous, unique ID code for each individual in the data. This project is part of Code for Charlottesville’s ongoing collaboration with the Legal Aid Justice Center, a nonprofit based in Charlottesville. The LAJC needs help wrangling and analyzing the data to provide answers to questions such as:

  • How many cases are eligible to be expunged from public view according to a new law that takes effect in 2025?
  • Are there racial disparities in who gets charged with what offenses, and what the final outcomes are?
  • Are there geographic disparities?
  • Under what conditions is probation applied as opposed to a fine or a jail sentence?

Restrictions: none

ML for Asset Management – Build and manage portfolios of financial assets.

Sponsor: Conlan Scientific

Contact: Chris Conlan <chris@conlan.io>

Abstract: The goal will be to build a machine-learning-driven asset management strategy. Given a set of assets within a financial market (commodities, stocks, currencies) and relevant related data (fundamental, macro, and alternative data streams that may have a causal relationship to changes in asset prices), develop a machine-learned approach to allocating a limited amount of capital to each asset across time.

Restrictions: Students must sign data use agreement

Q: Are we limited to the data that is provided or can we use other resources? Is there a specific time frame you are looking at for the investments to mature (i.e. long or short term)?

A: You will be provided historical equity price and fundamental data, metadata, index data, and ETF data. It is up to the group whether or not alternative data integrations will be explored, and the decision should be made in the interest of optimizing the strategy's performance and in consideration with of any budgetary constraints.

Q: How do you measure success of the asset-management strategy? Is it by performance vs. specific benchmarks? Do you have the alternative data mentioned or is feature exploration and determination (including collection) part of the project?

A: The success of the strategy will be determined by the performance of your trading simulations relative to the appropriate benchmarks using risk-adjusted return metrics like the Sharpe and Sortino Ratios. The core dataset will be provided (historical equity prices, fundamentals, indexes, ETFs, etc.) and whether or not any alternative data will be integrated will be determined by budgetary constraints and the group's desire to explore such data as potentially effective features. The goal of developing a successful ML-driven strategy can be achieved with or without alternative data.

Q: How do you measure success of the asset-management strategy? Is it by performance vs. specific benchmarks? Do you have the alternative data mentioned or is feature exploration and determination (including collection) part of the project?

A: The success of the strategy will be determined by the performance of your trading simulations relative to the appropriate benchmarks using risk-adjusted return metrics like the Sharpe and Sortino Ratios. The core dataset will be provided (historical equity prices, fundamentals, indexes, ETFs, etc.) and whether or not any alternative data will be integrated will be determined by budgetary constraints and the group's desire to explore such data as potentially effective features. The goal of developing a successful ML-driven strategy can be achieved with or without alternative data.

Q: How do you measure success of the asset-management strategy? Is it by performance vs. specific benchmarks? Do you have the alternative data mentioned or is feature exploration and determination (including collection) part of the project?

A: The success of the strategy will be determined by the performance of your trading simulations relative to the appropriate benchmarks using risk-adjusted return metrics like the Sharpe and Sortino Ratios. The core dataset will be provided (historical equity prices, fundamentals, indexes, ETFs, etc.) and whether or not any alternative data will be integrated will be determined by budgetary constraints and the group's desire to explore such data as potentially effective features. The goal of developing a successful ML-driven strategy can be achieved with or without alternative data.

Improving neonatal care with deep learning

Sponsor: Columbia University

Contact: Cait Dreisbach <c.dreisbach@columbia.edu>

Abstract: External fetal monitoring (EFM) is the most common method of assessing maternal contractions and fetal distress during labor. EFM use is associated with adverse outcomes including higher rates of cesarean surgery, instrumental vaginal births, and maternal infection. Further, inconsistent EFM interpretations, high false positive rates, and poor inter-observer reliability of fetal heart rate decelerations contribute to a lack of clarity for care decision-making and an inability to accurately predict fetal hypoxia. Few studies have used data science approaches to predict poor fetal heart rate tracings, and these approaches lack diverse data inputs, cross-sectional design, and often rely on proprietary software. The capstone team will use deep learning methods to dynamically monitor fetal heart rate, maternal physiological data, and nursing input to shape care delivery at the bedside. The main objectives of this research are to: 1) Build a convolutional neural network (CNN) to classify fetal heart rate (e.g. acceleration, late deceleration) and heart rate variability (e.g. minimal, moderate, marked) in relation to maternal contraction patterns. 2) Explore how we can use fetal heart rate tracings to predict neonatal outcomes (e.g. umbilical cord blood pH, Apgar score).

Restrictions: none

High-throughput Predictive Modeling of in vivo Chemical Transcriptomics

Sponsor: CUNY 1

Contact: Xie Lie <lxie@iscb.org>

Abstract: In this project, students will develop an innovative computational platform to achieve the goal mentioned above by integrating deep learning with systems pharmacology. The premise of approach is based on a systematic view of drug actions. Drugs commonly interact with not only their intended protein target (i.e., on-target) but also multiple other proteins (i.e., off-target). These on-targets and off-targets collectively induce the phenotypic response of a biological system via biological networks, which can be characterized by transcriptomics. Thus, a successful compound screening requires the deconvolution of drug actions on a multi-scale, from molecular interactions to network perturbations. Recent advances in high-throughput technology and machine learning provide us with new opportunities for multi-scale modeling of drug actions for precision drug discovery. By bridging target-based and phenotypic screening, the completion of this project will overcome a critical barrier between chemical potency in vitro and drug efficacy in patients.

Restrictions: none

Q: What is the metric of success for the model proposed? For example, is one trying to predict the transcriptome from known drug binding sites? Drug binding sites from perturbations to the transcriptome? Also, will this project be relying on a publicly available database or databases?
A: The target is to predict drug-induced transcriptomics changes of a cell line or a patient. The inputs are a chemical structure and a basal gene expression of a disease state. All data needed for the project are from the public database, have been pre-processed and ready to use.

Q: Are these projects achievable in the time frame? What would success look like in these projects?
A: Because data and framework are already available, it is not expected to have a big hurdle. The success can be measured by (1) beat the state-of-the-art in the benchmark studies. (2) experimentally validated compounds that have desired effects.

Genome-wide prediction of microbiome metabolite-host protein interactions

Sponsor: CUNY 2

Contact: Xie Lie <lxie@iscb.org>

Abstract: Students will develop computational tools to identify genome-scale interactions between microbiome metabolites and host proteins. There are two challenges in predicting molecular interactions using machine learning. The first one is the out-of-distribution problem, i.e., the molecule of interest may be significantly different from the molecules used as the training set. A conventional machine learning model cannot make reliable predictions, as unlabeled data are outside the domain of a trained machine learning model. We require novel approaches to detect unrecognized functional relationships between molecules. The second challenge is that it is often infeasible to obtain a large number of samples with coherent annotations. It is necessary to integrate multiple noisy and heterogeneous omics data and combine multiple methods in machine learning. To address the above challenges, this project will integrate two active areas in deep learning: self-supervised learning and semi-supervised learning

Restrictions: none

Q: Which proteins/protein families are of interest in the study

A: We will screen metabolites against all protein in the human genome (~30,000). All data needed for the project are from the public database, have been pre-processed and ready to use.

Multi-relationship Multi-layered Network Model for Omics Data Integration and Analysis

Sponsor: CUNY 3

Contact: Xie Lie <lxie@iscb.org>

Abstract: Although the MMN is a potentially powerful means to integrate heterogeneous biological data and model biological system on a multi-scale, its application is seriously hindered by the lack of efficient, accurate, and robust algorithm for inferring missing relationships, especially when the relationships is signed, i.e. two relationships between two entities have opposite consequences. For example, gene mutations can be loss-of-function or gain-of-function, and a gene can inhibit or activate a pathway. Explicit incorporation of the signed information in MMN may significantly enhance the capability of data-driven multi-scale modeling of biological system, and facilitate detangling causality from correlation, a lofty goal of applying data science to biological problem.

Restrictions: none

Impact of weather on search and rescue success

Sponsor: dbS Productions

Contact: Robert J. Koester <Robert@dbs-sar.com>

Abstract: Impact of weather on predicting the location of missing persons in search and rescue is largely unknown. Although, initial data collection and commonsense predicts that weather has a profound effect on how missing people behave when lost. The International Search and Rescue Incident Database (ISRID) is the worldwide gold standard for predicting the location of missing people. It takes a statistical approach and currently reports six different models; ring, dispersion, track offset, mobility, elevation, and find location. In addition other models have been developed including watershed, point, and revised Point last seen. While these models take into consideration subject category, terrain, and ecoregion domain, they fail to take into account the weather. Analysis of the ISRID data would allow the determination if weather makes a significant difference and if so, how much. This new information would be integrated into predictive software to guide search and rescue mission planners.

Restrictions: Students must sign IP release

Q: It sounded as though dbsProductions sponsored a capstone project during a previous semester. Would it be possible to contact one of the students that worked on it?

A: I can’t share contact information for students or alums, but you can find the last group’s presentation here: https://edas.info/p28115

Predicting Survivability in Lost Person Cases

Michael Pajewski, Chirag A Kulkarni, Nikhil Daga and Ronak Rijhwani (University of Virginia, USA)

Over 600,000 people go missing each year in the United States. These events can cover situations anywhere from a young child going missing in a park to a group of hikers getting lost on a trail. dbS Productions has collected data on 16,863 searches over the past 30 years to generate an international database for use by search and rescue teams. The data recorded include a variety of fields such as subject category, terrain, sex, weight, and search hours. The data set is currently being underutilized by search and rescue teams due to a lack of applicable predictive tools built upon the aforementioned data. These search and rescue teams are also often volunteer-based and face great resource limitations in their operations. A tool is needed to predict the probability of a missing person's survival for the operation's coordinator to aid in resource allocation and the decision to continue or terminate search missions, which can be costly. This paper details an effort to create such a survivability predictor to help with this goal. We applied an Boosted Tree implementation of an Accelerated Failure Time (AFT) model to estimate the probability that a lost person would be found over time, given personal information about the subject, the location, and weather. We engineered several categorical variables and obtained weather data through the National Weather Service API to improve the model performance. Our engineered model recorded a C-index score of .67, which indicates a relatively robust model where industry standard considers 0.7 as "good" and 0.5 on par with random guessing. An analysis of the feature weights suggested that subject age, temperature, population density, mental fitness, and sex are the most critical indicators of survival in a missing person incident. Future work should involve incorporating more specific weather data, such as wind speeds and precipitation, into the model to improve prediction accuracy. Further research directions may include building a geo-spatial model to predict potential paths taken by a missing person based on initial location and the same predictors used in the survivability model.

Q: Can you describe what kind of information/data is available in more detail (subject category, terrain, and ecoregion domain)?

A: subject category field contains 40+ types (hunter, hiker, dementia, etc), terrain is mountainous vs non, ecoregion are the Bailey ecoregions used by USDA).

Q: Can you talk a bit about some of the current variables used in each of the models? And also the output from those models? Basically, I am looking for some more details into what sounds like a really interesting project (I couldn't find the database, so couldn't look into this myself).

A: The major outputs are survivability, distance traveled from initial point, distance traveled from significant clue, direction of travel (dispersion), change in elevation, mobility time, track offset from a linear feature, land feature found, watershed contingency, distance from destination. Lots of variables such as subject category, terrain, population density, ecoregion, weather variables (chief goal of this project), subject variables, cognitive level. etc.

Q: How many years of weather data will be made available, and what level (hourly, daily, etc.) and what is the source? Will the team be required to engineer the weather data to match to the intervals of information available about the missing persons and activity?

A: The database spans from 1980-2019. Most of the data is from 2010-2015. Some records have weather data, some might require obtaining weather data. In the past we used NOAA data. It will be worthwhile to look at daily vs hourly data to determine what works the best. Need to balance requirements for the end user to input data versus increased accuracy of increased data. So I would say yes to the team selecting the best interval after looking at various tradeoffs.

Q: It sounded as though you are working with or have your own proprietary software. What is you preferred language for this project (R, Python, something else)? Is the end result simply a predictive/inferential model, or are you hoping to roll it out in an interactive app?

A: The ISRID database is in R. However the relevant data can be delivered via Excel. The software I use is written in html5 for delivery over the web and takes advantage of python scripts in a docker container. So python would be the preferred method. I have written apps that present the data to users, so an interactive app would also be considered down the line.

Prevalence and drivers of disease risk

Sponsor: Deloitte

Contact: Pittman, John Mark <jpittman@deloitte.com>; Boxley, Ab <aboxley@deloitte.com>; Howard, Brock Michael <brohoward@deloitte.com>

Abstract: The students will have access to an anonymized dataset that will contain (1) demographics, (2) information on social determinants of health, and (3) risk of developing certain health conditions.

Restrictions: Must sign NDA

Additional info:

This capstone project is using data from Deloitte's product, HeathPrism. HealthPrism™ is a predictive population health analytics platform designed to help organizations identify and support populations at greater risk for various health conditions. The solution is a flexible platform designed to deliver insights into who is at elevated risk for a specified health condition, what barriers to treatment exist, and what interventions might best help that population.

HealthPrism is an exciting Deloitte asset that brings together disease risk data, demographics, and other social determinants of health to offer a holistic view of populations in the United States. The dataset for the capstone is large and incredibly rich. An example of a project through Deloitte that has used HealthPrism data is to embed equity considerations into Virginia’s COVID-19 response by identifying and engaging vulnerable and underserved populations.

As part of their capstone, Deloitte expects the students help them to solve complex data science problems around health and social inequities. More specifically, the students may be asked to go through discovery, ideate, and solutions for identifying vulnerable populations that experience transportation insecurity across the country and recommending solutions for public health officials to address such inequities. Further, the solution will identify those populations that are more susceptible to other disease states because of such inequities.
Due to the proprietary nature of the data, we’ll need students on this capstone to sign an NDA as well as an Intellectual property Release before things kick off; however, there shouldn’t be any additional restrictions.

Q: Can you talk about how these analyses will be used? E.g., will they be used for insurance clients? Do you have more details into the dataset? Is it static, or is the idea for the students to determine additional features, collect data, and incorporate into their analyses?

A: You analysis will contribute to the HealthPrism product, which has been used to help inform public health policy, particularly with regard to COVID-19 response and health equity. I did a career talk earlier in the month about a use case for HealthPrism, Wendell Collins should have the recording of the talk if you're interested. We have a static dataset; however, you may consider using additional, publicly-available datasets to improve or validate your models/analysis. So, to answer your question, it will likely be a little bit of both.

Q: Is this project focused on prediction or analytics? The abstract mentions the dataset, but not concrete goals for its use. Could you elaborate more on the focus of the project?

A: The core of the project will be predictive in nature and focused on developing a new modeling approach that could help the HealthPrism data science team develop new features for the product. Assuming that the modeling objectives are accomplished, there may be opportunities to explore the data set further and develop visualizations, if the capstone team wants to go in that direction. See my answer for one of the other questions for how HealthPrism actually gets used.

Q: Is this project mean to be specifically AI driven, or is it open to any data driven methods? Is this a new data set that needs to be cleaned, or is data already cleaned?

A: See a previous question for my answer to the first part of your question. The core dataset will be well-structured and relatively clean; however, the capstone team will need to do some additional cleaning and feature engineering. There's also the possibility of using other publicly-available data to augment what we have available-it would be up to the capstone team to do the cleaning and aggregation of those datasets (if used)

Synthetic Data: Faking It Till we Make it, but What’s the Catch?

Sponsor: LMI 1

Contact: Brant Horio <bhorio@lmi.org>

Abstract: Artificial intelligence is driven by data to identify patterns, detect anomalies, make predictions, and generate insights. Often, the applications of AI are challenged by lack of access to real-world data due to privacy or legal concerns, or we don’t have enough of it to robustly train our models. Synthetic data increasingly is filling this gap, however there is a data value-utility tradeoff that must be considered for each use case—it is not possible to preserve every statistical property of the underlying data while still preserving a privacy guarantee. This project consists of two parts. The first component seeks to understand the state-of-the-art for validating synthetic data, with an emphasis on covariate relationships between variables. The results will be the development of a multi-faceted evaluation framework that consolidates the research and provides guidance for appropriate validation methods given end use case, AI approach, data types, and other context such as value-utility needs. Secondly, we hope to validate the framework with a test synthetic data set. The data set is open to discussion with the team, but our intention to provide ground truth data to the team that is aligned to LMI support for an actual government agency. The team will use this data to generate (using whatever generation method they prefer) a synthetic data set that will be measured against the original according to the guiding framework developed in the first part of the project.

Restrictions: Work must be performed in the U.S. by U.S. residents.

Q: Is there a possibility to use DoN weapon system data? I currently work for the DoN and would love the opportunity to work on something that would be directly applicable to what I do. I also understand though that it may be difficult due to security classifications. Thanks!

A: It's a possibility, but you are correct about the security of it potentially limiting what we might be able to use. The value proposition of this project however is essentially domain agnostic, since the data type and format we seek to synthesize and validate could just as easily be related to population health as with weapon systems.

Automating Test and Evaluation Methods for AI/ML Models

Sponsor: LMI 2

Contact: Michael Lujan <mlujan@lmi.org>; Joseph Ritzko <jritzko@lmi.org>

Abstract: Government and businesses are relying on AI/ML more every day, however, as we become better enabled and more reliant on AI/ML capabilities, we must be aware of and prepared to mitigate unintended effects at all stages of the lifecycle of developing and applying AI/ML. In order to mitigate these risks, a rigorous test and evaluation (T&E) process needs to be in place, and ideally automated, so that AI/ML models can be quickly validated against “ground-truth” data. One of the first steps in validating a model is identifying the correct test(s) that a model should go through. This is usually done via a manual process which becomes cumbersome and inefficient when the number of models and data sets become large. For example, in a current Defense Advanced Research Projects Agency (DARPA) project there are over 20+ AI/ML models using various training data and are producing output with differing structure (e.g. point estimates, distributions, time series, etc.). Each of these AI/ML models go through a tedious and manual process in order to determine what evaluation method should be used and what validation data is necessary. When performed manually, this takes valuable time, burns limited funds, and impacts DARPA’s ability to rapidly deliver solutions and insights that enable the U.S. military. In order to overcome the challenges associated with manual T&E, we’re pursuing the development of an algorithm that can automate the identification of testing methods and validation data. The desired output is an algorithm which takes in as input the description of the model along with data ontology and schema which then can be used to produce an output such as: "use test X and data set Z" for evaluation. As a use case we propose that students utilize tutorials that have of already developed models and validation data (see https://scikit-learn.org/stable/tutorial/index.html as an example).

Restrictions: Work must be performed in the U.S. by U.S. residents.

Automated Ontology Development for Advanced Analytics

Sponsor: LMI 3

Contact: Michael Anderson <manderson@lmi.org>; John Allison <jallison@lmi.org>

Abstract: Back in 2017 the FCC proposed the now infamous “net neutrality” rulemaking for public comment. In the months that followed nearly 22 million comments were submitted – many of which were either bot generated or submitted hundreds of thousands of times. The abuse was so large and disruptive that the U.S. Senate eventually launched an investigation into the matter. Although most rulemaking proposed by the federal government doesn’t attract nearly the same volume of comments, online comment submission is forcing the federal government to find more automated solutions in order to review these comments. In order to facilitate the review process and relieve pressure created by this influx of comments, LMI is seeking to create a text-matching tool that can pair a public comment with any associated content in the rule being commented on. This project will utilize existing text-matching methods and focus on automated ontology generation - creating domain-specific dictionaries that will be used to boost the accuracy and repeatability of text-matching algorithms. As a use case, each year in accordance with the 2002 E-Government Act, the Centers for Medicare and Medicaid Services (CMS) submits proposed rule changes for public comment on Regulations.gov. Labeled datasets of rules and several thousand comment submissions from the 2018 and 2019 CMS rule change cycles will be provided, as well as sample health-care ontologies.

Restrictions: Work must be performed in the U.S. by U.S. residents.

Explore OpenStreetMap Data to Improve the Map & Grow the Community

Sponsor: OpenStreetMap

Contact: Maggie Cawley <maggie@openstreetmap.us>, Jess Beutler <jess@openstreetmap.us>

Abstract: OpenStreetMap is a collaborative project to create a free editable geographic database of the world. As an open, community-produced data project it provides map data for thousands of web sites, mobile apps, and hardware devices. Students have the option to complete any of the following projects to enhance this dataset and support the community of contributors. Goals include:
1. Identify stale / out of date areas of the map in the US: Which localities have stale data, based on the historic growth of data in the locality or localities similar to it
2. Identify under-mapped areas in the US: Which localities lack detail based on the historic growth of data in the locality or localities similar to it
3. Identify what and where US mappers are mapping: What are people mapping across the US and where are they mapping? Are there any patterns that emerge to understand what (and where) mappers are motivated to map?

Restrictions: none

Q: Is this a more data analytics project or data science? Will we be using Machine learning?
A: The method of the analysis is open. Machine learning could be used on satellite imagery and compared to OSM data, for example.

Q: Will you require this project to make recommendations to increase more accurate crowd-sourced contributions to the data?
A: Not a requirement. We're interested in insights into the current state of the data. Possible suggestions on how to fix the gaps is the next step that could be approached or not, depending on the interests of the student.

Unsupervised Model Optimization for eCommerce Product Recommendations

Sponsor: Skafos, LLC

Contact: Tyler Hutcherson < tyler@skafos.ai>; Chris Patrick <chris@skafos.ai>

Abstract: Skafos is a Charlottesville-based startup focused on eCommerce. Our “Product Discovery” app assists online shoppers in the search for relevant products. Unlike traditional recommendation engines, the app is interactive & conversational, allowing shoppers to indicate their explicit intent (i.e. upvote/downvote), rather than mining personal data or historical transactions. Students will use Skafos analytics data, collected from production app usage, to help improve the quality of recommendations provided to shoppers. 1. Develop an approach to quantify the quality of product recommendations produced by our search engine algorithm.2. Develop an approach to customize the blend of many KNN-style models.3. Develop an approach to customize product interaction weights at the shop level, or even the user level, if possible.4. Build collaborative filtering models with interaction streams to feed the search engine algorithm.5. Identify and synthetically generate additional data required to accomplish the above tasks.

Restrictions: NDA, IP release

Q: "Can you describe the size of the dataset students will be working with? How many users/transactions/etc. per month?
Do Shopify merchants provide any data that can or will be utilized (in addition to user data generated by Skafos)?
Will the project focus on specific products (e.g. clothing, furniture, etc.)? Can you discuss the level of specification in algorithm you would expect necessary for different products from the work you’ve done to date?"

A: [answered in live session; please see recording ~22-25 minute mark, perhaps?]

Q: What programming language(s) is/are your existing models and work written in? What is expected for this project?

A: [answered in live session; please see recording ~22-25 minute mark, perhaps?]

Visual neuroscience single-cell stimulus representation

Sponsor: Stanford Neurobiology

Contact: Carl Wienecke <wienecke@stanford.edu>

Abstract: Visual neuroscience studies the relationship between visual stimulus and response. Defining this relationship advances our understanding of how neurons represent sensory information. The experimental tractability of vision coupled with the genetic accessibility and molecular tools available in Drosophila permit a deep exploration of the physical and algorithmic bases of stimulus selectivity. Driven by advances in in vivo imaging and genetic manipulation, the Drosophila visual-processing field has progressed rapidly in recent years, characterizing the relevant cells and circuits with exquisite anatomical and functional detail. We will supply movies of the neural responses of a group of visual neurons in the Drosophila brain as they respond to various visual stimuli. We will also supply the visual stimuli eliciting these responses. Our goal is to understand the linear relationship between stimulus and response, using various techniques

Restrictions: IP release required

Visual neuroscience single-cell identification

Sponsor: Stanford Neurobiology

Contact: Carl Wienecke <wienecke@stanford.edu>

Abstract: We will supply movies of the neural responses of a group of visual neurons in the Drosophila brain as they respond to various visual stimuli. Neural responses in our data take the form of changes in intracellular calcium concentration, which are in turn reported by a fluorescent indicator, GCaMP6f. We will also supply the visual stimuli eliciting these responses. Our goal is to identify individual neurons amid a population of neurons that occupies the imaging field of view. Students will identify individual neurons amid a dense network of neurons in the neural response movies. The quality of the source extraction can be assessed by comparing response properties of ROIs to the response properties of regions of interest (ROIs) extracted from existing movies where the anatomical segregation of individual neurons is clear (i.e., control movies whose individual neurons have been identified with high confidence because the source extraction is less challenging).

Restrictions: IP release required

Measuring the Impact and Diffusion of Open Source Software Innovation Using Network Analysis

Sponsor: UVA Biocomplexity Institute

Contact: Gizem Korkmaz <gkorkmaz@virginia.edu>; Brandon Kramer <kb7hp@virginia.edu>

Abstract:Open Source Software (OSS) is computer software with its source code shared with a license in which the copyright holder provides the rights to study, change, and distribute the software to anyone and for any purpose. Examples include Linux operating system, Apache server software, and R statistical programming software. Despite its extensive use, reliable measures of the scope and impact of OSS are scarce. The creation and use of OSS highlight an aspect of technology diffusion and flow that is not captured in science and technology indicators. Supported by the National Science Foundation (NSF) and building on research conducted over the last couple of years, we aim to measure the production, impact, and diffusion of OSS in specific sectors, institutions and geographic areas using data scraped from multiple hosting platforms (e.g., GitHub, GitLab, SourceForge). We will generate and analyze networks of contributors (through collaborations between software developers) and networks of OSS projects (through reuses across projects and shared contributors), and will identify key/influential players in this ecosystem.

Restrictions: none

Q: What is the preferred language (R, Python, something else?)

A: We use R and Python for data analysis as well as Julia and Python for scraping data from GitHub. In the project prospective students would be working on, we will be using R to classify users into economic sectors and/or countries and then using using Python or R depending on what aspect of the network analysis we are conducting. All of the data is stored on a PostgresSQL database, so some familiarity with SQL would also be beneficial.

Deep learning and protein structure: Can the ‘Urfold’ model detect domain swapping-like phenomena?

Sponsor: UVA BME 1

Contact: Cameron Mura <cmura@virginia.edu>

Abstract: In this Capstone project, we seek to further develop a Deep Learning (DL) approach that we have created to both ‘define’ the Urfold (as a bona fide protein structural entity) and detect instances of urfolds (in protein structure space). Specifically, we propose to examine the phenomenon of “3D domain swapping” (whereby structural elements ‘swap’ between two copies of a protein) as it relates to the Urfold, as we have reason to believe swapping can be detected by the DL framework we have been developing (dubbed ‘deepUrfold’). A core question is if DeepUrfold can distinguish between CATH superfamilies that are enriched in domain-swapped structures, versus those that are not? Examining this and related questions will ‘stress-test’ the deepUrfold approach, giving us a sense of its domain of applicability. Ultimately, the work may also (i) uncover subtle principles of protein structure (related to the “biophysical/structural integrity” of domains), and (ii) help assess the utility of deepUrfold as a systematic, automated approach for detecting anomalous structural features (such as domain swapping).

Restrictions: none

Machine Learning & Structural Bioinformatics to Assess the Likely Impact of COVID Variants

Sponsor: UVA BME 2

Contact: Cameron Mura <cmura@virginia.edu>

Abstract: This Capstone project will attempt to decipher statistical correlates between these two levels: (i) the molecular-scale data that are (and that continue to become) available, and (ii) the population-wide characteristics of viral pathogenesis. We will do this by first using Deep Learning-based methods--including those being actively developed in our lab (for predicting protein-protein interactions)--to map out the complete “interactomes” (protein-{protein, DNA, RNA}) of the 29 viral proteins. Importantly, we will do this in not a purely static (single-snapshot) manner, but rather with temporal resolution (on a timescale of months of viral progression, worldwide). This will be achieved by considering (i) the many known SARS-CoV-2 mutations that have arisen (via neutral drift, selection pressure, and so on), (ii) their likely structural and biophysical impacts (which we can model using approaches from computational molecular biophysics), and (iii) the associated phenotypic properties (apparent transmissibility, prevalence among populations, severity of illnesses [from real-world evidence], any comorbidities or other statistical correlates, and so on).

Restrictions: none

Primary Care Patient Analysis at UVA Health: From descriptive to predictive analytics

Sponsor: UVA Health

Contact: Christian Wernz <cwernz@virginia.edu>

Abstract: The goal of this project is to improve resource allocation and population health with a focus on equity and inclusion across different race, ethnicity, gender and age groups. The capstone team will develop managerial and medical insights and prediction capabilities for primary care patient at UVA Health by analyzing 2-4 million patient visit records of 25K patients. Python is preferred for data cleaning, analysis and machine learning. Students may also explore use of Tableau for data visualization, geo-tagging and population health comparison. Students will begin by defining the problem and submitting a protocol to the Institutional Review Board for approval.

Restrictions: IP release; Data use agreement

Q: For the Primary Care Patient Analysis , what are some examples of the resources that need to be better allocated?

A: Resources include number and qualification of physicians, nurses, admins and other support staff for the clinic. In addition, this could include office space, medical equipment etc. Further, it will be part of the group's effort to determine where opportunities lie. So, you can and will propose ideas based on your data findings.

Q: Is there a preference of R or python for this project?

A: Python would be preferred, but R is possible too.

An Expert-Sourced Measure of Judicial Ideology

Sponsor: UVA Law

Contact: Kevin Cope <kcope@law.virginia.edu>

Abstract: How can we measure the judicial ideology of judges? This dataset will comprise the first ideology measure covering every non-Supreme-Court Article III judge on a single scale. The dataset will comprise dynamic, interval-level, and potentially multi-dimensional data of every federal district and appellate judge serving since 1985. The measure will be derived from many thousands of qualitative evaluations by a representative sample of legal experts familiar with those judges’ approaches to judging. The data project involves developing a method for extracting data hierarchical ideology n-grams (as well as biographical info and other data) from 35 years of natural language contained in volumes of the Almanac of the Federal Judiciary. This process will involve writing programs (likely in R or Python) to: (1) recognize and parse the relevant hierarchical terms; (2) using a dictionary method, convert the terms into quantitative data; (3) measure the validity of the quantified data by testing it on a set of case outcomes; and (4) convert the corpus into quantitative data via machine learning. The data exist in corpus (prose) form in a series of approximately 100 PDFs. The corpus comprises approximately 11 million words.

Restrictions: IP release (open access)

Geospatial analysis of counterinsurgency warfare

Sponsor: UVA MSDS

Contact: Michael L. Davies <mld9s@virginia.edu>

Abstract: Each year, the Armed Conflict Location & Event Data Project identifies 10 conflicts or crisis situations around the world that are likely to worsen or evolve in the coming months. This analysis is descriptive rather than predictive and does not explore more complex methods or questions regarding the level or type of influence insurgent violence has on a conflict via conflict diffusion, predictive modeling of which actors control territory, modeling the likelihood of changes in territorial control, trends in the latent underlying character of the conflict, or predicting the likelihood or nature of threat to friendly forces. As such, this project aims to build on existing methods for evaluating one or a combination of the following questions for one or more ongoing insurgencies or civil wars:

  • Conflict diffusion (~contagion effect) – assess the spatial-temporal impact of specific event types? (Reference) (Reference)
  • Measuring territorial control – leveraging features such as frequency of conflict events, types of conflict events, and number of belligerents to create a predictive model of territorial control (Reference)
  • Text analysis – topic analysis to predict likelihood of changes in territorial control (building on my project for ETA)
  • Text analysis – predicting the probability of a threat to friendly forces leveraging social media sentiments and topics
  • Change point detection – modeling change points or abrupt variations in time series data and may represent transitions between different states of the conflict.

Restrictions: none

  • No labels