MSIS Capstone Projects

 

Graduate students meeting

Master of Science capstone projects in the College of Information Science provide an opportunity for students to showcase what they have mastered in the program.

The capstone project is based on a project plan that includes project goals, master's competencies addressed by the project, system design, implementation schedule, assessment plan and milestones. The project contributes to the development and enforcement of the student's knowledge and skill sets in the fields of data science and information science.

The capstone project must exercise all competencies required for the master's degree and must also have a software development component.

Recent Capstone Projects

View recent College of Information Science master's capstone projects completed by students and student teams (descriptions are provided by students):

PROJECT TITLE
& DESCRIPTION
FACULTY ADVISOR
NumEval: Numeral-Aware Language Understanding and Generation
In previous SemEval competitions, the majority of tasks have primarily focused on analyzing words within a text, with scant consideration given to numerical data. But comprehension of numerical values can significantly enhance performance in certain tasks as numbers provide important information in words, especially to me working as a CPA. Numeracy seems one of the recent hot topics in Natural Language Processing and quantitative understanding in NLP is new to me. SemEval 2024 offers NumEval as one of the tasks and NumEval consists of 3 tasks, which then consists of various subtasks. I am particularly interested in Task 1, which is further divided into 3 subtasks: Quantitative Prediction (QP), Quantitative Natural Language Inference (QNLI), and Quantitative Question Answering (QQA). QP is the task of predicting the correct magnitude of the masked numeral while QNLI is the task of making natural language inferences based on quantitative clues and QQA is the other format for testing whether models can understand numerals and semantics.
Steven Bethard
Visual Word Sense Disambiguation
The goal of this project is to train a machine learning model to match words to images based on their semantic meaning within the context of a phrase or sentence. Often times in Natural Language Processing, models are challenged when it comes to identical words having different meanings depending on the context. In this project we aim to show that the model can identify the correct semantic meaning of ambiguous word tokens by having it choose the correct image representation.
Steven Bethard

Plant Based Predictions
Prompted by the climate crisis and ethical concerns of animal-based diets, recent trends demonstrate a strong push for plant-based dietary options. This project will act as market research while uncovering potential gaps in product availability meeting consumer expectations, company profitability, or sustainability goals.

GitHub Repository

Sarah Bratt
Snorpheus
Snoring is a symptom of Sleep Related Breathing Disorder (SRBD). 57% of men and 40% of women in the US snore. If left untreated it can lead to Obstructive Sleep Apnea (OSA), which one in five adults in the US suffer from some form of. Untreated OSA can result in a number of health problems including hypertension, stroke, arrhythmias, cardiomyopathy. For mild to moderate OSA, use of oral appliances designed by qualified dentists is considered the first line of therapy in management of OSA. Depending on the severity of snoring related to body positions (supine or on the side), positional therapy in combination with an oral appliance can be a useful treatment to help improve quality of life measures. Ability to identify and document snoring events is challenging for patients, especially for patients with no bed partners. Associating the snoring events with patients body position over the duration of sleep period can provide valuable information for recommending appropriate therapy. There is currently no solution that allows this combination of positional and audio data to be easily collected, and there is no software that allows this highly related data to be visualized together.
Winslow Burleson

Brawl the Shrimp
My goal of the project is to explore how people's music choices on Spotify relate to their political affiliations. From the data, I want to create a model that predicts political leanings based on the user's top genres, artists, history and playlists. From this project, I hope to understand more about the connections between personal music taste and political beliefs, where this will also provide insight on societal dynamics. The project also aims to challenge or approve of cultural stereotypes. The overall goal of the project is to understand how individuals data, such as music, has effect on their identities.

GitHub Repository  |  Website

Greg Chism

Terrorism Observability
The Global Terrorism Database is an open-source database containing information on over 200,000 terrorist events from 1970-2022 (and beyond). For each GTD incident, a multitude of information is available. My goal is to build an end-to-end data solution that processes and cleans the GTD dataset, stores it in an analytics ready SQL database, perform analytics on top of the database, and then build an observability dashboard key metrics mined from the dataset. The dashboard will aspire to be a centralized place for hypothetical corporate, NGO, and private sector decision makers to assess terrorism risk across the globe and across time. I want to take the GTD dataset, transform it, mine it, possibly do some predictive analytics, and then present it in a digestible way outside of the cumbersome spreadsheet they provide.

GitHub Repository

Greg Chism

The Crossroads of Risk: A Comprehensive Analysis of Traffic Accidents
This project aims to delve into the critical issue of road safety by analyzing traffic accident data. The goal is to uncover patterns and factors contributing to accidents, which is of paramount importance for enhancing traffic safety measures, informing policy decisions, and ultimately saving lives. By understanding the circumstances leading to accidents, we can better predict and prevent future incidents, making roads safer for everyone.

GitHub Repository

Hong Cui
Leveraging Generative AI for Business Applications
The project aims to explore the diverse applications of generative AI within business contexts. It will address the growing relevance of AI-generated content, its impact on business operations, marketing strategies, and customer engagement. Additionally, the project seeks to tackle challenges such as ethical considerations, data privacy, and the potential biases inherent in AI-generated content.
Hong Cui
Analyzing Trader Sentiment through Cloud-Based Data Engineering
This capstone project will focus on the development of a data pipeline hosted on the AWS cloud platform to analyze trader sentiment based on notes shared on a note-taking platform. The primary objective is to empower traders to make rational decisions by presenting historical data insights on the project's frontend.
Hong Cui
Mini Twitter - Microblogging Service
It creates a platform on which users can register account, post and interact with other posts. Registered users can comment, post, modify and delete posts.
Hong Cui
Using Machine Learning to De-duplicate Massive Material Sample Records
SESAR is a repository hosting millions of earth science samples, like rocks, minerals, water samples from the ocean, etc. SESAR has a challenging task: Use metadata of a given sample to identify duplicates or suggest to a user that there may be a duplicate in a database. Many samples in data systems for legacy data will never have unique identifiers, so we need to use sample metadata, sample name, and related information such as authors or keywords of papers that have data for the sample to decide or at least suggest that it is the same sample.
Hong Cui
iVoices
iVoices allows students to make stories about technology. We process it in some ways through research and data analysis and provide these information and data to other researchers or interested groups through digital or paper magazines. Through this information, we can better understand the changes of technology to our society and the social problems that arise, so as to better understand our society and technology, and provide some research results for better future development.
Diana Daly

PROJECT Apollo
We are performing a thorough study using topic modeling and network analysis in the field of Astronomy and Astrophysics. This is a novel project and surprisingly there hasn't been a topic modeling and network study on Astronomy before. This project aims to unearth and reveal global research trends over time. This is an ongoing (long-term) project which is about 40% complete now (however our research paper write-up hasn't even started yet but it should sometime soon in the Spring term). We have been successful with our implementation of STM model for topic modeling, Weak Supervision with Snorkel framework (single-label classifier), Network Analysis of citations network and Citation Disruption Index. This was our progress at the end of Fall 2023. Now, for the upcoming Spring 2024 term we plan on completing the following tasks mentioned in proposed methods.

GitHub Repository

Charles Gomez

Identifying Leaf Phenology Patterns
The main objective of this project is to identify leaf phenology patterns of deciduous broadleaf forests and predicting the species as well as different Phenophases and their Region of interest. By observing and analyzing the past observed PhenoCam Images, I would propose a method using unsupervised machine algorithm to build a prediction model as to how these changes observed in images will help us in the prediction of upcoming modeling of PhenoCam Images. Previously, Professor Heidorn and other students worked on Color Cluster Analysis (Clustering). I aim to elevate previous research by identifying these pattern changes in the images and work on the image segmentation as well as feature extraction from the image.

GitHub Repository

Bryan Heidorn
Analyzing Federal Grant Programs: Insights from BERT
Federal research grant programs have undergone significant changes in their funding distributions across different topics in the past quarter-century. It's essential to understand these shifts, anticipate future topics of interest, and evaluate the effectiveness of these programs. This project expands on previous work that delved into National Science Foundation's data, broadening the horizon to encompass data from other influential agencies such as NASA and the Department of Energy.
Bryan Heidorn

Topic Analysis (NLP)
This project aims to look at how grant is given to drone projects. As drones become more important, it's crucial to understand where the money goes. This research helps teachers, leaders, and people who invest in projects see how money is used. It also helps find important trends that can help the robotics industry grow.

GitHub Repository

Bryan Heidorn
Pima Animal Care Center - Animal Database and Adopt/Foster Look Up
I have fostered a dog from Pima Animal Care Center (PACC) by looking through their online database of over 500 dogs, cats, and other small animals. The database online is not user friendly and is not able to easily navigate especially if you are looking for specific animal traits and characteristics. I am proposing to create a new website/database that stores the animal information (such as breed, weight, gender, crate trained, house broken, etc.) so possible foster/adopt parents can find the animal they are looking for more efficiently. I would incorporate the skills I've learned in creating web pages (INFO 515 and 578), information presentation (INFO 578), and more with this project. The problems that it will tackle is beyond the capstone project, but also helping the community foster/adopt animals in the shelter. It will allow users easy access to animal profiles and potentially help fostering/adopting become more efficient.
Bryan Heidorn
Analysis and Financial Modeling of Federal Grant Programs
This capstone project seeks to unravel the intricacies of scientific research funding over the past 25 years. With technological advancements and evolving research priorities shaping the landscape, understanding federal funding patterns is crucial. The project aims to address key questions, including shifts in funding over time, the distribution of funds across diverse research topics, and predictions for future funding trends. Through advanced modeling techniques such as Doc2Vec, BERT, and correlated topic models, the project intends to offer a nuanced analysis beyond traditional methods like Latent Dirichlet Allocation. Additionally, by leveraging supercomputing networks, the project optimizes scalability, enabling the exploration of larger datasets for a more comprehensive understanding. The significance of this endeavor lies in informing policymakers, researchers, and funding agencies, facilitating data-driven decisions, aligning research efforts with current priorities, and optimizing resource allocation for future innovation in scientific research.
Bryan Heidorn
Analyzing Airline Reviews for Customer Feedback
Skytrax Airline Quality (airlinequality.com) is a platform where customers can submit their experiences with airlines they have used for travel. Airlines can use this information and determine how they can improve experiences and services that are offered. Since there reviews are constantly added, it can take time to go through each review manually and understand the issue overall. Instead, we can use Sentiment Analysis and Topic Modeling to come up with automated system that can generate charts and other important information for better insights and precise decision making for improving their services.
Xuan Lu
Maternal Higher Education and Child Nutrition Status: Case of Uzbekistan
As a country that has historically been closed off, Uzbekistan has lacked extensive data collection by international organizations such as the World Bank and UN clusters. This scarcity of data has limited the understanding of various social and health dynamics within the country. The recent openness of Uzbekistan to the world and the subsequent publication of new datasets, such as the Multiple Indicator Cluster Surveys (MICS) by UNICEF, present a unique opportunity. This research is among the first to utilize these new and comprehensive data sources to explore critical social issues in Uzbekistan. Moreover, Understanding the impact of maternal education in a patriarchal society like Uzbekistan, where sons often receive preferential treatment in families, is crucial. This research can shed light on the broader social implications of gender and education on child health. By investigating the hypothesis that higher-educated mothers are more likely to provide healthy nutrition to their children, this research could support and inform government efforts to reduce gender disparities and improve child health outcomes. This is particularly relevant as the government is now actively working to bridge gender gaps influenced by cultural traits and systems.
Xuan Lu
Sentiment Analysis on Social Media Comments: Evaluating Public Perception for Reforms in Uzbekistan
Over the last 7 years, Uzbekistan has undergone significant reforms in areas such as governance, economics, and social policy. While these reforms are critical at the policy level, their success is equally measured by the acceptance and perception of the general public. This project aims to analyze public sentiment on these reforms using comments extracted from social media platforms. This analysis would be beneficial for the areas of reforms that are well-received and those that need more public engagement. My primary research question is: "How does the online community perceive the reforms implemented in Uzbekistan?"
Xuan Lu
Spatial Bayesian Network for the Timely Identification of Contamination Events in Water Distribution Networks
Approximately 18% of outbreaks that occurred in the European region between 2000 and 2010 were associated with water. However, the real burden of waterborne diseases is unknown given a lack of proper surveillance protocols, as well as limited laboratory capacity. Thus, the World Health Organization (2019) has encouraged the strengthening of surveillance systems around Acute Gastrointestinal Illness (AGIs) to better identify ongoing waterborne outbreaks. More specific to urban infrastructure, after 9/11, the deliberate introduction of harmful substances into Water Distribution Systems (WDS) became a threat given the potential for severe public health consequences. More recently, these concerns have been focused on unintentional events resulting from pathogenic, chemical, or microbial agents introduced into the network due to cross-contamination with non-potable water. The Surveillance Response Systems (SRS) have relied mainly on online water quality monitoring, measuring surrogate parameters that indicate an abnormal water quality. However, given this specificity, some contaminants may go undetected limiting the detection capabilities of the framework. Thus, my project proposes a framework to integrate multiple data streams that may indicate an AGI, including reports from public health, customer complaints from water utilities and the status of the system (failures and reported maintenance). The latter streams will supplement online water quality measurements to enhance and increase the detection capabilities of SRS. 
Clayton Morrison
Deep Learning for Closed-Loop Communication Detection
Good teamwork leads to a high level of productivity and job satisfaction. Effective communication among team members is crucial in facilitating cooperation, trust, and efficient problem-solving. One key aspect of effective team communication is closed-loop communication (CLC), which has been proposed in the literature as a coordinating mechanism for effective teamwork. CLC is a feedback process in which the receiver of a message sends a response or confirmation back to the sender. CLC has three components: call-out, check-back, and closing of the loop. The feedback process ensures that messages are accurately transmitted and understood and has been demonstrated to improve team efficiency in various domains. However, most existing research on CLC is conducted post-hoc, for example, by watching videos of sessions after they occur and recording only the parts that researchers are interested in (such as CLC categories and task completion time). There is a need for an automated method for detecting CLC. With the use of automated detection, real-time monitoring of communication can be achieved, allowing for immediate feedback and quick adjustments to be made and largely improving team communication.
Adarsh Pyarelal
Rule-based Detection of Closed-Loop-Communication in Multi-Parti Communication
Closed-loop communication (CLC) is often recommended in the team research literature as a communication behavior that can guarantee the accuracy of information exchange. Currently, CLC in spoken dialogue is identified via retrospective analyses involving manual transcription and annotation. Currently, most real-time dialogue systems are limited to conversing with a single human at a time. On the other hand, there are numerous analyses of multi-participant spoken dialogue in the academic literature - however, these are primarily performed offline rather than in real-time, and the communicative events in their multi-party conversations are manually coded rather than automatically extracted using information extraction (IE) methods. To address this limitation, I propose to develop a separate downstream CLC detection component that utilizes the outputs of the existing dialog agent, but also reasons about context and state more deeply.
Adarsh Pyarelal

Escherichia coli Predictions in the Upper Santa Cruz River
This project will continue work done to predict Escherichia coli (E. coli) levels within the Upper Santa Cruz River. E. coli levels within water may directly impact human and animal health and are important indicators of overall water quality. While there has been extensive research on E. coli prediction in aquatic settings, this unique stretch of river requires acute modelling to increase predictive accuracy. By finding an appropriate model and means of communicating this information to the public, the community surrounding the upper Santa Cruz River may be both safer and better informed of the overall health of the ecosystem.

GitHub Repository

Cristian Román-Palacios

Machine Learning for Phylogenetic Reconstructions
This project integrates machine learning with evolutionary biology to help in phylogenetic reconstruction. This project is worth doing, because AI can help with tree search while maintaining accurate results.

GitHub Repository

Cristian Román-Palacios
Analysis of Online Course Performance and Activity Metrics
In 2020, a worldwide pandemic broke out. COVID-19 caused mass quarantines across the world, including in the United States. Without the ability to meet in-person, schools looked to online course structures and platforms to host their curriculum. At many universities, including the University of Arizona, the platform Desire 2 Learn, better known as D2L, is used as a central location for students to access, interact with, and submit class content. D2L also reports automated activity metrics to course instructors to be able to track the progress of not only the class, but individual students. I believe these activity metrics, some tracked across time, might be worthwhile indicator metrics to estimate the level of effort and presumed subsequent performance of individual students. Other uses of the analytics could pertain the effectiveness of different teaching styles/ content organization. If the analysis provides conclusive or suggestive results, this could spark more interest in the data and serve as a reliable tool for educators to keep track of. 
Cristian Román-Palacios
Identifying Environmentally Comparable Cities Across the Globe for Urban Evolutionary Ecology Research
The goal of the project is to define a quantitative framework for comparing urban areas based on climate- and human-related features. Defining these areas and their comparability is a key task in urban ecology, the study of ecosystems in and around humans and urbanizing landscapes. There have been efforts in the field of urban geography to classify and compare cities but work from an urban ecology perspective is lacking. Previous urban ecology studies in this area have focused on using specific species of plant and animal life to compare responses to urbanization in different localities. This work will use human features such as city size, population, and infrastructure, as well as climatic features such as temperature, precipitation, and other geographic traits to provide a more general framework for comparing cities across different regions. The resulting framework will enable researchers in urban ecology to better understand the relationships between drivers of ecological and evolutionary patterns in urban areas.
Cristian Román-Palacios
The Color Palette of Neighborhoods
This project aims to extract and analyze the color palettes of building facades in US neighborhoods, addressing urban design, cultural, and environmental insights. 
Cristian Román-Palacios
Using Machine Learning Models to Support the Department's Student Admission Process and to Determine the most Effective Admissions Drivers
Machine learning algorithms have been used in the past on admissions data to enhance the admission process, making it more efficient, and this has the potential of improving the selection process. Instead of replacing human decision makers, these algorithms can instead be used to assist them. Human oversight is very crucial to address individual cases that may not fit the model. With new high demand programs developing, such as data science, Machine Learning and Artificial Intelligence, the School of Information has been witnessing an increased number of applications. This has greatly increased the time spent going through applicant information, and the fear is that the admissions staff could soon get overwhelmed by the number of applications to the graduate college, especially during peak intake seasons. Automation of this process using historical data could be done by studying the department decision making process. By evaluating previously admitted candidates and identifying the pivotal application metrics historically employed for admissions, it is possible to automate this process.  Such data can further serve as a basis for predicting both the quantity and characteristics of prospective students likely to excel in the program, as well as estimating total enrollments per semester. By focusing a significant amount of attention on promising applications, this approach ultimately minimizes waste and enhances the efficiency of the selection process.
Cristian Román-Palacios
Water Demand Dashboard for the City of Flagstaff, Arizona
As a provider of municipal water, the City of Flagstaff, Arizona is governed by the Adequate Water Supply Program as defined by the Arizona Department of Water Resources. Part of this designation is to include both Physical Water Availability, and Continuous Water Availability. To demonstrate these criteria, the city needs to be able to predict future water use based on current consumption versus future anticipated growth, including current water use, committed water use, projected water use, and future needs. In early 2021, the City of Flagstaff received a pro bono publico demonstration dashboard from a consulting firm (EHS-Support). The dashboard attempted to summarize the water consumption each water meter that the City of Flagstaff supplies water to for the past five years. The dashboard then attempted to assign an average water use, per meter, and summarized the average by multiple geographic factors.
Cristian Román-Palacios
Effect of the Pandemic on Pollution Levels
I will look at multiple data sets of several pollutant markers and then differentiate them between the pandemic affected years with years before and after lockdowns took place.
Cristian Román-Palacios
Applying Parallel Computing for Managing Customized Surveys Designed for Nested Mixed Method Study
The project is currently ongoing and is entitled as "a nested mixed-methods approach to armed non-state actor governance and the rule of law" (PI: Javier Osorio). Recently, conflict scholars advanced different theoretical framework classifying insurgent and criminal governance structures. As there are no existing validated measures, the project relies on an online survey of hundreds of local experts. To conduct the online survey, the research will rely on the institutional license of the online survey system Qualtrics. However, as the number of local experts increase, the time requires to process data becomes intensive (up to 16 hours). The current capstone project proposes application of the parallel computing to enhance the processing performances.   
Cristian Román-Palacios

Systematic Review of Models Used for Clumped Isotope Thermometry
Description: The project aims to contribute to the scientific understanding of this field by investigating and evaluating various regression models so we can get an idea of the best robust model under different conditions. Identifying these effective models can lead to methodological advancements in Clumped Isotope Thermometry and since we are using datasets with different levels of errors, we can also understand any potential bias.

GitHub Repository

Cristian Román-Palacios
An Analysis of the 3D Shape of Cities
This project focuses on analyzing the 3D shape of cities worldwide using a dataset that compiles building height information globally. By incorporating shapefiles, such as the one defining city limits, the project aims to conduct a regulation analysis, exploring the interplay between building shapes and urban. This project aims to contribute to the ongoing discourse on sustainable urbanization by providing a detailed examination of the 3D shapes of cities. By utilizing a globally sourced dataset and established methodologies, our exploration is poised to offer valuable insights into the intricate dynamics between urban forms, regulations, and sustainable urban development. This research is valuable because it provides a more detailed understanding of cities beyond traditional 2D studies, offering insights into three-dimensional aspects like building shapes and relationships for improved urban planning.
Cristian Román-Palacios
The Effect of Type of Scientific Communication Method on Retaining Math Information
Colleges and high schools in the state want to know what factors are impacting graduation rates. I am going to explore on a county and individual high school level, graduation rates in public and private high schools in Virginia. I will look at factors such as race, household income, number below the poverty line, size of school, student/teacher ratio, and location. I will explore these relationships with data visualizations as well and using multiple linear regression, principal component analysis, and factor reduction analysis. I will create visualizations to explain these trends to the wider public as well and find an accurate way to predict high school graduation rate. 
Meaghan Wetherell
Environmental Racism: Superfund Sites and Native American Reservations
The global healthcare landscape is rapidly evolving, with a significant focus on home-based medical devices, particularly blood glucose monitors. In 2022, the market for these devices was valued at USD 12.5 billion, and it's projected to grow at an impressive 8.13% CAGR from 2023 to 2030. The primary drivers include the increasing incidence of diabetes and a growing aging population. The International Diabetes Federation warns that global diabetes cases are set to rise dramatically, from 537 million in 2021 to 643 million by 2030 and a staggering 783 million by 2045. Given this alarming trend, there's a need for a holistic application with visual capabilities to record data generated by home-based medical devices, especially for diabetes management. This project aims to provide a unified platform for individuals to collect, visualize, and manage their health data, ultimately improving the quality of life for millions affected by diabetes worldwide. This project goes beyond glucose tracking, enabling users to record their meals, mood, and personal notes in one place. It aims to simplify wellness management by offering an integrated solution, bridging the gap between data collection and actionable insights. Ultimately, our goal is to empower users to make informed decisions to improve their health.
 

Ready to transform your future in information science?

Learn more about the Master of Science in Information Science by contacting us at infosci-grad@arizona.edu, or review the admissions process and begin your application now:

Start Your Application