Open Datasets for the 2024 Hackathon
The Positive and Negative Affect Dataset [Mental Health]
-
The Positive and Negative Affect Dataset involves university students from Southern California and was collected continuously over an average of 7.8 months before, during, and after the COVID-19 lockdown. It used a variety of tools, including smartwatches, rings, mobile apps, and personalized surveys, to gather both objective and subjective data. The objective data includes metrics like heart rate, sleep stages, and physical activity. Subjective data covers emotions, mental health, and responses to major events like the COVID-19 lockdown, the Black Lives Matter movement, and the 2020 U.S. presidential election. The dataset allows researchers to study how young adults' emotions, activities, and lifestyles changed over time during these significant events, making it valuable for understanding long-term mental and physical health trends.
-
-
The dataset is provided by UCI Institute for Future Health.
Multimodal Ingredient Substitution Knowledge Graph [Food]
-
Multimodal Ingredient Substitution Knowledge Graph (MISKG) is a comprehensive dataset featuring over 80,000 ingredient substitution pairs. The growing emphasis on personalized dietary preferences, medical conditions, and ingredient availability has significantly increased the need for advanced food personalization systems. Ingredient substitution plays a crucial role in this process, allowing individuals to adapt recipes while preserving nutritional value and sensory qualities. The challenge, however, lies in identifying appropriate substitutes that maintain flavor, texture, and nutritional content, all while addressing individual preferences and requirements. MISKG captures an extensive understanding of over 16,000 ingredients, incorporating semantic, nutritional, and flavor data to facilitate personalized ingredient substitutions. This dataset is designed to accommodate ingredient availability and sensory preferences, with the knowledge graph supporting both text- and image-based substitution queries. Addressing gaps in existing datasets it includes visual representations and contextual ingredient relationships. By integrating semantic information from authoritative sources such as ConceptNet, Wikidata, Edamam, and FlavorDB, this dataset provides a valuable resource for culinary research and recipe adaptation.
-
-
The dataset is provided by AI Institute, University of South Carolina.
DOMINO dataset [Mental & Physical Health]
-
The DOMINO dataset includes data from 39 participants aged 18–65 who engaged for 28 days between September 2022 and June 2023. Every participant used a Samsung Watch and Oura Ring for monitoring cardiovascular health, sleep, and physical activity, alongside the AWARE app for passive smartphone data sensing. Daily ecological momentary assessments and weekly surveys captured emotional states, including loneliness, depression and stress, via push notifications by a custom-designed mobile app. Participants were recruited through purposive and snowball sampling. All participants met criteria including fluency in English, a UCLA Loneliness Scale score ≥28, and ≤5 years of residence in Finland.
-
-
The dataset is provided by the Department of Nursing Science and Digital Health Technology Group, University of Turku.
FoodKG: A Semantics-Driven Knowledge Graph for Food Recommendation
-
FoodKG combines over 1 million recipes from the Recipe1M with nutritional data from the USDA Nutrient Database, linking ingredients and nutritional content from multiple sources, including the Cook’s Thesaurus and FoodOn ontology. Each assertion in FoodKG is traceable, with provenance and publication subgraphs linking back to original sources. FoodKG also addresses common issues in food data—different units of measurement, ambiguous terminology, and non-standard expressions (like “to taste”)—to provide users with consistent and accurate information related to recipes, their ingredients, and the nutritional content.
-
The FoodKG is available as a SPARQL endpoint at https://inciteprojects.idea.rpi.edu/foodkg
information about how to query the FoodKG is available at https://foodkg.github.io/endpoint.html
-
The FoodKG is provided by Rensselaer Polytechnic Institute.
NCCT ICH dataset [Medical Imaging]
-
Modality: non-contrast computer tomography (NCCT)
File format: NIfTI (fully deidentified)
Number of cases: 141
Disease: hemorrhagic stroke (diagnosis of all cases was confirmed by the neuroradiologist)
Age: 18+
Data collection window: from January 2024
-
-
CerebraAI Inc.
Suicide datasets
-
Two datasets on Suicide: the first dataset comprises 448 annotated users from r/SuicideWatch and related mental health subreddits. It employs the Columbia Suicide Severity Rating Scale (C-SSRS) for annotation, categorizing users into four groups: Supportive, Suicide Ideation, Suicide Behavior, and Suicide Attempt. The annotation quality is validated by an inter-rater agreement of 0.76. The second dataset expands the scope significantly with 2,181 Redditors, including a gold standard subset of 500 users annotated by four psychiatrists. This dataset implements a sophisticated 5-label classification system, improving upon traditional 4-label approaches. The annotation process achieved high reliability with a pairwise agreement of 0.79 and a group-wise agreement of 0.73. The dataset incorporates medical knowledge bases, language modeling, and entity recognition.
-
Dataset 1 : https://zenodo.org/records/2667859
Dataset 2: https://zenodo.org/records/4543776
-
The dataset is provided by Knowledge-infused AI and Inference Lab, University of Maryland Baltimore County.