Guodong Liu, Ph.D.
Professor of Public Health Sciences, Neurology, Pediatric, Psychiatry & Behavioral Health
Director of Center for Applied Studies for Health Economics (CASHE)
Psychometric Assessments via an AI-Powered Conversational Journal: Transforming Measurement-Based Care for Mood and Anxiety Disorders
Project Location: Penn State University College of Medicine
Project Timeline: July 1, 2024 – June 30, 2025
This unique research award gave me complete freedom to conduct my proposed study without micromanaging the details as is often the case with other externa awards. Our goal was to assess the use of large language model (LLM)-based tools in customizing the Patient Health Questionnaire-9 (PHQ-9) for individuals living with bipolar disorder.
Bipolar disorder is a great challenge to mental-health diagnosis and treatment. This category of disorders, which affects approximately 2.8% of adults in the United States, has been identified as affecting around 2.4% of the world population. Assessment tools such as PHQ-9 and GAD-7 are widely used in clinical practice. However, the static nature as well as other shortcomings underscore the need for an integrated and individualized approach capable of collecting continuous high-quality data regarding patients' mental states without being insensitive or intrusive.
We used ChatGPT®, Gemini®, Microsoft Copilot®, and Claude® to adapt the standard PHQ-9 questions. First, we created 50 simulated, individual cases, upon which personalized questions were AI-rendered and analyzed. These simulated cases had diverse backgrounds in demographics and other characteristics such as age, gender, socioeconomic status, bipolar disorder type, interests/hobbies, and social/emotional characteristics/tendency. We ran the four AI chatbots to render personalizedPHQ-9 questions, four full 9-component questionnaires for every simulated individual, one from each chatbot. The chatbots rendered totally 1,800 PHQ-9 component questions, tailored to an individual’s unique demographic background, the context and the circumstances. A qualitative analysis was then carried out by two independent evaluators to assess the quality of PHQ-9 adaptation, contextual relevance, linguistic sensitivity, clarity, and the level of personalization.
Our preliminary findings from this pilot study showed that ChatGPT had the most success, followed by Claude, then Gemini and Copilot. Adaptations by ChatGPT are highly tailored to individual’s background with a sensitive tone, and more conversational and fiducial to original questions. Here is an example in a simulated case of Alice,a female college freshman student, upper-middle class, Bipolar I, finance major, on College Track & Field team. A PHQ-9 question “Little interest or pleasure in doing things” was personalized by ChatGPT as “Alice, have you felt like you’re not really enjoying things you normally do? Like, have you been feeling less excited about practice, hanging out with friends, or even school activities lately?”. Questions generated by Claude have a more empathetic tone but deviate stylistically from the originals. Less or minimal personalization was achieved by Gemini and Copilot. In conclusion, our study demonstrated the utility of LLM-based tool in personalizing the PHQ-9 questionnaire. As our next step, we will investigate if this AI-alternative will improve patient engagement and improve the accuracy of assessment.
The research community has shown great enthusiasm in our study. Our research abstract, entitled “Leveraging AI to personalize symptom assessment for bipolar disorder: a comparative study of LLM-driven PHQ-9 tailoring”, was accepted for presentation this fall (September 17-19, 2025) at the International Society of Bipolar Disorders (ISBD) Annual Conference (Chiba, Japan), one of the top venues in bipolar disorder research. Perhaps at least partially due to this study, I was invited to the inaugural Thought Summit: Everyday AI & Mental Health – Navigating a Tipping Point (June 16-20, 2025) to participate in a discussion among leading experts and thought leaders from academia, industry, and the entrepreneurial world about the future of everyday AI and mental health.
We have continued working in this area and are exploring the feasibility of tracking & analyzing people’s daily rhythms using smart wearable sensing devices to assess the possibility of predicting the onset of mental health illness or adverse episodes/crises among those who have already been diagnosed with mental health disorders. We have analyzed a publicly available dataset (StudentLife Study) from the Dartmouth College to examine the association between daily sensing patterns and the weekly Ecological Momentary Assessment (EMA). In addition, we have also purchased several smart wearable devices, including OURA® Rings and Fitbit® Smart Watches to collect wearable sensing data ourselves. We will use our combined pilot data for multiple grant proposals to use the AI and smart wearable devices for timely mental health assessment among high school and college students.
Learning Achievements:
There has been a paradigm shift triggered by the generative AI technology and Large Language Models. During the last couple years ideas previously considered only possible in theory are now not only possible, but certain to materialize. The support from the SSRI has allowed us to keep up with the advances of new technologies to improve health and healthcare.
Program Strengths:
The program supports bold ideas. This is really where it stands out compared to other funding mechanisms. I wouldn’t have had a fair chance to have this pilot study supported by regular NIH funding mechanisms. In addition, the turn-around time (from RFA to grant application to award decision) was short. Second, SSRI has done a great job in publicizing the award. The announcement was made formally to a very broad audience, with news on its website with a picture, subsequently picked up by other media coverage. Therefore, a lot of people had a chance to read what this was all about. I also think it a great idea to rotate the award among major Penn State colleges, not only making the award even more prestigious, but also balancing across many disciplinary areas, where new and organic ideas are incubated and cross-pollinated.
Program Shortcomings:
The program is outstanding. I just wish the funding period to be a bit longer. New research has a lot of inertia to get started. For example, it takes time to assemble a team, to recruit RAs, to get the data ready (sometimes might involve recruiting patients as research subjects, IRB, etc.). in addition, presenting and publishing the findings also naturally take longer. I feel that 12 months has passed me by like a racing car, even though I’ve tried my best to move quickly.
Impact:
The Lloyd Prize has rekindled my interest in exploring provocative ideas at the intersection between AI & smart wearable device technology and measurement-based care for people with mental health disorders. Tremendous data have been collected and/or simulated, thanks to the Lloyd Prize. This study will lay a solid foundation for several major grant applications to NIH, NSF as well as other major foundations that support mental health-related research. For this, I am very grateful to the Lloyd Prize, SSRI and to Dr. Lloyd personally.