MLLM-Search: A Zero-Shot Approach to Finding People Using Multimodal Large Language Models

Fung, Angus; Tan, Aaron Hao; Wang, Haitong; Benhabib, Bensiyon; Nejat, Goldie

doi:10.3390/robotics14080102

Open AccessArticle

MLLM-Search: A Zero-Shot Approach to Finding People Using Multimodal Large Language Models

by

Angus Fung

^1,*

,

Aaron Hao Tan

¹

,

Haitong Wang

¹,

Bensiyon Benhabib

¹

and

Goldie Nejat

^1,2,3,*

¹

Autonomous Systems and Biomechatronics Laboratory (ASBLab), Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, ON M5S 3G8, Canada

²

KITE, Toronto Rehabilitation Institute, University Health Newtork (UHN), Toronto, ON M5G 2A2, Canada

³

Rotman Research Institute, Baycrest Health Sciences, North York, ON M6A 2E1, Canada

^*

Authors to whom correspondence should be addressed.

Robotics 2025, 14(8), 102; https://doi.org/10.3390/robotics14080102

Submission received: 22 April 2025 / Revised: 6 July 2025 / Accepted: 25 July 2025 / Published: 28 July 2025

(This article belongs to the Section Intelligent Robots and Mechatronics)

Download

Browse Figures

Versions Notes

Abstract

Robotic search of people in human-centered environments, including healthcare settings, is challenging, as autonomous robots need to locate people without complete or any prior knowledge of their schedules, plans, or locations. Furthermore, robots need to be able to adapt to real-time events that can influence a person’s plan in an environment. In this paper, we present MLLM-Search, a novel zero-shot person search architecture that leverages multimodal large language models (MLLM) to address the mobile robot problem of searching for a person under event-driven scenarios with varying user schedules. Our approach introduces a novel visual prompting method to provide robots with spatial understanding of the environment by generating a spatially grounded waypoint map, representing navigable waypoints using a topological graph and regions by semantic labels. This is incorporated into an MLLM with a region planner that selects the next search region based on the semantic relevance to the search scenario and a waypoint planner that generates a search path by considering the semantically relevant objects and the local spatial context through our unique spatial chain-of-thought prompting approach. Extensive 3D photorealistic experiments were conducted to validate the performance of MLLM-Search in searching for a person with a changing schedule in different environments. An ablation study was also conducted to validate the main design choices of MLLM-Search. Furthermore, a comparison study with state-of-the-art search methods demonstrated that MLLM-Search outperforms existing methods with respect to search efficiency. Real-world experiments with a mobile robot in a multi-room floor of a building showed that MLLM-Search was able to generalize to new and unseen environments.

Keywords:

robotic person search; multimodal large language models; vision language models; zero-shot search; event-driven scenarios

1. Introduction

Autonomous mobile robots can be used to search for specific people in human-centered environments to engage in human–robot interactions. For example, robots need to locate people in: (1) multi-room homes to assist with daily tasks such as meal preparation and exercise [1,2,3,4], (2) office and university buildings to deliver packages or messages [5,6,7,8], and (3) public venues such as shopping malls and amusement parks to locate lost people [9,10,11,12]. In healthcare settings, robots search in long-term homes to find residents to remind them of meal times or appointments [3,13,14,15] and in hospitals to find medical professionals to deliver supplies [5,16,17,18] or to guide visitors to their destinations [19].

Existing person search methods, such as hidden Markov model (HMM)-based [5,6,20] and Markov decision process (MDP)-based [21,22,23] planners have been used by robots to search for individuals with known user models. These models are generated from user location patterns, such as daily schedules [1,22] or past locations [20,21]. However, user schedules may be unavailable or incomplete, especially for new users, and may change unexpectedly due to real-time events (e.g., weather, delays in appointments). As these methods rely on past user behavior patterns, they are unable to generalize to new scenarios.

Recently, multimodal large language models (MLLMs) [24,25,26,27,28,29,30] have been proposed for robotic tasks such as robot navigation to unknown static objects. As MLLMs are trained on extensive data obtained from the internet [31], they have generalist reasoning capabilities [32,33]. However, the existing MLLMs for mobile robots have limited spatial reasoning capabilities due to being trained on image-caption pairs, which contain limited spatial information [34,35]. To address this, visual prompting methods have been developed [36,37,38,39,40,41]. These methods overlay coordinates representing locations of objects for general visual Q&A tasks [36,37,38] directly onto RGB images to improve spatial understanding of the local scene. However, these visual prompting methods cannot be directly applied to the robotic person search problem, as they lack the spatial understanding of the overall environment needed when searching within a known environment. MLLMs have the potential to be applied to robotic person search problems by leveraging their reasoning capabilities [32] to infer the location of a dynamic individual from incomplete schedules. Moreover, they can incorporate new/additional information within the MLLM’s context window, enabling zero-shot person search without retraining.

In this paper, we present a novel multimodal language model, MLLM-Search, to address the robotic person search problem for the first time under event-driven scenarios where user schedules are incomplete, unavailable, or deviate due to real-time events. The main contributions of this paper are as follows:

(1) The formulation of the robotic person search problem and the development of the first zero-shot person search method, which incorporates language models for generalist reasoning and spatial understanding of the environment.

(2) The introduction of a novel visual prompting method that generates a topological graph with semantic region labels by extracting spatial information from metric maps. The novelty lies in generating a semantically and spatially grounded waypoint map that uniquely enables MLLMs to perform search planning by providing spatial reasoning of the overall global environment.

(3) The development of MLLM-based search planner, which incorporates a region planner and a waypoint planner. The region planner considers the semantic relevance of each search region with respect to the event-driven search scenario using region-based score prompting. The waypoint planner uses semantically relevant objects in the environment during planning within our new spatial chain-of-thought (SCoT) prompting method. Our search planner is able to generalize to scenarios with varying user schedules where historical data are unavailable to optimize the likelihood of finding a person of interest. Overall, our planner bridges the gap between generalist language models and embodied spatial reasoning—specifically, how MLLMs, trained on 2D internet-scale data, can be structured to perform real-world robotic tasks without domain-specific training.

2. Related Works

In this section, we present and discuss existing (1) person search methods developed for robotic search of a dynamic person in human-centered environments and (2) visual prompting methods used to improve spatial reasoning.

2.1. Person Search by Robots

Existing person search methods for robotic applications have consisted of (1) lookahead [5,6,20,42], (2) MDP-based [21,22,23], or (3) graph-based [1,2] planners.

2.1.1. Lookahead Planners

Lookahead planners have either used HMMs [5] or predefined likelihood functions [6,20] to navigate a robot to the most probable user locations. In particular, HMMs identify user locations based on past location and activity data. For example, in [5], an HMM-based person search method was used by a robot to find a dynamic person in an indoor office setting. The HMM predicted the person’s movements in the environment based on past observed locations. In [6] and [20], lookahead planners used likelihood functions defined by human experts [6], by past user locations [20], or by spatial models of person occurrence [42] to determine the next search region. Namely, in [6], a robot searched for static people in an indoor laboratory, navigating to locations with the highest likelihood assigned by human experts. In [20], a robot searched for a dynamic resident in an apartment by navigating to the highest likelihood location based on past user location frequencies. Finally, in [42], a lookahead planner based on behavior trees was introduced for person search, where a mobile robot required human assistance to complete physical tasks, such as opening closed doors or using elevators. A spatial Poisson process was used to model where people are likely to appear, and candidate sequences of wait and search actions were evaluated using a discrete time Markov chain representation of the tree.

2.1.2. MDP-Based Planners

MDP-based planners [21,22,23] have optimized robotic search actions by maximizing the likelihood of finding a person. For example, in [21], an MDP-based method was used by a robot to search for static people on the floor of a building. A sequence of search regions was determined based on the expected proportion of people in each region to minimize search time. In [22], a partially observable MDP approach was used for a robot to find a dynamic person in a multi-room home environment. The approach determined search actions (e.g., searching a room) by incorporating user activity data (e.g., meal preparation) into a Bayesian network to determine the highest likelihood user locations. In [23], an MDP-based method was also used for a mobile robot to find an elderly person in a home. The MDP determined the next location to visit based on the probability of the person’s current location.

2.1.3. Graph-Based Planners

Graph-based planners have used activity probability density functions (APDF) to plan search paths based on user schedules. For example, in [1], a people search method was proposed for the assistive robot Blueberry to search for dynamic people in long-term care homes. APDF was utilized to predict user locations by considering their complete schedules. These schedules included activity types and duration, time of day, and specific regions. The work in [1] was extended in [2] to consider multiple robots searching.

2.2. Visual Prompting Methods

Visual prompting methods [36,37,38,39] for MLLMs improve spatial reasoning over standard prompting methods by overlaying visual coordinates onto RGB images of a scene. In [36,37,38], visual prompting methods were introduced for visual Q&A tasks to infer object positions [36], to identify regions based on text [37], or to answer queries related to the size and distance of objects in a scene [38]. For example, in [36], the scaffold method placed visual coordinates evenly across an RGB image of a scene, allowing MLLMs to associate visual data with textual data. Similarly, both [37] and [38], placed visual coordinates on segmented objects within an RGB image of a scene to associate these objects with their corresponding visual coordinates. Visual prompting methods have also been used in robot manipulation [39]. In [39], the PIVOT method applied visual prompting to robot manipulation tasks by placing coordinates within an RGB image of a scene, corresponding to potential robot manipulation actions. The MLLM was then used to select the next robot action based on the coordinates from the image.

2.3. Summary of Limitations

Existing lookahead planners [5,6,20] prioritize the next user region to search with the highest likelihood. However, this can result in increased search times as search plans may select further away regions [1]. On the other hand, MDP-based planners [21,22,23] rely on the Markov assumption that search decisions are based solely on the current region. This can result in redundant searches, where a robot revisits a recently searched region. Graph-based planners, along with MDP-based and lookahead planners, all require complete user schedules [1,22], past observed user locations [20,21], and/or last known user locations [23]. However, in real-world scenarios, such user information may be unavailable or incomplete due to insufficient knowledge about the user. Furthermore, user behaviors can deviate from expected schedules due to real-time events such as emergency situations or schedule changes, availability of locations, etc. Thus, existing robotic person search methods cannot generalize to these real-world scenarios.

MLLMs have the potential to infer the region a person is in from incomplete or changing information, by leveraging knowledge learned from extensive internet data [31]. They can also consider additional search information beyond user schedules, such as building/room schedules, or activity-specific data, without requiring retraining or specialized models for each data type. However, MLLMs have not yet been applied to robotic person search tasks. While existing MLLMs have incorporated visual prompting methods [36,37,38,39], these methods have focused on improving spatial understanding of local scenes from RGB images. However, spatial understanding of the entire environment is needed when searching for individuals within known environments.

To address the aforementioned limitations, we propose MLLM-Search, the first robotic person search method that leverages MLLMs for event-driven scenarios where user schedules vary. MLLM-Search provides spatial reasoning of the global environment by generating a semantically and spatially grounded waypoint map. Our search method incorporates both region and waypoint planners to generalize to scenarios with user schedules that are varying in completeness or have changed.

3. Person Search Problem Under Event-Driven Scenarios with Varying User Schedules

Problem Definition

The robot problem of person search under event-driven scenarios requires a mobile robot to search for a dynamic person in a known environment without complete, partial, or any a priori knowledge of their schedules. A search query

q_{s}

, provided by the search operator to the robot, includes natural language instructions containing the person to search for and their physical description

q_{a}

. The search query can optionally contain information such as their name, role, tasks, last known location, etc. The robot has access to an information database

Q_{d b}

, which consists of (1) the user schedule

Q_{u}

, if available, and (2) the system database

Q_{s},

if available, containing textual data relevant to the search, such as building/room schedules, visitor logs, EHRs (Electronic Health Records), etc. During the search at time

t

, images

x_{t}

are obtained from the robot’s camera. The function

f

is defined to output a sequence of robot actions

u_{t}

given the robot position

p_{i}

, metric map

ℳ

, search query

q_{s}

, and information database

Q_{d b}

:

u_{t} = f (x_{t}, p_{i}, ℳ, q_{s}, Q_{d b}) .

(1)

The objective is to minimize the expected distance traveled

d

between the robot’s start location

p_{s}

and the target location

p_{t g}

:

\min E [d (p_{s}, p_{t g})] .

(2)

4. MLLM-Search Architecture

The proposed MLLM-Search architecture, shown in Figure 1, consists of two subsystems: (1) Map Generation Subsystem (MGS), and (2) Person Search Subsystem (PSS). The goal of the MGS is to generate both a semantic metric map

ℳ_{sem}

and waypoint metric map

ℳ_{wp}

of the environment. Once the environment is mapped, the PSS leverages MLLMs to search for the user.

4.1. Map Generation Subsystem (MGS)

The MGS consists of two main modules: the Semantic Map Generation (SMG) module and the Waypoint Map Generation (WMG) module.

4.1.1. Semantic Map Generation (SMG)

The SMG module builds a semantic map of the environment. It consists of three modules: (1) Object Discovery VLM (OD-VLM), (2) Open Segmentation (OS), and (3) Semantic Simultaneous Localization and Mapping (S-SLAM). The OD-VLM module utilizes an MLLM to identify the objects in the environment. Namely, it takes an input image

x_{RGB}

and generates a list of detected object labels

L_{o}

in the image. The OS module takes these object labels

L_{o}

and uses the grounded segment anything model (Grounded SAM) [43] to generate corresponding segmentation masks

ℳ_{seg}

for each object. The S-SLAM module [44] takes an RGB image

x_{RGB}

and depth image

x_{D}

, and produces a semantic map

ℳ_{sem}

, as shown in Figure 2a. Specifically, it takes the segmented portions of

x_{RGB}

and

x_{D}

and projects them into a 3D point cloud

x_{SEG - PCL}

using the pinhole camera model [45]. The point cloud

x_{SEG - PCL}

is then converted into a voxel representation and summed over the height dimension to obtain the semantic map [44]. The semantic map is updated during the search to represent new objects and existing objects that have changed locations.

4.1.2. Waypoint Map Generation (WMG)

The WMG module generates a semantically and spatially grounded waypoint map

ℳ_{wp}

and consists of three sub-modules: (1) Occupancy Grid SLAM (OG-SLAM), (2) Topological Map Generation (TMG), and (3) Waypoint Visual Prompt Generation (WVPG). The OG-SLAM sub-module creates an occupancy map

ℳ_{o c c}

using odometry

ρ

and point clouds

x_{PCL}

with particle filters [46]. The TMG sub-module uses the occupancy map

ℳ_{o c c}

to generate a topological map, represented as a graph

ℳ_{t o p} = (V, E)

, where nodes

V

represent navigable waypoints, and edges

E

represent the traversable paths between waypoints. The nodes

V

are obtained based on the free space in

ℳ_{o c c}

. First, the distance transform

D

is computed, which measures the distance from each point

p

on

ℳ_{o c c}

to the nearest obstacle

o \in O

. For each point

p

in

ℳ_{o c c}

, the distance transform is computed:

D (p) = \min_{o \in O} | p - o | .

(3)

Safe points

p_{s a f e} = {p ∣ D (p) \geq σ_{\min}}

are identified as those at a distance

σ_{\min}

away from obstacles. K-means clustering [47] is used to generate waypoints

w_{i}

from these safe points. K-means was selected for its computational efficiency over other clustering methods. As K-means does not account for obstacles when selecting cluster centroids (i.e., waypoints), we apply a post-processing step where any

w_{i}

such that

D (w_{i}) < σ_{\min}

are updated to the nearest waypoint:

w_{i}^{'} = \arg \min_{p \in S} | p - w_{i} | .

(4)

Edges

E

are determined using a KDTree to find neighboring nodes within a distance

σ_{\max}

. Bresenham’s algorithm [48] is used to check if the path between waypoints is obstacle-free. The WVPG sub-module uses both

ℳ_{o c c}

and

ℳ_{t o p}

to generate a waypoint map

ℳ_{wp}

, as shown in Figure 2b, where navigation waypoints

w_{i}

and high-level region labels

L_{r}

are directly overlaid on top of the occupancy map. Waypoints

w_{i}

are represented as numbered markers (i.e.,

w_{1}

is labeled as “1”), and region labels

L_{r}

are represented as text (e.g., “Main Lobby”), as shown in Figure 2b. The map

ℳ_{wp}

is passed into the PSS.

4.2. Person Search Subsystem

The Person Search Subsystem is used to search for individuals within a dynamic environment using the semantic and waypoint maps generated by MGS. It contains the following modules: MLLM Region Planner, MLLM Waypoint Planner, Target Tracking, and Navigation.

4.2.1. Multimodal LLM Region Planner

The MLLM Region Planner (

M L L M_{RP}

) module determines the region

r_{t + 1} \in R

to search by considering the semantic relevance of each search region with respect to the search scenario. The inputs to this module include the robot’s waypoint position

w_{t}

, search query

q_{s}

, information database

Q_{d b}

, and the robot’s memory

ℋ

.

ℋ

consists of previous search histories for regions visited

ℋ_{r} = {r_{i}}_{i = 1}^{t}

and waypoints visited

ℋ_{w} = {w_{i}}_{i = 1}^{t}

, namely,

ℋ = ℋ_{r} \cup ℋ_{w}

. Region-to-object assignments are obtained from the semantic map

ℳ_{sem}

by assigning each object to the closest region:

O_{r} = {(r_{i}, {o_{j}}_{j \in J_{i}})}_{i} . J_{i}

is the index set of objects

o_{j}

assigned to region

r_{i}

. The contextual database

{Q^{'}}_{d b}

, representing semantically relevant information for the search, is retrieved through Retrieval Augmented Generation (RAG) [49,50]. RAG uses the cosine similarity between the search query embeddings

e (q_{s})

and database embeddings

e (q_{d b}^{i})

, where the database is divided into chunks

q_{d b}^{i} \in Q_{d b}

:

Q_{d b}^{'} = \arg \max_{Q_{d b}} \frac{e (q_{s}) \cdot e (q_{d b}^{i})}{‖ e (q_{s}) ‖ ‖ e (q_{d b}^{i}) ‖} .

(5)

We use region-based score prompting to assign semantic scores,

S_{t + 1} = {(s_{l}^{i}, s_{p}^{i}, s_{r}^{i})}

, to potential search regions

r_{i}

.

M L L M_{RP}

outputs the semantic scores for each region

r_{i}

:

S_{t + 1} = M L L M_{RP} (q_{s}, Q_{d b}^{'}, w_{t}, ℋ_{t}, O_{r}, ℳ_{wp}) .

(6)

In particular,

S_{t + 1}

consists of: (1) the likelihood score

s_{l}

, (2) the proximity score

s_{p}

, and (3) the recency score

s_{r}

shown in Figure 3. These scores were added iteratively during development, and the effectiveness of these scores was validated in the ablation study presented in Section 5.4. Furthermore, we use Chain-of-Thought (CoT) prompting [51,52] to provide explicit and sequential step-by-step reasoning. The text and visual prompts are also presented in Figure 3. The robot selects the region with the highest sum of semantic scores as the next region

r_{t + 1}

to search:

r_{t + 1} = \arg \max_{ℛ} s_{l}^{i} + s_{p}^{i} + s_{r}^{i} .

(7)

4.2.2. Multimodal LLM Waypoint Planner

The MLLM Waypoint Planner (

M L L M_{WP}

) module plans a sequence of waypoints

p_{t + 1}

to a search region

r_{t + 1}

while prioritizing the likelihood of encountering the person along the path. For example, when searching for a student, it may plan a route near tables where students work. A* [53] is used to generate several paths

p_{i}

from the current waypoint

w_{t}

to the destination waypoint

w_{t + 1}

. Waypoint-to-object assignments are obtained from

ℳ_{sem}

by associating each object with the closest waypoint:

O_{w} = {(w_{i}, {o_{j}}_{j \in J_{i}})}_{i}

. This allows

M L L M_{WP}

to consider semantically relevant objects during planning. We have developed a novel spatial CoT (SCoT) prompting method that improves spatial awareness of MLLMs over standard prompting methods by decomposing the planning task into sequential steps, with each step uniquely considering the semantically relevant objects (the parameter “objects” in Figure 4) as well as the local spatial context (the parameter “next_waypoints” in Figure 4). SCoT was designed to reduce model hallucinations, such as the selection of unreachable or disconnected waypoints, by grounding each decision step with a set of valid next waypoint candidates.

M L L M_{WP}

finds

p_{t + 1}

by optimizing the path to region

r_{t + 1}

:

p_{t + 1} = M L L M_{WP} (q_{s}, Q_{d b}^{'}, w_{t}, r_{t + 1}, p_{i}, ℋ_{t}, ℳ_{wp}, O_{w}) .

(8)

The output waypoint sequence

p_{t + 1}

is checked for feasibility against the topological map

ℳ_{t o p}

:

\forall (w_{i}, w_{i + 1}) \in p_{t + 1}, (w_{i}, w_{i + 1}) \in ℳ_{t o p} .

(9)

If constraints are violated, the above steps are repeated.

4.2.3. Target Tracking

The Target Tracking module identifies and tracks the target person in the environment. It takes as input an RGB image

x_{RGB}

, and a text description

q_{a}

of the person’s appearance. The Person Tracker, which runs throughout, detects and tracks individuals using LDTrack [7] that we have developed. LDTrack leverages diffusion models [54] to capture temporal embeddings of people. When a person is identified, the Open Detection (OpenDet) module, using G-DINO [55], detects the individual based on their description

q_{a}

. Each box

b_{i}

is associated with a label

c_{i}

and confidence score

p (c_{i})

, where

c_{i} \in q_{a}

. An MLLM is then used to evaluate whether the detected individual matches the search target by comparing

c_{i}

with

q_{a}

. If it matches, a target waypoint

w_{t + 1}

is passed to the Navigation module.

4.2.4. Navigation

The Navigation module converts a target waypoint

w_{t + 1}

from the Target Tracking module or a sequence of waypoints

p_{t + 1}

from the

M L L M_{WP}

module into robot velocities

(v, ω)

for navigation. The A* algorithm [53] and the TEB planner [56] were used as the global and local planners, respectively. The TEB local planner also provides real-time obstacle avoidance by dynamically adjusting the robot’s trajectory in response to static and moving obstacles. AMCL [57] was used to localize the robot within

ℳ_{o c c}

.

5. Simulated Experiments

We conducted extensive simulated experiments to evaluate the performance of our MLLM-Search architecture for robotic person search under event-driven scenarios with varying user schedules. Namely, we first conducted an ablation study to investigate the contributions of the design choices of our MLLM-Search architecture and then performed a benchmark comparison study with state-of-the-art (SOTA) person search methods. GPT-4o [33] was used as the MLLM. These experiments were conducted using a Clearpath Jackal robot.

5.1. Environment

Two 3D photorealistic environments from AWS RoboMaker [58] were used in the ROS Gazebo 3D simulator: (1) a hospital environment, and (2) an office environment. The hospital is 25 m × 55 m in size and includes 11 regions, such as patient rooms, intensive care units, and patient wards, with objects such as beds, chairs, and medical equipment, as shown in Figure 5a. The office is 22 m × 48 m in area and includes 14 regions, such as rooms, cubicles, and conference rooms, with objects such as tables, chairs, and TVs, as shown in Figure 5b. Both environments include dynamic people, whose movements have been modeled using the social force model [59].

5.2. Performance Metrics

We use three metrics to evaluate robot search performance:

Mean Success Rate (SR): the proportion of searches where the robot successfully locates the target user.
Success weighted by Path Length (SPL): the efficiency of the search method: $\frac{1}{N} \sum_{i = 1}^{N} S_{i} \frac{l_{i}}{\max (p_{i}, l_{i})}$ , where $N$ is the total number of search trials, $S_{i}$ represents whether the search was successful, $l_{i}$ is the shortest path, and $p_{i}$ is the robot path.
Mean Search Time (ST): the average time to complete the search and locate the target person across all trials.

5.3. Search Scenarios

For each environment type, we generated three types of scenarios based on people’s schedules:

Complete Schedules: scenarios involving schedules with all information provided;
Shifted Schedules: scenarios where schedules have been shifted back or forward due to real-time events (i.e., meetings running late or emergencies occurring);
Partial/Incomplete Schedules: scenarios with partial schedules involving a 1–2 h gap. Scenarios with incomplete schedules involve larger time gaps of more than 2 h;
No Schedules: scenarios with no prior schedule information available. GPT-4o [33] was used to generate the above scenarios using the waypoint map $ℳ_{wp}$ and object locations $O_{w}$ and $O_{r}$ . Handcrafted example scenarios of each schedule type were also provided to GPT-4o for in-context learning to generate 10 scenarios for each schedule type.

5.4. Ablation Methods

We considered the following for our ablation study:

MLLM-Search: our proposed method;
MLLM-Search with (w/) Single Stage (SS): a variant of MLLM-Search using a single MLLM to perform both region and waypoint planning.

We also ablated the score variables of the region planner

M L L M_{RP}

to determine their influence:

3.: MLLM-Search without (w/o) likelihood $s_{l}$ : a variant with no likelihood score $s_{l}$ ;
4.: MLLM-Search w/o proximity $s_{p}$ : a variant with no proximity score $s_{p}$ ;
5.: MLLM-Search w/o recency $s_{r}$ : a variant with no recency score $s_{r}$ ; and
6.: MLLM-Search w/o scores $Σ s$ a variant with no scores.

Furthermore, we ablated the waypoint planner

M L L M_{WP}

:

7.: MLLM-Search w/o SCoT: a variant with no SCoT prompting.

5.5. SOTA Methods

The SOTA methods we compared our MLLM-Search method with were as follows:

MDP-based planner [23]: This planner selects a region to search based on the expected reward, which is determined using transition probabilities between regions and the user location PDF. It was selected as a representative decision-making approach.
HMM-based planner: This planner predicts the target person’s region by modeling movement as a sequence of hidden states with transition probabilities between regions. It was selected as a representative probabilistic inference approach.
Random walk planner (RW): RW selects a region to search uniformly at random. It was selected as a naïve baseline. For all methods, GPT-4o is used to generate transition and user location probabilities for each scenario.

5.6. Simulation Results

Table 1 presents the results of both the ablation study and the SOTA comparison. In general, MLLM-Search outperformed all methods across all metrics and scenarios. In particular, the ablation study showed that our MLLM-Search with two stages of region and waypoint planning performed better than the single-stage variant (MLLM-Search w/SS) for both environments. The degradation in performance metrics for the single-stage variant is due to the selection of suboptimal waypoints as a result of longer context windows [60]. Namely, the MLLM must simultaneously process more information when combining the planning stages.

Ablating the design choices of the region planner, we observed that removing each score component resulted in performance degradation. The most significant degradation was observed for the MLLM-Search w/o

Σ s

variant. This variant does not account for the travel distances between regions nor how regions were visited recently, resulting in frequent travel to faraway regions and revisiting the same regions repeatedly. Similarly, MLLM-Search w/o

s_{r}

resulted in the robot frequently revisiting the same regions, leading to longer search times (up to 20.7 min) and up to 50% longer search paths. MLLM-Search w/o

s_{p}

resulted in the robot selecting regions that were the farthest away from its current location, leading to inefficient searches, as noted by a longer ST of up to 18.1 min, as well as a degradation of up to 40% in SR and 48% in SPL. MLLM-Search w/o

s_{l}

prioritizes the closest region rather than the most probable, which resulted in degradations of up to 30% in SR, 36% in SPL, and up to 9.9 min longer ST across all scenarios in both environments. Compared to the other methods, this resulted in the least degradation, as proximity and recency scores directly prevent inefficient behaviors such as traveling to faraway regions or repeated revisits.

Ablating the design choice of the waypoint planner, MLLM-Search w/o SCoT, resulted in comparable SR and SPL values with MLLM-Search, however, up to 22.7 min longer ST. Without SCoT, many of the generated paths were infeasible due to a lack of spatial understanding, thus requiring more time for GPT-4o to replan.

The comparison results showed that our MLLM-Search method outperformed the SOTA planners. For both MDP-based and HMM-based planners, search degradation became more significant as user schedules became less available. The SOTA methods can only perform optimally when a user’s past schedule directly matches their actual location, as they rely on historical data to derive transition probabilities (to predict movement patterns) and user location PDFs (to estimate user likelihoods in different regions). However, even with complete schedules, real-time events can cause deviations from the expected schedule. As a result, both MDP- and HMM-based methods only obtained an SR of 0.80–0.90 in the Complete Schedule scenarios across both environments, as shown in Table 1. The RW method consistently performed poorly across all scenario and schedule types, often revisiting the same regions or exploring faraway regions.

Under the most challenging scenarios, such as in the No Schedule scenario type, our MLLM-Search method outperformed the other SOTA planners with up to 50% improvement in SR, up to 193% improvement in SPL, and achieved ST of up to 22.4 min faster in both the hospital and office environments. Even without prior user information, our MLLM-Search was able to leverage contextual cues from the search scenario to locate the user. For example, in a hospital scenario, the robot was tasked with delivering supplies to a doctor located in the ICU room during an emergency situation. Based on this context, the robot inferred that the doctor was most likely in the Exam Room, ICU, or Operating Room, allowing the robot to efficiently locate the doctor. Similarly, in an office scenario, the robot was tasked with locating a client who arrived for a meeting with the CEO. As visitors do not have schedules in the system, the client’s location was unknown. Based on the context of the visit, the robot inferred that the client was most likely in one of three locations: the Reception Area, where visitors typically wait; the Conference Room, where meetings are often held; or the Executive Office, where the CEO might already be meeting the client. This deductive reasoning from contextual information allowed the robot to efficiently locate the user. On the other hand, without prior information, both the MDP- and HMM-based methods assumed uniform distributions for user transition probabilities and likelihoods, unable to incorporate semantic understanding of the environment and contextual understanding of the situation. However, failures still occurred in some No Schedule scenarios, for example, with the search query “Find Dr. Smith who has curly gray hair, a white coat, and glasses”, where the robot explored multiple plausible regions such as the Exam Room, ICU, General Ward, and Radiology Room, but failed to find the target before the deadline due to the lack of evidence to narrow down the target location.

6. Real-World Experiments

We conducted a real-world trial in a 40 m by 43 m multi-room floor of a university building where a food delivery robot was tasked with delivering lunch to a student. The robot consisted of a Jackal robot base with a Velodyne LiDAR and a ZED Mini camera located on a custom platform at a height of 1.2 m above the ground, Figure 6a. The environment included the following regions: (1) Conference Room (CR), (2) Lounge (LN), (3) Atrium (AT), (4) Lecture Room (LR), (5) Club Room (CLR), (6) Study Room (SR), and (7) Lab (LB), with various objects such as chairs and tables, Figure 6b,c.

The student was initially located in the AT but moved to the CLR after receiving a message from the 3D printing club regarding a print failure. The waypoint map generated by the WMG module is presented in Figure 7. The search scenario is described as follows:

Search Query

q_{s}

: “Search for an undergraduate student with glasses and wearing jeans. The search start time is 1PM.”

User Schedule

Q_{u}

: The student has an exam in two hours at 3PM. This is a partial schedule type due to the time gap.

Room Scheduling and Student Activity Database

Q_{s}

: The scheduling database indicates that the LN is closed due to renovations. Club records indicate that the student is involved with the 3D printing club.

Results: The search path taken by the robot and the path of the student are presented in Figure 7. The robot first visited the CR as it achieved the highest

s_{l}

and

s_{p}

scores from

M L L M_{RP}

. CoT reasoning indicated this choice was due to the presence of objects such as chairs and tables, as well as the room being a likely place for students to study for an upcoming exam. After the robot reached the CR, the region scores were updated by the

M L L M_{RP}

, with the AT having the next highest

s_{l}

and

s_{p}

scores. As a result, the robot navigated to the AT next, with the

M L L M_{WP}

using our SCoT method to select an obstacle-free route between the tables, providing visibility of seated individuals on both sides of the tables. This demonstrates spatial reasoning as the robot is able to associate objects in the environment with waypoints for path planning. Lastly, the robot visited the CLR, which, despite having a lower

s_{l}

score than the SR, had a higher

s_{p}

score due to its close proximity to the robot’s location, resulting in the highest overall score from

M L L M_{RP}

. Overall, MLLM-Search achieved an ST of 6.38 min and a SPL of 0.53. MLLM-Search only requires two GPT inference calls per search step: one for selecting a region and one for generating a waypoint path. Each call takes approximately 10 s on average. A video of this scenario is presented on our YouTube channel at: https://youtu.be/mzP3vcU611Y (accessed on 25 July 2025).

7. Conclusions

In this paper, we present MLLM-Search, a novel person search architecture developed to address the challenge of locating a person under event-driven scenarios with varying user schedules. MLLM-Search is the first approach to incorporating MLLMs for search, using a unique visual prompting method that generates a semantically and spatially grounded waypoint map to provide robots with spatial understanding of the global environment. MLLM-Search includes a region planner that selects regions based on semantic relevance, and a waypoint planner that considers semantically relevant objects and spatial context using our novel SCoT prompting method in order to plan a robot search path. An extensive ablation study validated the design choices of MLLM-Search, while a comparison study with SOTA methods demonstrated that MLLM-Search achieves higher search efficiency in 3D photorealistic environments under event-driven scenarios with varying user schedules. A real-world experiment highlights the generalizability of MLLM-Search to be applied to a new, unseen environment and scenario. Future work will extend MLLM-Search to consider human–robot interactions during the search for the robot to obtain additional search evidence or clues, as well as extending the Waypoint Map Generation module to support point cloud and mesh-based map representations.

Author Contributions

Conceptualization, A.F., A.H.T., H.W. and G.N.; methodology, A.F., A.H.T., H.W. and G.N.; software, A.F.; validation, A.F., A.H.T., H.W. and G.N.; formal analysis, A.F. and G.N.; investigation, A.F., A.H.T., H.W. and G.N.; resources, G.N. and B.B.; data curation, A.F., A.H.T. and H.W.; writing—original draft preparation, A.F. and G.N.; writing—review and editing, A.F. and G.N.; visualization, A.F. and G.N.; supervision, G.N. and B.B.; project administration, G.N. and B.B.; funding acquisition, G.N. and B.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by AGE-WELL Inc., the Natural Sciences and Engineering Research Council of Canada (NSERC), an NSERC CREATE HeRo fellowship, and the Canada Research Chairs Program.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mohamed, S.C.; Rajaratnam, S.; Hong, S.T.; Nejat, G. Person Finding: An Autonomous Robot Search Method for Finding Multiple Dynamic Users in Human-Centered Environments. IEEE Trans. Autom. Sci. Eng. 2020, 17, 433–449. [Google Scholar] [CrossRef]
Mohamed, S.C.; Fung, A.; Nejat, G. A Multirobot Person Search System for Finding Multiple Dynamic Users in Human-Centered Environments. IEEE Trans. Cybern. 2023, 53, 628–640. [Google Scholar] [CrossRef]
Wilson, G.; Pereyda, C.; Raghunath, N.; de la Cruz, G.; Goel, S.; Nesaei, S.; Minor, B.; Schmitter-Edgecombe, M.; Taylor, M.E.; Cook, D.J. Robot-Enabled Support of Daily Activities in Smart Home Environments. Cogn. Syst. Res. 2019, 54, 258–272. [Google Scholar] [CrossRef] [PubMed]
Lee, J.J.; Atrash, A.; Glas, D.F.; Fu, H. Developing Autonomous Behaviors for a Consumer Robot to Be near People in the Home. In Proceedings of the 2023 32nd IEEE International Conference on Robot and Human Interactive Communication, Busan, Republic of Korea, 28 August 2023; pp. 197–204. [Google Scholar]
Bayoumi, A.; Karkowski, P.; Bennewitz, M. Speeding up Person Finding Using Hidden Markov Models. Robot. Auton. Syst. 2019, 115, 40–48. [Google Scholar] [CrossRef]
Elinas, P.; Hoey, J.; Little, J.J. Homer: Human Oriented Messenger Robot. In AAAI Spring Symposium on Human Interaction with Autonomous Systems in Complex Environments; AAAI: Washington, DC, USA, 2003. [Google Scholar]
Fung, A.; Benhabib, B.; Nejat, G. LDTrack: Dynamic People Tracking by Service Robots Using Diffusion Models. Int. J. Comput. Vis. 2025, 133, 3392–3412. [Google Scholar] [CrossRef]
Veloso, M.; Biswas, J.; Coltin, B.; Rosenthal, S.; Kollar, T.; Mericli, C.; Samadi, M.; Brandao, S.; Ventura, R. CoBots: Collaborative Robots Servicing Multi-Floor Buildings. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; IEEE: Vilamoura-Algarve, Portugal, 2012. [Google Scholar]
Lin, X.; Lu, R.; Kwan, D.; Shen, X. (Sherman) REACT: An RFID-Based Privacy-Preserving Children Tracking Scheme for Large Amusement Parks. Comput. Netw. 2010, 54, 2744–2755. [Google Scholar] [CrossRef]
Dworakowski, D.; Fung, A.; Nejat, G. Robots Understanding Contextual Information in Human-Centered Environments Using Weakly Supervised Mask Data Distillation. Int. J. Comput. Vis. 2023, 131, 407–430. [Google Scholar] [CrossRef]
Fung, A.; Wang, L.Y.; Zhang, K.; Nejat, G.; Benhabib, B. Using Deep Learning to Find Victims in Unknown Cluttered Urban Search and Rescue Environments. Curr. Robot. Rep. 2020, 1, 105–115. [Google Scholar] [CrossRef]
Shiomi, M.; Kanda, T.; Glas, D.F.; Satake, S.; Ishiguro, H.; Hagita, N. Field Trial of Networked Social Robots in a Shopping Mall. In Proceedings of the 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, St. Louis, MO, USA, 10–15 October 2009; pp. 2846–2853. [Google Scholar]
Mehdi, S.A.; Berns, K. Probabilistic Search of Human by Autonomous Mobile Robot. In Proceedings of the 4th International Conference on PErvasive Technologies Related to Assistive Environments, Crete, Greece, 25–27 May 2011; Association for Computing Machinery: New York, NY, USA, 2011; pp. 1–8. [Google Scholar]
Nauta, J.; Mahieu, C.; Michiels, C.; Ongenae, F.; De Backere, F.; De Turck, F.; Khaluf, Y.; Simoens, P. Pro-Active Positioning of a Social Robot Intervening upon Behavioral Disturbances of Persons with Dementia in a Smart Nursing Home. Cogn. Syst. Res. 2019, 57, 160–174. [Google Scholar] [CrossRef]
Cruces, A.; Jerez, A.; Bandera, J.P.; Bandera, A. Socially Assistive Robots in Smart Environments to Attend Elderly People—A Survey. Appl. Sci. 2024, 14, 5287. [Google Scholar] [CrossRef]
Tipaldi, G.D.; Arras, K.O. I Want My Coffee Hot! Learning to Find People under Spatio-Temporal Constraints. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 1217–1222. [Google Scholar]
Fung, A.; Benhabib, B.; Nejat, G. Robots Autonomously Detecting People: A Multimodal Deep Contrastive Learning Method Robust to Intraclass Variations. IEEE Robot. Autom. Lett. 2023, 8, 3550–3557. [Google Scholar] [CrossRef]
Kodur, K.; Kyrarini, M. Patient–Robot Co-Navigation of Crowded Hospital Environments. Appl. Sci. 2023, 13, 4576. [Google Scholar] [CrossRef]
Hasan, M.K.; Hoque, A.; Szecsi, T. Application of a Plug-and-Play Guidance Module for Hospital Robots. In Proceedings of the 2010 International Conference on Industrial Engineering and Operations Management (IEOM), Dhaka, Bangladesh, 9–10 January 2010. [Google Scholar]
Volkhardt, M.; Gross, H.-M. Finding People in Home Environments with a Mobile Robot. In Proceedings of the 2013 European Conference on Mobile Robots, Barcelona, Spain, 25–27 September 2013; IEEE: Barcelona, Spain, 2013; pp. 282–287. [Google Scholar]
Lau, H.; Huang, S.; Dissanayake, G. Optimal Search for Multiple Targets in a Built Environment. In Proceedings of the 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems, Edmonton, AB, Canada, 2–6 August 2005; IEEE: Edmonton, AB, Canada, 2005; pp. 3740–3745. [Google Scholar]
Lin, S.; Nejat, G. Robot Evidence Based Search for a Dynamic User in an Indoor Environment. In Proceedings of the Volume 5B: 42nd Mechanisms and Robotics Conference, Quebec City, QC, Canada, 26 August 2018; American Society of Mechanical Engineers: Quebec City, QC, Canada, 2018; p. V05BT07A029. [Google Scholar]
Mehdi, S.A.; Berns, K. Behavior-Based Search of Human by an Autonomous Indoor Mobile Robot in Simulation. Univers. Access Inf. Soc. 2014, 13, 45–58. [Google Scholar] [CrossRef]
Yu, B.; Kasaei, H.; Cao, M. L3MVN: Leveraging Large Language Models for Visual Target Navigation. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1 October 2023; pp. 3554–3560. [Google Scholar]
Kuang, Y.; Lin, H.; Jiang, M. OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models 2024. arXiv 2024, arXiv:2402.10670. [Google Scholar]
Shah, D.; Osinski, B.; Ichter, B.; Levine, S. LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action 2022. arXiv 2022, arXiv:2207.04429. [Google Scholar]
Zhou, G.; Hong, Y.; Wu, Q. NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models 2023. arXiv 2023, arXiv:2305.16986. [Google Scholar]
Rajvanshi, A.; Sahu, P.; Shan, T.; Sikka, K.; Chiu, H.-P. SayCoNav: Utilizing Large Language Models for Adaptive Collaboration in Decentralized Multi-Robot Navigation 2025. arXiv 2025, arXiv:2505.13729. [Google Scholar]
Cai, Y.; He, X.; Wang, M.; Guo, H.; Yau, W.-Y.; Lv, C. CL-CoTNav: Closed-Loop Hierarchical Chain-of-Thought for Zero-Shot Object-Goal Navigation with Vision-Language Models 2025. arXiv 2025, arXiv:2504.09000. [Google Scholar]
Shen, Z.; Luo, H.; Chen, K.; Lv, F.; Li, T. Enhancing Multi-Robot Semantic Navigation Through Multimodal Chain-of-Thought Score Collaboration 2025. arXiv 2024, arXiv:2412.18292. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners 2020. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Liu, H.; Ning, R.; Teng, Z.; Liu, J.; Zhou, Q.; Zhang, Y. Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4 2023. arXiv 2023, arXiv:2304.03439. [Google Scholar]
OpenAI GPT-4o. Available online: https://openai.com (accessed on 5 September 2024).
Chen, B.; Xu, Z.; Kirmani, S.; Ichter, B.; Driess, D.; Florence, P.; Sadigh, D.; Guibas, L.; Xia, F. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities 2024. arXiv 2024, arXiv:2401.12168. [Google Scholar]
Hao Tan, A.; Fung, A.; Wang, H.; Nejat, G. Mobile Robot Navigation Using Hand-Drawn Maps: A Vision Language Model Approach. arXiv 2025, arXiv:2502.00114. [Google Scholar]
Lei, X.; Yang, Z.; Chen, X.; Li, P.; Liu, Y. Scaffolding Coordinates to Promote Vision-Language Coordination in Large Multi-Modal Models 2024. arXiv 2024, arXiv:2402.12058. [Google Scholar]
Yang, J.; Zhang, H.; Li, F.; Zou, X.; Li, C.; Gao, J. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V 2023. arXiv 2023, arXiv:2310.11441. [Google Scholar]
Cheng, A.-C.; Yin, H.; Fu, Y.; Guo, Q.; Yang, R.; Kautz, J.; Wang, X.; Liu, S. SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model 2024. arXiv 2024, arXiv:2406.01584. [Google Scholar]
Nasiriany, S.; Xia, F.; Yu, W.; Xiao, T.; Liang, J.; Dasgupta, I.; Xie, A.; Driess, D.; Wahid, A.; Xu, Z.; et al. PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs 2024. arXiv 2024, arXiv:2402.07872. [Google Scholar]
Tang, W.; Sun, Y.; Gu, Q.; Li, Z. Visual Position Prompt for MLLM Based Visual Grounding 2025. arXiv 2025, arXiv:2503.15426. [Google Scholar]
Liu, D.; Wang, C.; Gao, P.; Zhang, R.; Ma, X.; Meng, Y.; Wang, Z. 3DAxisPrompt: Promoting the 3D Grounding and Reasoning in GPT-4o 2025. arXiv 2025, arXiv:2503.13185. [Google Scholar]
Stuede, M.; Lerche, T.; Petersen, M.A.; Spindeldreier, S. Behavior-Tree-Based Person Search for Symbiotic Autonomous Mobile Robot Tasks. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 2414–2420. [Google Scholar]
Ren, T.; Liu, S.; Zeng, A.; Lin, J.; Li, K.; Cao, H.; Chen, J.; Huang, X.; Chen, Y.; Yan, F.; et al. Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks 2024. arXiv 2024, arXiv:2401.14159. [Google Scholar]
Chaplot, D.S.; Gandhi, D.P.; Gupta, A.; Salakhutdinov, R.R. Object Goal Navigation Using Goal-Oriented Semantic Exploration. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Curran Associates, Inc.: New York, NY, USA, 2020; Volume 33, pp. 4247–4258. [Google Scholar]
Rebello, J.; Fung, A.; Waslander, S.L. AC/DCC: Accurate Calibration of Dynamic Camera Clusters for Visual SLAM. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 6035–6041. [Google Scholar]
Grisetti, G.; Stachniss, C.; Burgard, W. Improved Techniques for Grid Mapping with Rao-Blackwellized Particle Filters. IEEE Trans. Robot. 2007, 23, 34–46. [Google Scholar] [CrossRef]
Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A K-Means Clustering Algorithm. J. R. Stat. Soc. Ser. C Appl. Stat. 1979, 28, 100–108. [Google Scholar] [CrossRef]
Bresenham, J.E. Algorithm for Computer Control of a Digital Plotter. IBM Syst. J. 1965, 4, 25–30. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2020, arXiv:2005.11401. [Google Scholar]
Thanasi-Boçe, M.; Hoxha, J. From Ideas to Ventures: Building Entrepreneurship Knowledge with LLM, Prompt Engineering, and Conversational Agents. Educ. Inf. Technol. 2024, 29, 24309–24365. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models 2023. arXiv 2022, arXiv:2201.11903. [Google Scholar]
Hang, C.N.; Tan, C.W.; Yu, P.-D. MCQGen: A Large Language Model-Driven MCQ Generator for Personalized Learning. IEEE Access 2024, 12, 102261–102273. [Google Scholar] [CrossRef]
Hart, P.E.; Nilsson, N.J.; Raphael, B. A Formal Basis for the Heuristic Determination of Minimum Cost Paths. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 100–107. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Curran Associates, Inc.: New York, NY, USA, 2020; Volume 33, pp. 6840–6851. [Google Scholar]
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Li, C.; Yang, J.; Su, H.; Zhu, J.; et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection 2023. arXiv 2023, arXiv:2303.05499. [Google Scholar]
Rösmann, C.; Feiten, W.; Woesch, T.; Hoffmann, F.; Bertram, T. Trajectory Modification Considering Dynamic Constraints of Autonomous Robots. In Proceedings of the 7th German Conference on Robotics (ROBOTIK 2012), Munich, Germany, 15–16 May 2012; pp. 1–6. [Google Scholar]
Dellaert, F.; Fox, D.; Burgard, W.; Thrun, S. Monte Carlo Localization for Mobile Robots. In Proceedings of the Proceedings 1999 IEEE International Conference on Robotics and Automation, Detroit, MI, USA, 10–15 May 1999; Volume 2, pp. 1322–1328. [Google Scholar]
AWS Robomaker: Amazon Cloud Robotics Platform. Available online: https://aws.amazon.com/robomaker (accessed on 20 May 2024).
Helbing, D.; Molnar, P. Social Force Model for Pedestrian Dynamics. Phys. Rev. E 1995, 51, 4282–4286. [Google Scholar] [CrossRef]
Liu, N.F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; Liang, P. Lost in the Middle: How Language Models Use Long Contexts 2023. arXiv 2023, arXiv:2307.03172. [Google Scholar]

Figure 1. Proposed MLLM-Search architecture for person search.

Figure 2. (a) Semantic map and (b) waypoint map of a hospital environment.

Figure 3. Text and visual prompt of the MLLM Region Planner.

Figure 4. Text and visual prompt of the MLLM waypoint planner.

Figure 5. Three-dimensional gazebo simulation environment of (a) a hospital, and (b) an office.

Figure 6. (a) Mobile robot used in the real-world experiments for the food delivery robot scenario; (b,c) images of the floor of a university building.

Figure 7. The waypoint map of the experiment with the search path of the robot (in red), and the path of the student (in green).

Table 1. Simulation results.

	Environment Type	Hospital												Office
	Environment Type	Complete Schedule			Shifted Schedule			Partial/Incomplete Schedule			No Schedule			Complete Schedule			Shifted Schedule			Partial/Incomplete Schedule			No Schedule
Method		SR	SPL	ST	SR	SPL	ST	SR	SPL	ST	SR	SPL	ST	SR	SPL	ST	SR	SPL	ST	SR	SPL	ST	SR	SPL	ST
Ablation Methods
MLLM-Search		1.00	0.55	8.1	1.00	0.45	10.4	0.90	0.47	12.3	0.80	0.44	15.6	1.00	0.54	9.3	1.00	0.51	11.2	1.00	0.48	13.4	0.80	0.50	14.1
MLLM-Search w/ SS		0.80	0.40	15.2	0.70	0.30	22.6	0.70	0.27	25.5	0.40	0.21	32.8	0.80	0.35	18.3	0.70	0.31	24.0	0.70	0.28	27.2	0.50	0.22	29.3
MLLM-Search w/o $s_{l}$		0.80	0.44	12.5	0.70	0.33	18.3	0.60	0.35	15.2	0.50	0.31	22.4	0.90	0.38	16.6	0.80	0.36	21.1	0.70	0.37	19.3	0.50	0.32	22.9
MLLM-Search w/o $s_{p}$		0.90	0.38	18.3	0.60	0.26	26.7	0.70	0.30	23.1	0.50	0.30	24.9	0.90	0.32	20.1	0.70	0.28	29.3	0.70	0.29	25.7	0.40	0.26	27.8
MLLM-Search w/o $s_{r}$		0.80	0.35	25.8	0.40	0.25	30.5	0.40	0.22	29.7	0.30	0.19	35.7	0.80	0.27	22.4	0.50	0.23	31.2	0.60	0.23	33.2	0.40	0.21	34.8
MLLM-Search w/o $Σ s$		0.60	0.25	28.1	0.30	0.15	32.9	0.40	0.16	34.7	0.30	0.17	37.1	0.60	0.25	25.7	0.50	0.22	36.0	0.50	0.15	37.7	0.30	0.18	35.3
MLLM-Search w/o SCoT		1.00	0.51	21.2	1.00	0.43	28.2	0.90	0.45	28.8	0.70	0.43	35.3	1.00	0.50	20.4	1.00	0.47	25.6	1.00	0.46	32.1	0.70	0.46	36.8
Comparison Methods
MDP-based Planner		0.80	0.31	26.3	0.40	0.22	32.5	0.40	0.26	28.4	0.30	0.15	38.0	0.80	0.24	23.4	0.50	0.21	33.1	0.60	0.22	32.7	0.40	0.20	36.5
HMM-based Planner		0.80	0.34	18.7	0.60	0.23	28.4	0.70	0.35	22.2	0.50	0.20	30.5	0.90	0.28	20.9	0.70	0.26	31.2	0.70	0.28	25.3	0.40	0.23	32.2
Random Walk		0.30	0.14	36.3	0.20	0.08	44.8	0.20	0.11	43.4	0.20	0.12	39.1	0.30	0.11	40.7	0.40	0.16	35.2	0.50	0.17	33.4	0.30	0.10	42.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fung, A.; Tan, A.H.; Wang, H.; Benhabib, B.; Nejat, G. MLLM-Search: A Zero-Shot Approach to Finding People Using Multimodal Large Language Models. Robotics 2025, 14, 102. https://doi.org/10.3390/robotics14080102

AMA Style

Fung A, Tan AH, Wang H, Benhabib B, Nejat G. MLLM-Search: A Zero-Shot Approach to Finding People Using Multimodal Large Language Models. Robotics. 2025; 14(8):102. https://doi.org/10.3390/robotics14080102

Chicago/Turabian Style

Fung, Angus, Aaron Hao Tan, Haitong Wang, Bensiyon Benhabib, and Goldie Nejat. 2025. "MLLM-Search: A Zero-Shot Approach to Finding People Using Multimodal Large Language Models" Robotics 14, no. 8: 102. https://doi.org/10.3390/robotics14080102

APA Style

Fung, A., Tan, A. H., Wang, H., Benhabib, B., & Nejat, G. (2025). MLLM-Search: A Zero-Shot Approach to Finding People Using Multimodal Large Language Models. Robotics, 14(8), 102. https://doi.org/10.3390/robotics14080102

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MLLM-Search: A Zero-Shot Approach to Finding People Using Multimodal Large Language Models

Abstract

1. Introduction

2. Related Works

2.1. Person Search by Robots

2.1.1. Lookahead Planners

2.1.2. MDP-Based Planners

2.1.3. Graph-Based Planners

2.2. Visual Prompting Methods

2.3. Summary of Limitations

3. Person Search Problem Under Event-Driven Scenarios with Varying User Schedules

Problem Definition

4. MLLM-Search Architecture

4.1. Map Generation Subsystem (MGS)

4.1.1. Semantic Map Generation (SMG)

4.1.2. Waypoint Map Generation (WMG)

4.2. Person Search Subsystem

4.2.1. Multimodal LLM Region Planner

4.2.2. Multimodal LLM Waypoint Planner

4.2.3. Target Tracking

4.2.4. Navigation

5. Simulated Experiments

5.1. Environment

5.2. Performance Metrics

5.3. Search Scenarios

5.4. Ablation Methods

5.5. SOTA Methods

5.6. Simulation Results

6. Real-World Experiments

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI