Multimodal Classification of Safety-Report Observations
Round 1
Reviewer 1 Report
The presented paper introduces a multimodal machine learning architecture for analyzing and categorizing safety observations, given textual descriptions and images taken from the location sites. Furthermore, it presents a novel dataset for classifying safety-related observations called Safety4All, a collection of textual and visual observations with associated metadata gathered during on-premise safety inspections in real-world businesses. The manuscript is clear and presented in a well-structured manner. However, this reviewer has some minor concerns in order to improve the quality of the work, which are related below:
- State the paper's contributions clearly in the Introduction.
- Lack of details about used Visual Encoders and their training process.
- What is the size of the images in the proposed database?
- Since image preprocessing perform a random crop, the resulting images from the augmentation process may show a section of the original image not associated with the observation; how does it solve in the procedure?
- For priority classification, there is some class imbalance; how do the authors address this issue? may a weighted scheme improve the obtained results?
- If there are no Acknowledgments, remove the section's title.
Author Response
Please see the attachment
Author Response File:
Author Response.pdf
Reviewer 2 Report
Suggestion for corrections/improvement:
-Line 79: Need to be careful when making such claim ... "first work". It maybe over claimed.
-Line 81: Reference [18] is incomplete, see reference listing (page 25). No page number.
-Too many sections. Sections 7, 8 and 9 can be combined as one section, and renamed as "Results", with subsections for Tasks 1, 2 and 3.
-Section 10 and 11 can be combined and renamed as "Conclusions".
-Table 6 should appear after citing it in the text.
-Accuracy is in "%"?
-Line 560, close to 50% or above 60%?
-Line 164 - 165: It will be good if a brief comment on the findings from the Table 1 is included such as trend, direction etc.
-Table 1 (page 5): Header for column 1 (Work), can be renamed as "Reference Number".
-It will be nice, information from Sections 2.2, 2.3 and perhaps Section 2.4 can be summarised/condensed into a table format to compare the strength and weaknesses of existing models and learning techniques.
-Line 231 -232: provide justification
-Page 8, Table 2: The caption is too long. See journal style whether this is acceptable or not. Similarly for Figures 4, 7 etc.
-Page 9, Table 3: Each column should have a header.
-Line 344: Figure 3 should appear before mentioning Figure 4.
-All axes for graphs should have label or unit (if applicable).
-Figure 7: What is the difference between the thick arrow and thin arrow?
-Figure 7: To improve clarity, it is better to include notations (as described in the text) on the diagram.
-Line 455: Should be part of the earlier paragraph?
-Line 517: Briefly explain how concatenation was done or please provide reference.
Author Response
Please see the attachment
Author Response File:
Author Response.pdf
Reviewer 3 Report
This article describes a method for jointly fine-tuning large language and pictures neural network models. They suggest using a combined task and contrastive loss to align text and vision in a multimodal space. Contrastive loss maintains intramodality representation distances, thus vision and language representations for similar data are close in multimodal space. They analyze the proposed model on three tasks: input observation prioritization, assessment, and categorization. Their investigations suggest that inspection scene photographs and textual descriptions are complementing. Joint contrastive loss creates strong multimodal representations and outperforms simple late across-task fusion. They also train and release an Electra-based transformer-based Greek language model.
In a satisfactory manner, the basic purpose of the research has been described, but with some crucial comments that should be taken into consideration.
1. At the end of Section I, INTRODUCTION, it is preferred to add a paragraph representing the contributions in this work "The main contributions in this study are: ......"
2. Figure 2, 3, and 6 is unclear; the author should replace them with a higher-quality image by increasing the font size of the words written on the x-axis and y-axis.
3. For all these tables, 1, 2, 3, 4, and 5 the caption must be written above the table not below.
4. In Table 3, we advise the authors to add a suitable header field for the columns.
5. We suggest in Table 7 to merge the rows that contain common information like “Construction”, “Office”, “Store”, and “Warehouse” to make the table easier and more readable. The same is in Table 9 for “Issue Source” and “Category”.
Author Response
Please see the attachment
Author Response File:
Author Response.pdf

