Multimodal Conversational AI: A Review of Integration Techniques, Applications, and Future Directions
Main Article Content
Abstract
The integration of multimodal inputs, including text, voice, and visual data, into conversational artificial intelligence (AI) systems marks a significant shift toward more natural and effective human–computer interaction. This narrative synthesis review examines recent research on the technological foundations, applications, challenges, and future directions of multimodal conversational AI. Drawing on prior studies, the review analyzes key frameworks and models, including Situated Interactive MultiModal Conversations (SIMMC) and DialogueTRM, which employ multimodal fusion to support emotion recognition and context-aware interaction. The synthesis indicates that combining multiple modalities enhances system accuracy, strengthens user engagement, and enables richer contextual understanding in conversational settings. At the same time, the review identifies major challenges related to data synchronization, privacy protection, computational complexity, and bias mitigation. Based on these findings, the study highlights the need for future research on adaptive fusion techniques, cross-cultural usability, ethical AI development, and the incorporation of emerging modalities such as haptic and physiological data. This review contributes to the growing scholarship on conversational AI by providing an integrated understanding of the opportunities and limitations of multimodal systems and by outlining directions for the development of more responsive, inclusive, and ethically grounded AI interactions.

Citation Metrics:
Downloads
Article Details

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
References
Baltrušaitis, T., Ahuja, C., & Morency, L.-P. (2019). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443. https://doi.org/10.1109/TPAMI.2018.2798607
Cassell, J. (2001). Embodied conversational agents: Representation and intelligence in user interfaces. AI Magazine, 22(4), 67–83. https://doi.org/10.1609/aimag.v22i4.1593
Feldman, S., Yalcin, O. N., & DiPaola, S. (2017). Engagement with artificial intelligence through natural interaction models. In Electronic Visualisation and the Arts (EVA 2017) (pp. 296–303). BCS Learning & Development Ltd. https://doi.org/10.14236/ewic/EVA2017.60
Khan Mohd, T., Nguyen, N., & Javaid, A. Y. (2022). Multi-modal data fusion in enhancing human-machine interaction for robotic applications: A survey. arXiv. https://arxiv.org/abs/2202.07732
Korbar, B., Tran, D., & Torresani, L. (2018). Cooperative learning of audio and video models from self-supervised synchronization. Advances in Neural Information Processing Systems, 31, 7763–7774. https://papers.nips.cc/paper_files/paper/2018/hash/c4616f5a24a66668f11ca4fa80525dc4-Abstract.html
Liang, P. P., Zadeh, A., & Morency, L.-P. (2024). Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 56(10), Article 264, 1–42. https://doi.org/10.1145/3656580
Mao, Y., Sun, Q., Liu, G., Wang, X., Gao, W., Li, X., & Shen, J. (2020). DialogueTRM: Exploring the intra- and inter-modal emotional behaviors in the conversation. arXiv. https://arxiv.org/abs/2010.07637
McTear, M. F. (2017). The rise of the conversational interface: A new kid on the block? In J. F. Quesada, F. J. Martín Mateos, & T. López-Soto (Eds.), Future and emerging trends in language technology: Machine learning and big data (pp. 38–49). Springer. https://doi.org/10.1007/978-3-319-69365-1_3
Moon, S., Kottur, S., Crook, P., De, A., Poddar, S., Levin, T., Whitney, D., Difranco, D., Beirami, A., Cho, E., Subba, R., & Geramifard, A. (2020). Situated and interactive multimodal conversations. In Proceedings of the 28th International Conference on Computational Linguistics (pp. 1103–1121). International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.96
Oviatt, S. (1999). Ten myths of multimodal interaction. Communications of the ACM, 42(11), 74–81. https://doi.org/10.1145/319382.319398
Potamianos, G., Neti, C., Gravier, G., Garg, A., & Senior, A. W. (2003). Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 91(9), 1306–1326. https://doi.org/10.1109/JPROC.2003.817150
Schwartz, I., Yu, S., Hazan, T., & Schwing, A. G. (2019). Factor graph attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2039–2048). IEEE. https://doi.org/10.1109/CVPR.2019.00214














