Multimodal Conversational AI: A Review of Integration Techniques, Applications, and Future Directions

Herbert Wanga

doi:10.58578/ajstea.v4i2.8617

Page Numbers: 132-138

Download PDF

Published: Mar 18, 2026

Digital Object Identifier: 10.58578/ajstea.v4i2.8617

Save this to:

Article Metrics:

Viewed: 137 times

Downloaded: 75 times

Article can trace at:

Author Fee:

Free Publication Fees for Foreign Researchers (USD 0.00)

Check for article on SINTA:

LYAS Publisher cordially invites qualified professionals to serve as Editors or Reviewers. Your expertise will make an important contribution to maintaining and strengthening the academic quality of our publications. Interested applicants are kindly requested to complete the application form at the following link: Editors & Reviewers

Connected Papers:

Please feel free to contact us if you need any further information about the submission process or if you have any additional questions.

Authors:
Herbert Wanga^1*

Copyright :

Authors retain copyright and grant the journal right of first publication.

Herbert Wanga

University of Iringa, Tanzania

Abstract

The integration of multimodal inputs, including text, voice, and visual data, into conversational artificial intelligence (AI) systems marks a significant shift toward more natural and effective human–computer interaction. This narrative synthesis review examines recent research on the technological foundations, applications, challenges, and future directions of multimodal conversational AI. Drawing on prior studies, the review analyzes key frameworks and models, including Situated Interactive MultiModal Conversations (SIMMC) and DialogueTRM, which employ multimodal fusion to support emotion recognition and context-aware interaction. The synthesis indicates that combining multiple modalities enhances system accuracy, strengthens user engagement, and enables richer contextual understanding in conversational settings. At the same time, the review identifies major challenges related to data synchronization, privacy protection, computational complexity, and bias mitigation. Based on these findings, the study highlights the need for future research on adaptive fusion techniques, cross-cultural usability, ethical AI development, and the incorporation of emerging modalities such as haptic and physiological data. This review contributes to the growing scholarship on conversational AI by providing an integrated understanding of the opportunities and limitations of multimodal systems and by outlining directions for the development of more responsive, inclusive, and ethically grounded AI interactions.

Keywords:

Conversational Artificial Intelligence; Human–Computer Interaction; Multimodal Fusion; Multimodal Interaction; Narrative Synthesis Review

Share Article:

Citation Metrics:

Downloads

Download data is not yet available.

Scopus Citation Data

Data source Crossref

0

citations

Check Secondary Documents in Scopus

Open this article in Scopus, then check the Secondary documents tab. Use Manual Citation Fallback only for counts you have verified manually.

Open in Scopus

References

Baltrušaitis, T., Ahuja, C., & Morency, L.-P. (2019). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443. https://doi.org/10.1109/TPAMI.2018.2798607

Cassell, J. (2001). Embodied conversational agents: Representation and intelligence in user interfaces. AI Magazine, 22(4), 67–83. https://doi.org/10.1609/aimag.v22i4.1593

Feldman, S., Yalcin, O. N., & DiPaola, S. (2017). Engagement with artificial intelligence through natural interaction models. In Electronic Visualisation and the Arts (EVA 2017) (pp. 296–303). BCS Learning & Development Ltd. https://doi.org/10.14236/ewic/EVA2017.60

Khan Mohd, T., Nguyen, N., & Javaid, A. Y. (2022). Multi-modal data fusion in enhancing human-machine interaction for robotic applications: A survey. arXiv. https://arxiv.org/abs/2202.07732

Korbar, B., Tran, D., & Torresani, L. (2018). Cooperative learning of audio and video models from self-supervised synchronization. Advances in Neural Information Processing Systems, 31, 7763–7774. https://papers.nips.cc/paper_files/paper/2018/hash/c4616f5a24a66668f11ca4fa80525dc4-Abstract.html

Liang, P. P., Zadeh, A., & Morency, L.-P. (2024). Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 56(10), Article 264, 1–42. https://doi.org/10.1145/3656580

Mao, Y., Sun, Q., Liu, G., Wang, X., Gao, W., Li, X., & Shen, J. (2020). DialogueTRM: Exploring the intra- and inter-modal emotional behaviors in the conversation. arXiv. https://arxiv.org/abs/2010.07637

McTear, M. F. (2017). The rise of the conversational interface: A new kid on the block? In J. F. Quesada, F. J. Martín Mateos, & T. López-Soto (Eds.), Future and emerging trends in language technology: Machine learning and big data (pp. 38–49). Springer. https://doi.org/10.1007/978-3-319-69365-1_3

Moon, S., Kottur, S., Crook, P., De, A., Poddar, S., Levin, T., Whitney, D., Difranco, D., Beirami, A., Cho, E., Subba, R., & Geramifard, A. (2020). Situated and interactive multimodal conversations. In Proceedings of the 28th International Conference on Computational Linguistics (pp. 1103–1121). International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.96

Oviatt, S. (1999). Ten myths of multimodal interaction. Communications of the ACM, 42(11), 74–81. https://doi.org/10.1145/319382.319398

Potamianos, G., Neti, C., Gravier, G., Garg, A., & Senior, A. W. (2003). Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 91(9), 1306–1326. https://doi.org/10.1109/JPROC.2003.817150

Schwartz, I., Yu, S., Hazan, T., & Schwing, A. G. (2019). Factor graph attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2039–2048). IEEE. https://doi.org/10.1109/CVPR.2019.00214

Explore Our Journals

Find the most suitable journal for your research. If this journal does not fully align with the scope of your manuscript, we invite you to explore our wider portfolio of journals covering diverse fields of study. Please select one of the journals below to identify the most appropriate publication platform for your work.

HOME Yasin AlSys Anwarul Masaliq Arzusin Tsaqofah Ahkam AlDyas Mikailalsys Edumalsys Alsystech AJSTEA AJECEE AJISD IJHESS IJEMT IJECS MJMS MJAEI AMJSAI AJBMBR AJSTM AJCMPR AJMSPHR KIJST KIJEIT KIJAHRS

No. of Scopus Citations :	7
Contributing Countries :	14
Number of Contributors :	516
Abstract Views :	19.310
PDF Downloads :	14.485

Article Sidebar

Main Article Content

Abstract

Downloads

Scopus Citation Data

Article Details

References

Asian Journal of Science, Technology, Engineering, and Art