Multimodal Conversational AI: A Review of Integration Techniques, Applications, and Future Directions

Main Article Content

Herbert Wanga

Abstract

The integration of multimodal inputs, including text, voice, and visual data, into conversational artificial intelligence (AI) systems marks a significant shift toward more natural and effective human–computer interaction. This narrative synthesis review examines recent research on the technological foundations, applications, challenges, and future directions of multimodal conversational AI. Drawing on prior studies, the review analyzes key frameworks and models, including Situated Interactive MultiModal Conversations (SIMMC) and DialogueTRM, which employ multimodal fusion to support emotion recognition and context-aware interaction. The synthesis indicates that combining multiple modalities enhances system accuracy, strengthens user engagement, and enables richer contextual understanding in conversational settings. At the same time, the review identifies major challenges related to data synchronization, privacy protection, computational complexity, and bias mitigation. Based on these findings, the study highlights the need for future research on adaptive fusion techniques, cross-cultural usability, ethical AI development, and the incorporation of emerging modalities such as haptic and physiological data. This review contributes to the growing scholarship on conversational AI by providing an integrated understanding of the opportunities and limitations of multimodal systems and by outlining directions for the development of more responsive, inclusive, and ethically grounded AI interactions.

Downloads

Download data is not yet available.

Scopus Citation Data

Data source Crossref
0
citations
Check Secondary Documents in Scopus
Open this article in Scopus, then check the Secondary documents tab. Use Manual Citation Fallback only for counts you have verified manually.
Open in Scopus
Similar Scopus Articles
Scopus
  1. Mirzahosseini M. (2027)
    A Review of Constitutive Modeling of Unsaturated Soils
    Iranian Journal of Geophysics, 20(3), 81-128
  2. Shiryazdi R.S. (2027)
    Assessing performances of pattern informatics method variants: a comparative analysis in Zagros, Iran
    Iranian Journal of Geophysics, 20(3), 65-80
  3. Asl S.B. (2027)
    Uncertainty estimation in earthquake magnitude determination using high-rate GPS data with Bootstrap method
    Iranian Journal of Geophysics, 20(3), 187-203

Article Details

How to Cite
Wanga, H. (2026). Multimodal Conversational AI: A Review of Integration Techniques, Applications, and Future Directions. Asian Journal of Science, Technology, Engineering, and Art, 4(2), 132-138. https://doi.org/10.58578/ajstea.v4i2.8617

References

Baltrušaitis, T., Ahuja, C., & Morency, L.-P. (2019). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443. https://doi.org/10.1109/TPAMI.2018.2798607

Cassell, J. (2001). Embodied conversational agents: Representation and intelligence in user interfaces. AI Magazine, 22(4), 67–83. https://doi.org/10.1609/aimag.v22i4.1593

Feldman, S., Yalcin, O. N., & DiPaola, S. (2017). Engagement with artificial intelligence through natural interaction models. In Electronic Visualisation and the Arts (EVA 2017) (pp. 296–303). BCS Learning & Development Ltd. https://doi.org/10.14236/ewic/EVA2017.60

Khan Mohd, T., Nguyen, N., & Javaid, A. Y. (2022). Multi-modal data fusion in enhancing human-machine interaction for robotic applications: A survey. arXiv. https://arxiv.org/abs/2202.07732

Korbar, B., Tran, D., & Torresani, L. (2018). Cooperative learning of audio and video models from self-supervised synchronization. Advances in Neural Information Processing Systems, 31, 7763–7774. https://papers.nips.cc/paper_files/paper/2018/hash/c4616f5a24a66668f11ca4fa80525dc4-Abstract.html

Liang, P. P., Zadeh, A., & Morency, L.-P. (2024). Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 56(10), Article 264, 1–42. https://doi.org/10.1145/3656580

Mao, Y., Sun, Q., Liu, G., Wang, X., Gao, W., Li, X., & Shen, J. (2020). DialogueTRM: Exploring the intra- and inter-modal emotional behaviors in the conversation. arXiv. https://arxiv.org/abs/2010.07637

McTear, M. F. (2017). The rise of the conversational interface: A new kid on the block? In J. F. Quesada, F. J. Martín Mateos, & T. López-Soto (Eds.), Future and emerging trends in language technology: Machine learning and big data (pp. 38–49). Springer. https://doi.org/10.1007/978-3-319-69365-1_3

Moon, S., Kottur, S., Crook, P., De, A., Poddar, S., Levin, T., Whitney, D., Difranco, D., Beirami, A., Cho, E., Subba, R., & Geramifard, A. (2020). Situated and interactive multimodal conversations. In Proceedings of the 28th International Conference on Computational Linguistics (pp. 1103–1121). International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.96

Oviatt, S. (1999). Ten myths of multimodal interaction. Communications of the ACM, 42(11), 74–81. https://doi.org/10.1145/319382.319398

Potamianos, G., Neti, C., Gravier, G., Garg, A., & Senior, A. W. (2003). Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 91(9), 1306–1326. https://doi.org/10.1109/JPROC.2003.817150

Schwartz, I., Yu, S., Hazan, T., & Schwing, A. G. (2019). Factor graph attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2039–2048). IEEE. https://doi.org/10.1109/CVPR.2019.00214


Explore Our Journals
Find the most suitable journal for your research. If this journal does not fully align with the scope of your manuscript, we invite you to explore our wider portfolio of journals covering diverse fields of study. Please select one of the journals below to identify the most appropriate publication platform for your work.