As a conversational intelligence task, visual dialog entails answering a series of questions grounded in an image, using the dialog history as context. To generate correct answers, the comprehension of the semantic dependencies among implicit visual and textual contents is critical. Prior works usually ignored the underlying relation and failed to infer it reasonably. In this paper, we propose a Visual-Textual Alignment for Graph Inference (VTAGI) network. Compared with other approaches, it makes up the lack of structural inference in visual dialog. The whole system consists of two modules, Visual and Textual Alignment (VTA) and Visual Graph Attended by Text (VGAT). Specially, the VTA module aims at representing an image with a set of integrated visual regions and corresponding textual concepts, reflecting certain semantics. The VGAT module views the visual features with semantic information as observed nodes and each node learns the relationship with others in visual graph. We also qualitatively and quantitatively evaluate the model on VisDial v1.0 dataset, showing our VTAGI outperforms previous state-of-the-art models.