Voice and face are two most popular biometrics for person verification, usually used in speaker verification and face verification tasks. Due to the successful application of deep learning technologies, the single-modal person verification system based on only voice or face has achieved remarkable performance. It has already been observed that individual modality has its own advantage and simply combining these two types of information can lead to a more powerful and robust person verification system. In this work, to fully explore the audiovisual multi-modal learning strategies for person verification, we designed and proposed three types of audio-visual deep neural network (AVN), including feature level AVN (AVN-F), embedding level AVN (AVN-E) and embedding level combination with joint learning AVN (AVN-J). To further enhance the system robustness in real noisy conditions where not both modalities can be accessed with high-quality, we proposed several data augmentation strategies for each proposed AVN: a feature-level multi-modal data augmentation is proposed for AVN-F and an embedding-level data augmentation with novel noise distribution matching is designed for AVN-E. For AVN-J, both the feature and embedding level multi-modal data augmentation methods can be applied. All the proposed models are trained on the VoxCeleb2 dev dataset and evaluated on the standard VoxCeleb1 dataset, and the best system achieves 0.558%, 0.441% and 0.793% EER on the three official trial lists of VoxCeleb1, which is to our knowledge the best published single system results on this corpus for person verification. To validate the robustness of the proposed approaches, a noisy evaluation set based on the VoxCeleb1 is constructed, and experimental results show that the proposed system can significantly boost the system robustness and still show promising performance for person verification under this noisy scenario.