Leveraging Multimodal Perspectives To Learn Common Sense For Vision And Language Tasks