I lead the Multimodal AI team at Qualcomm AI Research. Previously, I was a Professor and ARC DECRA Fellow at Monash University. My research spans computer vision, generative AI, and multimodal learning, bridging fundamental research and real-world applications.
PhD in Computer Science, 2015
The University of Western Austraia
Masters in Space Science, 2011
Luleå Tekniska Universitet
BSc in Engineering, 2009
National University of Sciences & Technology
Generation of images containing multiple humans, performing complex actions, while preserving their facial identities, is a significant challenge. MultiHuman-Testbench is a novel benchmark comprising 1,800 samples with carefully curated text prompts matched with 5,550 unique human face images, sampled uniformly to ensure diversity across age, ethnic background, and gender. This benchmark enables comprehensive evaluation of multi-human image generation models.
CustomKD proposes a novel knowledge distillation approach that effectively leverages large vision foundation models (LVFMs) to enhance the performance of edge models (e.g., MobileNetV3). CustomKD customizes the well-generalized features inherent in LVFMs to a given student model to reduce model discrepancies, achieving state-of-the-art performances in unsupervised domain adaptation and semi-supervised learning scenarios.
This paper proposes a self-supervised approach to learn universal facial representations from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our proposed framework, named MARLIN, is a facial video masked autoencoder, that learns highly robust and generic facial embeddings from abundantly available non-annotated web crawled facial videos.
We propose Restormer, an efficient Transformer architecture for high-resolution image restoration that captures long-range pixel interactions while remaining computationally tractable. Our model achieves state-of-the-art results across multiple restoration tasks including deraining, motion deblurring, defocus deblurring, and image denoising.
We address domain generalized semantic segmentation through Semantic-Aware Normalization (SAN) and Semantic-Aware Whitening (SAW) modules. Our framework promotes both intra-category compactness and inter-category separability, achieving significant improvements over state-of-the-art methods on widely-used datasets including GTAV, SYNTHIA, and Cityscapes.
Active learning (AL) is a promising ML paradigm that has the potential to parse through large unlabeled data and help reduce annotation cost in domains where labeling data can be prohibitive. Recently proposed neural network based AL methods use different heuristics to accomplish this goal. In this study, we demonstrate that under identical experimental settings, different types of AL algorithms (uncertainty based, diversity based, and committee based) produce an inconsistent gain over random sampling baseline.
Given a degraded input image, image restoration aims to recover the missing high-quality image content. Numerous applications demand effective image restoration, e.g., computational photography, surveillance, autonomous vehicles, and remote sensing. Significant advances in image restoration have been made in recent years, dominated by convolutional neural networks (CNNs). The widely-used CNN-based methods typically operate either on full-resolution or on progressively low-resolution representations.
We systematically study the robustness properties of Vision Transformers, revealing their remarkable resilience to occlusions, perturbations, and domain shifts. ViTs demonstrate significantly less texture bias than CNNs, achieve human-level shape recognition capabilities, and enable accurate semantic segmentation without pixel-level supervision through flexible self-attention mechanisms.
We propose a novel generative approach for highly transferable targeted adversarial perturbations. Unlike existing methods that rely on class-boundary information, our approach matches the perturbed image distribution with the target class by aligning both global distributions and local neighborhood structures. Our method achieves 4x higher target transferability than previous best generative attacks and 16x better than instance-specific iterative attacks on ImageNet.
We propose Orthogonal Projection Loss (OPL), a novel loss function that enforces inter-class separation and intra-class clustering through orthogonality constraints in the feature space. Unlike standard cross-entropy loss, OPL explicitly separates class features while requiring no additional learnable parameters. We demonstrate OPL's effectiveness across diverse tasks including image recognition, domain generalization, and few-shot learning, with improved robustness against adversarial attacks and label noise.
Image restoration tasks demand a complex balance between spatial details and high-level contextualized information while recovering images. In this paper, we propose a novel synergistic design that can optimally balance these competing goals. Our main proposal is a multi-stage architecture, that progressively learns restoration functions for the degraded inputs, thereby breaking down the overall recovery process into more manageable steps.
A comprehensive survey of transformer models in computer vision, covering fundamental concepts of self-attention and self-supervision. We review extensive applications across recognition, generative modeling, multi-modal tasks, video processing, low-level vision, and 3D analysis, providing insights into architectural designs and future research directions.
We present MIRNet, a novel architecture that maintains spatially-precise high-resolution representations while receiving strong contextual information from low-resolution streams. Our multi-scale residual blocks with parallel convolution streams, information exchange, and attention mechanisms achieve state-of-the-art results on image denoising, super-resolution, and enhancement tasks.
We propose a self-supervised adversarial training mechanism in the input space that provides generalizable defense against adversarial attacks. Our plug-and-play approach significantly improves robustness across classification, segmentation, and detection tasks, reducing attack success rates from 82.6% to 31.9%.
We present CycleISP, a framework that models camera imaging pipelines in forward and reverse directions to generate realistic synthetic training data for image denoising. Our approach achieves state-of-the-art performance on real camera benchmarks with 5× fewer parameters than previous methods, and generalizes beyond denoising to tasks like color matching.
We propose RPS-Net, a random path selection algorithm for continual learning that progressively chooses optimal paths for new tasks while encouraging parameter sharing. Integrated with knowledge distillation and a dynamic plasticity controller, our approach surpasses state-of-the-art incremental learning performance while running in constant time.
We propose using deep image restoration networks as a defense against adversarial attacks by bringing off-manifold adversarial samples back onto the natural image manifold. Our approach simultaneously provides robustness, enhances image quality, and maintains performance on clean images without requiring model training, parameter optimization, or adversarial image detection.
We propose a cost-sensitive deep neural network that automatically learns robust feature representations for both majority and minority classes in imbalanced datasets. Our approach jointly optimizes class-dependent costs and network parameters, significantly outperforming baseline algorithms and data sampling techniques without altering the original data distribution.