Contractor for Google | Research Assistant in IRAB | Kaggle Grandmaster
Hello! I’m a Deep Learning Researcher and Competitor with over 4 years experience, particularly in Computer Vision, Speech Recognition and NLP. Currently, I am working as a Contractor for Google focusing on solving problems in Kaggle Competitions with multi-backend Keras.
Just like a detective, I find joy in dissecting complex research problems and crafting solutions through code, and math. My perseverance has led to the authorship of 6 papers and co-authorship of 2 more. Beyond academia, I thrive in competitions like Kaggle, DrivenData, etc. It’s here that I became one of the youngest Kaggle Grandmasters, peaking at 5th (now 7th) out of 313,000+ participants. I also take pride in open-source contributing on GitHub, in platforms such as TensorFlow, Keras, HuggingFace, YOLOv5 (PyTorch), etc. I also won Google OSS Prize 4 times and Kaggle ML Research Award for my open-source contributions.
Interests
Computer Vision
Generative AI
Vison-Language Model
Audio-Visual Learning
Education
BSc in Electrical & Electronic Engineering (2023) Bangladesh University of Engineering and Technology (BUET), Bangladesh
In computer vision, depth estimation is crucial for domains like robotics, autonomous vehicles, augmented reality, and virtual reality. Integrating semantics with depth enhances scene understanding through reciprocal information sharing. However, the scarcity of semantic information in datasets poses challenges. Existing convolutional approaches with limited local receptive fields hinder the full utilization of the symbiotic potential between depth and semantics. This paper introduces a dataset-invariant semi-supervised strategy to address the scarcity of semantic information. It proposes the Depth Semantics Symbiosis module, leveraging the Symbiotic Transformer for achieving comprehensive mutual awareness by information exchange within both local and global contexts. Additionally, a novel augmentation, NearFarMix is introduced to combat overfitting and compensate both depth-semantic tasks by strategically merging regions from two images, generating diverse and structurally consistent samples with enhanced control. Extensive experiments on NYU-Depth-V2 and KITTI datasets demonstrate the superiority of our proposed techniques in indoor and outdoor environments.
2023
DwinFormer: Dual Window Transformers for End-to-End Monocular Depth Estimation
Depth estimation from a single image is of paramount importance in the realm of computer vision, with a multitude of applications. Conventional methods suffer from the trade-off between consistency and fine-grained details due to the local-receptive field limiting their practicality. This lack of long-range dependency inherently comes from the convolutional neural network part of the architecture. In this paper, a dual window transformer-based network, namely DwinFormer, is proposed, which utilizes both local and global features for end-to-end monocular depth estimation. The DwinFormer consists of dual window self-attention and cross-attention transformers, Dwin-SAT and Dwin-CAT, respectively. The Dwin-SAT seamlessly extracts intricate, locally aware features while concurrently capturing global context. It harnesses the power of local and global window attention to adeptly capture both short-range and long-range dependencies, obviating the need for complex and computationally expensive operations, such as attention masking or window shifting. Moreover, Dwin-SAT introduces inductive biases which provide desirable properties, such as translational equvariance and less dependence on large-scale data. Furthermore, conventional decoding methods often rely on skip connections which may result in semantic discrepancies and a lack of global context when fusing encoder and decoder features. In contrast, the Dwin-CAT employs both local and global window cross-attention to seamlessly fuse encoder and decoder features with both fine-grained local and contextually aware global information, effectively amending semantic gap. Empirical evidence obtained through extensive experimentation on the NYU-Depth-V2 and KITTI datasets demonstrates the superiority of the proposed method, consistently outperforming existing approaches across both indoor and outdoor environments.
ArtiFact: A Large-Scale Dataset with Artificial and Factual Images for Generalizable and Robust Synthetic Image Detection
Md Awsafur Rahman, Bishmoy Paul, Najibul Haque Sarker, and 2 more authors
Synthetic image generation has opened up new opportunities but has also created threats in regard to privacy, authenticity, and security. Detecting fake images is of paramount importance to prevent illegal activities, and previous research has shown that generative models leave unique patterns in their synthetic images that can be exploited to detect them. However, the fundamental problem of generalization remains, as even state-of-the-art detectors encounter difficulty when facing generators never seen during training. To assess the generalizability and robustness of synthetic image detectors in the face of real-world impairments, this paper presents a large-scale dataset named ArtiFact, comprising diverse generators, object categories, and real-world challenges. Moreover, the proposed multi-class classification scheme, combined with a filter stride reduction strategy addresses social platform impairments and effectively detects synthetic images from both seen and unseen generators. The proposed solution outperforms other teams by 8.34% on Test 1, 1.26% on Test 2, and 15.08% on Test 3 in the IEEE VIP CUP at ICIP 2022.
CIFF-Net: Contextual Image Feature Fusion for Melanoma Diagnosis
Md Awsafur Rahman, Bishmoy Paul, Tanvir Mahmud, and 1 more author
Melanoma is considered to be the deadliest variant of skin cancer causing around 75% of total skin cancer deaths. To diagnose Melanoma, clinicians assess and compare multiple skin lesions of the same patient concurrently to gather contextual information regarding the patterns, and abnormality of the skin. So far this concurrent multi-image comparative method has not been explored by existing deep learning-based schemes. In this paper, based on contextual image feature fusion (CIFF), a deep neural network (CIFF-Net) is proposed, which integrates patient-level contextual information into the traditional approaches for improved Melanoma diagnosis by concurrent multi-image comparative method. The proposed multi-kernel self attention (MKSA) module offers better generalization of the extracted features by introducing multi-kernel operations in the self attention mechanisms. To utilize both self attention and contextual feature-wise attention, an attention guided module named contextual feature fusion (CFF) is proposed that integrates extracted features from different contextual images into a single feature vector. Finally, in comparative contextual feature fusion (CCFF) module, primary and contextual features are compared concurrently to generate comparative features. Significant improvement in performance has been achieved on the ISIC-2020 dataset over the traditional approaches that validate the effectiveness of the proposed contextual learning scheme.
Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown Multi-Class Ensemble of CNNs
Md Awsafur Rahman, Bishmoy Paul, Najibul Haque Sarker, and 3 more authors
With the huge technological advances introduced by deep learning in audio & speech processing, many novel synthetic speech techniques achieved incredible realistic results. As these methods generate realistic fake human voices, they can be used in malicious acts such as people imitation, fake news, spreading, spoofing, media manipulations, etc. Hence, the ability to detect synthetic or natural speech has become an urgent necessity. Moreover, being able to tell which algorithm has been used to generate a synthetic speech track can be of preeminent importance to track down the culprit. In this paper, a novel strategy is proposed to attribute a synthetic speech track to the generator that is used to synthesize it. The proposed detector transforms the audio into log-mel spectrogram, extracts features using CNN, and classifies it between five known and unknown algorithms, utilizing semi-supervision and ensemble to improve its robustness and generalizability significantly. The proposed detector is validated on two evaluation datasets consisting of a total of 18,000 weakly perturbed (Eval 1) & 10,000 strongly perturbed (Eval 2) synthetic speeches. The proposed method outperforms other top teams in accuracy by 12-13% on Eval 2 and 1-2% on Eval 1, in the IEEE SP Cup challenge at ICASSP 2022.
2022
A Deep Learning Scheme for Detecting Atrial Fibrillation Based on Fusion of Raw and Discrete Wavelet Transformed ECG Features
Md Awsafur Rahman, Shahed Ahmed, and Shaikh Anowarul Fattah
Atrial fibrillation is the most common sustained cardiac arrhythmia and the electrocardiogram (ECG) is a powerful non-invasive tool for its clinical diagnosis. Automatic AF detection remains a very challenging task due to the high inter-patient variability of ECGs. In this paper, an automatic AF detection scheme is proposed based on a deep learning network that utilizes both raw ECG signal and its discrete wavelet transform (DWT) version. In order to utilize the time-frequency characteristics of the ECG signal, first level DWT is applied and both high and low frequency components are then utilized in the 1D CNN network in parallel. If only the transformed data are utilized in the network, original variations in the data may not be explored, which also contains useful information to identify the abnormalities. A multi-phase training scheme is proposed which facilitates parallel optimization for efficient gradient propagation. In the proposed network, features are directly extracted from raw ECG and DWT coefficients, followed by 2 fully connected layers to process features furthermore and to detect arrhythmia in the recordings. Classification performance of the proposed method is tested on PhysioNet-2017 dataset and it offers superior performance in detecting AF from normal, alternating and noisy cases in comparison to some state-of-the-art methods.
2021
CovSegNet: A multi encoder–decoder architecture for improved lesion segmentation of COVID-19 chest CT scans
Tanvir Mahmud, Md Awsafur Rahman, Shaikh Anowarul Fattah, and 1 more author
IEEE Transactions on Artificial Intelligence, 2021
Automatic lung lesion segmentation of chest computer tomography (CT) scans is considered a pivotal stage toward accurate diagnosis and severity measurement of COVID-19. Traditional U-shaped encoder-decoder architecture and its variants suffer from diminutions of contextual information in pooling/upsampling operations with increased semantic gaps among encoded and decoded feature maps as well as instigate vanishing gradient problems for its sequential gradient propagation that result in suboptimal performance. Moreover, operating with 3-D CT volume poses further limitations due to the exponential increase of computational complexity making the optimization difficult. In this article, an automated COVID-19 lesion segmentation scheme is proposed utilizing a highly efficient neural network architecture, namely CovSegNet, to overcome these limitations. Additionally, a two-phase training scheme is introduced where a deeper 2-D network is employed for generating region-of-interest (ROI)-enhanced CT volume followed by a shallower 3-D network for further enhancement with more contextual information without increasing computational burden. Along with the traditional vertical expansion of Unet, we have introduced horizontal expansion with multistage encoder-decoder modules for achieving optimum performance. Additionally, multiscale feature maps are integrated into the scale transition process to overcome the loss of contextual information. Moreover, a multiscale fusion module is introduced with a pyramid fusion scheme to reduce the semantic gaps between subsequent encoder/decoder modules while facilitating the parallel optimization for efficient gradient propagation. Outstanding performances have been achieved in three publicly available datasets that largely outperform other state-of-the-art approaches. The proposed scheme can be easily extended for achieving optimum segmentation performances in a wide variety of applications. Impact Statement-With lower sensitivity (60-70%), elongated testing time, and a dire shortage of testing kits, traditional RTPCR based COVID-19 diagnostic scheme heavily relies on postCT based manual inspection for further investigation. Hence, automating the process of infected lesions extraction from chestCT volumes will be major progress for faster accurate diagnosis of COVID-19. However, in challenging conditions with diffused, blurred, and varying shaped edges of COVID-19 lesions, conventional approaches fail to provide precise segmentation of lesions that can be deleterious for false estimation and loss of information. The proposed scheme incorporating an efficient neural network architecture (CovSegNet) overcomes the limitations of traditional approaches that provide significant improvement of performance (8.4% in averaged dice measurement scale) over two datasets. Therefore, this scheme can be an effective, economical tool for the physicians for faster infection analysis to greatly reduce the spread and massive death toll of this deadly virus through mass-screening.
2020
CovXNet: A multi-dilation convolutional neural network for automatic COVID-19 and other pneumonia detection from chest X-ray images with transferable multi-receptive feature optimization
Tanvir Mahmud, Md Awsafur Rahman, and Shaikh Anowarul Fattah
With the recent outbreak of COVID-19, fast diagnostic testing has become one of the major challenges due to the critical shortage of test kit. Pneumonia, a major effect of COVID-19, needs to be urgently diagnosed along with its underlying reasons. In this paper, deep learning aided automated COVID-19 and other pneumonia detection schemes are proposed utilizing a small amount of COVID-19 chest X-rays. A deep convolutional neural network (CNN) based architecture, named as CovXNet, is proposed that utilizes depthwise convolution with varying dilation rates for efficiently extracting diversified features from chest X-rays. Since the chest X-ray images corresponding to COVID-19 caused pneumonia and other traditional pneumonias have significant similarities, at first, a large number of chest X-rays corresponding to normal and (viral/bacterial) pneumonia patients are used to train the proposed CovXNet. Learning of this initial training phase is transferred with some additional fine-tuning layers that are further trained with a smaller number of chest X-rays corresponding to COVID-19 and other pneumonia patients. In the proposed method, different forms of CovXNets are designed and trained with X-ray images of various resolutions and for further optimization of their predictions, a stacking algorithm is employed. Finally, a gradient-based discriminative localization is integrated to distinguish the abnormal regions of X-ray images referring to different types of pneumonia. Extensive experimentations using two different datasets provide very satisfactory detection performance with accuracy of 97.4% for COVID/Normal, 96.9% for COVID/Viral pneumonia, 94.7% for COVID/Bacterial pneumonia, and 90.2% for multiclass COVID/normal/Viral/Bacterial pneumonias. Hence, the proposed schemes can serve as an efficient tool in the current state of COVID-19 pandemic. All the architectures are made publicly available at: https://github.com/awsaf49/CovXNet