
In Q3 of 2024, Kapersy (the security company) tracked nearly 7 million attacks directed at mobile users. That's malware, adware, or applications trying to get themselves onto your device. Again, that's just three months of data, from one security company. And unfortunately, the problem isn't just that malware is getting more plentiful; it's also increasing in sophistication.
Bad actors now employ evasion techniques to bypass detection. They're now able to conceal their presence, to impersonate legitimate applications, to exploit system permissions, and to perform unauthorised actions like data exfiltration or financial fraud. So we're in an arms race: the malware detection schemes we deploy must keep up with the advancements in the malware itself. But on mobile, that's easier said than done: you're in a constrained environment with few resources. You can't exactly load a giant model onto an Android device and expect it to run. You need a more nuanced approach.
And that's the context for today's paper. The authors are trying to build a malware detection system for Android devices that's accurate enough to catch sophisticated threats, fast enough to run in real-time on smartphones, and transparent enough to explain its decisions. Their solution combines lightweight transformer models with explainable AI techniques to create a system that balances these requirements. Let's see how they did it. But first, a little background on what kind of solutions already exist for this, and why they fall short.
Traditional ML approaches have relied on things like Support Vector Machines, Random Forest, and K-Nearest Neighbours. While these have shown promise in identifying malicious behaviours, deep learning approaches have consistently outperformed them when dealing with complex and diverse datasets. Convolutional Neural Networks (CNNs) excel at extracting features from raw data and detecting local dependencies, making them highly effective at identifying malicious patterns. Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTMs) networks are particularly adept at processing sequential data, such as API and system call sequences. But, these deep models face a critical limitation: they're closed systems, black boxes. While they do output probabilities, they don't provide insight into the underlying reasons behind their predictions, making it difficult to assess the reliability of their decisions. This lack of interpretability severely limits trust in their outputs and restricts their use as practical malware detectors. When your security system flags an app as malicious, you need to understand exactly which features triggered that decision. It can't just say "this is malware because we say so"; you need it to be able to explain why. Then there's the deployment challenge. As I mentioned earlier, most existing deep learning models require substantial computational resources, making them impractical for mobile devices with limited battery, memory, and processing power.
Enter transformers and the self-attention mechanism. Transformers excel at modelling global dependencies and enabling parallel processing, making them theoretically ideal for identifying complex malicious patterns. With their self-attention mechanisms, they can assign different levels of importance to various elements within a sequence, dramatically improving contextual understanding of API calls and permissions. That being said, standard transformers are (like the other models I mentioned) too computationally intensive for mobile. And that's what they're trying to fix in this paper. They want the benefits of a transformer, without the footprint. So, they sought out existing lightweight versions specifically designed for resource-constrained environments. In the end, they found five worth evaluating, all based on Google's BERT model (famous for its bidirectional language understanding).
DistilBERT, which is (as it sounds) a distilled version retaining most of BERT's performance in a much smaller package. CodeBERT which was pretrained on source code and natural language to support code understanding and classification tasks. TinyBERT which was obtained through layer-wise distillation. MobileBERT, which was optimised for mobile and edge devices. ALBERT, which uses factorised embeddings and cross-layer parameter sharing for efficiency. But for any of these models to be useful, they needed to be trained on data, to learn the difference between a harmless app and a malicious one. That dataset came from Koodous, an online platform for malware analysis. The dataset describes 100,000 different Android apps, and is split evenly between malware samples and benign apps. The authors used 80% of this dataset for training and the other 20% for testing.
Feature extraction focused on two critical indicators of app behaviour: permissions and API calls. Permissions regulate access to system functionality and restrict interactions between applications, while API call sequences provide insight into execution patterns associated with malicious intent. Malicious applications often request excessive or unnecessary permissions to perform unauthorised actions, while API calls reveal the actual execution flow that can expose malicious behaviour patterns.
Preprocessing involved several tokenisation strategies, which differed slightly based on the model architecture. Prior to tokenisation, they performed lightweight cleaning to reduce lexical noise, including removing special characters and standardising formatting. For example, permission strings were normalised from their verbose Android format to shorter, more readable versions.
The training process used learning rate selection (to avoid catastrophic forgetting of the pre-trained knowledge), batch sizing to balance computational efficiency with gradient stability, and early stopping based on validation performance, to prevent overfitting.
So how did each of the models do?
DistilBERT achieved the highest overall performance, effectively balancing correct classifications while minimising false positives. TinyBERT and ALBERT also did well, which validates the effectiveness of distillation and parameter sharing techniques for this domain. CodeBERT, meanwhile, had high recall and reduced the total number of malicious samples that evaded detection. This suggests that models trained on code-related data may have learned representations that transfer well to malware analysis. MobileBERT, on the other hand, showed weaker performance. Which, unfortunately, makes sense. Sometimes the architectural choices that improve deployment characteristics can hurt performance on specific tasks.
To validate the practical deployability of these models, they converted them to ONNX format and then loaded them onto actual Android phones.
Side note: ONNX is an open standard for representing machine learning models in a portable, hardware-agnostic format. It works by defining a common set of operators and computational graphs that allow models trained in any framework that outputs to ONNX (like PyTorch or TensorFlow) to run efficiently in any system that supports the ONNX runtime. It's almost like the JVM, but for models. It enables the deployment of the same model across various platforms, from servers to edge devices, and from Linux to Windows, without requiring retraining, re-exporting, or reformatting. In this case, it allowed the researchers to export their transformer models and benchmark them directly on mobile devices.
Anyway, all the models were deployable and worked in some way. The older devices showed more variable performance, while the newer devices delivered consistent performance. As can be expected.
Explainability analysis used LIME to interpret model decisions on individual samples. LIME works by creating variations of the input and observing how the model's predictions change. This allows you to identify which features most strongly influence decisions. The stability of the explanations was tested across multiple runs with different random seeds. Some samples showed highly consistent explanations, others showed more variability, particularly for cases where malicious behaviour was subtle, or features overlapped significantly with benign patterns.
So overall, this seems (to me) like a validated proof of concept. The models are working, they're explainable, and they're deployable. This paper demonstrates that lightweight transformers can achieve near state-of-the-art performance while maintaining practical deployability constraints. The authors' use of transfer learning, efficient architectures, and explainable AI provides a foundation for trustworthy, real-time malware detection. And (importantly) their system preserves user privacy by performing its processing entirely on-device.
If you want to dive deeper into the model architectures or see the LIME explanations for the authors' case studies, I'd highly recommend downloading the full paper. It includes the ablation studies showing exactly how each architectural choice impacts performance, the complete confusion matrices for all model variants, and the implementation details in case you want to reproduce their ONNX deployments yourself.
pythonfrom torch.utils.data import Dataset, DataLoader from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments import torch.onnx # 1. Mock Dataset Class (representing the Koodous dataset) class AndroidMalwareDataset(Dataset): def __init__(self, texts, labels, tokenizer, max_len=512): self.texts = texts # Combined string of Permissions and API calls self.labels = labels # 0 for Benign, 1 for Malware self.tokenizer = tokenizer self.max_len = max_len def __len__(self): return len(self.texts) def __getitem__(self, idx): text = str(self.texts[idx]) label = self.labels[idx] # Tokenization strategies mentioned in the article encoding = self.tokenizer.encode_plus( text, add_special_tokens=True, max_length=self.max_len, return_token_type_ids=False, padding='max_length', truncation=True, return_attention_mask=True, return_tensors='pt', ) return { 'input_ids': encoding['input_ids'].flatten(), 'attention_mask': encoding['attention_mask'].flatten(), 'labels': torch.tensor(label, dtype=torch.long) } # Sample data (Permissions + API Calls normalized) train_texts = [ "android.permission.INTERNET android.permission.SEND_SMS Ljava/net/HttpURLConnection;->connect", "android.permission.CAMERA Landroid/hardware/Camera;->open" ] train_labels = [1, 0] # 1: Malware, 0: Benign # 2. Initialise Model and Tokenizer model_name = 'distilbert-base-uncased' # "DistilBERT... retaining most of BERT's performance" tokenizer = DistilBertTokenizer.from_pretrained(model_name) model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=2) # 3. Prepare Data train_dataset = AndroidMalwareDataset(train_texts, train_labels, tokenizer) # 4. Training Arguments (Batch sizing and Learning rate selection) training_args = TrainingArguments( output_dir='./results', num_train_epochs=3, per_device_train_batch_size=16, # Balancing computational efficiency learning_rate=2e-5, # To avoid catastrophic forgetting logging_dir='./logs', ) trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, ) # Note: In a real scenario, you would call a trainer.train() here. print("Training setup complete. Model ready for fine-tuning.") # 5. Export to ONNX (for mobile deployment) # The article mentions converting models to ONNX to run on Android. model.eval() dummy_input = torch.randint(0, 1000, (1, 512)) # Dummy input for tracing onnx_path = "android_malware_detector.onnx" torch.onnx.export( model, (dummy_input,), onnx_path, input_names=['input_ids'], output_names=['logits'], dynamic_axes={'input_ids': {0: 'batch_size'}, 'logits': {0: 'batch_size'}}, opset_version=11 ) print(f"Model exported to {onnx_path} for mobile deployment.") defaultLanguage="python" height="400"