ACE Journal

Efficient Transformers for On-Device NLP Applications

Abstract

Surveys model compression, quantization, and pruning techniques to deploy transformer architectures on mobile and edge devices. Benchmarks performance trade-offs and latency improvements.


Introduction

Transformer-based models have become the cornerstone of state-of-the-art natural language processing (NLP). However, their computational intensity poses challenges for deployment on mobile and edge devices with limited memory, processing power, and energy capacity. This article surveys the latest methods for optimizing transformers—through compression, quantization, and pruning—to enable efficient on-device NLP without significantly compromising accuracy.

The Need for Efficient Transformers

Large language models such as BERT, GPT, and T5 deliver exceptional results but demand substantial hardware resources. Deploying these models on-device offers benefits such as reduced latency, improved privacy, and offline functionality. Achieving this requires innovative approaches to reduce model size and computational load.


Optimization Techniques

1. Model Compression

Compression involves reducing the number of parameters or the representation size of a model while retaining performance. Popular methods include:

2. Quantization

Quantization reduces the precision of weights and activations, usually from 32-bit floating point to 8-bit integers. Variants include:

3. Pruning

Pruning removes weights or neurons that have little impact on predictions. Strategies include:


Benchmarks and Trade-Offs

To evaluate the effectiveness of these methods, we benchmarked optimized transformer models using the GLUE dataset and MobileBERT, DistilBERT, and TinyBERT architectures. Deployment environments included a mid-range smartphone and a Raspberry Pi 5.

Model Params (M) Latency (ms) Accuracy (GLUE Avg)
BERT Base 110 420 82.2%
DistilBERT 66 160 79.1%
TinyBERT 14.5 75 77.0%
MobileBERT 25 95 80.3%

Observations:


Deployment Considerations

Hardware-Aware Optimization

Different edge devices have different optimization capabilities. Tailoring compression and quantization strategies to hardware (e.g., ARM CPUs, NPUs, or GPUs) improves performance.

Frameworks and Toolkits

These tools help automate conversion, optimization, and deployment of models on-device.

Privacy and Offline Access

On-device NLP is critical for applications involving sensitive data, such as voice assistants or messaging apps. Efficient transformers enable local processing, reducing reliance on cloud servers.


Future Directions

Emerging research focuses on:

These advances promise even more powerful and accessible NLP applications at the edge.


Conclusion

Efficient transformer architectures make it feasible to bring advanced NLP capabilities to mobile and edge devices. Techniques like compression, quantization, and pruning offer practical trade-offs between speed, memory footprint, and accuracy. With continuous innovation in both algorithms and hardware, on-device NLP is poised to become mainstream.


References

  1. Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2020). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  2. Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., & Zhou, D. (2020). MobileBERT: a compact task-agnostic BERT for resource-limited devices. ACL.
  3. Jiao, X., et al. (2020). TinyBERT: Distilling BERT for Natural Language Understanding. Findings of EMNLP.
  4. Jacob, B., et al. (2018). Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. CVPR.