CodeTrans: Towards Cracking the Language of Silicon's Code Through Self-Supervised Deep Learning and High Performance Computing
Currently, a growing number of mature natural language processing applications make people's life more convenient. Such applications are built by source code - the language in software engineering. However, the applications for understanding source code language to ease the software engineering process are under-researched. Simultaneously, the transformer model, especially its combination with transfer learning, has been proven to be a powerful technique for natural language processing tasks. These breakthroughs point out a promising direction for process source code and crack software engineering tasks. This paper describes CodeTrans - an encoder-decoder transformer model for tasks in the software engineering domain, that explores the effectiveness of encoder-decoder transformer models for six software engineering tasks, including thirteen sub-tasks. Moreover, we have investigated the effect of different training strategies, including single-task learning, transfer learning, multi-task learning, and multi-task learning with fine-tuning. CodeTrans outperforms the state-of-the-art models on all the tasks. To expedite future works in the software engineering domain, we have published our pre-trained models of CodeTrans. https://github.com/agemagician/CodeTrans
PDF AbstractCode
Results from the Paper
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Program Synthesis | AlgoLisp | CodeTrans-MT-TF-Small | Accuracy | 90.31 | # 1 | |
Code Documentation Generation | CodeSearchNet - Go | CodeTrans-TF-Large | Smoothed BLEU-4 | 19.54 | # 7 | |
Code Documentation Generation | CodeSearchNet - Java | CodeTrans-MT-Large | Smoothed BLEU-4 | 21.87 | # 1 | |
Code Documentation Generation | CodeSearchNet - JavaScript | CodeTrans-TF-Large | Smoothed BLEU-4 | 18.98 | # 2 | |
Code Documentation Generation | CodeSearchNet - Php | CodeTrans-MT-Base | Smoothed BLEU-4 | 26.23 | # 1 | |
Code Documentation Generation | CodeSearchNet - Python | CodeTrans-MT-Base | Smoothed BLEU-4 | 20.39 | # 1 | |
Code Documentation Generation | CodeSearchNet - Ruby | CodeTrans-MT-Base | Smoothed BLEU-4 | 15.26 | # 1 | |
Git Commit Message Generation | CommitGen | CodeTrans-TF-Large | BLEU-4 | 44.41 | # 1 | |
API Sequence Recommendation | DeepAPI | CodeTrans-MT-TF-Large | BLEU-4 | 73.39 | # 1 | |
Code Comment Generation | DeepCom | CodeTrans-TF-Large | Smoothed BLEU-4 | 39.50 | # 1 | |
Source Code Summarization | Summarizing Source Code using a Neural Attention Model - C# | CodeTrans-MT-Large | Smoothed BLEU-4 | 23.57 | # 1 | |
Source Code Summarization | Summarizing Source Code using a Neural Attention Model - Python | CodeTrans-MT-Base | Smoothed BLEU-4 | 13.37 | # 1 | |
Source Code Summarization | Summarizing Source Code using a Neural Attention Model - SQL | CodeTrans-MT-TF-Large | Smoothed BLEU-4 | 19.98 | # 1 |