The model was trained on 10’s of millions of public code repositories on GitHub. The primary goal of the model is that it be used for research purposes and produce original output, rather than simply regurgitating the code it has been trained on. In fact, OpenAI has found that “the vast majority (>99%) of output does not match the training data”. This means that the model was able to capture the structure, syntax, and semantics of programming languages in the same way it has previously done with human languages.
Figure 1: Transformer architecture
The GPT-3 model is built on the atomic unit of a Transformer, first proposed in 2017 in the research paper “Attention is all you need” by a team at Google. The key finding of this research was the idea of self-attention, which allows generative models to be trained on unstructured and unlabeled text – leading to an exponential increase in training data available to such models.
The main difference between the different iterations of GPT is the number of layers, training corpus size, vocabulary sizes and the number of parameters in the model – all increasing successively with every iteration. GPT-2 increased the number of parameters to 1.5 billion from 117 million. The latest model uses a mind-bending 175 billion parameters. As the models expand, so do their capabilities, as we see in Figure 2 of model accuracy in “next word prediction”, i.e., predicting the next word from a given context. What is important to note, however, is that model performance seems to be approaching a plateau, at least in the case of GPT-3, which may point to the idea that future model iterations will focus on other aspects than training data size to improve generalization ability.
Figure 2: GPT model accuracy comparison (next token prediction)
If Codex can automate code generation from natural language, it also raises the question of whether it can replace an actual programmer. We decided to start with a simple problem and move to a more involved prompt and finally something more in-line with what a seasoned programmer would need to perform.
- Simple prompt – Sort a list of numbers
Figure 3: Sample screenshot from OpenAI Codex Beta
- More specific prompt along the same lines – Sort a list of numbers with o(n2) time complexity
- Application-specific prompt – Create a LSTM-based time series prediction model with a training window of 3 days and plot the errors of predictions
Overall, the results seem quite impressive. The first two prompts are handled with relative ease, and the model is even able to identify suitable algorithms based on complexity and use built-in Python functions when they are more appropriate. The final prompt also provides some interesting insights into the generation process. The general outline is clearly there, as well as the common pre-processing steps when working with time series data.
It is, however, clear to any Data Scientist / ML Engineer that the third output is not a generalized solution and would require significant reworking. There is no function to plots the errors, and it reads like a copy-paste implementation of a toy problem, including irrelevant comments and unused lines of code. To many, this is akin to a “Googled” solution in that it is not tailor-made to specifications but rather a vaguely relevant code snippet that needs to be largely altered.
Potential and limitations
The Codex model has gone a long way in illustrating what we can expect from future language models, which will be better able to understand us than ever before. With a simple instruction, Codex can find relevant information and structure it in a way that programmers understand, or even create unique content that helps to bridge the translation gap between programmers and non-programmers. This exciting development will continue to fuel interest and research in this area for years to come.
Although the Codex model can code with surprising coherence and speed, it is evidently limited by its exposure to code repositories and their inherit imperfections. The model is also prone to offering up the same solution to slightly varying prompts, and in some cases, presents a solution that is identical to a code excerpt from its training corpus. This means that the model does not always deliver an original and relevant solution to an unseen prompt, something that most trained programmers with exposure to only 1% of the Codex corpora would be able to do.
Codex demonstrates that by combining enormous unstructured corpora and sufficiently complex architectures, models can produce realistic and veritable programming output from simple natural language prompts. However, when digging a little deeper, it is not the panacea it first appears to be. It is still a work in progress, and though exciting, there is still much to be done. For this reason, automated programmers may still be a little way off, and instead the immediate relevance of Codex will be to augment the ability of current programmers, and hopefully clean up some of our dusty, syntactically littered repositories on GitHub.
Other interesting resources related to Codex:
Inspired? We can help you with your next step!
Whether you are beginning your AI journey or already have models in production, you can turn to Combine for expert help. Feel free to check out our blog posts on how we used Transformers to build a semantic search engine https://combine.se/using-nlp-to-build-a-semantic-search-engine-plugin-for-sympathy/ and other AI projects such as crowd-sourced data annotation and image classification using deep learning: https://combine.se/an-odf-update/
Want to learn more about how Combine can assist in AI development?
Contact Simon Yngve (firstname.lastname@example.org)