How CODEX Model Size Influences COCOGEN’s Output Quality

Table of Links

G Designing Python class for a structured task

Figure 7 shows three different designs for Explagraphs. For PROSCRIPT, the various formats include representing proscript as a Networkx[8] class (8), DOT-like class 9, and as a Tree (10).

H Impact of Model size

The CODEX model released by OpenAI is available in two versions[9]: code-davinci-001 and code-davinci-002. While the exact sizes of the models are unknown because of their proprietary nature, OpenAI API states that code-davinci-002 is the Most capable Codex model Tables 16 and ?? compares COCOGEN +code-davinci-001 with COCOGEN +code-davinci-002. Note that both code-davinci-001 and code-davinci-002 can fit 4000 tokens, so the number of in-context examples was identical for the two settings. The results show that for identical prompts, COCOGEN +code-davinci-002 vastly outperforms COCOGEN +code-davinci-001, showing the importance of having a better underlying code generation model.

Figure 5: Example graphs for each of the tasks used for COCOGEN: PROSCRIPT (top-left), EXPLAGRAPHS (topright), and PROPARA (bottom).

Table 13: Performance of CODEX on the three different formats present in Figure 7 for EXPLAGRAPHS.

Table 14: Performance of CODEX-001 and CODEX002 on the the different formats present in Figure 10 and 9 for PROSCRIPT edge prediction. We find that the literal format that combines structure with literally Figure output performs the best for CODEX-002.

Model size vs. sensitivity to the prompt In Table 14 shows the performance of CODEX-001 (smaller) and CODEX-002 (larger, also see Appendix A) on identical prompts. Our experiments show that as model size increases, the sensitivity of the model on the prompt design might get progressively easier.

I Variation in prompts

We run each experiment with 4 different random seeds, where the random seeds decide the order of examples in the prompt. We find minimal variance between runs using different fixed prompts between 3 runs. Further, as shown in the Table 18, 19, 20, and 21, all improvements of COCOGEN over DAVINCI are statistically (p-value < 0.001).

Figure 6: A PROSCRIPT plan (top) and the corresponding Python code (bottom).

Table 18: PROSCRIPT script generation: mean and standard deviation across three different random seeds.

Table 21: PROPARA: mean and standard deviation across three different random seeds.

Table 19: PROSCRIPT edge prediction: mean and standard deviation across three different random seeds.

Table 15: CODEX results on PROSCRIPT generation for various Python source formats.

Figure 7: Templates tried for explagraph.

Table 16: CODEX-001 vs 002 on PROSCRIPT script generation

Figure 8: Proscript as a Networkx class.

Figure 9: Representing PROSCRIPT graph literally.

Table 20: EXPLAGRAPHS: mean and standard deviation across three different random seeds.

Figure 10: Proscript with a tree-encoding.

[9] as of June 2022

Authors:

(1) Aman Madaan, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]);

(2) Shuyan Zhou, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]);

(3) Uri Alon, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]);

(4) Yiming Yang, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]);

(5) Graham Neubig, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]).

Source link

Trending News

Web

Web

Category Collection

How CODEX Model Size Influences COCOGEN’s Output Quality

Table of Links

G Designing Python class for a structured task

H Impact of Model size

I Variation in prompts

Leave a Reply Cancel reply

Trending News

Web

Web

Category Collection

Table of Links

G Designing Python class for a structured task

H Impact of Model size

I Variation in prompts

Leave a Reply Cancel reply

Related News

AI Terms

Vue.js: Propagating Props Like a Pro

Navigating the Complexities of Government Funding: A Technical & Holistic Perspective

It Turns Out That Predictive Policing Software Is Pretty Terrible At Predicting Crimes