ðUnderstanding Project Stanford Alpaca
Thematic Artificial Intelligence Data Modelling
Last updated
Thematic Artificial Intelligence Data Modelling
Last updated
By using the basis of following:
Key Principals
This produced an instruction-following dataset with 52K examples obtained at a much lower cost (less than $500). In a preliminary study, we also find our 52K generated data to be much more diverse than the data released by self-instruct. We plot the below figure (in the style of Figure 2 in the self-instruct paper to demonstrate the diversity of our data. The inner circle of the plot represents the root verb of the instructions, and the outer circle represents the direct objects.
Batch size
128
Learning rate
2e-5
Epochs
3
Max length
512
Weight decay
0
Below is a command that fine-tunes LLaMA-7B with our dataset on a machine with 4 A100 80G GPUs in FSDP full_shard
mode. We were able to reproduce a model of similar quality as the one we hosted in our demo with the following command using Python 3.10. Replace <your_random_port>
with a port of your own, <your_path_to_hf_converted_llama_ckpt_and_tokenizer>
with the path to your converted checkpoint and tokenizer (following instructions in the PR), and <your_output_dir>
with where you want to store your outputs.
Note the given training script is meant to be simple and easy to use, and is not particularly optimized. To run on more gpus, you may prefer to turn down gradient_accumulation_steps
to keep a global batch size of 128. Global batch size has not been tested for optimality.