🎯 Objective

In this assignment, you will evaluate the effectiveness of one critical AI prompt used in your project. You will generate a test dataset, assess the initial performance of the prompt using chosen metrics, refine the prompt iteratively, and report both the initial and improved performance outcomes.

📋 Requirements

Prompt Selection:
- Choose the most critical AI prompt used in your project (e.g., question-answering, fact generation, classification, recommendation, etc.).
Test Dataset Generation:
- Generate a dataset with at least 100 examples relevant to the chosen prompt. For example, if the prompt involves question-answering, create 100 question-answer pairs; i.e., you need to generate ground-truth responses as well.
Metric Selection:
- Choose appropriate metrics to evaluate the performance of the prompt (e.g., accuracy, response quality, BLEU, ROUGE, Bert Score, precision, recall, F1-measure, etc.). Explain why these metrics are suitable for your evaluation.
- ‼️ If you are using LLM-based evaluation, you need to show your LLM-based method is aligned with human judgment. Please refer to the quantitative evaluation lectures and in-class activities.
Initial Performance Evaluation:
- Test the AI's performance using the original prompt on the dataset and report the results using the selected metrics.
Prompt Refinement:
- Iteratively refine the prompt to improve its performance. Document the changes made to the prompt and how they aim to improve the chosen metrics.
Final Performance Evaluation:
- Re-test the refined prompt on the same dataset and report the final performance outcomes using the same metrics. Compare the initial and final results to highlight the improvements.

📬 Submission

Report Document: Submit a PDF document that includes the following sections:
- Prompt Selection: The critical AI prompt you chose and why it’s essential for your project.
- Dataset Description: Details of your generated dataset (e.g., the type of data, how it was created).
- Metrics and Justification: The metrics you chose to measure the AI's performance and why they are suitable for evaluating your prompt.
- Initial Performance Results: The AI's performance based on the original prompt using the selected metrics.
- Refinement Process: Description of how you refined the prompt and the rationale behind the changes.
- Final Performance Results: The AI's performance using the refined prompt, with a comparison to the initial results using the same metrics.
Code and Dataset: Upload the dataset and evaluation code (i.e., Jupyter Notebook)

Grading Guidelines

Grading will focus on the thoroughness of your evaluation, the quality of the dataset and refinements, and the clarity of your report.

Possible point deductions:

The selected prompt is inappropriate for evaluation (e.g., too simple or trivial for meaningful testing).
Dataset is of poor quality (e.g., lacks diversity or sufficient examples to meaningfully test the prompt)
Metrics are not well-justified for the evaluation
Prompt refinement is not backed up by evidence (e.g., no clear rationale for changes)