Detecting deceit in backdoored LLMs.

To reproduce

Model organism of misalignment

Prepare training data using Anthropic's code_vulnerability_fewshot_prompts.json
Train the backdoored models using the CoT_prompts or the distilled_CoT prompts. I trained with Phi-2 through QLoRA 4-bit with lr=1e-3 over 10 epochs using LLaMA-Factory
Effectiveness of backdoor can be evaluated using CodeQL, scripts in codeql_eval can help you with this

Detecting deceit with LAT scans

Use the scripts in lat_scans/run.py and lat_scans/utils.py with the representation-engineering repository

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
codeql_eval		codeql_eval
data		data
lat_scans		lat_scans
results		results
README.md		README.md