Repository for Knowledge Card: Filling LLMs' Knowledge Gaps with Plug-in Specialized Language Models @ ICLR 2024, Oral.
config.py
specifies the configuration/hyperparameters for running Knowledge Card with the three modes. We provide four default settings in config.py
.
ChatGPT, slightly slower
: We employ ChatGPT (gpt-3.5-turbo
) as the base LLM, use GPU 0 for both relevance and pruning selectors, use GPU 1 for the two models in the factuality selector, and GPU 2 for hosting the modular knowledge cards. Note that model sharing on GPUs 0 and 1 will make things a bit slower. 3 GPUs are required in total. Please fill in your OpenAI API key in line 44 oflm_utils.py
.ChatGPT, slightly faster
: We employ ChatGPT (gpt-3.5-turbo
) as the base LLM, use GPU 0 for the relevance selector, GPU 1 for pruning selector, GPUs 2 and 3 for the two models in the factuality selector, and GPU 4 for hosting the modular knowledge cards. 5 GPUs are required in total. Please fill in your OpenAI API key in line 44 oflm_utils.py
.open-source LLM, slightly slower
: We employ an open-source LLM (default: Mistral-7B or LLaMA2-7B) as the base LLM and employ GPU 0 to support it, use GPU 1 for both relevance and pruning selectors, use GPU 2 for the two models in the factuality selector, and GPU 3 for hosting the modular knowledge cards. Note that model sharing on GPUs 1 and 2 will make things a bit slower. 4 GPUs are required in total.open-source LLM, slightly faster
: We employ an open-source LLM (default: Mistral-7B or LLaMA2-7B) as the base LLM and employ GPU 0 to support it, use GPUs 1-4 to support the three selectors, and GPU 5 for the modular knowledge cards. 6 GPUs are required in total.
Other specifications/hyperparameters in config.py
should be self-explanatory or come with comments.
Any environment with a reasonable Huggingface Transformers installation should be fine. If you really want to install the messy environment I used, do conda env create -f environment.yml
.
data/sample.jsonl
provides an example of input/output format. Just organize your prompts in a JSONL file and one dict per line, with two fields prompt
and output
in each line.
bottom_up.py
, top_down_auto.py
, and top_down_explicit.py
are the three modes of Knowledge Card. You can run them with:
python <mode>.py -i <path_to_input_file> -o <path_to_output_file>
Please note that it might be slow (downloading all knowledge card checkpoints, running multiple LMs on multiple GPUs, etc.) and you might want to run it on a cluster. There are some potential improvements for better parallelism and efficiency that I may or may not add in the future.
The pool of knowledge cards to leverage is specified in config.py
: knowledge_card_paths
specify a list of strings where each string represents a model checkpoint on HuggingFace (or local). knowledge_card_names
specify a list of strings where each string represents the name of the knowledge card, any string representing the domain/information source/knowledge type should work: commonsense knowledge
, Wikipedia
, news articles
, social media
, etc.
In default we employ five knowledge cards specified in the config.py
file. We also provide all 26 knowledge cards on HuggingFace:
Model Name | Link | Description |
---|---|---|
bunsenfeng/knowledge-card-yelp | link | yelp reviews |
bunsenfeng/knowledge-card-yago | link | YAGO knowledge graph |
bunsenfeng/knowledge-card-wikipedia | link | Wikipedia |
bunsenfeng/knowledge-card-wikipedia2 | link | Wikipedia, cont. |
bunsenfeng/knowledge-card-wikidata | link | Wikidata knowledge graph |
bunsenfeng/knowledge-card-twitter | link | tweets |
bunsenfeng/knowledge-card-reddit | link | reddit posts |
bunsenfeng/knowledge-card-realnews1 | link | real news, part 1 |
bunsenfeng/knowledge-card-realnews2 | link | real news, part 2 |
bunsenfeng/knowledge-card-realnews3 | link | real news, part 3 |
bunsenfeng/knowledge-card-realnews4 | link | real news, part 4 |
bunsenfeng/knowledge-card-pubmed | link | medical literature |
bunsenfeng/knowledge-card-opensubtitles | link | movie subtitles |
bunsenfeng/knowledge-card-midterm | link | 2022 US midterm election news |
bunsenfeng/knowledge-card-math | link | math text |
bunsenfeng/knowledge-card-legal-contracts | link | legal contracts |
bunsenfeng/knowledge-card-kgap | link | KGAP knowledge graph |
bunsenfeng/knowledge-card-IMDB | link | IMDB movie reviews |
bunsenfeng/knowledge-card-gutenberg | link | Gutenberg |
bunsenfeng/knowledge-card-DDB | link | biomedical knowledge graph |
bunsenfeng/knowledge-card-ConceptNet | link | commonsense knowledge graph |
bunsenfeng/knowledge-card-bookcorpus | link | BookCorpus |
bunsenfeng/knowledge-card-atomic | link | commonsense knowledge graph |
bunsenfeng/knowledge-card-acl-papers | link | *ACL papers |
bunsenfeng/knowledge-card-1btokens | link | 1B tokens |
bunsenfeng/knowledge-card-politics | link | political news |
Note that these knowledge cards are based on the OPT-1.3B
model. Please note that they are far from perfect: after all they are just 1B models trained with our very limited compute resources. Any language generation model that supports inference on a single GPU should also work so feel free to use your own models/selections as knowledge cards. If you are interested in contributing/suggesting model checkpoints as knowledge cards, please feel free to open an issue or a pull request.
Any language model checkpoint trained with the causal language modeling objective should work as a knowledge card. We provide a most generalized implementation in card_training.py
: provide a text file (.txt
) of corpora and train your own specialized knowledge card!
python card_training.py -m <model_checkpoint> -d <data_txt_path> -n <name_of_the_card>
The trained knowledge card will appear in cards/<name_of_the_card>
.
For MMLU, visit link. The fake news detection and MidtermQA datasets are provided in eval_datasets
with their respective readmes.
If you find our work interesting/helpful, please consider citing Knowledge Card:
@inproceedings{feng2023knowledge,
title={Knowledge Card: Filling LLMs' Knowledge Gaps with Plug-in Specialized Language Models},
author={Feng, Shangbin and Shi, Weijia and Bai, Yuyang and Balachandran, Vidhisha and He, Tianxing and Tsvetkov, Yulia},
booktitle={The Twelfth International Conference on Learning Representations},
year={2023}
}