Name: neural-chat-dataset-v1-1
Creator: Intel
License: https://choosealicense.com/licenses/apache-2.0/

Here is a collective list of instruction dataset used for Neural Chat fine-tuning. The total number of instruction samples and tokens are about 1.1M and 3M respectively.

Type	Language	Dataset	Number
HC3	en	HC3	24K
dolly	en	databricks-dolly-15k	15K
alpaca-zh	zh	tigerbot-alpaca-zh-0.5m	500K
alpaca-en	en	TigerResearch/tigerbot-alpaca-en-50k	50K
math	en	tigerbot-gsm-8k-en	8K
general	en	tigerbot-stackexchange-qa-en-0.5m	500K

The collective dataset has been validated on multiple LLMs (such as MPT, LLama) by the NeuralChat team (Kaokao Lv, Wenxin Zhang, Xuhui Ren, and Haihao Shen) from Intel/SATG/AIA/AIPT. Thanks to Hello-SimpleAI, databricks, TigerResearch/TigerBot for releasing the open-source instruction dataset.