F2LLM-v2 results on Thai and Spanish benchmark by Geralt-Targaryen · Pull Request #465 · embeddings-benchmark/results

Geralt-Targaryen · 2026-03-31T10:03:25Z

Checklist

My model has a model sheet, report, or similar
My model has a reference implementation in mteb/models/model_implementations/, this can be as an API. Instruction on how to add a model can be found here
- No, but there is an existing PR ___
The results submitted are obtained using the reference implementation
My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
I solemnly swear that for all results submitted I have not trained on the evaluation dataset including training splits. If I have, I have disclosed it clearly.

As discussed in embeddings-benchmark/mteb#3237 and embeddings-benchmark/mteb#4321, I evaluated the F2LLM-v2 models on the Thai and Spanish benchmarks.

MKQARetrieval, WisesightSentimentClassification.v2, and SpanishSentimentClassification.v2 are evaluated using these prompts:

{
    "MKQARetrieval": "Given a question, retrieve the answer.",
    "WisesightSentimentClassification.v2": "Classify the sentiment of the given text.",
    "SpanishSentimentClassification.v2": "Classify the sentiment of the given text."
}

Other tasks are evaluated using the prompts in the current implementation.

I have two related questions:

Do I need to create a PR on model implementation to add relevant prompts each time new tasks are evaluated? Or does it suffice to show the prompts here?
For certain tasks that have existing results on other languages (e.g., MIRACLReranking), when I evaluated the models on the Thai benchmark, the existing results were erased. But then I evaluated the models on the Spanish benchmark, and the Thai results were not erased. I presume this is probably because different MTEB versions were used (existing results use v2.6.7; now I'm using v2.12.4). Currently I solved this issue by manually merging the new results into the existing file and setting the mteb_version field to the latter one. However, I think it would be better to automatically resolve this conflict.

github-actions · 2026-03-31T10:06:23Z

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: codefuse-ai/F2LLM-v2-0.6B, codefuse-ai/F2LLM-v2-1.7B, codefuse-ai/F2LLM-v2-14B, codefuse-ai/F2LLM-v2-160M, codefuse-ai/F2LLM-v2-330M, codefuse-ai/F2LLM-v2-4B, codefuse-ai/F2LLM-v2-80M, codefuse-ai/F2LLM-v2-8B
Tasks: MIRACLReranking, MIRACLRetrievalHardNegatives.v2, MKQARetrieval, MLSUMClusteringP2P, MLSUMClusteringS2S, MTOPDomainClassification, MTOPIntentClassification, MintakaRetrieval, MrTidyRetrieval, MultiLongDocReranking, MultiLongDocRetrieval, SIB200Classification, STS22, STSBenchmarkMultilingualSTS, SpanishNewsClassification.v2, SpanishPassageRetrievalS2P, SpanishPassageRetrievalS2S, SpanishSentimentClassification.v2, WebFAQRetrieval, WisesightSentimentClassification.v2, XPQARetrieval, XQuADRetrieval

Results for `codefuse-ai/F2LLM-v2-0.6B`

task_name	codefuse-ai/F2LLM-v2-0.6B	google/gemini-embedding-001	intfloat/multilingual-e5-large	Max result	Model with max result	In Training Data
MIRACLReranking	0.5971	nan	0.6249	0.6851	cl-nagoya/ruri-v3-310m	True
MIRACLRetrievalHardNegatives.v2	0.5974	nan	0.5333	0.7836	jinaai/jina-embeddings-v5-text-small	True
MKQARetrieval	0.1730	nan	nan	0.1835	tencent/KaLM-Embedding-Gemma3-12B-2511	False
MLSUMClusteringP2P	0.6919	nan	0.4494	0.5156	Salesforce/SFR-Embedding-2_R	False
MLSUMClusteringS2S	0.6856	nan	0.455	0.5103	Salesforce/SFR-Embedding-2_R	False
MTOPDomainClassification	0.9906	0.9777	0.9024	0.9995	voyageai/voyage-3-m-exp	True
MTOPIntentClassification	0.9370	nan	0.672	0.9307	BAAI/bge-multilingual-gemma2	True
MintakaRetrieval	0.3229	nan	0.3037	0.6253	BAAI/bge-multilingual-gemma2	False
MrTidyRetrieval	0.5959	nan	0.6509	0.7737	intfloat/multilingual-e5-large-instruct	True
MultiLongDocReranking	0.9112	nan	0.8887	0.9338	cl-nagoya/ruri-v3-310m	False
MultiLongDocRetrieval	0.3266	nan	0.3175	0.4626	cl-nagoya/ruri-v3-30m	False
SIB200Classification	0.9562	nan	0.7339	0.8467	tencent/KaLM-Embedding-Gemma3-12B-2511	False
STS22	0.6661	nan	0.6365	0.8314	OrdalieTech/Solon-embeddings-mini-beta-1.1	True
STSBenchmarkMultilingualSTS	0.8434	nan	0.842	0.9596	Gameselo/STS-multilingual-mpnet-base-v2	False
SpanishNewsClassification.v2	0.8962	nan	0.8862	0.8862	intfloat/multilingual-e5-large	False
SpanishPassageRetrievalS2P	0.3583	nan	0.4196	0.4402	BAAI/bge-m3	False
SpanishPassageRetrievalS2S	0.7418	nan	0.7232	0.7516	intfloat/e5-mistral-7b-instruct	False
SpanishSentimentClassification.v2	0.9343	nan	0.9241	0.9628	intfloat/multilingual-e5-large-instruct	False
WebFAQRetrieval	0.8259	nan	0.7596	0.7953	jinaai/jina-embeddings-v3	False
WisesightSentimentClassification.v2	0.3655	nan	nan	0.4010	google/embeddinggemma-300m	False
XPQARetrieval	0.5184	nan	0.434	0.7742	BAAI/bge-multilingual-gemma2	False
XQuADRetrieval	0.9319	nan	0.9687	0.9709	telepix/PIXIE-Rune-v1.0	False
Average	0.6758	0.9777	0.6563	0.7283	nan	-

Model have high performance on these tasks: MTOPIntentClassification,SpanishNewsClassification.v2,SIB200Classification,WebFAQRetrieval,MLSUMClusteringP2P,MLSUMClusteringS2S

Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, StackExchangeClusteringP2P.v2, StackExchangeDupQuestions, StackOverflowDupQuestions, StackOverflowDupQuestions-VN, StackOverflowQA, SyntheticText2SQL, T2Ranking, THUCNews, TNews, TNews.v2, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TriviaQA, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering-VN, TwentyNewsgroupsClustering.v2, UNPC, Waimai, Waimai.v2, WebFAQ, WikiOmnia, WildChat, XCodeEvalCodeToCode, XCodeEvalNLToCode, XCodeEvalTranslation, XNLI, XSum, YahooAnswers, cMedQAv2, webMedQA

Results for `codefuse-ai/F2LLM-v2-1.7B`

task_name	codefuse-ai/F2LLM-v2-1.7B	google/gemini-embedding-001	intfloat/multilingual-e5-large	Max result	Model with max result	In Training Data
MIRACLReranking	0.6168	nan	0.6249	0.6851	cl-nagoya/ruri-v3-310m	True
MIRACLRetrievalHardNegatives.v2	0.6263	nan	0.5333	0.7836	jinaai/jina-embeddings-v5-text-small	True
MKQARetrieval	0.2651	nan	nan	0.1835	tencent/KaLM-Embedding-Gemma3-12B-2511	False
MLSUMClusteringP2P	0.7047	nan	0.4494	0.5156	Salesforce/SFR-Embedding-2_R	False
MLSUMClusteringS2S	0.6979	nan	0.455	0.5103	Salesforce/SFR-Embedding-2_R	False
MTOPDomainClassification	0.9914	0.9777	0.9024	0.9995	voyageai/voyage-3-m-exp	True
MTOPIntentClassification	0.9448	nan	0.672	0.9307	BAAI/bge-multilingual-gemma2	True
MintakaRetrieval	0.4060	nan	0.3037	0.6253	BAAI/bge-multilingual-gemma2	False
MrTidyRetrieval	0.6393	nan	0.6509	0.7737	intfloat/multilingual-e5-large-instruct	True
MultiLongDocReranking	0.9274	nan	0.8887	0.9338	cl-nagoya/ruri-v3-310m	False
MultiLongDocRetrieval	0.3820	nan	0.3175	0.4626	cl-nagoya/ruri-v3-30m	False
SIB200Classification	0.9666	nan	0.7339	0.8467	tencent/KaLM-Embedding-Gemma3-12B-2511	False
STS22	0.6665	nan	0.6365	0.8314	OrdalieTech/Solon-embeddings-mini-beta-1.1	True
STSBenchmarkMultilingualSTS	0.8570	nan	0.842	0.9596	Gameselo/STS-multilingual-mpnet-base-v2	False
SpanishNewsClassification.v2	0.9152	nan	0.8862	0.8862	intfloat/multilingual-e5-large	False
SpanishPassageRetrievalS2P	0.4034	nan	0.4196	0.4402	BAAI/bge-m3	False
SpanishPassageRetrievalS2S	0.7711	nan	0.7232	0.7516	intfloat/e5-mistral-7b-instruct	False
SpanishSentimentClassification.v2	0.9445	nan	0.9241	0.9628	intfloat/multilingual-e5-large-instruct	False
WebFAQRetrieval	0.8439	nan	0.7596	0.7953	jinaai/jina-embeddings-v3	False
WisesightSentimentClassification.v2	0.3840	nan	nan	0.4010	google/embeddinggemma-300m	False
XPQARetrieval	0.5591	nan	0.434	0.7742	BAAI/bge-multilingual-gemma2	False
XQuADRetrieval	0.9474	nan	0.9687	0.9709	telepix/PIXIE-Rune-v1.0	False
Average	0.7028	0.9777	0.6563	0.7283	nan	-

Model have high performance on these tasks: MTOPIntentClassification,SpanishNewsClassification.v2,SIB200Classification,WebFAQRetrieval,SpanishPassageRetrievalS2S,MLSUMClusteringP2P,MLSUMClusteringS2S,MKQARetrieval

Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, StackExchangeClusteringP2P.v2, StackExchangeDupQuestions, StackOverflowDupQuestions, StackOverflowDupQuestions-VN, StackOverflowQA, SyntheticText2SQL, T2Ranking, THUCNews, TNews, TNews.v2, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TriviaQA, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering-VN, TwentyNewsgroupsClustering.v2, UNPC, Waimai, Waimai.v2, WebFAQ, WikiOmnia, WildChat, XCodeEvalCodeToCode, XCodeEvalNLToCode, XCodeEvalTranslation, XNLI, XSum, YahooAnswers, cMedQAv2, webMedQA

Results for `codefuse-ai/F2LLM-v2-14B`

task_name	codefuse-ai/F2LLM-v2-14B	google/gemini-embedding-001	intfloat/multilingual-e5-large	Max result	Model with max result	In Training Data
MIRACLReranking	0.6248	nan	0.6249	0.6851	cl-nagoya/ruri-v3-310m	True
MIRACLRetrievalHardNegatives.v2	0.6479	nan	0.5333	0.7836	jinaai/jina-embeddings-v5-text-small	True
MKQARetrieval	0.4634	nan	nan	0.1835	tencent/KaLM-Embedding-Gemma3-12B-2511	False
MLSUMClusteringP2P	0.7183	nan	0.4494	0.5156	Salesforce/SFR-Embedding-2_R	False
MLSUMClusteringS2S	0.7147	nan	0.455	0.5103	Salesforce/SFR-Embedding-2_R	False
MTOPDomainClassification	0.9924	0.9777	0.9024	0.9995	voyageai/voyage-3-m-exp	True
MTOPIntentClassification	0.9561	nan	0.672	0.9307	BAAI/bge-multilingual-gemma2	True
MintakaRetrieval	0.6244	nan	0.3037	0.6253	BAAI/bge-multilingual-gemma2	False
MrTidyRetrieval	0.7073	nan	0.6509	0.7737	intfloat/multilingual-e5-large-instruct	True
MultiLongDocReranking	0.9316	nan	0.8887	0.9338	cl-nagoya/ruri-v3-310m	False
MultiLongDocRetrieval	0.4014	nan	0.3175	0.4626	cl-nagoya/ruri-v3-30m	False
SIB200Classification	0.9644	nan	0.7339	0.8467	tencent/KaLM-Embedding-Gemma3-12B-2511	False
STS22	0.6394	nan	0.6365	0.8314	OrdalieTech/Solon-embeddings-mini-beta-1.1	True
STSBenchmarkMultilingualSTS	0.8643	nan	0.842	0.9596	Gameselo/STS-multilingual-mpnet-base-v2	False
SpanishNewsClassification.v2	0.9290	nan	0.8862	0.8862	intfloat/multilingual-e5-large	False
SpanishPassageRetrievalS2P	0.4189	nan	0.4196	0.4402	BAAI/bge-m3	False
SpanishPassageRetrievalS2S	0.7844	nan	0.7232	0.7516	intfloat/e5-mistral-7b-instruct	False
SpanishSentimentClassification.v2	0.9693	nan	0.9241	0.9628	intfloat/multilingual-e5-large-instruct	False
WebFAQRetrieval	0.8704	nan	0.7596	0.7953	jinaai/jina-embeddings-v3	False
WisesightSentimentClassification.v2	0.4169	nan	nan	0.4010	google/embeddinggemma-300m	False
XPQARetrieval	0.6049	nan	0.434	0.7742	BAAI/bge-multilingual-gemma2	False
XQuADRetrieval	0.9656	nan	0.9687	0.9709	telepix/PIXIE-Rune-v1.0	False
Average	0.7368	0.9777	0.6563	0.7283	nan	-

Model have high performance on these tasks: SpanishSentimentClassification.v2,MTOPIntentClassification,SpanishNewsClassification.v2,SIB200Classification,WebFAQRetrieval,SpanishPassageRetrievalS2S,MLSUMClusteringP2P,MLSUMClusteringS2S,WisesightSentimentClassification.v2,MKQARetrieval

Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, StackExchangeClusteringP2P.v2, StackExchangeDupQuestions, StackOverflowDupQuestions, StackOverflowDupQuestions-VN, StackOverflowQA, SyntheticText2SQL, T2Ranking, THUCNews, TNews, TNews.v2, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TriviaQA, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering-VN, TwentyNewsgroupsClustering.v2, UNPC, Waimai, Waimai.v2, WebFAQ, WikiOmnia, WildChat, XCodeEvalCodeToCode, XCodeEvalNLToCode, XCodeEvalTranslation, XNLI, XSum, YahooAnswers, cMedQAv2, webMedQA

Results for `codefuse-ai/F2LLM-v2-160M`

task_name	codefuse-ai/F2LLM-v2-160M	google/gemini-embedding-001	intfloat/multilingual-e5-large	Max result	Model with max result	In Training Data
MIRACLReranking	0.5403	nan	0.6249	0.6851	cl-nagoya/ruri-v3-310m	True
MIRACLRetrievalHardNegatives.v2	0.5335	nan	0.5333	0.7836	jinaai/jina-embeddings-v5-text-small	True
MKQARetrieval	0.0801	nan	nan	0.1835	tencent/KaLM-Embedding-Gemma3-12B-2511	False
MLSUMClusteringP2P	0.6487	nan	0.4494	0.5156	Salesforce/SFR-Embedding-2_R	False
MLSUMClusteringS2S	0.6397	nan	0.4550	0.5103	Salesforce/SFR-Embedding-2_R	False
MTOPDomainClassification	0.9852	0.9777	0.9024	0.9995	voyageai/voyage-3-m-exp	True
MTOPIntentClassification	0.9048	nan	0.6720	0.9307	BAAI/bge-multilingual-gemma2	True
MintakaRetrieval	0.2242	nan	0.3037	0.6253	BAAI/bge-multilingual-gemma2	False
MrTidyRetrieval	0.4970	nan	0.6509	0.7737	intfloat/multilingual-e5-large-instruct	True
MultiLongDocReranking	0.8935	nan	0.8887	0.9338	cl-nagoya/ruri-v3-310m	False
MultiLongDocRetrieval	0.2085	nan	0.3175	0.4626	cl-nagoya/ruri-v3-30m	False
SIB200Classification	0.8935	nan	0.7339	0.8467	tencent/KaLM-Embedding-Gemma3-12B-2511	False
STS22	0.6308	nan	0.6365	0.8314	OrdalieTech/Solon-embeddings-mini-beta-1.1	True
STSBenchmarkMultilingualSTS	0.7914	nan	0.8420	0.9596	Gameselo/STS-multilingual-mpnet-base-v2	False
SpanishNewsClassification.v2	0.8741	nan	0.8862	0.8862	intfloat/multilingual-e5-large	False
SpanishPassageRetrievalS2P	0.3252	nan	0.4196	0.4402	BAAI/bge-m3	False
SpanishPassageRetrievalS2S	0.6471	nan	0.7232	0.7516	intfloat/e5-mistral-7b-instruct	False
SpanishSentimentClassification.v2	0.8635	nan	0.9241	0.9628	intfloat/multilingual-e5-large-instruct	False
WebFAQRetrieval	0.7480	nan	0.7596	0.7953	jinaai/jina-embeddings-v3	False
WisesightSentimentClassification.v2	0.3366	nan	nan	0.4010	google/embeddinggemma-300m	False
XPQARetrieval	0.4359	nan	0.4340	0.7742	BAAI/bge-multilingual-gemma2	False
XQuADRetrieval	0.8962	nan	0.9687	0.9709	telepix/PIXIE-Rune-v1.0	False
Average	0.6181	0.9777	0.6563	0.7283	nan	-

Model have high performance on these tasks: SIB200Classification,MLSUMClusteringP2P,MLSUMClusteringS2S

Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, StackExchangeClusteringP2P.v2, StackExchangeDupQuestions, StackOverflowDupQuestions, StackOverflowDupQuestions-VN, StackOverflowQA, SyntheticText2SQL, T2Ranking, THUCNews, TNews, TNews.v2, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TriviaQA, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering-VN, TwentyNewsgroupsClustering.v2, UNPC, Waimai, Waimai.v2, WebFAQ, WikiOmnia, WildChat, XCodeEvalCodeToCode, XCodeEvalNLToCode, XCodeEvalTranslation, XNLI, XSum, YahooAnswers, cMedQAv2, webMedQA

Results for `codefuse-ai/F2LLM-v2-330M`

task_name	codefuse-ai/F2LLM-v2-330M	google/gemini-embedding-001	intfloat/multilingual-e5-large	Max result	Model with max result	In Training Data
MIRACLReranking	0.5824	nan	0.6249	0.6851	cl-nagoya/ruri-v3-310m	True
MIRACLRetrievalHardNegatives.v2	0.5890	nan	0.5333	0.7836	jinaai/jina-embeddings-v5-text-small	True
MKQARetrieval	0.1404	nan	nan	0.1835	tencent/KaLM-Embedding-Gemma3-12B-2511	False
MLSUMClusteringP2P	0.6777	nan	0.4494	0.5156	Salesforce/SFR-Embedding-2_R	False
MLSUMClusteringS2S	0.6736	nan	0.455	0.5103	Salesforce/SFR-Embedding-2_R	False
MTOPDomainClassification	0.9893	0.9777	0.9024	0.9995	voyageai/voyage-3-m-exp	True
MTOPIntentClassification	0.9293	nan	0.672	0.9307	BAAI/bge-multilingual-gemma2	True
MintakaRetrieval	0.2795	nan	0.3037	0.6253	BAAI/bge-multilingual-gemma2	False
MrTidyRetrieval	0.5738	nan	0.6509	0.7737	intfloat/multilingual-e5-large-instruct	True
MultiLongDocReranking	0.9153	nan	0.8887	0.9338	cl-nagoya/ruri-v3-310m	False
MultiLongDocRetrieval	0.3155	nan	0.3175	0.4626	cl-nagoya/ruri-v3-30m	False
SIB200Classification	0.9497	nan	0.7339	0.8467	tencent/KaLM-Embedding-Gemma3-12B-2511	False
STS22	0.6634	nan	0.6365	0.8314	OrdalieTech/Solon-embeddings-mini-beta-1.1	True
STSBenchmarkMultilingualSTS	0.8322	nan	0.842	0.9596	Gameselo/STS-multilingual-mpnet-base-v2	False
SpanishNewsClassification.v2	0.9013	nan	0.8862	0.8862	intfloat/multilingual-e5-large	False
SpanishPassageRetrievalS2P	0.3719	nan	0.4196	0.4402	BAAI/bge-m3	False
SpanishPassageRetrievalS2S	0.7415	nan	0.7232	0.7516	intfloat/e5-mistral-7b-instruct	False
SpanishSentimentClassification.v2	0.8978	nan	0.9241	0.9628	intfloat/multilingual-e5-large-instruct	False
WebFAQRetrieval	0.8058	nan	0.7596	0.7953	jinaai/jina-embeddings-v3	False
WisesightSentimentClassification.v2	0.3294	nan	nan	0.4010	google/embeddinggemma-300m	False
XPQARetrieval	0.4852	nan	0.434	0.7742	BAAI/bge-multilingual-gemma2	False
XQuADRetrieval	0.9243	nan	0.9687	0.9709	telepix/PIXIE-Rune-v1.0	False
Average	0.6622	0.9777	0.6563	0.7283	nan	-

Model have high performance on these tasks: SpanishNewsClassification.v2,SIB200Classification,WebFAQRetrieval,MLSUMClusteringP2P,MLSUMClusteringS2S

Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, StackExchangeClusteringP2P.v2, StackExchangeDupQuestions, StackOverflowDupQuestions, StackOverflowDupQuestions-VN, StackOverflowQA, SyntheticText2SQL, T2Ranking, THUCNews, TNews, TNews.v2, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TriviaQA, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering-VN, TwentyNewsgroupsClustering.v2, UNPC, Waimai, Waimai.v2, WebFAQ, WikiOmnia, WildChat, XCodeEvalCodeToCode, XCodeEvalNLToCode, XCodeEvalTranslation, XNLI, XSum, YahooAnswers, cMedQAv2, webMedQA

Results for `codefuse-ai/F2LLM-v2-4B`

task_name	codefuse-ai/F2LLM-v2-4B	google/gemini-embedding-001	intfloat/multilingual-e5-large	Max result	Model with max result	In Training Data
MIRACLReranking	0.6149	nan	0.6249	0.6851	cl-nagoya/ruri-v3-310m	True
MIRACLRetrievalHardNegatives.v2	0.6320	nan	0.5333	0.7836	jinaai/jina-embeddings-v5-text-small	True
MKQARetrieval	0.3655	nan	nan	0.1835	tencent/KaLM-Embedding-Gemma3-12B-2511	False
MLSUMClusteringP2P	0.7067	nan	0.4494	0.5156	Salesforce/SFR-Embedding-2_R	False
MLSUMClusteringS2S	0.7046	nan	0.455	0.5103	Salesforce/SFR-Embedding-2_R	False
MTOPDomainClassification	0.9924	0.9777	0.9024	0.9995	voyageai/voyage-3-m-exp	True
MTOPIntentClassification	0.9508	nan	0.672	0.9307	BAAI/bge-multilingual-gemma2	True
MintakaRetrieval	0.5215	nan	0.3037	0.6253	BAAI/bge-multilingual-gemma2	False
MrTidyRetrieval	0.6868	nan	0.6509	0.7737	intfloat/multilingual-e5-large-instruct	True
MultiLongDocReranking	0.9301	nan	0.8887	0.9338	cl-nagoya/ruri-v3-310m	False
MultiLongDocRetrieval	0.3903	nan	0.3175	0.4626	cl-nagoya/ruri-v3-30m	False
SIB200Classification	0.9658	nan	0.7339	0.8467	tencent/KaLM-Embedding-Gemma3-12B-2511	False
STS22	0.6711	nan	0.6365	0.8314	OrdalieTech/Solon-embeddings-mini-beta-1.1	True
STSBenchmarkMultilingualSTS	0.8541	nan	0.842	0.9596	Gameselo/STS-multilingual-mpnet-base-v2	False
SpanishNewsClassification.v2	0.9242	nan	0.8862	0.8862	intfloat/multilingual-e5-large	False
SpanishPassageRetrievalS2P	0.4195	nan	0.4196	0.4402	BAAI/bge-m3	False
SpanishPassageRetrievalS2S	0.7964	nan	0.7232	0.7516	intfloat/e5-mistral-7b-instruct	False
SpanishSentimentClassification.v2	0.9650	nan	0.9241	0.9628	intfloat/multilingual-e5-large-instruct	False
WebFAQRetrieval	0.8576	nan	0.7596	0.7953	jinaai/jina-embeddings-v3	False
WisesightSentimentClassification.v2	0.4126	nan	nan	0.4010	google/embeddinggemma-300m	False
XPQARetrieval	0.5815	nan	0.434	0.7742	BAAI/bge-multilingual-gemma2	False
XQuADRetrieval	0.9593	nan	0.9687	0.9709	telepix/PIXIE-Rune-v1.0	False
Average	0.7229	0.9777	0.6563	0.7283	nan	-

Model have high performance on these tasks: SpanishSentimentClassification.v2,MTOPIntentClassification,SpanishNewsClassification.v2,SIB200Classification,WebFAQRetrieval,SpanishPassageRetrievalS2S,MLSUMClusteringP2P,MLSUMClusteringS2S,WisesightSentimentClassification.v2,MKQARetrieval

Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, `StackExchangeClus

Note: Content truncated due to GitHub API limits. See the full report in the workflow artifacts.

Samoed · 2026-03-31T10:26:13Z

Can you add prompts to your model meta?

Geralt-Targaryen · 2026-03-31T12:14:52Z

I've updated the model meta. Do I need to created a PR to add the new prompts to model implementation as well?

Samoed · 2026-03-31T12:36:55Z

Do I need to created a PR to add the new prompts to model implementation as well?

Yes

KennethEnevoldsen · 2026-04-05T14:06:17Z

For certain tasks that have existing results on other languages (e.g., MIRACLReranking), when I evaluated the models on the Thai benchmark, the existing results were erased. But then I evaluated the models on the Spanish benchmark, and the Thai results were not erased. I presume this is probably because different MTEB versions were used (existing results use v2.6.7; now I'm using v2.12.4). Currently I solved this issue by manually merging the new results into the existing file and setting the mteb_version field to the latter one. However, I think it would be better to automatically resolve this conflict.

You should not automatically merge these - they were run using different MTEB versions and therefore shouldn't be merged (it makes reproducing the results potentially impossible).

A solution right now it rerun the full set with your current version or downgrading the version.

I do however, agree that this is not ideal, we would need to move the MTEB version to the individual subsets to avoid this issue. I would be more than happy to see an issue on this to outline the problem

Geralt-Targaryen · 2026-04-06T02:33:36Z

I've created an issue. I'll rerun the models on all subsets.

The prompts have been added to the implementation btw (embeddings-benchmark/mteb#4336)

Geralt-Targaryen · 2026-04-07T13:12:40Z

Rerunning the full set would take a really long time for some tasks. So I've rerun the Thai and Spanish subsets using the older version instead. The results are identical anyway. @KennethEnevoldsen

github-actions · 2026-04-22T02:20:57Z

This pull request has been automatically marked as stale due to inactivity.

Geralt-Targaryen added 5 commits March 31, 2026 17:02

0.6b

cfbb0fa

8b

9cb684f

1.7b & 4b

843a54d

160M & 330M & 14B

4a1732d

80M

5c632fe

Geralt-Targaryen changed the title ~~F2LLM-v2 results and Thai and Spanish benchmark~~ F2LLM-v2 results on Thai and Spanish benchmark Mar 31, 2026

update model meta

d6dd810

KennethEnevoldsen added the waiting for review of implementation This PR is waiting for an implementation review before merging the results. label Apr 5, 2026

Geralt-Targaryen mentioned this pull request Apr 6, 2026

Move mteb_version to individual subsets in result files embeddings-benchmark/mteb#4354

Closed

Geralt-Targaryen added 8 commits April 7, 2026 19:28

80m, v2.6.7

1351a6c

160m, v2.6.7

89c3179

330m, v2.6.7

f8fba45

0.6b, v2.6.7

a201068

4b, v2.6.7

b8fbefe

8b, v2.6.7

97a43c6

14b, v2.6.7

96c69df

1.7b, v2.6.7

837ef7e

github-actions Bot added the stale label Apr 22, 2026

Geralt-Targaryen mentioned this pull request Apr 22, 2026

More F2LLM-v2 results #492

Merged

6 tasks

Samoed requested a review from KennethEnevoldsen April 22, 2026 12:15

github-actions Bot removed the stale label Apr 23, 2026

KennethEnevoldsen approved these changes Apr 23, 2026

View reviewed changes

KennethEnevoldsen merged commit 6655497 into embeddings-benchmark:main Apr 23, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

F2LLM-v2 results on Thai and Spanish benchmark#465

F2LLM-v2 results on Thai and Spanish benchmark#465
KennethEnevoldsen merged 14 commits intoembeddings-benchmark:mainfrom
Geralt-Targaryen:main

Geralt-Targaryen commented Mar 31, 2026

Uh oh!

github-actions Bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

Samoed commented Mar 31, 2026

Uh oh!

Geralt-Targaryen commented Mar 31, 2026

Uh oh!

Samoed commented Mar 31, 2026

Uh oh!

KennethEnevoldsen commented Apr 5, 2026

Uh oh!

Geralt-Targaryen commented Apr 6, 2026

Uh oh!

Geralt-Targaryen commented Apr 7, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Geralt-Targaryen commented Mar 31, 2026

Checklist

Uh oh!

github-actions Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Model Results Comparison

Results for codefuse-ai/F2LLM-v2-0.6B

Results for codefuse-ai/F2LLM-v2-1.7B

Results for codefuse-ai/F2LLM-v2-14B

Results for codefuse-ai/F2LLM-v2-160M

Results for codefuse-ai/F2LLM-v2-330M

Results for codefuse-ai/F2LLM-v2-4B

Uh oh!

Samoed commented Mar 31, 2026

Uh oh!

Geralt-Targaryen commented Mar 31, 2026

Uh oh!

Samoed commented Mar 31, 2026

Uh oh!

KennethEnevoldsen commented Apr 5, 2026

Uh oh!

Geralt-Targaryen commented Apr 6, 2026

Uh oh!

Geralt-Targaryen commented Apr 7, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Mar 31, 2026 •

edited

Loading

Results for `codefuse-ai/F2LLM-v2-0.6B`

Results for `codefuse-ai/F2LLM-v2-1.7B`

Results for `codefuse-ai/F2LLM-v2-14B`

Results for `codefuse-ai/F2LLM-v2-160M`

Results for `codefuse-ai/F2LLM-v2-330M`

Results for `codefuse-ai/F2LLM-v2-4B`