F2LLM-v2 results on Thai and Spanish benchmark#465
F2LLM-v2 results on Thai and Spanish benchmark#465KennethEnevoldsen merged 14 commits intoembeddings-benchmark:mainfrom
Conversation
Model Results ComparisonReference models: Results for
|
| task_name | codefuse-ai/F2LLM-v2-0.6B | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result | Model with max result | In Training Data |
|---|---|---|---|---|---|---|
| MIRACLReranking | 0.5971 | nan | 0.6249 | 0.6851 | cl-nagoya/ruri-v3-310m | True |
| MIRACLRetrievalHardNegatives.v2 | 0.5974 | nan | 0.5333 | 0.7836 | jinaai/jina-embeddings-v5-text-small | True |
| MKQARetrieval | 0.1730 | nan | nan | 0.1835 | tencent/KaLM-Embedding-Gemma3-12B-2511 | False |
| MLSUMClusteringP2P | 0.6919 | nan | 0.4494 | 0.5156 | Salesforce/SFR-Embedding-2_R | False |
| MLSUMClusteringS2S | 0.6856 | nan | 0.455 | 0.5103 | Salesforce/SFR-Embedding-2_R | False |
| MTOPDomainClassification | 0.9906 | 0.9777 | 0.9024 | 0.9995 | voyageai/voyage-3-m-exp | True |
| MTOPIntentClassification | 0.9370 | nan | 0.672 | 0.9307 | BAAI/bge-multilingual-gemma2 | True |
| MintakaRetrieval | 0.3229 | nan | 0.3037 | 0.6253 | BAAI/bge-multilingual-gemma2 | False |
| MrTidyRetrieval | 0.5959 | nan | 0.6509 | 0.7737 | intfloat/multilingual-e5-large-instruct | True |
| MultiLongDocReranking | 0.9112 | nan | 0.8887 | 0.9338 | cl-nagoya/ruri-v3-310m | False |
| MultiLongDocRetrieval | 0.3266 | nan | 0.3175 | 0.4626 | cl-nagoya/ruri-v3-30m | False |
| SIB200Classification | 0.9562 | nan | 0.7339 | 0.8467 | tencent/KaLM-Embedding-Gemma3-12B-2511 | False |
| STS22 | 0.6661 | nan | 0.6365 | 0.8314 | OrdalieTech/Solon-embeddings-mini-beta-1.1 | True |
| STSBenchmarkMultilingualSTS | 0.8434 | nan | 0.842 | 0.9596 | Gameselo/STS-multilingual-mpnet-base-v2 | False |
| SpanishNewsClassification.v2 | 0.8962 | nan | 0.8862 | 0.8862 | intfloat/multilingual-e5-large | False |
| SpanishPassageRetrievalS2P | 0.3583 | nan | 0.4196 | 0.4402 | BAAI/bge-m3 | False |
| SpanishPassageRetrievalS2S | 0.7418 | nan | 0.7232 | 0.7516 | intfloat/e5-mistral-7b-instruct | False |
| SpanishSentimentClassification.v2 | 0.9343 | nan | 0.9241 | 0.9628 | intfloat/multilingual-e5-large-instruct | False |
| WebFAQRetrieval | 0.8259 | nan | 0.7596 | 0.7953 | jinaai/jina-embeddings-v3 | False |
| WisesightSentimentClassification.v2 | 0.3655 | nan | nan | 0.4010 | google/embeddinggemma-300m | False |
| XPQARetrieval | 0.5184 | nan | 0.434 | 0.7742 | BAAI/bge-multilingual-gemma2 | False |
| XQuADRetrieval | 0.9319 | nan | 0.9687 | 0.9709 | telepix/PIXIE-Rune-v1.0 | False |
| Average | 0.6758 | 0.9777 | 0.6563 | 0.7283 | nan | - |
Model have high performance on these tasks: MTOPIntentClassification,SpanishNewsClassification.v2,SIB200Classification,WebFAQRetrieval,MLSUMClusteringP2P,MLSUMClusteringS2S
Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, StackExchangeClusteringP2P.v2, StackExchangeDupQuestions, StackOverflowDupQuestions, StackOverflowDupQuestions-VN, StackOverflowQA, SyntheticText2SQL, T2Ranking, THUCNews, TNews, TNews.v2, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TriviaQA, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering-VN, TwentyNewsgroupsClustering.v2, UNPC, Waimai, Waimai.v2, WebFAQ, WikiOmnia, WildChat, XCodeEvalCodeToCode, XCodeEvalNLToCode, XCodeEvalTranslation, XNLI, XSum, YahooAnswers, cMedQAv2, webMedQA
Results for codefuse-ai/F2LLM-v2-1.7B
| task_name | codefuse-ai/F2LLM-v2-1.7B | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result | Model with max result | In Training Data |
|---|---|---|---|---|---|---|
| MIRACLReranking | 0.6168 | nan | 0.6249 | 0.6851 | cl-nagoya/ruri-v3-310m | True |
| MIRACLRetrievalHardNegatives.v2 | 0.6263 | nan | 0.5333 | 0.7836 | jinaai/jina-embeddings-v5-text-small | True |
| MKQARetrieval | 0.2651 | nan | nan | 0.1835 | tencent/KaLM-Embedding-Gemma3-12B-2511 | False |
| MLSUMClusteringP2P | 0.7047 | nan | 0.4494 | 0.5156 | Salesforce/SFR-Embedding-2_R | False |
| MLSUMClusteringS2S | 0.6979 | nan | 0.455 | 0.5103 | Salesforce/SFR-Embedding-2_R | False |
| MTOPDomainClassification | 0.9914 | 0.9777 | 0.9024 | 0.9995 | voyageai/voyage-3-m-exp | True |
| MTOPIntentClassification | 0.9448 | nan | 0.672 | 0.9307 | BAAI/bge-multilingual-gemma2 | True |
| MintakaRetrieval | 0.4060 | nan | 0.3037 | 0.6253 | BAAI/bge-multilingual-gemma2 | False |
| MrTidyRetrieval | 0.6393 | nan | 0.6509 | 0.7737 | intfloat/multilingual-e5-large-instruct | True |
| MultiLongDocReranking | 0.9274 | nan | 0.8887 | 0.9338 | cl-nagoya/ruri-v3-310m | False |
| MultiLongDocRetrieval | 0.3820 | nan | 0.3175 | 0.4626 | cl-nagoya/ruri-v3-30m | False |
| SIB200Classification | 0.9666 | nan | 0.7339 | 0.8467 | tencent/KaLM-Embedding-Gemma3-12B-2511 | False |
| STS22 | 0.6665 | nan | 0.6365 | 0.8314 | OrdalieTech/Solon-embeddings-mini-beta-1.1 | True |
| STSBenchmarkMultilingualSTS | 0.8570 | nan | 0.842 | 0.9596 | Gameselo/STS-multilingual-mpnet-base-v2 | False |
| SpanishNewsClassification.v2 | 0.9152 | nan | 0.8862 | 0.8862 | intfloat/multilingual-e5-large | False |
| SpanishPassageRetrievalS2P | 0.4034 | nan | 0.4196 | 0.4402 | BAAI/bge-m3 | False |
| SpanishPassageRetrievalS2S | 0.7711 | nan | 0.7232 | 0.7516 | intfloat/e5-mistral-7b-instruct | False |
| SpanishSentimentClassification.v2 | 0.9445 | nan | 0.9241 | 0.9628 | intfloat/multilingual-e5-large-instruct | False |
| WebFAQRetrieval | 0.8439 | nan | 0.7596 | 0.7953 | jinaai/jina-embeddings-v3 | False |
| WisesightSentimentClassification.v2 | 0.3840 | nan | nan | 0.4010 | google/embeddinggemma-300m | False |
| XPQARetrieval | 0.5591 | nan | 0.434 | 0.7742 | BAAI/bge-multilingual-gemma2 | False |
| XQuADRetrieval | 0.9474 | nan | 0.9687 | 0.9709 | telepix/PIXIE-Rune-v1.0 | False |
| Average | 0.7028 | 0.9777 | 0.6563 | 0.7283 | nan | - |
Model have high performance on these tasks: MTOPIntentClassification,SpanishNewsClassification.v2,SIB200Classification,WebFAQRetrieval,SpanishPassageRetrievalS2S,MLSUMClusteringP2P,MLSUMClusteringS2S,MKQARetrieval
Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, StackExchangeClusteringP2P.v2, StackExchangeDupQuestions, StackOverflowDupQuestions, StackOverflowDupQuestions-VN, StackOverflowQA, SyntheticText2SQL, T2Ranking, THUCNews, TNews, TNews.v2, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TriviaQA, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering-VN, TwentyNewsgroupsClustering.v2, UNPC, Waimai, Waimai.v2, WebFAQ, WikiOmnia, WildChat, XCodeEvalCodeToCode, XCodeEvalNLToCode, XCodeEvalTranslation, XNLI, XSum, YahooAnswers, cMedQAv2, webMedQA
Results for codefuse-ai/F2LLM-v2-14B
| task_name | codefuse-ai/F2LLM-v2-14B | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result | Model with max result | In Training Data |
|---|---|---|---|---|---|---|
| MIRACLReranking | 0.6248 | nan | 0.6249 | 0.6851 | cl-nagoya/ruri-v3-310m | True |
| MIRACLRetrievalHardNegatives.v2 | 0.6479 | nan | 0.5333 | 0.7836 | jinaai/jina-embeddings-v5-text-small | True |
| MKQARetrieval | 0.4634 | nan | nan | 0.1835 | tencent/KaLM-Embedding-Gemma3-12B-2511 | False |
| MLSUMClusteringP2P | 0.7183 | nan | 0.4494 | 0.5156 | Salesforce/SFR-Embedding-2_R | False |
| MLSUMClusteringS2S | 0.7147 | nan | 0.455 | 0.5103 | Salesforce/SFR-Embedding-2_R | False |
| MTOPDomainClassification | 0.9924 | 0.9777 | 0.9024 | 0.9995 | voyageai/voyage-3-m-exp | True |
| MTOPIntentClassification | 0.9561 | nan | 0.672 | 0.9307 | BAAI/bge-multilingual-gemma2 | True |
| MintakaRetrieval | 0.6244 | nan | 0.3037 | 0.6253 | BAAI/bge-multilingual-gemma2 | False |
| MrTidyRetrieval | 0.7073 | nan | 0.6509 | 0.7737 | intfloat/multilingual-e5-large-instruct | True |
| MultiLongDocReranking | 0.9316 | nan | 0.8887 | 0.9338 | cl-nagoya/ruri-v3-310m | False |
| MultiLongDocRetrieval | 0.4014 | nan | 0.3175 | 0.4626 | cl-nagoya/ruri-v3-30m | False |
| SIB200Classification | 0.9644 | nan | 0.7339 | 0.8467 | tencent/KaLM-Embedding-Gemma3-12B-2511 | False |
| STS22 | 0.6394 | nan | 0.6365 | 0.8314 | OrdalieTech/Solon-embeddings-mini-beta-1.1 | True |
| STSBenchmarkMultilingualSTS | 0.8643 | nan | 0.842 | 0.9596 | Gameselo/STS-multilingual-mpnet-base-v2 | False |
| SpanishNewsClassification.v2 | 0.9290 | nan | 0.8862 | 0.8862 | intfloat/multilingual-e5-large | False |
| SpanishPassageRetrievalS2P | 0.4189 | nan | 0.4196 | 0.4402 | BAAI/bge-m3 | False |
| SpanishPassageRetrievalS2S | 0.7844 | nan | 0.7232 | 0.7516 | intfloat/e5-mistral-7b-instruct | False |
| SpanishSentimentClassification.v2 | 0.9693 | nan | 0.9241 | 0.9628 | intfloat/multilingual-e5-large-instruct | False |
| WebFAQRetrieval | 0.8704 | nan | 0.7596 | 0.7953 | jinaai/jina-embeddings-v3 | False |
| WisesightSentimentClassification.v2 | 0.4169 | nan | nan | 0.4010 | google/embeddinggemma-300m | False |
| XPQARetrieval | 0.6049 | nan | 0.434 | 0.7742 | BAAI/bge-multilingual-gemma2 | False |
| XQuADRetrieval | 0.9656 | nan | 0.9687 | 0.9709 | telepix/PIXIE-Rune-v1.0 | False |
| Average | 0.7368 | 0.9777 | 0.6563 | 0.7283 | nan | - |
Model have high performance on these tasks: SpanishSentimentClassification.v2,MTOPIntentClassification,SpanishNewsClassification.v2,SIB200Classification,WebFAQRetrieval,SpanishPassageRetrievalS2S,MLSUMClusteringP2P,MLSUMClusteringS2S,WisesightSentimentClassification.v2,MKQARetrieval
Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, StackExchangeClusteringP2P.v2, StackExchangeDupQuestions, StackOverflowDupQuestions, StackOverflowDupQuestions-VN, StackOverflowQA, SyntheticText2SQL, T2Ranking, THUCNews, TNews, TNews.v2, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TriviaQA, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering-VN, TwentyNewsgroupsClustering.v2, UNPC, Waimai, Waimai.v2, WebFAQ, WikiOmnia, WildChat, XCodeEvalCodeToCode, XCodeEvalNLToCode, XCodeEvalTranslation, XNLI, XSum, YahooAnswers, cMedQAv2, webMedQA
Results for codefuse-ai/F2LLM-v2-160M
| task_name | codefuse-ai/F2LLM-v2-160M | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result | Model with max result | In Training Data |
|---|---|---|---|---|---|---|
| MIRACLReranking | 0.5403 | nan | 0.6249 | 0.6851 | cl-nagoya/ruri-v3-310m | True |
| MIRACLRetrievalHardNegatives.v2 | 0.5335 | nan | 0.5333 | 0.7836 | jinaai/jina-embeddings-v5-text-small | True |
| MKQARetrieval | 0.0801 | nan | nan | 0.1835 | tencent/KaLM-Embedding-Gemma3-12B-2511 | False |
| MLSUMClusteringP2P | 0.6487 | nan | 0.4494 | 0.5156 | Salesforce/SFR-Embedding-2_R | False |
| MLSUMClusteringS2S | 0.6397 | nan | 0.4550 | 0.5103 | Salesforce/SFR-Embedding-2_R | False |
| MTOPDomainClassification | 0.9852 | 0.9777 | 0.9024 | 0.9995 | voyageai/voyage-3-m-exp | True |
| MTOPIntentClassification | 0.9048 | nan | 0.6720 | 0.9307 | BAAI/bge-multilingual-gemma2 | True |
| MintakaRetrieval | 0.2242 | nan | 0.3037 | 0.6253 | BAAI/bge-multilingual-gemma2 | False |
| MrTidyRetrieval | 0.4970 | nan | 0.6509 | 0.7737 | intfloat/multilingual-e5-large-instruct | True |
| MultiLongDocReranking | 0.8935 | nan | 0.8887 | 0.9338 | cl-nagoya/ruri-v3-310m | False |
| MultiLongDocRetrieval | 0.2085 | nan | 0.3175 | 0.4626 | cl-nagoya/ruri-v3-30m | False |
| SIB200Classification | 0.8935 | nan | 0.7339 | 0.8467 | tencent/KaLM-Embedding-Gemma3-12B-2511 | False |
| STS22 | 0.6308 | nan | 0.6365 | 0.8314 | OrdalieTech/Solon-embeddings-mini-beta-1.1 | True |
| STSBenchmarkMultilingualSTS | 0.7914 | nan | 0.8420 | 0.9596 | Gameselo/STS-multilingual-mpnet-base-v2 | False |
| SpanishNewsClassification.v2 | 0.8741 | nan | 0.8862 | 0.8862 | intfloat/multilingual-e5-large | False |
| SpanishPassageRetrievalS2P | 0.3252 | nan | 0.4196 | 0.4402 | BAAI/bge-m3 | False |
| SpanishPassageRetrievalS2S | 0.6471 | nan | 0.7232 | 0.7516 | intfloat/e5-mistral-7b-instruct | False |
| SpanishSentimentClassification.v2 | 0.8635 | nan | 0.9241 | 0.9628 | intfloat/multilingual-e5-large-instruct | False |
| WebFAQRetrieval | 0.7480 | nan | 0.7596 | 0.7953 | jinaai/jina-embeddings-v3 | False |
| WisesightSentimentClassification.v2 | 0.3366 | nan | nan | 0.4010 | google/embeddinggemma-300m | False |
| XPQARetrieval | 0.4359 | nan | 0.4340 | 0.7742 | BAAI/bge-multilingual-gemma2 | False |
| XQuADRetrieval | 0.8962 | nan | 0.9687 | 0.9709 | telepix/PIXIE-Rune-v1.0 | False |
| Average | 0.6181 | 0.9777 | 0.6563 | 0.7283 | nan | - |
Model have high performance on these tasks: SIB200Classification,MLSUMClusteringP2P,MLSUMClusteringS2S
Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, StackExchangeClusteringP2P.v2, StackExchangeDupQuestions, StackOverflowDupQuestions, StackOverflowDupQuestions-VN, StackOverflowQA, SyntheticText2SQL, T2Ranking, THUCNews, TNews, TNews.v2, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TriviaQA, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering-VN, TwentyNewsgroupsClustering.v2, UNPC, Waimai, Waimai.v2, WebFAQ, WikiOmnia, WildChat, XCodeEvalCodeToCode, XCodeEvalNLToCode, XCodeEvalTranslation, XNLI, XSum, YahooAnswers, cMedQAv2, webMedQA
Results for codefuse-ai/F2LLM-v2-330M
| task_name | codefuse-ai/F2LLM-v2-330M | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result | Model with max result | In Training Data |
|---|---|---|---|---|---|---|
| MIRACLReranking | 0.5824 | nan | 0.6249 | 0.6851 | cl-nagoya/ruri-v3-310m | True |
| MIRACLRetrievalHardNegatives.v2 | 0.5890 | nan | 0.5333 | 0.7836 | jinaai/jina-embeddings-v5-text-small | True |
| MKQARetrieval | 0.1404 | nan | nan | 0.1835 | tencent/KaLM-Embedding-Gemma3-12B-2511 | False |
| MLSUMClusteringP2P | 0.6777 | nan | 0.4494 | 0.5156 | Salesforce/SFR-Embedding-2_R | False |
| MLSUMClusteringS2S | 0.6736 | nan | 0.455 | 0.5103 | Salesforce/SFR-Embedding-2_R | False |
| MTOPDomainClassification | 0.9893 | 0.9777 | 0.9024 | 0.9995 | voyageai/voyage-3-m-exp | True |
| MTOPIntentClassification | 0.9293 | nan | 0.672 | 0.9307 | BAAI/bge-multilingual-gemma2 | True |
| MintakaRetrieval | 0.2795 | nan | 0.3037 | 0.6253 | BAAI/bge-multilingual-gemma2 | False |
| MrTidyRetrieval | 0.5738 | nan | 0.6509 | 0.7737 | intfloat/multilingual-e5-large-instruct | True |
| MultiLongDocReranking | 0.9153 | nan | 0.8887 | 0.9338 | cl-nagoya/ruri-v3-310m | False |
| MultiLongDocRetrieval | 0.3155 | nan | 0.3175 | 0.4626 | cl-nagoya/ruri-v3-30m | False |
| SIB200Classification | 0.9497 | nan | 0.7339 | 0.8467 | tencent/KaLM-Embedding-Gemma3-12B-2511 | False |
| STS22 | 0.6634 | nan | 0.6365 | 0.8314 | OrdalieTech/Solon-embeddings-mini-beta-1.1 | True |
| STSBenchmarkMultilingualSTS | 0.8322 | nan | 0.842 | 0.9596 | Gameselo/STS-multilingual-mpnet-base-v2 | False |
| SpanishNewsClassification.v2 | 0.9013 | nan | 0.8862 | 0.8862 | intfloat/multilingual-e5-large | False |
| SpanishPassageRetrievalS2P | 0.3719 | nan | 0.4196 | 0.4402 | BAAI/bge-m3 | False |
| SpanishPassageRetrievalS2S | 0.7415 | nan | 0.7232 | 0.7516 | intfloat/e5-mistral-7b-instruct | False |
| SpanishSentimentClassification.v2 | 0.8978 | nan | 0.9241 | 0.9628 | intfloat/multilingual-e5-large-instruct | False |
| WebFAQRetrieval | 0.8058 | nan | 0.7596 | 0.7953 | jinaai/jina-embeddings-v3 | False |
| WisesightSentimentClassification.v2 | 0.3294 | nan | nan | 0.4010 | google/embeddinggemma-300m | False |
| XPQARetrieval | 0.4852 | nan | 0.434 | 0.7742 | BAAI/bge-multilingual-gemma2 | False |
| XQuADRetrieval | 0.9243 | nan | 0.9687 | 0.9709 | telepix/PIXIE-Rune-v1.0 | False |
| Average | 0.6622 | 0.9777 | 0.6563 | 0.7283 | nan | - |
Model have high performance on these tasks: SpanishNewsClassification.v2,SIB200Classification,WebFAQRetrieval,MLSUMClusteringP2P,MLSUMClusteringS2S
Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, StackExchangeClusteringP2P.v2, StackExchangeDupQuestions, StackOverflowDupQuestions, StackOverflowDupQuestions-VN, StackOverflowQA, SyntheticText2SQL, T2Ranking, THUCNews, TNews, TNews.v2, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TriviaQA, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering-VN, TwentyNewsgroupsClustering.v2, UNPC, Waimai, Waimai.v2, WebFAQ, WikiOmnia, WildChat, XCodeEvalCodeToCode, XCodeEvalNLToCode, XCodeEvalTranslation, XNLI, XSum, YahooAnswers, cMedQAv2, webMedQA
Results for codefuse-ai/F2LLM-v2-4B
| task_name | codefuse-ai/F2LLM-v2-4B | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result | Model with max result | In Training Data |
|---|---|---|---|---|---|---|
| MIRACLReranking | 0.6149 | nan | 0.6249 | 0.6851 | cl-nagoya/ruri-v3-310m | True |
| MIRACLRetrievalHardNegatives.v2 | 0.6320 | nan | 0.5333 | 0.7836 | jinaai/jina-embeddings-v5-text-small | True |
| MKQARetrieval | 0.3655 | nan | nan | 0.1835 | tencent/KaLM-Embedding-Gemma3-12B-2511 | False |
| MLSUMClusteringP2P | 0.7067 | nan | 0.4494 | 0.5156 | Salesforce/SFR-Embedding-2_R | False |
| MLSUMClusteringS2S | 0.7046 | nan | 0.455 | 0.5103 | Salesforce/SFR-Embedding-2_R | False |
| MTOPDomainClassification | 0.9924 | 0.9777 | 0.9024 | 0.9995 | voyageai/voyage-3-m-exp | True |
| MTOPIntentClassification | 0.9508 | nan | 0.672 | 0.9307 | BAAI/bge-multilingual-gemma2 | True |
| MintakaRetrieval | 0.5215 | nan | 0.3037 | 0.6253 | BAAI/bge-multilingual-gemma2 | False |
| MrTidyRetrieval | 0.6868 | nan | 0.6509 | 0.7737 | intfloat/multilingual-e5-large-instruct | True |
| MultiLongDocReranking | 0.9301 | nan | 0.8887 | 0.9338 | cl-nagoya/ruri-v3-310m | False |
| MultiLongDocRetrieval | 0.3903 | nan | 0.3175 | 0.4626 | cl-nagoya/ruri-v3-30m | False |
| SIB200Classification | 0.9658 | nan | 0.7339 | 0.8467 | tencent/KaLM-Embedding-Gemma3-12B-2511 | False |
| STS22 | 0.6711 | nan | 0.6365 | 0.8314 | OrdalieTech/Solon-embeddings-mini-beta-1.1 | True |
| STSBenchmarkMultilingualSTS | 0.8541 | nan | 0.842 | 0.9596 | Gameselo/STS-multilingual-mpnet-base-v2 | False |
| SpanishNewsClassification.v2 | 0.9242 | nan | 0.8862 | 0.8862 | intfloat/multilingual-e5-large | False |
| SpanishPassageRetrievalS2P | 0.4195 | nan | 0.4196 | 0.4402 | BAAI/bge-m3 | False |
| SpanishPassageRetrievalS2S | 0.7964 | nan | 0.7232 | 0.7516 | intfloat/e5-mistral-7b-instruct | False |
| SpanishSentimentClassification.v2 | 0.9650 | nan | 0.9241 | 0.9628 | intfloat/multilingual-e5-large-instruct | False |
| WebFAQRetrieval | 0.8576 | nan | 0.7596 | 0.7953 | jinaai/jina-embeddings-v3 | False |
| WisesightSentimentClassification.v2 | 0.4126 | nan | nan | 0.4010 | google/embeddinggemma-300m | False |
| XPQARetrieval | 0.5815 | nan | 0.434 | 0.7742 | BAAI/bge-multilingual-gemma2 | False |
| XQuADRetrieval | 0.9593 | nan | 0.9687 | 0.9709 | telepix/PIXIE-Rune-v1.0 | False |
| Average | 0.7229 | 0.9777 | 0.6563 | 0.7283 | nan | - |
Model have high performance on these tasks: SpanishSentimentClassification.v2,MTOPIntentClassification,SpanishNewsClassification.v2,SIB200Classification,WebFAQRetrieval,SpanishPassageRetrievalS2S,MLSUMClusteringP2P,MLSUMClusteringS2S,WisesightSentimentClassification.v2,MKQARetrieval
Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, `StackExchangeClus
Note: Content truncated due to GitHub API limits. See the full report in the workflow artifacts.
|
Can you add prompts to your model meta? |
|
I've updated the model meta. Do I need to created a PR to add the new prompts to model implementation as well? |
Yes |
You should not automatically merge these - they were run using different MTEB versions and therefore shouldn't be merged (it makes reproducing the results potentially impossible). A solution right now it rerun the full set with your current version or downgrading the version. I do however, agree that this is not ideal, we would need to move the MTEB version to the individual subsets to avoid this issue. I would be more than happy to see an issue on this to outline the problem |
|
I've created an issue. I'll rerun the models on all subsets. The prompts have been added to the implementation btw (embeddings-benchmark/mteb#4336) |
|
Rerunning the full set would take a really long time for some tasks. So I've rerun the Thai and Spanish subsets using the older version instead. The results are identical anyway. @KennethEnevoldsen |
|
This pull request has been automatically marked as stale due to inactivity. |
Checklist
mteb/models/model_implementations/, this can be as an API. Instruction on how to add a model can be found hereAs discussed in embeddings-benchmark/mteb#3237 and embeddings-benchmark/mteb#4321, I evaluated the F2LLM-v2 models on the Thai and Spanish benchmarks.
MKQARetrieval,WisesightSentimentClassification.v2, andSpanishSentimentClassification.v2are evaluated using these prompts:{ "MKQARetrieval": "Given a question, retrieve the answer.", "WisesightSentimentClassification.v2": "Classify the sentiment of the given text.", "SpanishSentimentClassification.v2": "Classify the sentiment of the given text." }Other tasks are evaluated using the prompts in the current implementation.
I have two related questions: