rerun old results of multilingual-e5 small, medium, large by Samoed · Pull Request #474 · embeddings-benchmark/results

Samoed · 2026-04-02T19:48:59Z

rerun old results of multilingual-e5 small, medium, large

Checklist

My model has a model sheet, report, or similar
My model has a reference implementation in mteb/models/model_implementations/, this can be as an API. Instruction on how to add a model can be found here
- No, but there is an existing PR ___
The results submitted are obtained using the reference implementation
My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
I solemnly swear that for all results submitted I have not trained on the evaluation dataset including training splits. If I have, I have disclosed it clearly.

github-actions · 2026-04-02T19:51:27Z

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: BAAI/bge-m3, intfloat/e5-base, intfloat/multilingual-e5-base, intfloat/multilingual-e5-large, intfloat/multilingual-e5-small
Tasks: ArxivClusteringP2P, ArxivClusteringS2S, BUCC, CLSClusteringP2P, CLSClusteringS2S, CMedQAv1-reranking, CMedQAv2-reranking, Cmnli, DBPedia-PL, DKHateClassification, DiaBlaBitextMining, EightTagsClustering, FEVER, HotpotQA-PL, LEMBNarrativeQARetrieval, LEMBNeedleRetrieval, LEMBQMSumRetrieval, LEMBSummScreenFDRetrieval, LEMBWikimQARetrieval, MSMARCO-PL, NQ-PL, NorwegianParliamentClassification, Ocnli, PpcPC, QBQTC, Quora-PL, ScalaClassification, ThuNewsClusteringP2P, ThuNewsClusteringS2S

Results for `BAAI/bge-m3`

task_name	BAAI/bge-m3	intfloat/multilingual-e5-large	Max result	Model with max result	In Training Data
LEMBNarrativeQARetrieval	0.4879	0.2422	0.7690	lightonai/GTE-ModernColBERT-v1	False
LEMBNeedleRetrieval	0.395	0.28	0.9325	lightonai/GTE-ModernColBERT-v1	False
LEMBQMSumRetrieval	0.3611	0.2426	0.8323	mteb/baseline-bm25s	False
LEMBSummScreenFDRetrieval	0.9389	0.7112	0.9784	mteb/baseline-bm25s	False
LEMBWikimQARetrieval	0.7946	0.568	0.9988	lightonai/GTE-ModernColBERT-v1	False
Average	0.5955	0.4088	0.9022	nan	-

Training datasets: CMedQAv1-reranking, CMedQAv2-reranking, CmedqaRetrieval, CodeSearchNet, DuRetrieval, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, LeCaRDv2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MMarcoReranking, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, T2Reranking, T2Retrieval, mMARCO-NL

Results for `intfloat/e5-base`

task_name	google/gemini-embedding-001	intfloat/e5-base	intfloat/multilingual-e5-large	Max result	Model with max result	In Training Data
DKHateClassification	0.8702	0.5863	nan	0.8702	google/gemini-embedding-001	False
NorwegianParliamentClassification	0.5672	0.5593	0.5606	0.7007	Qwen/Qwen3-Embedding-8B	False
ScalaClassification	0.5185	0.4991	0.5109	0.9112	microsoft/harrier-oss-v1-27b	False
Average	0.6520	0.5482	0.5357	0.8274	nan	-

Training datasets: MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, mMARCO-NL

Results for `intfloat/multilingual-e5-base`

task_name	google/gemini-embedding-001	intfloat/multilingual-e5-base	intfloat/multilingual-e5-large	Max result	Model with max result	In Training Data
ArxivClusteringP2P	nan	0.4376	0.4473	0.6092	TencentBAC/Conan-embedding-v2	False
ArxivClusteringS2S	nan	0.3631	0.3871	0.5520	TencentBAC/Conan-embedding-v2	False
BUCC	nan	0.9818	0.9872	0.9900	intfloat/multilingual-e5-large-instruct	False
CLSClusteringP2P	nan	0.386	0.3915	0.8225	Bytedance/Seed1.6-embedding	False
CLSClusteringS2S	nan	0.3686	0.3682	0.7627	tencent/Youtu-Embedding	False
CMedQAv1-reranking	nan	0.6646	0.6765	0.9434	Kingsoft-LLM/QZhou-Embedding	False
CMedQAv2-reranking	nan	0.6626	0.6676	0.9444	IEITYuan/Yuan-embedding-2.0-zh	False
Cmnli	nan	0.6735	0.6983	0.9579	Kingsoft-LLM/QZhou-Embedding-Zh	False
DBPedia-PL	nan	0.3014	0.3578	0.4319	BAAI/bge-multilingual-gemma2	False
DiaBlaBitextMining	0.8723	0.847	0.8483	0.8882	codefuse-ai/F2LLM-v2-14B	False
EightTagsClustering	nan	0.2235	0.2652	0.5029	BAAI/bge-multilingual-gemma2	False
FEVER	nan	0.7942	0.8281	0.9628	voyageai/voyage-3-m-exp	True
HotpotQA-PL	nan	0.6347	0.6741	0.7703	BAAI/bge-multilingual-gemma2	True
MSMARCO-PL	nan	0.6198	0.6991	0.7269	BAAI/bge-multilingual-gemma2	True
NQ-PL	nan	0.4479	0.5282	0.5685	BAAI/bge-multilingual-gemma2	True
NorwegianParliamentClassification	0.5672	0.5609	0.5606	0.7007	Qwen/Qwen3-Embedding-8B	False
Ocnli	nan	0.5901	0.5864	0.9518	Kingsoft-LLM/QZhou-Embedding-Zh	False
PpcPC	0.955	0.883	0.9116	0.9576	microsoft/harrier-oss-v1-27b	False
QBQTC	nan	0.2831	0.2747	0.6156	Kingsoft-LLM/QZhou-Embedding-Zh	False
Quora-PL	nan	0.8131	0.837	0.8563	sdadas/mmlw-e5-large	False
ScalaClassification	0.5185	0.5089	0.5109	0.9112	microsoft/harrier-oss-v1-27b	False
ThuNewsClusteringP2P	nan	0.5318	0.5593	0.8976	Kingsoft-LLM/QZhou-Embedding-Zh	False
ThuNewsClusteringS2S	nan	0.5371	0.5456	0.8955	tencent/Youtu-Embedding	False
Average	0.7282	0.5702	0.5918	0.7922	nan	-

Training datasets: FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-PL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoFEVER-VN, NanoFEVERRetrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, XQuADRetrieval, mMARCO-NL

Results for `intfloat/multilingual-e5-large`

task_name	google/gemini-embedding-001	intfloat/multilingual-e5-large	intfloat/multilingual-e5-large	Max result	Model with max result	In Training Data
Revisions		ab10c1a7f42e74530fe7ae5be82e6d4f11a719eb	external
ArxivClusteringP2P	nan	0.4473	0.4431	0.6092	TencentBAC/Conan-embedding-v2	False
ArxivClusteringS2S	nan	0.3871	0.3843	0.5520	TencentBAC/Conan-embedding-v2	False
BUCC	nan	0.9872	0.9855	0.9900	intfloat/multilingual-e5-large-instruct	False
CLSClusteringP2P	nan	0.3915	nan	0.8225	Bytedance/Seed1.6-embedding	False
CLSClusteringS2S	nan	0.3682	nan	0.7627	tencent/Youtu-Embedding	False
CMedQAv1-reranking	nan	0.6765	nan	0.9434	Kingsoft-LLM/QZhou-Embedding	False
CMedQAv2-reranking	nan	0.6676	nan	0.9444	IEITYuan/Yuan-embedding-2.0-zh	False
Cmnli	nan	0.6983	nan	0.9579	Kingsoft-LLM/QZhou-Embedding-Zh	False
DBPedia-PL	nan	0.3578	nan	0.4319	BAAI/bge-multilingual-gemma2	False
DiaBlaBitextMining	0.8723	0.8483	nan	0.8882	codefuse-ai/F2LLM-v2-14B	False
EightTagsClustering	nan	0.2652	nan	0.5029	BAAI/bge-multilingual-gemma2	False
FEVER	nan	0.8279	0.8281	0.9628	voyageai/voyage-3-m-exp	True
HotpotQA-PL	nan	0.6741	nan	0.7703	BAAI/bge-multilingual-gemma2	True
MSMARCO-PL	nan	0.6991	nan	0.7269	BAAI/bge-multilingual-gemma2	True
NQ-PL	nan	0.5282	nan	0.5685	BAAI/bge-multilingual-gemma2	True
NorwegianParliamentClassification	0.5672	0.5606	nan	0.7007	Qwen/Qwen3-Embedding-8B	False
Ocnli	nan	0.5864	nan	0.9518	Kingsoft-LLM/QZhou-Embedding-Zh	False
PpcPC	0.955	0.9116	nan	0.9576	microsoft/harrier-oss-v1-27b	False
QBQTC	nan	0.2747	nan	0.6156	Kingsoft-LLM/QZhou-Embedding-Zh	False
Quora-PL	nan	0.8370	nan	0.8563	sdadas/mmlw-e5-large	False
ScalaClassification	0.5185	0.5109	nan	0.9112	microsoft/harrier-oss-v1-27b	False
ThuNewsClusteringP2P	nan	0.5593	nan	0.8976	Kingsoft-LLM/QZhou-Embedding-Zh	False
ThuNewsClusteringS2S	nan	0.5456	nan	0.8955	tencent/Youtu-Embedding	False
Average	0.7282	0.5918	0.6602	0.7922	nan	-

Training datasets: FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-PL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoFEVER-VN, NanoFEVERRetrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, XQuADRetrieval, mMARCO-NL

Results for `intfloat/multilingual-e5-small`

task_name	google/gemini-embedding-001	intfloat/multilingual-e5-large	intfloat/multilingual-e5-small	Max result	Model with max result	In Training Data
CMedQAv1-reranking	nan	0.6765	0.6468	0.9434	Kingsoft-LLM/QZhou-Embedding	False
CMedQAv2-reranking	nan	0.6676	0.6404	0.9444	IEITYuan/Yuan-embedding-2.0-zh	False
DBPedia-PL	nan	0.3578	0.2927	0.4319	BAAI/bge-multilingual-gemma2	False
DiaBlaBitextMining	0.8723	0.8483	0.8192	0.8882	codefuse-ai/F2LLM-v2-14B	False
EightTagsClustering	nan	0.2652	0.209	0.5029	BAAI/bge-multilingual-gemma2	False
HotpotQA-PL	nan	0.6741	0.6016	0.7703	BAAI/bge-multilingual-gemma2	True
MSMARCO-PL	nan	0.6991	0.5772	0.7269	BAAI/bge-multilingual-gemma2	True
NQ-PL	nan	0.5282	0.4045	0.5685	BAAI/bge-multilingual-gemma2	True
NorwegianParliamentClassification	0.5672	0.5606	0.5657	0.7007	Qwen/Qwen3-Embedding-8B	False
PpcPC	0.955	0.9116	0.8774	0.9576	microsoft/harrier-oss-v1-27b	False
Quora-PL	nan	0.837	0.787	0.8563	sdadas/mmlw-e5-large	False
ScalaClassification	0.5185	0.5109	0.5041	0.9112	microsoft/harrier-oss-v1-27b	False
Average	0.7282	0.6281	0.5771	0.7669	nan	-

Training datasets: FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-PL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoFEVER-VN, NanoFEVERRetrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, XQuADRetrieval, mMARCO-NL

Note: Content truncated due to GitHub API limits. See the full report in the workflow artifacts.

KennethEnevoldsen · 2026-04-05T13:52:00Z

A few of the results are notably lower - it is not by a lot. I am unsure if it worth looking too much into it

KennethEnevoldsen · 2026-04-16T13:50:28Z

@Samoed should we just merge these or should we look into the scores?

Samoed · 2026-04-16T14:40:04Z

Which results are lower? I tried to search such tasks, but can't find

KennethEnevoldsen · 2026-04-17T09:30:47Z

it is the reason why the tests fails:

FAILED tests/test_results_diff.py::test_result_diffs_within_threshold - AssertionError: Main score changes exceed configured threshold (MTEB_SCORE_EPSILON=0.001):
    intfloat/multilingual-e5-base/ArxivClusteringP2P: The difference between the current score (0.437572) and the previous (0.43349302457921374) exceeds threshold of 0.001
    intfloat/multilingual-e5-base/ArxivClusteringS2S: The difference between the current score (0.363116) and the previous (0.3599685331336122) exceeds threshold of 0.001
    intfloat/multilingual-e5-base/DiaBlaBitextMining: The difference between the current score (0.846981) and the previous (0.8346846148516296) exceeds threshold of 0.001
    intfloat/multilingual-e5-base/DiaBlaBitextMining: The difference between the current score (0.846981) and the previous (0.8346846148516296) exceeds threshold of 0.001
    intfloat/multilingual-e5-base/NorwegianParliamentClassification: The difference between the current score (0.56225) and the previous (0.5585000000000001) exceeds threshold of 0.001
    intfloat/multilingual-e5-base/NorwegianParliamentClassification: The difference between the current score (0.5595) and the previous (0.5548333333333333) exceeds threshold of 0.001
    intfloat/multilingual-e5-base/PpcPC: The difference between the current score (0.882969) and the previous (0.8802438485182167) exceeds threshold of 0.001
    intfloat/multilingual-e5-base/ScalaClassification: The difference between the current score (0.506885) and the previous (0.508544921875) exceeds threshold of 0.001
    intfloat/multilingual-e5-large/NorwegianParliamentClassification: The difference between the current score (0.561833) and the previous (0.563167) exceeds threshold of 0.001
    intfloat/multilingual-e5-small/CMedQAv2-reranking: The difference between the current score (0.6404) and the previous (0.6423767434864328) exceeds threshold of 0.001

In general it is small differences, but differences larger than the threshold

Samoed · 2026-04-17T09:44:04Z

Ah, forgot about test. I think the only big difference is in DiaBlaBitextMining, but I don't have idea why here is such big gap

KennethEnevoldsen · 2026-04-17T09:47:25Z

hmm, but I suspect it is the change of implementation to sentence trf (but maybe worth looking into why there is a difference?). Seems like it generally gets a bit worse

Samoed · 2026-04-17T11:40:15Z

Most of the tasks are classification and clustering. We changed (fixed, previously they were random) random seeds for them and this can cause difference

KennethEnevoldsen · 2026-04-19T13:28:33Z

I will assume it is either the seed or the change in the model implementation. Either way, these results are now reproducible so will merge

remove old results

9d07ff2

remove duplicated

2d2b95c

KennethEnevoldsen changed the title ~~Remove old results~~ rerun old results of multilingual-e5 small, medium, large Apr 5, 2026

Samoed mentioned this pull request Apr 16, 2026

fix: remove tasks without a main score when loading historic results embeddings-benchmark/mteb#4404

Closed

add rest results

fd9047b

KennethEnevoldsen merged commit 6b80cc6 into main Apr 19, 2026
2 of 3 checks passed

Samoed mentioned this pull request Apr 19, 2026

fix tests #490

Merged

Samoed deleted the remove_old_results branch May 6, 2026 18:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rerun old results of multilingual-e5 small, medium, large#474

rerun old results of multilingual-e5 small, medium, large#474
KennethEnevoldsen merged 3 commits intomainfrom
remove_old_results

Samoed commented Apr 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

KennethEnevoldsen commented Apr 5, 2026

Uh oh!

KennethEnevoldsen commented Apr 16, 2026

Uh oh!

Samoed commented Apr 16, 2026

Uh oh!

KennethEnevoldsen commented Apr 17, 2026

Uh oh!

Samoed commented Apr 17, 2026

Uh oh!

KennethEnevoldsen commented Apr 17, 2026

Uh oh!

Samoed commented Apr 17, 2026

Uh oh!

KennethEnevoldsen commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Samoed commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

github-actions Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Model Results Comparison

Results for BAAI/bge-m3

Results for intfloat/e5-base

Results for intfloat/multilingual-e5-base

Results for intfloat/multilingual-e5-large

Results for intfloat/multilingual-e5-small

Uh oh!

KennethEnevoldsen commented Apr 5, 2026

Uh oh!

KennethEnevoldsen commented Apr 16, 2026

Uh oh!

Samoed commented Apr 16, 2026

Uh oh!

KennethEnevoldsen commented Apr 17, 2026

Uh oh!

Samoed commented Apr 17, 2026

Uh oh!

KennethEnevoldsen commented Apr 17, 2026

Uh oh!

Samoed commented Apr 17, 2026

Uh oh!

KennethEnevoldsen commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Samoed commented Apr 2, 2026 •

edited

Loading

github-actions Bot commented Apr 2, 2026 •

edited

Loading

Results for `BAAI/bge-m3`

Results for `intfloat/e5-base`

Results for `intfloat/multilingual-e5-base`

Results for `intfloat/multilingual-e5-large`

Results for `intfloat/multilingual-e5-small`