Skip to content
View 1kkiRen's full-sized avatar

Highlights

  • Pro

Block or report 1kkiRen

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
1kkiRen/README.md

Dmitrii Kuzmin — NLP/ML Engineer & Researcher

I research and build language-model systems: tokenizer adaptation, multilingual LLM evaluation, agentic workflows, and production NLP pipelines. My current work focuses on making LLMs more reliable, efficient, and easier to adapt across languages.

Portfolio Email CV EN CV RU Profile Views

Jump to what interests you

Quick highlights

  • Research Intern at Mohamed bin Zayed University of Artificial Intelligence (Jun 2025 – present), studying alternative tokenization methods and language adaptation for LLMs.
  • Middle NLP Engineer / Lead NLP Researcher at DeepPavlov.ai (May 2025 – present), working on LLM evaluation, agentic systems, uncertainty estimation, and reasoning reliability.
  • Previously worked with Center for Applied AI (Skolkovo), Higher School of Economics, Moscow Aviation Institute, and Innopolis University on multimodal fine-tuning, Russian LLM adaptation, BERT models, and NLP services.
  • Maintainer of open-source tokenizer and embedding tooling published on PyPI.

What I work with

NLP & Deep Learning Stack

PyTorch Transformers Tokenizers LangChain LangGraph NumPy pandas

DevOps & Tooling

Git Docker Linux Bash

Backend & Communication

FastAPI MongoDB Telegram Bot API

Languages & soft skills
  • English (C1)
  • Russian (native)
  • Flexibility · Responsibility · Enthusiasm

Recent experience

Research Intern · MBZUAI — Abu Dhabi, UAE (Jun 2025 – present)
  • Lead research on tokenizer adaptation and language-specific LLM fine-tuning strategies.
  • Research alternative tokenization strategies and language adaptation methods for LLMs.
  • Evaluate tokenizer-driven quality and efficiency tradeoffs and prepare publication-ready papers.
Middle NLP Engineer / Lead NLP Researcher · DeepPavlov.ai — Moscow, Russia (May 2025 – present)
  • Lead R&D around LLM evaluation, agentic systems, uncertainty estimation, and reasoning reliability.
  • Build benchmarking workflows and run comparative testing across diverse GPU infrastructures.
Middle NLP Engineer · Center for Applied AI, Skolkovo — Moscow, Russia (Feb 2025 – May 2025)
  • Tuned the Qwen2.5-VL model and built supporting pipelines.
  • Designed prompting strategies to generate actionable feedback on heterogeneous specifications.
NLP Researcher · Higher School of Economics — Moscow, Russia (Jun 2024 – May 2025)
  • Fine-tuned Llama3-8B-Instruct for Russian-language tasks.
  • Developed a Russian BPE tokenizer and tooling to manipulate existing tokenizer vocabularies safely.
  • Built a grammar benchmark suite to quantify improvements across downstream tasks.
ML / Backend Engineer · Moscow Aviation Institute — Moscow, Russia (Jul 2023 – Oct 2023)
  • Delivered a sentence theme classifier and optimized database queries.
  • Integrated Telegram-based interfaces for model delivery.
NLP Engineer · Innopolis University — Innopolis, Russia (Jun 2023 – Jul 2023)
  • Developed a deep-learning sentiment model for YouTube comments.
  • Fine-tuned BERT for domain-specific tone classification.

Publications & research

Mitigating the Impact of Glitch Tokens via Targeted Retokenization — EMNLP 2026 (under review)

Researcher & writer, 2025. Studies how glitch-token handling and tokenizer behavior affect LLM generation quality.

TokenSubstitution: Cost-Efficient Method of Language Adaptation Based on Token "Trained-ness" — EMNLP 2026 (in progress)

Proposes a cost-efficient method for adapting LLM generation quality to a target language.

A Multi-Aspect Evaluation of Tokenizer Adaptation Methods for Large Language Models on Russian — AI Journey 2025 (accepted)

Demonstrates tokenizer adaptation as a cost-effective technique by analyzing text quality and token efficiency across diverse benchmarks.

Open-source projects

TokenizerChanger — modify tokenizers
  • Python library for modifying Hugging Face tokenizers.
  • PyPI · GitHub
  • pip install TokenizerChanger
EmbeddingsDivision — adapt LLM embeddings
  • Python library for separating and adapting LLM embedding layers.
  • PyPI · GitHub
  • pip install embdiv
CRUD Calendar LLM Chatbot — Telegram/FastAPI assistant
  • Features: calendar CRUD, summarise latest news, voice reminders.
  • Stack: Telegram Bot API, FastAPI, RAG pipeline with Qwen2.5-VL.

Education

Innopolis University — B.S. in Data Analysis & Artificial Intelligence (2022 – 2026)
Key coursework: Software Systems Analysis and Design, Human-AI Interaction, Mathematical Analysis.

Beyond work

  • Tutor for first-year students at Innopolis University (Sep 2023 – Jan 2024), helping newcomers acclimate and organizing community events.
  • Always exploring ways to make LLM tooling more accessible and efficient.

Let’s connect


Pinned Loading

  1. Tokenizer-Changer Tokenizer-Changer Public

    Python script for manipulating the existing tokenizer.

    Python 21 1

  2. Embeddings-Division Embeddings-Division Public

    Python script for dividing embedding layer of LLM.

    Python

  3. Crossy-Road-Course-Project Crossy-Road-Course-Project Public

    Python 2

  4. 1kkiRen.github.io 1kkiRen.github.io Public

    Svelte