-
Notifications
You must be signed in to change notification settings - Fork 1k
docs: add Android LLM runner page and HuggingFace #19611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
omkar-334
wants to merge
6
commits into
pytorch:main
Choose a base branch
from
omkar-334:docs-hf
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
4f38dc3
[DOC] Add Android LLM runner page (#8790)
omkar-334 f97f410
[DOC] Add run-on-android to LLM toctree (#8790)
omkar-334 b7dc058
[DOC] Point LLM getting-started runtime links to in-docs pages (#8790)
omkar-334 260b1a0
[DOC] Add Optimum ExecuTorch callout on export page (#8790)
omkar-334 ff88af6
[DOC] Clean up export-llm-optimum and link to in-docs runtime pages (…
omkar-334 37d0f68
Merge branch 'main' into docs-hf
omkar-334 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,202 @@ | ||
| # Running LLMs on Android | ||
|
|
||
| ExecuTorch's LLM-specific runtime components provide an experimental Java interface around the core C++ LLM runtime, available through the `executorch-android` AAR. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| Make sure you have a model and tokenizer files ready, as described in the prerequisites section of the [Running LLMs with C++](run-with-c-plus-plus.md) guide. | ||
|
|
||
| To add the `executorch-android` library to your app, see [Using ExecuTorch on Android](../using-executorch-android.md). The LLM runner classes are bundled inside the same AAR as the generic `Module` API. | ||
|
|
||
| ## Runtime API | ||
|
|
||
| Once the `executorch-android` AAR is on your classpath, you can import the LLM runner classes from the `org.pytorch.executorch.extension.llm` package. | ||
|
|
||
| ### Importing | ||
|
|
||
| ```java | ||
| import org.pytorch.executorch.extension.llm.LlmModule; | ||
| import org.pytorch.executorch.extension.llm.LlmModuleConfig; | ||
| import org.pytorch.executorch.extension.llm.LlmGenerationConfig; | ||
| import org.pytorch.executorch.extension.llm.LlmCallback; | ||
| ``` | ||
|
|
||
| ### LlmModule | ||
|
|
||
| The `LlmModule` class provides a simple Java interface for loading a text-generation model, configuring its tokenizer, generating token streams, and stopping execution. It also supports multimodal models that accept image and audio inputs alongside a text prompt. | ||
|
|
||
| This API is experimental and subject to change. | ||
|
|
||
| #### Initialization | ||
|
|
||
| Create an `LlmModule` by specifying paths to your serialized model (`.pte`) and tokenizer files. For text-only models, the simple constructor is enough: | ||
|
|
||
| ```java | ||
| LlmModule module = new LlmModule( | ||
| "/data/local/tmp/llama-3.2-instruct.pte", | ||
| "/data/local/tmp/tokenizer.model", | ||
| 0.8f); | ||
| ``` | ||
|
|
||
| For finer control (multimodal model type, BOS/EOS handling, supplementary data files, load mode), use `LlmModuleConfig` with the fluent builder: | ||
|
|
||
| ```java | ||
| LlmModuleConfig config = LlmModuleConfig.create() | ||
| .modulePath("/data/local/tmp/llama-3.2-instruct.pte") | ||
| .tokenizerPath("/data/local/tmp/tokenizer.model") | ||
| .temperature(0.8f) | ||
| .modelType(LlmModuleConfig.MODEL_TYPE_TEXT) | ||
| .loadMode(LlmModuleConfig.LOAD_MODE_MMAP) | ||
| .build(); | ||
|
|
||
| LlmModule module = new LlmModule(config); | ||
| ``` | ||
|
|
||
| Available load modes are `LOAD_MODE_FILE`, `LOAD_MODE_MMAP` (default), `LOAD_MODE_MMAP_USE_MLOCK`, and `LOAD_MODE_MMAP_USE_MLOCK_IGNORE_ERRORS`. Available model types are `MODEL_TYPE_TEXT`, `MODEL_TYPE_TEXT_VISION`, and `MODEL_TYPE_MULTIMODAL`. | ||
|
|
||
|
|
||
| Construction itself is lightweight and does not load the program data immediately. | ||
|
|
||
| #### Loading | ||
|
|
||
| Explicitly load the model before generation to avoid paying the load cost during your first `generate` call. | ||
|
|
||
| ```java | ||
| int status = module.load(); | ||
| if (status != 0) { | ||
| // Handle load failure (status is an ExecuTorch runtime error code). | ||
| } | ||
| ``` | ||
|
|
||
| If you skip this step, the model is loaded lazily on the first `generate` call. | ||
|
|
||
| #### Generating | ||
|
|
||
| Generate tokens from a text prompt by passing an `LlmCallback` that receives each token as it is produced. The same callback also receives a JSON-encoded statistics string when generation completes. | ||
|
|
||
| ```java | ||
| LlmCallback callback = new LlmCallback() { | ||
| @Override | ||
| public void onResult(String token) { | ||
| // Called once per generated token. Append to your UI buffer here. | ||
| System.out.print(token); | ||
| } | ||
|
|
||
| @Override | ||
| public void onStats(String statsJson) { | ||
| // Called once when generation finishes. See extension/llm/runner/stats.h | ||
| // for the field definitions. | ||
| System.out.println("\n" + statsJson); | ||
| } | ||
|
|
||
| @Override | ||
| public void onError(int errorCode, String message) { | ||
| // Called if the runtime reports an error during generation. | ||
| } | ||
| }; | ||
|
|
||
| module.generate("Once upon a time", callback); | ||
| ``` | ||
|
|
||
| For full control over generation parameters, use `LlmGenerationConfig`: | ||
|
|
||
| ```java | ||
| LlmGenerationConfig genConfig = LlmGenerationConfig.create() | ||
| .seqLen(2048) | ||
| .temperature(0.8f) | ||
| .echo(false) | ||
| .build(); | ||
|
|
||
| module.generate("Once upon a time", genConfig, callback); | ||
| ``` | ||
|
|
||
| `LlmGenerationConfig` exposes `echo`, `maxNewTokens`, `seqLen`, `temperature`, `numBos`, `numEos`, and `warming`. Defaults match the C++ `GenerationConfig` documented in [Running LLMs with C++](run-with-c-plus-plus.md). | ||
|
|
||
| #### Stopping Generation | ||
|
|
||
| If you need to interrupt a long-running generation, call `stop()` from another thread (or from inside the `onResult` callback): | ||
|
|
||
| ```java | ||
| module.stop(); | ||
| ``` | ||
|
|
||
| Generation also runs synchronously on the calling thread, so make sure you invoke `generate()` off the main thread (for example, on a `HandlerThread` or via a `java.util.concurrent.Executor`). | ||
|
|
||
| #### Resetting | ||
|
|
||
| To clear the prefilled tokens from the KV cache and reset the start position to 0, call: | ||
|
|
||
| ```java | ||
| module.resetContext(); | ||
| ``` | ||
|
|
||
| This is the equivalent of `reset()` on the iOS runner and `reset()` on the C++ `IRunner`. | ||
|
|
||
| ### Multimodal Inputs | ||
|
|
||
| For models declared as `MODEL_TYPE_TEXT_VISION` or `MODEL_TYPE_MULTIMODAL`, image and audio data are provided through dedicated prefill methods. After prefilling all modalities, call `generate()` with the text prompt to produce the response. | ||
|
|
||
| #### Images | ||
|
|
||
| Raw uint8 pixel data in CHW order can be supplied as an `int[]`, or as a direct `ByteBuffer` to avoid JNI array copies: | ||
|
|
||
| ```java | ||
| // As int[] | ||
| int[] pixels = ...; // length == channels * height * width | ||
| module.prefillImages(pixels, /*width=*/336, /*height=*/336, /*channels=*/3); | ||
|
|
||
| // As direct ByteBuffer (preferred for large images) | ||
| ByteBuffer buffer = ByteBuffer.allocateDirect(3 * 336 * 336); | ||
| buffer.put(rawBytes).rewind(); | ||
| module.prefillImages(buffer, 336, 336, 3); | ||
| ``` | ||
|
|
||
| Pre-normalized float pixel data is also supported, both as a `float[]` and as a direct `ByteBuffer` in native byte order: | ||
|
|
||
| ```java | ||
| float[] normalized = ...; // length == channels * height * width | ||
| module.prefillImages(normalized, 336, 336, 3); | ||
|
|
||
| ByteBuffer floatBuffer = ByteBuffer | ||
| .allocateDirect(3 * 336 * 336 * Float.BYTES) | ||
| .order(ByteOrder.nativeOrder()); | ||
| // fill floatBuffer with normalized values, then: | ||
| module.prefillNormalizedImage(floatBuffer, 336, 336, 3); | ||
| ``` | ||
|
|
||
| #### Audio | ||
|
|
||
| Preprocessed audio features (for example mel spectrograms produced by a Whisper preprocessor) can be supplied as `byte[]` or `float[]`: | ||
|
|
||
| ```java | ||
| module.prefillAudio(features, /*batchSize=*/1, /*nBins=*/128, /*nFrames=*/3000); | ||
| ``` | ||
|
|
||
| Raw audio samples can be supplied with `prefillRawAudio`: | ||
|
|
||
| ```java | ||
| module.prefillRawAudio(samples, /*batchSize=*/1, /*nChannels=*/1, /*nSamples=*/16000); | ||
| ``` | ||
|
|
||
| #### Generating with Multimodal Prefill | ||
|
|
||
| After prefilling each modality, run `generate()` with the text prompt as usual: | ||
|
|
||
| ```java | ||
| module.prefillImages(pixels, 336, 336, 3); | ||
| module.generate("What's in this image?", callback); | ||
| ``` | ||
|
|
||
| For text-vision models, a convenience overload accepts the image and prompt together: | ||
|
|
||
| ```java | ||
| module.generate( | ||
| pixels, /*width=*/336, /*height=*/336, /*channels=*/3, | ||
| "What's in this image?", | ||
| /*seqLen=*/768, | ||
| callback, | ||
| /*echo=*/false); | ||
| ``` | ||
|
|
||
| ## Demo | ||
|
|
||
| See the [Llama Android demo app](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android/LlamaDemo) in `executorch-examples` for an end-to-end project that wires `LlmModule`, `LlmCallback`, and a `HandlerThread` into a chat UI. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a kotlin example