Skip to content
Open
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
57390d1
add is_displayed_on_page function with caching
andreasntr Apr 20, 2026
ff16713
remove unneeded castings
andreasntr Apr 20, 2026
cf232f9
comply with linter
andreasntr Apr 20, 2026
4d1ca4e
comply with linter
andreasntr Apr 20, 2026
672ad47
remove example
andreasntr Apr 20, 2026
069d5c5
add minimal test
andreasntr Apr 20, 2026
ff80e2e
comply with linter
andreasntr Apr 20, 2026
dd2ac2e
fix docstring and pdf path in test_is_xobject_image_displayed, add py…
andreasntr Apr 21, 2026
27fc2bb
switch from page_number to page as is_displayed_on_page input
andreasntr Apr 21, 2026
5f59487
temporarily remove is_displayed_on_page caching
andreasntr Apr 21, 2026
a09a6bb
Merge branch 'main' into main
andreasntr Apr 22, 2026
58c75a6
switch display check to image constructor
andreasntr Apr 22, 2026
2966ee2
fix tests to use the new is_displayed property
andreasntr Apr 22, 2026
3bcf9a5
Merge branch 'main' into main
andreasntr Apr 26, 2026
14a56d7
move image displayed check to page initialization
andreasntr Apr 30, 2026
e7f78cf
update references to _parse_images_from_content_stream
andreasntr Apr 30, 2026
b2a6114
Merge branch 'main' into main
andreasntr May 1, 2026
f0de97d
fix conflict with main
andreasntr May 6, 2026
6cad13a
Merge branch 'main' into main
andreasntr May 6, 2026
69cb462
Merge branch 'main' into main
andreasntr May 16, 2026
54d6dd2
update sample files
andreasntr May 16, 2026
6db1389
add _displayed_images test file
andreasntr May 16, 2026
f0c7a72
make _displayed_images private, deprecate inline_images and derive it…
andreasntr May 16, 2026
18ebf94
update _displayed_images references
andreasntr May 16, 2026
983022f
update inline_images references
andreasntr May 16, 2026
d6b7ff4
update some image paths
andreasntr May 16, 2026
973f345
Merge branch 'main' into main
andreasntr May 18, 2026
6f0aa8b
Update tests/test_images.py
andreasntr May 18, 2026
183e10f
rename _displayed_images to _content_stream_images
andreasntr May 18, 2026
ccf4a9d
remove wrong docstring
andreasntr May 18, 2026
42c1f81
add deprecation notice to inline_images setter
andreasntr May 18, 2026
364ccbf
remove unneeded cache setter
andreasntr May 18, 2026
683d5d4
use regular mock instead of type
andreasntr May 18, 2026
70963f6
remove unneeded cache setter
andreasntr May 18, 2026
439fab3
fix key error message in test_get_inline_image_without_xobject_resour…
andreasntr May 18, 2026
e4ea241
invalidate cache after manipulating images
andreasntr May 18, 2026
38eebdb
emit warnings for image read errors instead of crashing
andreasntr May 18, 2026
bb11c8c
remove abbreviations
andreasntr May 19, 2026
743d023
remove is_displayed, is_inline; only cache inline images; keep privat…
andreasntr May 19, 2026
00411b1
Merge branch 'main' into main
andreasntr May 21, 2026
677e088
STY: Import AnnotationDictionaryAttributes and ImageAttributes withou…
j-t-1 May 4, 2026
92a93ac
restore is_inline and is_displayed in ImageFile, deprecate inline_images
andreasntr May 22, 2026
5918e8b
Merge pull request #1 from andreasntr/no-caching
andreasntr May 22, 2026
38f5544
Merge branch 'main' into main
andreasntr May 22, 2026
4302443
udpate sample files
andreasntr May 22, 2026
749ee97
Merge branch 'main' into main
andreasntr May 26, 2026
5d40adb
Update pypdf/_page.py
andreasntr May 26, 2026
2a8fdb1
Update pypdf/_page.py
andreasntr May 26, 2026
1015999
remove obsolete lock file
andreasntr May 26, 2026
956a4bf
exclude sample-files from mypy checks
andreasntr May 26, 2026
f2f8617
remove samples decorator from tests using RESOURCE_ROOT files
andreasntr May 26, 2026
3f98544
fix inline image in new inline images tests
andreasntr May 26, 2026
44ab6ae
handle warning with pytest.warns
andreasntr May 26, 2026
cf03742
use 2 empty lines between functions
andreasntr May 26, 2026
f3f828a
replace generic mock type with ImageFile
andreasntr May 26, 2026
b1290a2
remove docstrings in inline_images setter
andreasntr May 26, 2026
b6aa4c0
replaced >>> with ... in docstrings for multiline example
andreasntr May 26, 2026
9501d67
update example usage in images
andreasntr May 26, 2026
0848260
simplify ImageFile creation in _get_images for xobjs and do images
andreasntr May 26, 2026
d278768
remove unnecessary if
andreasntr May 26, 2026
216c9db
remove obsolete test
andreasntr May 26, 2026
5159e13
fix sample-files exclusion from mypy script
andreasntr May 26, 2026
6bf79f8
fix sample-files exclusion from mypy script
andreasntr May 26, 2026
2994cbf
Merge branch 'main' into main
andreasntr May 27, 2026
ecf3716
revert unnecessary edits
andreasntr May 27, 2026
207f1f4
exclude sample-files from mypy jobs
andreasntr May 27, 2026
85579d6
replace warnings.catch_warnings with pytest.warns
andreasntr May 27, 2026
1540d3b
declare variables in example usage
andreasntr May 27, 2026
9ecac76
update example pdf path
andreasntr May 27, 2026
bc4c931
switch to pdfwriter in examples since files are not available
andreasntr May 27, 2026
a5b2054
fix writer variable name in examples
andreasntr May 27, 2026
fbcd1e7
add missing ":" after for loop declaration in example
andreasntr May 27, 2026
7b2bf6c
Merge branch 'main' into main
andreasntr May 29, 2026
ec977c3
delay content stream processing for do objects until image retrieval
andreasntr May 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
195 changes: 173 additions & 22 deletions pypdf/_page.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@
TransformationMatrixType,
_human_readable_bytes,
deprecate,
deprecate_with_replacement,
logger_warning,
matrix_multiply,
)
Expand Down Expand Up @@ -359,6 +360,18 @@ class ImageFile:
Reference to the object storing the stream.
"""

is_inline: bool = False
"""
True if this is an inline image (~0~, ~1~, etc.).
"""

is_displayed: bool = False
"""
True if this image is displayed in the page content stream.
Some PDFs duplicate image references over all the pages,
so this is needed to disambiguate.
"""

def replace(self, new_image: Image, **kwargs: Any) -> None:
"""
Replace the image with a new PIL image.
Expand Down Expand Up @@ -512,7 +525,7 @@ def __init__(
) -> None:
DictionaryObject.__init__(self)
self.pdf = pdf
self.inline_images: Optional[dict[str, ImageFile]] = None
self._displayed_images: Optional[dict[str, ImageFile]] = None
Comment thread
andreasntr marked this conversation as resolved.
Outdated
self.indirect_reference = indirect_reference
if not is_null_or_none(indirect_reference):
assert indirect_reference is not None, "mypy"
Expand Down Expand Up @@ -608,8 +621,8 @@ def _get_ids_image(
if _i in call_stack:
return []
call_stack.append(_i)
if self.inline_images is None:
self.inline_images = self._get_inline_images()
if self._displayed_images is None:
self._displayed_images = self._parse_images_from_content_stream()
if obj is None:
obj = self
if ancest is None:
Expand All @@ -620,19 +633,42 @@ def _get_ids_image(
is_null_or_none(resources := obj[PG.RESOURCES]) or
RES.XOBJECT not in cast(DictionaryObject, resources)
):
return [] if self.inline_images is None else list(self.inline_images.keys())
return [] if self._displayed_images is None else list(self._displayed_images.keys())

x_object = resources[RES.XOBJECT].get_object() # type: ignore

# Iterate through all XObject resources
for o in x_object:
# Skip non-stream objects (only process StreamObject)
if not isinstance(x_object[o], StreamObject):
continue

# Check if this XObject is an Image
if x_object[o][ImageAttributes.SUBTYPE] == "/Image":
# Add the image ID (with ancestry if needed)
# When ancest is empty, o is top-level: "/I0"
# When ancest is not empty, [ancest, o] is nested: ["/Form1", "/I0"]
lst.append(o if len(ancest) == 0 else [*ancest, o])
else: # is a form with possible images inside

# If it's a form, recursively search for images inside it
else:
# Forms may contain images that are Do-referenced in their content stream
lst.extend(self._get_ids_image(x_object[o], [*ancest, o], call_stack))
assert self.inline_images is not None
Comment thread
stefan6419846 marked this conversation as resolved.
lst.extend(list(self.inline_images.keys()))
return lst

# Removes duplicates and preserves order
Comment thread
andreasntr marked this conversation as resolved.
deduplicated = []
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are we getting duplicates from?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xojb names matching with do images referencing them

for item in lst:
if item not in deduplicated:
deduplicated.append(item)

# Add inline images (they may overlap with XObject images)
# Preserves order
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this preserve order? The attribute is a regular dictionary, where we should not assume a fixed order.

Additionally, can we really expect overlaps? How would they look like? If we want to remove duplicates, where aren't we collecting the data as a set where we can eliminate explicit duplicate checks?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicates happen when xobjs are referenced by do images since they share the same name. Order preservation is intended as "keep the order in which they are resolved from the pdf content"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we really giving any guarantees about the order? Or is this just required for testing purposes? A set would avoid all of this hassle.

Copy link
Copy Markdown
Author

@andreasntr andreasntr May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i remembered why i excluded sets in the first place: lst, which populates the return value, can hold both strings (image names) or lists (lst: list[Union[str, list[str]]], this was not introduced by me), which are not serializable and thus not fit for sets. As for the order, probably we don't need it, however this would mean rewriting some tests involing inline_images

# Inline images have names starting with ~ (e.g., ~0~, ~1~)
for k in self._displayed_images:
if k not in deduplicated:
deduplicated.append(k)

return deduplicated

def _get_image(
self,
Expand All @@ -657,13 +693,22 @@ def _get_image(
) from exc
if isinstance(id, str):
if id[0] == "~" and id[-1] == "~":
if self.inline_images is None:
self.inline_images = self._get_inline_images()
if self.inline_images is None:
if self._displayed_images is None:
self._displayed_images = self._parse_images_from_content_stream()
if self._displayed_images is None:
raise KeyError("No inline image can be found")
return self.inline_images[id]
img = self._displayed_images[id]
img.is_inline = True
img.is_displayed = True
return img

assert xobjs is not None
# Check if image is in content stream (from _parse_images_from_content_stream)
if self._displayed_images and id in self._displayed_images:
img = self._displayed_images[id]
img.is_inline = False
return img

from .generic._image_xobject import _xobj_to_image # noqa: PLC0415
imgd = _xobj_to_image(cast(DictionaryObject, xobjs[id]))
extension, byte_stream = imgd[:2]
Expand All @@ -672,6 +717,8 @@ def _get_image(
data=byte_stream,
image=imgd[2],
indirect_reference=xobjs[id].indirect_reference,
is_inline=False,
is_displayed=False, # XObject images from resources only (not in content stream)
)
# in a subobject
assert xobjs is not None
Expand Down Expand Up @@ -701,7 +748,9 @@ def images(self) -> VirtualListImages:
* `.name` : name of the object
* `.data` : bytes of the object
* `.image` : PIL Image Object
* `.indirect_reference` : object reference
* `.indirect_reference` : object reference (None for inline images)
* `.is_inline` : True for inline images (~0~, ~1~...), False for XObjects
* `.is_displayed` : True for images found in content stream, False otherwise

and the following methods:
`.replace(new_image: PIL.Image.Image, **kwargs)` :
Expand All @@ -712,12 +761,49 @@ def images(self) -> VirtualListImages:

reader.pages[0].images[0].replace(Image.open("new_image.jpg"), quality=20)

Inline images are extracted and named ~0~, ~1~, ..., with the
indirect_reference set to None.

"""
return VirtualListImages(self._get_ids_image, self._get_image)

@property
def inline_images(self) -> Optional[dict[str, ImageFile]]:
"""
Return only inline images from the page.

.. deprecated::
Comment thread
andreasntr marked this conversation as resolved.
Use :attr:`images` and filter by :attr:`ImageFile.is_inline` instead.
This property will be removed in pypdf 7.0.

Examples:
>>> from pypdf import PdfReader
>>> reader = PdfReader("example.pdf")
>>> page = reader.pages[0]
>>> inline_images = {k: v for k, v in page.images.items() if v.is_inline}
"""
deprecate_with_replacement(
"PageObject.inline_images",
"PageObject.images",
"7.0",
)
if self._displayed_images is None:
return None
return {k: v for k, v in self._displayed_images.items() if v.is_inline}
Comment thread
andreasntr marked this conversation as resolved.
Outdated

@inline_images.setter
def inline_images(self, value: Optional[dict[str, ImageFile]]) -> None:
Comment thread
andreasntr marked this conversation as resolved.
"""
Comment thread
andreasntr marked this conversation as resolved.
Outdated
Setter for inline_images.

Setting to None clears the cache and forces recalculation on next access,
emulating the previous caching control mechanism. Setting to a dict merges
the values into the existing cache.
"""
if value is None:
self._displayed_images = None
else:
if self._displayed_images is None:
self._displayed_images = {}
self._displayed_images.update(value)

def _translate_value_inline_image(self, k: str, v: PdfObject) -> PdfObject:
"""Translate values used in inline image"""
try:
Expand All @@ -733,24 +819,85 @@ def _translate_value_inline_image(self, k: str, v: PdfObject) -> PdfObject:
raise PdfReadError(f"Cannot find resource entry {v} for {k}")
return v

def _get_inline_images(self) -> dict[str, ImageFile]:
"""Load inline images. Entries will be identified as `~1~`."""
def _parse_images_from_content_stream(self) -> dict[str, ImageFile]:
"""Load images from content stream. Includes both inline images and Do-referenced images.

This method scans the page content stream and extracts:

1. **Inline images** (~0~, ~1~...): Embedded directly in content stream via BI/EI operators
- is_inline=True, is_displayed=True, indirect_reference=None

2. **Do-referenced images** (/Im0, /Im1...): Referenced via "Do" operator
- is_inline=False, is_displayed=True, indirect_reference=<image object>

3. **Pure XObject images** (/I0, /Image1...): Defined in Resources only (not in content stream)
- is_inline=False, is_displayed=False, indirect_reference=<image object>

Returns:
Dictionary mapping image names to ImageFile instances.
"""
# Idempotent: if already parsed, return cached result
if self._displayed_images is not None:
return self._displayed_images

content = self.get_contents()
if is_null_or_none(content):
return {}
self._displayed_images = {}
Comment thread
andreasntr marked this conversation as resolved.
Outdated
return self._displayed_images
imgs_data = []
do_image_names: list[bytes] = []
assert content is not None, "mypy"
for param, ope in content.operations:
if ope == b"INLINE IMAGE":
imgs_data.append(
{"settings": param["settings"], "__streamdata__": param["data"]}
)
elif ope == b"Do" and param:
do_image_names.append(param[0]) # First operand is the XObject name
elif ope in (b"BI", b"EI", b"ID"): # pragma: no cover
raise PdfReadError(
f"{ope!r} operator met whereas not expected, "
"please share use case with pypdf dev team"
)
# Process Do-referenced images first
files = {}
xobjs: Optional[DictionaryObject] = None
Comment thread
andreasntr marked this conversation as resolved.
try:
resources = cast(DictionaryObject, self[PG.RESOURCES])
xobjs = cast(DictionaryObject, resources[RES.XOBJECT])
except KeyError:
pass # Continue with inline images only

if xobjs is None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this full logic? How was this handled before?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is no xobject, then there cannot be any do reference to that xobject. Previously xobject images were returned even if not referenced

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this logic should be here instead of when requesting the actual file, where we could avoid the overhead of the loop.

Copy link
Copy Markdown
Author

@andreasntr andreasntr May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we're only extracting image names, not image contents. If we don't check whether a do reference is pointing to an image (can also be pointing to a form as far as i understand), how would we know if we need to store that name in the cache dict?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But couldn't we store this name in the cache dict in every case? I mean, it is referenced from the page, and would be ignored on actual data retrieval?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main point of this PR is not returning objects which are present but not displayed in the page. If we store in the cache dict the name of an image which is not actually referenced, i.e. present in the content but not actually displayed, we are back to the main branch

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic of this section is to exclude non-image objects from the names, not to check whether it actually is displayed. This has been done with the Do operation analysis beforehand.

If this object is referenced, it is displayed. If relevant for the is_displayed value, the logic for checking the type of the stream object should be done when retrieving the actual image, not when retrieving the displayed names.

Copy link
Copy Markdown
Author

@andreasntr andreasntr May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it now, sorry for the confusion.
Edit: would you still keep the name _content_stream_images for the cache dict or rename it to something like _content_stream_visual_objects?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delayed content stream processing as requested.

Summary partially generated by AI (qwen 3.6 35B-A3B)

Where images are discovered

Source How discovered In cache
Inline images (~0~, ~1~…) Parsed ASAP from page content stream (BI/EI operators) Yes
Do-referenced /Image (/Im0, /Im1…) Parsed at retrieval time from page content stream (Do operator) + subtype check in _get_ids_image() Yes
Do-referenced /Form (/FFT0…) Parsed at retrieval time from page content stream (Do operator) Yes, filtered later
Images inside forms Recursive XObject search in _get_ids_image() Yes

How _get_image() retrieves them

Access pattern Behavior
page.images["~0~"] (inline) Returns cached ImageFile
page.images["/Im0"] (Do /Image) Subtype check → decode from xobjs
page.images["/FFT0"] (Do /Form) KeyError: "is not an image"
page.images["/FFT0","/Im0"] (inside form) Recursive call → decode from form's xobjs
for img in page.images Only /Image subtype objects (non-image filtered by _get_ids_image())

Key design decisions

  1. _content_stream_images stores ALL Do-referenced objects from the page content stream (images + forms) as None placeholders.
  2. _get_ids_image() filters by /Image subtype for image references
  3. _get_image() filters by /Image subtype for image references and raises if the requested object does not have /Image subtype.

What about pure XObjects?

They are still returned by _get_ids_image, _get_image and images but will have is_displayed=False , ideally they will be excluded by a potential displayed_images property

# No XOBJECT resources, skip Do-referenced images
pass
else:
for do_name in do_image_names:
try:
# Handle both NameObject (str) and bytes
if isinstance(do_name, bytes):
do_name_str = do_name.decode()
else:
do_name_str = str(do_name)
xobj = xobjs[do_name]
# Only process if it's an actual image, not a form
if isinstance(xobj, DictionaryObject) and str(xobj[ImageAttributes.SUBTYPE]) == "/Image":
from .generic._image_xobject import _xobj_to_image as _xobj_to_image2 # noqa: PLC0415
imgd = _xobj_to_image2(xobj)
extension, byte_stream, img = imgd
img_file = ImageFile(
name=f"{do_name_str.lstrip('/')}{extension}",
data=byte_stream,
image=img,
indirect_reference=xobj.indirect_reference,
is_inline=False,
is_displayed=True, # Do-referenced images are always displayed
)
files[do_name_str] = img_file
except KeyError:
continue

# Then process inline images
for num, ii in enumerate(imgs_data):
init = {
"__streamdata__": ii["__streamdata__"],
Expand All @@ -776,8 +923,12 @@ def _get_inline_images(self) -> dict[str, ImageFile]:
data=byte_stream,
image=img,
indirect_reference=None,
is_inline=True,
is_displayed=True,
)
return files

self._displayed_images = files
Comment thread
andreasntr marked this conversation as resolved.
Outdated
return self._displayed_images

@property
def rotation(self) -> int:
Expand Down Expand Up @@ -1061,8 +1212,8 @@ def replace_contents(
# as a backup solution, we put content as an object although not in accordance with pdf ref
# this will be fixed with the _add_object
self[NameObject(PG.CONTENTS)] = content
# forces recalculation of inline_images
self.inline_images = None
# forces recalculation of images
self._displayed_images = None

def merge_page(
self, page2: "PageObject", expand: bool = False, over: bool = True
Expand Down
Loading
Loading