Skip to content

PI: Don't load entire file into memory when passed file name#2520

Open
mjsir911 wants to merge 7 commits into
py-pdf:mainfrom
terrapower:memory
Open

PI: Don't load entire file into memory when passed file name#2520
mjsir911 wants to merge 7 commits into
py-pdf:mainfrom
terrapower:memory

Conversation

@mjsir911
Copy link
Copy Markdown

@mjsir911 mjsir911 commented Mar 15, 2024

This functionality originally added back in ced2890

Reduces memory usage by size of loaded file.

Benchmark script
from pypdf import *

filename = '/home/msirabella/tmp/100MB-TESTFILE.ORG.pdf'

writer = PdfWriter(clone_from=filename)

writer.write("out.pdf")
Before stats
📏 Total allocations:
	109695

📦 Total memory allocated:
	409.726MB

📊 Histogram of allocation size:
	min: 1.000B
	--------------------------------------------
	< 6.000B   : 40707 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
	< 40.000B  :   229 ▇
	< 256.000B :    33 ▇
	< 1.590KB  : 67394 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
	< 10.104KB :  1060 ▇
	< 64.190KB :   141 ▇
	< 407.789KB:    47 ▇
	< 2.530MB  :    82 ▇
	< 16.072MB :     0 
	<=102.099MB:     2 ▇
	--------------------------------------------
	max: 102.099MB

📂 Allocator type distribution:
	 MALLOC: 107587
	 CALLOC: 1223
	 REALLOC: 865
	 MMAP: 20

🥇 Top 5 largest allocating locations (by size):
	- __init__:./pypdf/_reader.py:315 -> 204.210MB
	- read_from_stream:./pypdf/generic/_data_structures.py:541 -> 101.628MB
	- read_until_regex:./pypdf/_utils.py:233 -> 48.318MB
	- read_object:./pypdf/generic/_data_structures.py:1287 -> 26.012MB
	- _call_with_frames_removed:<frozen importlib._bootstrap>:241 -> 7.360MB

🥇 Top 5 largest allocating locations (by number of allocations):
	- read_until_regex:./pypdf/_utils.py:233 -> 81058
	- read_object:./pypdf/generic/_data_structures.py:1287 -> 23017
	- _call_with_frames_removed:<frozen importlib._bootstrap>:241 -> 2101
	- _compile_bytecode:<frozen importlib._bootstrap_external>:729 -> 988
	- _create_fn:/usr/lib/python3.11/dataclasses.py:433 -> 365
After stats
📏 Total allocations:
	109687

📦 Total memory allocated:
	205.521MB

📊 Histogram of allocation size:
	min: 1.000B
	--------------------------------------------
	< 4.000B   : 40707 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
	< 18.000B  :     4 ▇
	< 80.000B  :   227 ▇
	< 348.000B :    39 ▇
	< 1.468KB  : 67239 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
	< 6.341KB  :   737 ▇
	< 27.388KB :   563 ▇
	< 118.297KB:    68 ▇
	< 510.959KB:    21 ▇
	<=2.155MB  :    82 ▇
	--------------------------------------------
	max: 2.155MB

📂 Allocator type distribution:
	 MALLOC: 107587
	 CALLOC: 1218
	 REALLOC: 862
	 MMAP: 20

🥇 Top 5 largest allocating locations (by size):
	- read_from_stream:./pypdf/generic/_data_structures.py:541 -> 101.628MB
	- read_until_regex:./pypdf/_utils.py:233 -> 46.318MB
	- read_object:./pypdf/generic/_data_structures.py:1287 -> 24.012MB
	- _call_with_frames_removed:<frozen importlib._bootstrap>:241 -> 7.356MB
	- _compile_bytecode:<frozen importlib._bootstrap_external>:729 -> 4.844MB

🥇 Top 5 largest allocating locations (by number of allocations):
	- read_until_regex:./pypdf/_utils.py:233 -> 81056
	- read_object:./pypdf/generic/_data_structures.py:1287 -> 23015
	- _call_with_frames_removed:<frozen importlib._bootstrap>:241 -> 2095
	- _compile_bytecode:<frozen importlib._bootstrap_external>:729 -> 989
	- _create_fn:/usr/lib/python3.11/dataclasses.py:433 -> 365

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-discussion The PR/issue needs more discussion before we can continue PdfReader The PdfReader component is affected

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants