Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion csharp/ql/lib/qlpack.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,6 @@ dependencies:
codeql/xml: ${workspace}
dataExtensions:
- ext/*.model.yml
- ext/generated/*.model.yml
- ext/generated/**/*.model.yml
warnOnImplicitThis: true
compileForOverlayEval: true
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# THIS FILE IS AN AUTO-GENERATED MODELS AS DATA FILE. DO NOT EDIT.
# Generated from https://github.com/apache/avro.git#79017ee391c04f60bdffd5fecf9ecc27c1b1f420 by codeql-mads-via-llm
extensions:
- addsTo:
pack: codeql/java-all
extensible: sinkModel
data:
- ["org.apache.avro.data", "ObjectReader", True, "read", "(Object,Decoder)", "", "Argument[1]", "unsafe-deserialization", "ai-generated"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to look up the docs for this and found that it's actually a nested class. I believe we would need the below syntax to correctly specify it. (You can see this is used for java.io.ObjectInputFilter.Config, for example.)

Suggested change
- ["org.apache.avro.data", "ObjectReader", True, "read", "(Object,Decoder)", "", "Argument[1]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.data", "Json$ObjectReader", True, "read", "(Object,Decoder)", "", "Argument[1]", "unsafe-deserialization", "ai-generated"]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, Gemini 3.1 Pro doesn't think this should be a sink.

Reasoning: Unsafe deserialization vulnerabilities occur when an application deserializes untrusted data into arbitrary Java objects (allowing an attacker to trigger malicious gadget chains). However, Json.ObjectReader is designed to strictly read Avro-encoded data matching the specific Json.SCHEMA internal to Apache Avro.

If you examine its implementation, it maps incoming primitive tokens directly to basic, safe Jackson JsonNode types (like LongNode, DoubleNode, TextNode, ArrayNode, and ObjectNode) and then unwraps them into basic Java structures (Map, List, String, Long, etc.). Since it does not perform polymorphic deserialization or resolve arbitrary class names from the data stream, it is structurally immune to unsafe class instantiation and does not act as a deserialization sink.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Out of curiosity, how did you ask the model? Did you provide the source code, which the tool (currently) does not.
My impression is that you can ask it multiple times and get different results, which is why I would like to know what input it was given.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which arguments of org.apache.avro.data.Json.ObjectReader.read, if any, should be "unsafe-deserialization" sinks for CodeQL?

It looks like it searched GitHub and found the source code. In the past I have given it the javadocs that I've found online, but I was lazy this time.

Copy link
Copy Markdown
Contributor Author

@jacknojo jacknojo Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I am trying to understand the reasoning of the model (Opus 4.6). It seems to conclude the same, in it's internal monologue (copy-pasted verbatim):

Object read(Object reuse, Decoder in) in org.apache.avro.data.Json$ObjectReader

  • This method deserializes data from a Decoder parameter, which reads from an external source like a stream or network. While the Decoder is passed in by the caller, the method itself performs the actual deserialization of untrusted data, making it a potential source for tainted data.
  • Sink analysis: The Decoder in parameter is where the potentially malicious data comes from. When deserializing untrusted data through this decoder, there's a risk of CWE-502 (Deserialization of Untrusted Data). The decoder provides the data being deserialized, so it's the actual source of the taint.

For deserialization vulnerabilities, the method performing the deserialization is the sink, and the decoder carrying the untrusted data is what we're concerned about. In Avro's case, the JSON deserialization is schema-driven, which provides some protection by limiting what types can be instantiated, but I should still consider whether this creates a meaningful vulnerability. a security perspective, we should consider this as a potential deserialization concern. But looking closer, Avro's schema-based deserialization doesn't instantiate arbitrary classes like Java's native serialization—it just reads structured data according to a fixed schema, so this isn't really a CWE-502 vulnerability. The Json$ObjectReader is parsing Avro-encoded data into JSON objects, which is more of a data parsing operation than dangerous deserialization.

For Avro specifically, this doesn't constitute a dangerous deserialization sink since it produces JSON data structures rather than arbitrary Java objects, operates schema-driven rather than type-driven, and doesn't invoke arbitrary constructors or methods. I'll skip reporting this as a CWE-502 sink.

As for whether this is a source—the method reads from a Decoder that could represent external data, but the Decoder is already provided by the caller, meaning the actual I/O boundary was crossed when the decoder was created. The method itself is processing data through the Decoder abstraction, which typically wraps an InputStream and calls read methods on it.

Given all this, I don't think these methods clearly qualify as sources or sinks from a security standpoint—they're just performing schema-based data parsing without the dangerous characteristics that would trigger CWE-502 concerns. Actually, let me reconsider one more time. The read method does perform deserialization - it takes serialized data from a Decoder and produces Java objects. Even though Avro's deserialization is schema-based and generally safer than Java's native serialization, it's still deserialization. In a conservative security analysis, we might want to flag this.

But looking at it more carefully - Avro's JSON reader produces JsonNode objects (from Jackson) or similar simple data structures. It doesn't instantiate arbitrary Java classes based on the input data. The security risk of CWE-502 specifically relates to being able to instantiate arbitrary classes or invoke methods through deserialization gadgets. Avro's schema-based approach doesn't have this vulnerability.

I'll not report these as sinks or sources. Now I'm thinking through what Json$ObjectReader.read actually does - it's an inner class in the Avro package that reads JSON from a Decoder and returns a JsonNode. For the in parameter, the Decoder carries the serialized data, so that could be considered a source if we're tracking external input, though typically the application creates the Decoder from a known source rather than the method opening connections itself.

However, it still ends up reporting it as a sink (this is from the LLM output):

Object read(Object reuse, Decoder in) - This method performs deserialization: it reads serialized data from an Avro Decoder and produces Java objects. The Decoder (argument 1) carries the potentially untrusted serialized data that is being deserialized into objects. This is a deserialization operation that could be vulnerable to CWE-502 if the input is untrusted.

As we discussed yesterday, it's not clear to me where we need to draw the line. I have explicitly told it to be conservative in it's analysis (including sources/sinks if in doubt).

Edit: Having run the same query multiple times now, there is a strong leaning towards the method being a sink: The issue above is not normal, usually the LLM is convinced that this is a potential security issue (assuming input made with ill intent).

Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# THIS FILE IS AN AUTO-GENERATED MODELS AS DATA FILE. DO NOT EDIT.
# Generated from https://github.com/apache/avro.git#79017ee391c04f60bdffd5fecf9ecc27c1b1f420 by codeql-mads-via-llm
extensions:
- addsTo:
pack: codeql/java-all
extensible: sinkModel
data:
- ["org.apache.avro.file", "DataFileReader", True, "DataFileReader", "(File,DatumReader)", "", "Argument[0]", "path-injection", "ai-generated"]
- ["org.apache.avro.file", "DataFileReader", True, "openReader", "(File,DatumReader)", "", "Argument[0]", "path-injection", "ai-generated"]
- ["org.apache.avro.file", "DataFileWriter", True, "appendTo", "(File)", "", "Argument[0]", "path-injection", "ai-generated"]
- ["org.apache.avro.file", "DataFileWriter", True, "create", "(Schema,File)", "", "Argument[1]", "path-injection", "ai-generated"]
- ["org.apache.avro.file", "SeekableFileInput", True, "SeekableFileInput", "(File)", "", "Argument[0]", "path-injection", "ai-generated"]
- ["org.apache.avro.file", "SyncableFileOutputStream", True, "SyncableFileOutputStream", "(File)", "", "Argument[0]", "path-injection", "ai-generated"]
- ["org.apache.avro.file", "SyncableFileOutputStream", True, "SyncableFileOutputStream", "(File,boolean)", "", "Argument[0]", "path-injection", "ai-generated"]
- ["org.apache.avro.file", "SyncableFileOutputStream", True, "SyncableFileOutputStream", "(String)", "", "Argument[0]", "path-injection", "ai-generated"]
- ["org.apache.avro.file", "SyncableFileOutputStream", True, "SyncableFileOutputStream", "(String,boolean)", "", "Argument[0]", "path-injection", "ai-generated"]
- addsTo:
pack: codeql/java-all
extensible: sourceModel
data:
- ["org.apache.avro.file", "DataFileReader12", True, "getMeta", "(String)", "", "ReturnValue", "file", "ai-generated"]
- ["org.apache.avro.file", "DataFileReader12", True, "getMetaString", "(String)", "", "ReturnValue", "file", "ai-generated"]
- ["org.apache.avro.file", "DataFileReader12", True, "getSchema", "()", "", "ReturnValue", "file", "ai-generated"]
- ["org.apache.avro.file", "DataFileReader12", True, "iterator", "()", "", "ReturnValue", "file", "ai-generated"]
- ["org.apache.avro.file", "DataFileReader12", True, "next", "()", "", "ReturnValue", "file", "ai-generated"]
- ["org.apache.avro.file", "DataFileReader12", True, "next", "(Object)", "", "ReturnValue", "file", "ai-generated"]
- ["org.apache.avro.file", "DataFileStream", True, "getHeader", "()", "", "ReturnValue", "file", "ai-generated"]
- ["org.apache.avro.file", "DataFileStream", True, "getMeta", "(String)", "", "ReturnValue", "file", "ai-generated"]
- ["org.apache.avro.file", "DataFileStream", True, "getMetaKeys", "()", "", "ReturnValue", "file", "ai-generated"]
- ["org.apache.avro.file", "DataFileStream", True, "getMetaString", "(String)", "", "ReturnValue", "file", "ai-generated"]
- ["org.apache.avro.file", "DataFileStream", True, "getSchema", "()", "", "ReturnValue", "file", "ai-generated"]
- ["org.apache.avro.file", "DataFileStream", True, "iterator", "()", "", "ReturnValue", "file", "ai-generated"]
- ["org.apache.avro.file", "DataFileStream", True, "next", "()", "", "ReturnValue", "file", "ai-generated"]
- ["org.apache.avro.file", "DataFileStream", True, "next", "(Object)", "", "ReturnValue", "file", "ai-generated"]
- ["org.apache.avro.file", "DataFileStream", True, "nextBlock", "()", "", "ReturnValue", "file", "ai-generated"]
- ["org.apache.avro.file", "FileReader", True, "getSchema", "()", "", "ReturnValue", "file", "ai-generated"]
- ["org.apache.avro.file", "FileReader", True, "next", "(Object)", "", "ReturnValue", "file", "ai-generated"]
- ["org.apache.avro.file", "SeekableInput", True, "read", "(byte[],int,int)", "", "Argument[0]", "file", "ai-generated"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# THIS FILE IS AN AUTO-GENERATED MODELS AS DATA FILE. DO NOT EDIT.
# Generated from https://github.com/apache/avro.git#79017ee391c04f60bdffd5fecf9ecc27c1b1f420 by codeql-mads-via-llm
extensions:
- addsTo:
pack: codeql/java-all
extensible: sinkModel
data:
- ["org.apache.avro.generic", "GenericDatumReader", True, "read", "(Object,Decoder)", "", "Argument[1]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.generic", "GenericDatumReader", True, "read", "(Object,Schema,ResolvingDecoder)", "", "Argument[2]", "unsafe-deserialization", "ai-generated"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# THIS FILE IS AN AUTO-GENERATED MODELS AS DATA FILE. DO NOT EDIT.
# Generated from https://github.com/apache/avro.git#79017ee391c04f60bdffd5fecf9ecc27c1b1f420 by codeql-mads-via-llm
extensions:
- addsTo:
pack: codeql/java-all
extensible: sinkModel
data:
- ["org.apache.avro.io", "DatumReader", True, "read", "(Object,Decoder)", "", "Argument[1]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.io", "ExecutionStep", True, "execute", "(Object,Decoder)", "", "Argument[1]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.io", "FieldReader", True, "read", "(Object,Decoder)", "", "Argument[1]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.io", "RecordReader", True, "read", "(Object,Decoder)", "", "Argument[1]", "unsafe-deserialization", "ai-generated"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# THIS FILE IS AN AUTO-GENERATED MODELS AS DATA FILE. DO NOT EDIT.
# Generated from https://github.com/apache/avro.git#79017ee391c04f60bdffd5fecf9ecc27c1b1f420 by codeql-mads-via-llm
extensions:
- addsTo:
pack: codeql/java-all
extensible: sinkModel
data:
- ["org.apache.avro.message", "BaseDecoder", True, "decode", "(ByteBuffer)", "", "Argument[0]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.message", "BaseDecoder", True, "decode", "(ByteBuffer,Object)", "", "Argument[0]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.message", "BaseDecoder", True, "decode", "(InputStream)", "", "Argument[0]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.message", "BaseDecoder", True, "decode", "(byte[])", "", "Argument[0]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.message", "BaseDecoder", True, "decode", "(byte[],Object)", "", "Argument[0]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.message", "BinaryMessageDecoder", True, "decode", "(InputStream,Object)", "", "Argument[0]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.message", "MessageDecoder", True, "decode", "(ByteBuffer)", "", "Argument[0]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.message", "MessageDecoder", True, "decode", "(ByteBuffer,Object)", "", "Argument[0]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.message", "MessageDecoder", True, "decode", "(InputStream)", "", "Argument[0]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.message", "MessageDecoder", True, "decode", "(InputStream,Object)", "", "Argument[0]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.message", "MessageDecoder", True, "decode", "(byte[])", "", "Argument[0]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.message", "MessageDecoder", True, "decode", "(byte[],Object)", "", "Argument[0]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.message", "RawMessageDecoder", True, "decode", "(InputStream,Object)", "", "Argument[0]", "unsafe-deserialization", "ai-generated"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# THIS FILE IS AN AUTO-GENERATED MODELS AS DATA FILE. DO NOT EDIT.
# Generated from https://github.com/apache/avro.git#79017ee391c04f60bdffd5fecf9ecc27c1b1f420 by codeql-mads-via-llm
extensions:
- addsTo:
pack: codeql/java-all
extensible: sinkModel
data:
- ["org.apache.avro", "Parser", True, "parse", "(File)", "", "Argument[0]", "path-injection", "ai-generated"]
- ["org.apache.avro", "Protocol", True, "parse", "(File)", "", "Argument[0]", "path-injection", "ai-generated"]
- ["org.apache.avro", "Schema", True, "parse", "(File)", "", "Argument[0]", "path-injection", "ai-generated"]
- ["org.apache.avro", "SchemaParser", True, "parse", "(File)", "", "Argument[0]", "path-injection", "ai-generated"]
- ["org.apache.avro", "SchemaParser", True, "parse", "(File,Charset)", "", "Argument[0]", "path-injection", "ai-generated"]
- ["org.apache.avro", "SchemaParser", True, "parse", "(Path)", "", "Argument[0]", "path-injection", "ai-generated"]
- ["org.apache.avro", "SchemaParser", True, "parse", "(Path,Charset)", "", "Argument[0]", "path-injection", "ai-generated"]
- ["org.apache.avro", "SchemaParser", True, "parse", "(URI,Charset)", "", "Argument[0]", "request-forgery", "ai-generated"]
- addsTo:
pack: codeql/java-all
extensible: sourceModel
data:
- ["org.apache.avro", "Parser", True, "parse", "(File)", "", "ReturnValue", "file", "ai-generated"]
- ["org.apache.avro", "Protocol", True, "parse", "(File)", "", "ReturnValue", "file", "ai-generated"]
- ["org.apache.avro", "Schema", True, "parse", "(File)", "", "ReturnValue", "file", "ai-generated"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# THIS FILE IS AN AUTO-GENERATED MODELS AS DATA FILE. DO NOT EDIT.
# Generated from https://github.com/apache/avro.git#79017ee391c04f60bdffd5fecf9ecc27c1b1f420 by codeql-mads-via-llm
extensions:
- addsTo:
pack: codeql/java-all
extensible: sinkModel
data:
- ["org.apache.avro.reflect", "CustomEncoding", True, "read", "(Object,Decoder)", "", "Argument[1]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.reflect", "ReflectDatumReader", True, "read", "(Object,Schema,ResolvingDecoder)", "", "Argument[2]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.reflect", "ReflectDatumReader", True, "readArray", "(Object,Schema,ResolvingDecoder)", "", "Argument[2]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.reflect", "ReflectDatumReader", True, "readBytes", "(Object,Schema,Decoder)", "", "Argument[2]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.reflect", "ReflectDatumReader", True, "readField", "(Object,Schema$Field,Object,ResolvingDecoder,Object)", "", "Argument[3]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.reflect", "ReflectDatumReader", True, "readInt", "(Object,Schema,Decoder)", "", "Argument[2]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.reflect", "ReflectDatumReader", True, "readString", "(Object,Decoder)", "", "Argument[1]", "unsafe-deserialization", "ai-generated"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# THIS FILE IS AN AUTO-GENERATED MODELS AS DATA FILE. DO NOT EDIT.
# Generated from https://github.com/apache/avro.git#79017ee391c04f60bdffd5fecf9ecc27c1b1f420 by codeql-mads-via-llm
extensions:
- addsTo:
pack: codeql/java-all
extensible: sinkModel
data:
- ["org.apache.avro.specific", "SpecificDatumReader", True, "readField", "(Object,Schema$Field,Object,ResolvingDecoder,Object)", "", "Argument[3]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.specific", "SpecificDatumReader", True, "readRecord", "(Object,Schema,ResolvingDecoder)", "", "Argument[2]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.specific", "SpecificExceptionBase", True, "readExternal", "(ObjectInput)", "", "Argument[0]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.specific", "SpecificFixed", True, "readExternal", "(ObjectInput)", "", "Argument[0]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.specific", "SpecificRecordBase", True, "customDecode", "(ResolvingDecoder)", "", "Argument[0]", "unsafe-deserialization", "ai-generated"]
- ["org.apache.avro.specific", "SpecificRecordBase", True, "readExternal", "(ObjectInput)", "", "Argument[0]", "unsafe-deserialization", "ai-generated"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# THIS FILE IS AN AUTO-GENERATED MODELS AS DATA FILE. DO NOT EDIT.
# Generated from https://github.com/apache/avro.git#79017ee391c04f60bdffd5fecf9ecc27c1b1f420 by codeql-mads-via-llm
extensions:
- addsTo:
pack: codeql/java-all
extensible: sinkModel
data:
- ["org.apache.avro.util", "RandomData", True, "main", "(String[])", "", "Argument[0]", "path-injection", "ai-generated"]
- addsTo:
pack: codeql/java-all
extensible: sourceModel
data:
- ["org.apache.avro.util", "RandomData", True, "main", "(String[])", "", "Argument[0]", "commandargs", "ai-generated"]
Loading
Loading