Skip to content

Optimise unescape#457

Open
andyundso wants to merge 1 commit intoruby-rdf:developfrom
andyundso:optimise-unescape
Open

Optimise unescape#457
andyundso wants to merge 1 commit intoruby-rdf:developfrom
andyundso:optimise-unescape

Conversation

@andyundso
Copy link
Copy Markdown
Contributor

Can be considered a follow-up to #453.

I did again some profiling and noted that a lot of time was still spent in unescape. The main problem is that the double gsub will allocate two copies of the string to the potential third copy when the encoding was not UTF-8. This is solved by combining the two regex into one. as only the UCHAR contains capture groups, it is clear within the block of gsub which one to replace.

The second optimization is to add a match? for an early return. Now this means the UNESCAPE_COMBINED regex is executed twice when encountering a match, but since most things parsed won't contain these special cases, the early return will have a positive benefit for most parsing operations.

Benchmark script:

$:.unshift(File.expand_path(File.join(File.dirname(__FILE__), 'lib')))
require 'benchmark/ips'
require 'rdf'

Benchmark.ips do |x|
  x.report('without') do
    RDF::NTriples::Reader.unescape("D\u00FCrst")
    RDF::NTriples::Reader.unescape("Hello world!")
  end

  if ENV['WITH_MODULE'] == 'true'
    module RDF::NTriples
      class Reader
        UNESCAPE_COMBINED = Regexp.union(UCHAR, ESCAPE_CHARS_ESCAPED_REGEXP).freeze

        def self.unescape(string)
          # Note: avoiding copying the input string when no escaping is needed
          # greatly reduces the number of allocations and the processing time.
          string = string.dup.force_encoding(Encoding::UTF_8) unless string.encoding == Encoding::UTF_8

          # Early return when nothing to unescape: avoids string allocation entirely.
          return string unless string.match?(UNESCAPE_COMBINED)

          # Single pass handles both \uXXXX/\UXXXXXXXX and backslash escape chars.
          string.gsub(UNESCAPE_COMBINED) do |match|
              ($1 || $2) ? [($1 || $2).hex].pack('U*') : ESCAPE_CHARS_ESCAPED[match]
          end
        end
      end
    end
  end

  x.report('with') do
    RDF::NTriples::Reader.unescape("D\u00FCrst")
    RDF::NTriples::Reader.unescape("Hello world!")
  end
  x.hold! 'temp_results'
  x.compare!
end

Results:

ruby 4.0.2 (2026-03-17 revision d3da9fec82) +PRISM [x86_64-linux]
Warming up --------------------------------------
                with   363.408k i/100ms
Calculating -------------------------------------
                with      3.654M (± 3.6%) i/s  (273.68 ns/i) -     18.534M in   5.081023s

Comparison:
                with:  3653882.0 i/s
             without:  1521371.4 i/s - 2.40x  slower

@coveralls
Copy link
Copy Markdown

Coverage Status

coverage: 91.805% (+0.003%) from 91.802%
when pulling 130e396 on andyundso:optimise-unescape
into d6dd27d on ruby-rdf:develop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants