Hibernate ORM vs SQL Schema Analysis

This document provides a comprehensive analysis of discrepancies between the Hibernate ORM data model and the SQL-based PostgreSQL schema in TreeBASE.

Executive Summary

Finding: There are discrepancies between the Hibernate ORM layer (source of truth) and the SQL schema instantiated by the database initialization scripts. However, switching to Hibernate-based schema generation introduces significant issues due to seed data dependencies.

Key Issues Identified:

Missing patch in CI/CD: Patch 0011_increase-citation-column-lengths.sql was not included in init_db_uptodate.pg (now fixed)
Column length mismatches: Several columns in the SQL schema have different lengths than defined in Hibernate (fixed by patches)
Different initialization paths: CI/CD and Docker use different initialization approaches
Patch idempotency issues: Some patches failed when schema already had correct types (now fixed)

Recommendation: Keep the SQL-based schema with patches (Option A) rather than switching to Hibernate-based generation. See "Test Impact Analysis" section for reasoning.

Current Architecture

CI/CD Database Initialization

Uses treebase-core/db/schema/init_db_uptodate.pg
Applies snapshot 0000_SCHEMA_before_patches_start.sql + 0000_DATA_before_patches_start.sql
Then applies patches 0001 through 0011 sequentially
Fixed: Patch 0011 is now included

Docker Database Initialization

Uses docker-compose.yml volume mounts
Applies:
1. docker/00-init-roles.sql - Role initialization
2. treebase-core/src/main/resources/TBASE2_POSTGRES_CREATION.sql - Schema creation
3. treebase-core/src/main/resources/initTreebase.sql - Initial data
4. docker/03-migration-hibernate-sequence.sql - Hibernate sequence migration

Hibernate Configuration

hibernate.hbm2ddl.auto= (empty/disabled)
Uses annotation-based mapping (@Entity, @Table, @Column)
Entities defined in org.cipres.treebase.domain.*

Detailed Schema Discrepancies

1. Citation Table

Column	Hibernate Definition	SQL Schema (snapshot)	SQL Schema (TBASE2_POSTGRES_CREATION)	Patch Applied
`title`	VARCHAR(500)	VARCHAR(500)	VARCHAR(500)	-
`abstract`	VARCHAR(10000)	VARCHAR(10000)	VARCHAR(10000)	-
`keywords`	VARCHAR(1000)	VARCHAR(255)	VARCHAR(1000)	Patch 0011
`journal`	VARCHAR(500)	VARCHAR(255)	VARCHAR(500)	Patch 0011

Notes:

The snapshot has outdated column lengths for keywords (255 vs 1000) and journal (255 vs 500)
Patch 0011 fixes these in the snapshot-based initialization
TBASE2_POSTGRES_CREATION.sql already has correct values

2. TaxonLabel Table

Column	Hibernate Definition	SQL Schema (snapshot)	Patch Applied
`linked`	BOOLEAN	BOOLEAN	Patch 0010

Notes:

Earlier SQL had linked as smallint, but this was fixed by Patch 0010
Both snapshot (after patches) and TBASE2_POSTGRES_CREATION.sql now use BOOLEAN

3. Help Table

Column	Hibernate Definition	SQL Schema
`tag`	VARCHAR(255) (implicit)	VARCHAR(255)
`helptext`	TEXT (LOB, 65536)	TEXT

Notes: Schema matches.

4. PasswordResetToken Table

Column	Hibernate Definition	SQL Schema (Patch 0009)
`token_id`	BIGINT (auto-increment)	BIGINT (sequence)
`token`	VARCHAR(100), unique, NOT NULL	VARCHAR(100), unique, NOT NULL
`user_id`	BIGINT, NOT NULL	BIGINT, NOT NULL, FK
`expiry_date`	TIMESTAMP, NOT NULL	TIMESTAMP, NOT NULL
`used`	BOOLEAN, NOT NULL	BOOLEAN, NOT NULL

Notes:

Hibernate uses @GeneratedValue(strategy = GenerationType.IDENTITY)
SQL uses a sequence - these are compatible in PostgreSQL
Schema matches structurally

5. AnalysisStep Table

Column	Hibernate Definition	SQL Schema (snapshot)
`tb_analysisid`	Not in Hibernate entity	VARCHAR(34)

Notes: The SQL schema has a tb_analysisid column that doesn't appear to be mapped in Hibernate. This is a legacy TB1 field.

6. MatrixRow Table (CRITICAL - LOB Type Mismatch)

Column	Hibernate Definition (Before)	Hibernate Definition (After)	SQL Schema
`symbolstring`	`@Lob` + `@Column(length=524288)`	`@Column(columnDefinition="text")`	TEXT

Root Cause of "Bad value for type long" Error: The @Lob annotation on MatrixRow.symbolString caused Hibernate to use OID-based CLOB handling in PostgreSQL. However, the data is inserted via direct JDBC in DiscreteMatrixJDBC.batchUpdateRowSymbol() using setString(), which writes plain text. When Hibernate tried to read the data back using getClob(), it attempted to interpret the text data as a CLOB OID, causing the error:

PSQLException: Bad value for type long : 0002000000000000000000000-0-0000000---00-000000

Solution: Removed @Lob and used @Column(columnDefinition = "text") to ensure TEXT column type without CLOB semantics.

7. PhyloTree Table (Preventive Fix)

Column	Hibernate Definition (Before)	Hibernate Definition (After)	SQL Schema
`newickstring`	`@Lob` + `@Column(length=4194304)`	`@Column(columnDefinition="text")`	TEXT

Notes: Same pattern as MatrixRow. Removed @Lob to prevent potential similar issues.

Root Cause Analysis

Why Discrepancies Exist

Dual Maintenance: The SQL schema and Hibernate annotations are maintained separately
Historical Evolution: The SQL schema evolved over time with patches while Hibernate annotations were updated independently
Different Base Files:
- TBASE2_POSTGRES_CREATION.sql appears more aligned with Hibernate
- The snapshot approach uses older schema + patches
Missing Patch: Patch 0011 wasn't added to the patch inclusion list

The Two Initialization Paths

CI/CD Path:
  snapshot → patches (0001-0011) → final schema

Docker Path:
  TBASE2_POSTGRES_CREATION.sql → initTreebase.sql → final schema

These paths should produce equivalent schemas but use different mechanisms.

Recommendations

Option A: Continue with SQL-Based Schema (RECOMMENDED)

Pros:

Known working approach - all tests pass
Explicit control over schema
Migration scripts for production
Seed data loading order is controlled

Cons:

Dual maintenance burden
Potential for drift between Hibernate and SQL
Requires diligent patch management

Action Items:

✅ Add patch 0011 to init_db_uptodate.pg (DONE)
✅ Make patches idempotent (DONE)
Consider creating a new schema snapshot that includes all patches
Consider adding hibernate.hbm2ddl.auto=validate in production

Option B: Switch to Hibernate-Based Schema Generation

Pros:

Single source of truth (Hibernate entities)
No impedance mismatch
Automatic schema updates with hbm2ddl.auto=update

Cons:

❌ NOT VIABLE - 21 test failures/errors (see Test Impact Analysis)
Requires significant refactoring of seed data loading
Risk of data loss in production if not carefully managed
Less control over exact DDL

Option C: Hybrid Approach

Strategy: Use Hibernate for tests/development, SQL for production

Status: Not recommended due to:

Seed data dependency issues
Would require rewriting test data setup
Maintenance burden of two approaches

Test Impact Analysis

Actual Test Results

SQL-Based Schema (Current Approach)

Result: ✅ All tests pass

Tests run: 301, Failures: 0, Errors: 0, Skipped: 43
BUILD SUCCESS

Hibernate-Based Schema (`hbm2ddl.auto=create`)

Result: ❌ 12 failures, 9 errors

Tests run: 301, Failures: 12, Errors: 9, Skipped: 53
BUILD FAILURE

Failed Tests (Missing Seed Data):

ItemDefinitionDAOTest.testFindByDescription
ItemDefinitionDAOTest.testFindPredefinedItemDefinition
MatrixDAOTest.testfindKindByDescription
MatrixDataTypeDAOTest.testFindByDescription
AlgorithmDAOTest.testFinalAllUniqueAlgorithmDescriptions
StudyStatusDAOTest.testFindStatusInProgress/Published/Ready
PhyloTreeDAOTest.testFindTypeByDescription/findKindByDescription/findQualityByDescription
SubmissionServiceImplTest.testProcessNexusFile

Error Tests (Foreign Key Violations):

EnvironmentTest.testGetGeneratedKey - null value in column violation
EnvironmentTest.testSelectFromInsert - null value in column violation
MatrixServiceImplTest.testAddDelete - NullPointerException (missing service beans)
AnalysisServiceImplTest.testAddDelete - NullPointerException
StudyServiceImplTest.* - Multiple NullPointerExceptions

Root Cause Analysis

The test failures with Hibernate-based schema generation are caused by:

Missing Seed Data: Hibernate creates empty tables. Tests expect pre-populated reference data for:
- ItemDefinition (predefined item definitions)
- MatrixDataType (DNA, RNA, Protein, etc.)
- StudyStatus (In Progress, Ready, Published)
- TreeType, TreeKind, TreeQuality
- Algorithm (reference algorithm types)
- User, Person (test user accounts)
Foreign Key Ordering: When loading initTreebase.sql after Hibernate schema creation:
- Hibernate creates FK constraints immediately
- SQL script inserts data in wrong order (e.g., user before person)
- Results in FK constraint violations
Schema Already Exists Errors:
- password_reset_token table created by Hibernate conflicts with SQL script

Conclusion

Hibernate-based schema generation is NOT suitable for the current test setup because:

Tests depend on seed data that must be loaded in a specific order
The initTreebase.sql script was designed for SQL-based schema creation
Significant refactoring of test data setup would be required

Recommendation: Continue with Option A (SQL-Based Schema) with the following improvements:

✅ Keep patches idempotent (fixed in this PR)
✅ Ensure all patches are included in init_db_uptodate.pg
Consider adding hibernate.hbm2ddl.auto=validate in production to catch drift

Tests to Monitor

Based on the codebase, these test categories should be monitored:

DAO Tests (org.cipres.treebase.dao.*)
- Test CRUD operations against database
- May be affected by constraint differences
Service Tests (org.cipres.treebase.service.*)
- Test business logic with database
- Should be unaffected by schema generation method
Domain Tests (org.cipres.treebase.domain.*)
- Test entity relationships
- May be affected by cascade/fetch settings

Implementation Plan

Phase 1: Fix Immediate Issues (COMPLETED)

✅ Add patch 0011 to init_db_uptodate.pg

Phase 2: Validate Current State

Run full test suite with current SQL-based initialization
Document any existing test failures

Phase 3: Test Hibernate-Based Initialization

Create test configuration with hbm2ddl.auto=create
Run tests to identify failures
Document failures and root causes

Phase 4: Implement Chosen Strategy

Based on test results, implement Option A, B, or C
Update CI/CD configuration
Update Docker configuration
Update documentation

Appendix: Column Length Constants

From org.cipres.treebase.domain.TBPersistable:

public static final int COLUMN_LENGTH_30 = 30;
public static final int COLUMN_LENGTH_50 = 50;
public static final int COLUMN_LENGTH_100 = 100;
public static final int COLUMN_LENGTH_STRING = 255;
public static final int COLUMN_LENGTH_500 = 500;
public static final int COLUMN_LENGTH_STRING_1K = 1000;
public static final int COLUMN_LENGTH_STRING_NOTES = 2000;
public static final int COLUMN_LENGTH_STRING_MAX = 5000;

public static final int CITATION_TITLE_COLUMN_LENGTH = 500;
public static final int CITATION_ABSTRACT_COLUMN_LENGTH = 10000;
public static final int CITATION_KEYWORDS_COLUMN_LENGTH = 1000;
public static final int CITATION_JOURNAL_COLUMN_LENGTH = 500;

These constants define the expected column lengths in Hibernate and should be used as the reference when comparing with SQL schemas.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hibernate ORM vs SQL Schema Analysis

Executive Summary

Current Architecture

CI/CD Database Initialization

Docker Database Initialization

Hibernate Configuration

Detailed Schema Discrepancies

1. Citation Table

2. TaxonLabel Table

3. Help Table

4. PasswordResetToken Table

5. AnalysisStep Table

6. MatrixRow Table (CRITICAL - LOB Type Mismatch)

7. PhyloTree Table (Preventive Fix)

Root Cause Analysis

Why Discrepancies Exist

The Two Initialization Paths

Recommendations

Option A: Continue with SQL-Based Schema (RECOMMENDED)

Option B: Switch to Hibernate-Based Schema Generation

Option C: Hybrid Approach

Test Impact Analysis

Actual Test Results

SQL-Based Schema (Current Approach)

Hibernate-Based Schema (`hbm2ddl.auto=create`)

Root Cause Analysis

Conclusion

Tests to Monitor

Implementation Plan

Phase 1: Fix Immediate Issues (COMPLETED)

Phase 2: Validate Current State

Phase 3: Test Hibernate-Based Initialization

Phase 4: Implement Chosen Strategy

Appendix: Column Length Constants

FilesExpand file tree

HIBERNATE_VS_SQL_SCHEMA_ANALYSIS.md

Latest commit

History

HIBERNATE_VS_SQL_SCHEMA_ANALYSIS.md

File metadata and controls

Hibernate ORM vs SQL Schema Analysis

Executive Summary

Current Architecture

CI/CD Database Initialization

Docker Database Initialization

Hibernate Configuration

Detailed Schema Discrepancies

1. Citation Table

2. TaxonLabel Table

3. Help Table

4. PasswordResetToken Table

5. AnalysisStep Table

6. MatrixRow Table (CRITICAL - LOB Type Mismatch)

7. PhyloTree Table (Preventive Fix)

Root Cause Analysis

Why Discrepancies Exist

The Two Initialization Paths

Recommendations

Option A: Continue with SQL-Based Schema (RECOMMENDED)

Option B: Switch to Hibernate-Based Schema Generation

Option C: Hybrid Approach

Test Impact Analysis

Actual Test Results

SQL-Based Schema (Current Approach)

Hibernate-Based Schema (hbm2ddl.auto=create)

Root Cause Analysis

Conclusion

Tests to Monitor

Implementation Plan

Phase 1: Fix Immediate Issues (COMPLETED)

Phase 2: Validate Current State

Phase 3: Test Hibernate-Based Initialization

Phase 4: Implement Chosen Strategy

Appendix: Column Length Constants

Hibernate-Based Schema (`hbm2ddl.auto=create`)