This document provides a comprehensive analysis of discrepancies between the Hibernate ORM data model and the SQL-based PostgreSQL schema in TreeBASE.
Finding: There are discrepancies between the Hibernate ORM layer (source of truth) and the SQL schema instantiated by the database initialization scripts. However, switching to Hibernate-based schema generation introduces significant issues due to seed data dependencies.
Key Issues Identified:
- Missing patch in CI/CD: Patch
0011_increase-citation-column-lengths.sqlwas not included ininit_db_uptodate.pg(now fixed) - Column length mismatches: Several columns in the SQL schema have different lengths than defined in Hibernate (fixed by patches)
- Different initialization paths: CI/CD and Docker use different initialization approaches
- Patch idempotency issues: Some patches failed when schema already had correct types (now fixed)
Recommendation: Keep the SQL-based schema with patches (Option A) rather than switching to Hibernate-based generation. See "Test Impact Analysis" section for reasoning.
- Uses
treebase-core/db/schema/init_db_uptodate.pg - Applies snapshot
0000_SCHEMA_before_patches_start.sql+0000_DATA_before_patches_start.sql - Then applies patches 0001 through 0011 sequentially
- Fixed: Patch 0011 is now included
- Uses
docker-compose.ymlvolume mounts - Applies:
docker/00-init-roles.sql- Role initializationtreebase-core/src/main/resources/TBASE2_POSTGRES_CREATION.sql- Schema creationtreebase-core/src/main/resources/initTreebase.sql- Initial datadocker/03-migration-hibernate-sequence.sql- Hibernate sequence migration
hibernate.hbm2ddl.auto=(empty/disabled)- Uses annotation-based mapping (
@Entity,@Table,@Column) - Entities defined in
org.cipres.treebase.domain.*
| Column | Hibernate Definition | SQL Schema (snapshot) | SQL Schema (TBASE2_POSTGRES_CREATION) | Patch Applied |
|---|---|---|---|---|
title |
VARCHAR(500) | VARCHAR(500) | VARCHAR(500) | - |
abstract |
VARCHAR(10000) | VARCHAR(10000) | VARCHAR(10000) | - |
keywords |
VARCHAR(1000) | VARCHAR(255) | VARCHAR(1000) | Patch 0011 |
journal |
VARCHAR(500) | VARCHAR(255) | VARCHAR(500) | Patch 0011 |
Notes:
- The snapshot has outdated column lengths for
keywords(255 vs 1000) andjournal(255 vs 500) - Patch 0011 fixes these in the snapshot-based initialization
TBASE2_POSTGRES_CREATION.sqlalready has correct values
| Column | Hibernate Definition | SQL Schema (snapshot) | Patch Applied |
|---|---|---|---|
linked |
BOOLEAN | BOOLEAN | Patch 0010 |
Notes:
- Earlier SQL had
linkedassmallint, but this was fixed by Patch 0010 - Both snapshot (after patches) and TBASE2_POSTGRES_CREATION.sql now use BOOLEAN
| Column | Hibernate Definition | SQL Schema |
|---|---|---|
tag |
VARCHAR(255) (implicit) | VARCHAR(255) |
helptext |
TEXT (LOB, 65536) | TEXT |
Notes: Schema matches.
| Column | Hibernate Definition | SQL Schema (Patch 0009) |
|---|---|---|
token_id |
BIGINT (auto-increment) | BIGINT (sequence) |
token |
VARCHAR(100), unique, NOT NULL | VARCHAR(100), unique, NOT NULL |
user_id |
BIGINT, NOT NULL | BIGINT, NOT NULL, FK |
expiry_date |
TIMESTAMP, NOT NULL | TIMESTAMP, NOT NULL |
used |
BOOLEAN, NOT NULL | BOOLEAN, NOT NULL |
Notes:
- Hibernate uses
@GeneratedValue(strategy = GenerationType.IDENTITY) - SQL uses a sequence - these are compatible in PostgreSQL
- Schema matches structurally
| Column | Hibernate Definition | SQL Schema (snapshot) |
|---|---|---|
tb_analysisid |
Not in Hibernate entity | VARCHAR(34) |
Notes: The SQL schema has a tb_analysisid column that doesn't appear to be mapped in Hibernate. This is a legacy TB1 field.
| Column | Hibernate Definition (Before) | Hibernate Definition (After) | SQL Schema |
|---|---|---|---|
symbolstring |
@Lob + @Column(length=524288) |
@Column(columnDefinition="text") |
TEXT |
Root Cause of "Bad value for type long" Error:
The @Lob annotation on MatrixRow.symbolString caused Hibernate to use OID-based CLOB handling in PostgreSQL. However, the data is inserted via direct JDBC in DiscreteMatrixJDBC.batchUpdateRowSymbol() using setString(), which writes plain text. When Hibernate tried to read the data back using getClob(), it attempted to interpret the text data as a CLOB OID, causing the error:
PSQLException: Bad value for type long : 0002000000000000000000000-0-0000000---00-000000
Solution: Removed @Lob and used @Column(columnDefinition = "text") to ensure TEXT column type without CLOB semantics.
| Column | Hibernate Definition (Before) | Hibernate Definition (After) | SQL Schema |
|---|---|---|---|
newickstring |
@Lob + @Column(length=4194304) |
@Column(columnDefinition="text") |
TEXT |
Notes: Same pattern as MatrixRow. Removed @Lob to prevent potential similar issues.
- Dual Maintenance: The SQL schema and Hibernate annotations are maintained separately
- Historical Evolution: The SQL schema evolved over time with patches while Hibernate annotations were updated independently
- Different Base Files:
TBASE2_POSTGRES_CREATION.sqlappears more aligned with Hibernate- The snapshot approach uses older schema + patches
- Missing Patch: Patch 0011 wasn't added to the patch inclusion list
CI/CD Path:
snapshot → patches (0001-0011) → final schema
Docker Path:
TBASE2_POSTGRES_CREATION.sql → initTreebase.sql → final schema
These paths should produce equivalent schemas but use different mechanisms.
Pros:
- Known working approach - all tests pass
- Explicit control over schema
- Migration scripts for production
- Seed data loading order is controlled
Cons:
- Dual maintenance burden
- Potential for drift between Hibernate and SQL
- Requires diligent patch management
Action Items:
- ✅ Add patch 0011 to
init_db_uptodate.pg(DONE) - ✅ Make patches idempotent (DONE)
- Consider creating a new schema snapshot that includes all patches
- Consider adding
hibernate.hbm2ddl.auto=validatein production
Pros:
- Single source of truth (Hibernate entities)
- No impedance mismatch
- Automatic schema updates with
hbm2ddl.auto=update
Cons:
- ❌ NOT VIABLE - 21 test failures/errors (see Test Impact Analysis)
- Requires significant refactoring of seed data loading
- Risk of data loss in production if not carefully managed
- Less control over exact DDL
Strategy: Use Hibernate for tests/development, SQL for production
Status: Not recommended due to:
- Seed data dependency issues
- Would require rewriting test data setup
- Maintenance burden of two approaches
Result: ✅ All tests pass
Tests run: 301, Failures: 0, Errors: 0, Skipped: 43
BUILD SUCCESS
Result: ❌ 12 failures, 9 errors
Tests run: 301, Failures: 12, Errors: 9, Skipped: 53
BUILD FAILURE
Failed Tests (Missing Seed Data):
ItemDefinitionDAOTest.testFindByDescriptionItemDefinitionDAOTest.testFindPredefinedItemDefinitionMatrixDAOTest.testfindKindByDescriptionMatrixDataTypeDAOTest.testFindByDescriptionAlgorithmDAOTest.testFinalAllUniqueAlgorithmDescriptionsStudyStatusDAOTest.testFindStatusInProgress/Published/ReadyPhyloTreeDAOTest.testFindTypeByDescription/findKindByDescription/findQualityByDescriptionSubmissionServiceImplTest.testProcessNexusFile
Error Tests (Foreign Key Violations):
EnvironmentTest.testGetGeneratedKey- null value in column violationEnvironmentTest.testSelectFromInsert- null value in column violationMatrixServiceImplTest.testAddDelete- NullPointerException (missing service beans)AnalysisServiceImplTest.testAddDelete- NullPointerExceptionStudyServiceImplTest.*- Multiple NullPointerExceptions
The test failures with Hibernate-based schema generation are caused by:
-
Missing Seed Data: Hibernate creates empty tables. Tests expect pre-populated reference data for:
ItemDefinition(predefined item definitions)MatrixDataType(DNA, RNA, Protein, etc.)StudyStatus(In Progress, Ready, Published)TreeType,TreeKind,TreeQualityAlgorithm(reference algorithm types)User,Person(test user accounts)
-
Foreign Key Ordering: When loading
initTreebase.sqlafter Hibernate schema creation:- Hibernate creates FK constraints immediately
- SQL script inserts data in wrong order (e.g.,
userbeforeperson) - Results in FK constraint violations
-
Schema Already Exists Errors:
password_reset_tokentable created by Hibernate conflicts with SQL script
Hibernate-based schema generation is NOT suitable for the current test setup because:
- Tests depend on seed data that must be loaded in a specific order
- The
initTreebase.sqlscript was designed for SQL-based schema creation - Significant refactoring of test data setup would be required
Recommendation: Continue with Option A (SQL-Based Schema) with the following improvements:
- ✅ Keep patches idempotent (fixed in this PR)
- ✅ Ensure all patches are included in
init_db_uptodate.pg - Consider adding
hibernate.hbm2ddl.auto=validatein production to catch drift
Based on the codebase, these test categories should be monitored:
-
DAO Tests (
org.cipres.treebase.dao.*)- Test CRUD operations against database
- May be affected by constraint differences
-
Service Tests (
org.cipres.treebase.service.*)- Test business logic with database
- Should be unaffected by schema generation method
-
Domain Tests (
org.cipres.treebase.domain.*)- Test entity relationships
- May be affected by cascade/fetch settings
- ✅ Add patch 0011 to
init_db_uptodate.pg
- Run full test suite with current SQL-based initialization
- Document any existing test failures
- Create test configuration with
hbm2ddl.auto=create - Run tests to identify failures
- Document failures and root causes
- Based on test results, implement Option A, B, or C
- Update CI/CD configuration
- Update Docker configuration
- Update documentation
From org.cipres.treebase.domain.TBPersistable:
public static final int COLUMN_LENGTH_30 = 30;
public static final int COLUMN_LENGTH_50 = 50;
public static final int COLUMN_LENGTH_100 = 100;
public static final int COLUMN_LENGTH_STRING = 255;
public static final int COLUMN_LENGTH_500 = 500;
public static final int COLUMN_LENGTH_STRING_1K = 1000;
public static final int COLUMN_LENGTH_STRING_NOTES = 2000;
public static final int COLUMN_LENGTH_STRING_MAX = 5000;
public static final int CITATION_TITLE_COLUMN_LENGTH = 500;
public static final int CITATION_ABSTRACT_COLUMN_LENGTH = 10000;
public static final int CITATION_KEYWORDS_COLUMN_LENGTH = 1000;
public static final int CITATION_JOURNAL_COLUMN_LENGTH = 500;These constants define the expected column lengths in Hibernate and should be used as the reference when comparing with SQL schemas.