Skip to content

ETT-1459: Record first ingest date#185

Open
aelkiss wants to merge 4 commits into
mainfrom
ETT-1459-first-ingest-date
Open

ETT-1459: Record first ingest date#185
aelkiss wants to merge 4 commits into
mainfrom
ETT-1459-first-ingest-date

Conversation

@aelkiss
Copy link
Copy Markdown
Member

@aelkiss aelkiss commented Jun 4, 2026

This change moves the functionality for recording items in feed_audit to the end of the Collate stage rather than a particular storage.

It also takes the opportunity to:

  • clean up some confusing configuration regarding link_dir and obj_dir
  • improve testing functionality around temporary directories
  • remove LinkedPairtree entirely (since we no longer deposit using symlinked anything)
  • log how long each collate operation takes (ETT-824); add additional fields to info & warn level logging in collate

See comments in more detail on each commit.

I had looked into options for rolling back failed deposits to S3, but the ideas I had didn't work out (see ETT-1483).

aelkiss added 3 commits June 4, 2026 15:25
* add first ingest date column to feed_audit table
* record item in feed_audit at the end of collate
* remove record_audit functionality from LocalPairtree (now unused except in
  development); emit warning (could record to feed_storage for
  consistency if we want instead, but we aren't really using it..?)
* testing with storage classes in collate is a bit messy because of the
  distinction between depositing to the repo and reading back from the
  repo
* add additional logging options in Stage (need to DRY out though)
* additional logging for collate (should log duration; see ETT-824)
* add some notes towards ETT-1687
* Mock depositing item for collate tests with mocked storage
This addresses two issues:
* We are no longer using symlinks to deposit material into the repository.
* When we read from the repo, we just care about the root of the repo
  (just like e.g. babel apps reading from the repo), not about any
  symlinks, etc.

Specific changes:
* remove LinkedPairtree
* remove "repository" key in config & references to link_dir / obj_dir;
  replace with a "repository_root" key
* TempDirs keeps track of what it creates; callers can create additional
  temp dirs that will get cleaned up at the end of a test.
@aelkiss aelkiss requested a review from moseshll June 4, 2026 20:09
Comment thread etc/ingest.sql
`id` varchar(30) NOT NULL,
`sdr_partition` tinyint(4) DEFAULT NULL,
`zip_size` bigint(20) DEFAULT NULL,
`first_ingest_date` datetime NULL DEFAULT CURRENT_TIMESTAMP,
Copy link
Copy Markdown
Member Author

@aelkiss aelkiss Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change for recording first ingest date; the application side doesn't need to handle it directly at all beyond making sure that something is recorded in feed_audit

Comment thread lib/HTFeed/Stage.pm
Comment thread lib/HTFeed/Storage.pm
my $self = shift;

return $self->{volume}->get_zip_path(get_config('staging', 'zipfile')) . '.gpg';
return $self->{volume}->get_zip_path(get_config('staging', 'zipfile')) . "-$self->{name}.gpg";
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This avoids collisions with encrypted zips left over from other storages. They should get cleaned up but don't always in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant