Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(recovery): add crash recovery implementation #3491

Draft
wants to merge 1 commit into
base: satp-dev
Choose a base branch
from

Conversation

Yogesh01000100
Copy link
Contributor

@Yogesh01000100 Yogesh01000100 commented Aug 20, 2024

This PR addresses issue #3114

Changes

  • Crash Recovery: Updated crash_recovery.proto and related ts files; added core recovery logic (created functions not yet implemented).
  • Knex Config: Improved knexfile.ts and knexfile-remote.ts; added Docker Compose for production.

@RafaelAPB
Copy link
Contributor

I will review this PR

@RafaelAPB
Copy link
Contributor

@Yogesh01000100 please rebase with satp-dev (should not have conflicts)

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from 0de9744 to 4c0124d Compare August 23, 2024 17:19
@Yogesh01000100 Yogesh01000100 changed the title feat: add crash recovery and knex config for production feat(recovery): add crash recovery implementation Aug 25, 2024
@RafaelAPB
Copy link
Contributor

@Yogesh01000100 please include documentation and tests, and update the description, as discussed.

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from ce9a179 to 24b8eaf Compare September 8, 2024 07:44
@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from 24b8eaf to 728e7cb Compare September 16, 2024 18:56
@RafaelAPB
Copy link
Contributor

@Yogesh01000100 could you please squash the commits and rebase with latest version of satp-dev, prior to merge?

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from 728e7cb to 1a55673 Compare September 17, 2024 10:02
Signed-off-by: Yogesh01000100 <yogeshone678@gmail.com>

chore(satp-hermes): improve DB management

Signed-off-by: Rafael Belchior <rafael.belchior@tecnico.ulisboa.pt>

chore(satp-hermes): crash recovery architecture

Signed-off-by: Rafael Belchior <rafael.belchior@tecnico.ulisboa.pt>

fix(recovery): enhance crash recovery and rollback implementation

Signed-off-by: Yogesh01000100 <yogeshone678@gmail.com>

refactor(recovery): consolidate logic and improve SATP message handling

Signed-off-by: Yogesh01000100 <yogeshone678@gmail.com>

feat(recovery): add rollback implementations

Signed-off-by: Yogesh01000100 <yogeshone678@gmail.com>

fix: correct return types and inits

Signed-off-by: Yogesh01000100 <yogeshone678@gmail.com>
@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from 1a55673 to 21ad772 Compare September 17, 2024 10:11
Copy link
Contributor

@RafaelAPB RafaelAPB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks very good, but there are some changes to be done prior to merging.
Summarizing my comments:

  1. Add other authors to the commit
  2. Incorporate feedback from the logging process (namely un-hardcoding logs and adding more information)
  3. Implement RollbackState (for example, should state how many more steps are to be rolled-back, at any moment; what was rolledback already; estimated time to completion, etc)
  4. Please add tests that support the new feature
  5. Please add comprehensive documentation on this feature. Example: The readme of SATP should have a section on how to run the docker compose with several examples of configurations.

@@ -0,0 +1,17 @@
version: '3.8'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to do a few changes. We will have a satp runner image that encapsulates satp. The docker file is being developed by Bruno (please sync with him). In particular, please note what are the necessary env variables that need to be fed to docker compose.

See suggestion below:

version: '3.8'

services:
  db:
    image: postgres:13
    environment:
      POSTGRES_DB: ${DB_NAME}
      POSTGRES_USER: ${DB_USER}
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_HOST: ${DB_HOST}
      PGPORT: ${DB_PORT}
    ports:
      - "${DB_PORT}:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data

  satp-runner:
    image: satp-runner-gateway:latest  #to be published
    environment:
      # environment variables your SATP runner needs
    ports:
      - "3000:3000"  #  port mapping
    depends_on:
      - db

volumes:
  pgdata:

Additionally, we will need a load balancer, and other services for monitoring (@eduv09 will implement those). We would have something like this:

  nginx:
    image: nginx:latest
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    ports:
      - "80:80"
    depends_on:
      - satp-runner

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:latest
    volumes:
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-admin}
    ports:
      - "3000:3000"
    depends_on:
      - prometheus

@@ -78,7 +78,16 @@
"watch": "tsc --build --watch",
"forge": "forge build ./src/solidity/*.sol --out ./src/solidity/generated",
"forge:test": "forge build ./src/test/solidity/contracts/*.sol --out ./src/test/solidity/generated",
"forge:all": "run-s 'forge' 'forge:test'"
"forge:all": "run-s 'forge' 'forge:test'",
"db:setup": "bash -c 'npm run db:destroy || true && run-s db:start db:migrate db:seed'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did many contributions to this code. Unfortunately the squash you did removed that history.

Please add me as a co-author to the squashed commit:
https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors

}

// todo read from local log to get session data
/*private async getSessions(): Map<string, SessionData> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented?

const logEntry = {
sessionID: req.sessionId,
type: "RECOVER_SUCCESS",
key: "key", // generateKey(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented?

sessionID: req.sessionId,
type: "ROLLBACK_",
key: "key", //generateKey(),
operation: "ROLLBACK_",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

throw new Error(`${fnTag}, session data is not correctly initialized`);
}
try {
// TODO record the rollback on the log. Implement RollbackLogEntry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove todos when the task is completed. I see you are already recording the rollback log entry (despite being a dummy entry). Please un-hardcode it, and gather information from the current session

try {
// TODO: Implement Stage 1 specific rollback logic

// TODO: Record the rollback on the log. Implement RollbackLogEntry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please un-hardcode the log creation where applicable

try {
// TODO: Implement Stage 2 specific rollback logic

// TODO: Record the rollback on the log. Implement RollbackLogEntry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please un-hardcode the log creation where applicable.
For instance, details can be a subset of the execution trace to the moment. Remember, these logs should help developers finding bugs or understanding the state of the application at any moment

// TODO: Implement Stage 3 specific cleanup logic

state.currentStage = "Stage3";
// TODO: Update other state properties as needed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments from other stages

@@ -1,4 +1,4 @@
// @generated by protoc-gen-es v1.7.2 with parameter "target=ts"
// @generated by protoc-gen-es v1.8.0 with parameter "target=ts"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us please fix the latest working version of the generator in package.json (if not done yet)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants