Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(client): realtime client #29

Merged
merged 15 commits into from
Nov 27, 2023
Merged

feat(client): realtime client #29

merged 15 commits into from
Nov 27, 2023

Conversation

drochetti
Copy link
Collaborator

@drochetti drochetti commented Nov 25, 2023

Introducing new public API fal.realtime.connect

In order to support the new realtime protocol built on top of WebSockets, this PR introduces a new client API. It allows developers to easily interact with model APIs that support realtime inference with very little effort, and also compatible with different JS runtimes and frameworks.

API signature + ergonomics

The API signature is as follows:

function connect(app: string, handler: RealtimeConnectionHandler): RealtimeConnection

It is used as follows:

const connection = fal.realtime.connect('my-superfast-lcm', {
  onResult: (result) => {
    // handle the result
  }
});

connection.send({
  prompt: 'a moon in a starry night sky'
});

Implementation notes

  • The client abstracts the usage of WebSocket, the design aims to fit the "duplex realtime communication between client and server", the usage of WS is considered an implementation details the consumer won't need to worry about.
  • It handles authentication automatically. WebSockets clients are created in the browser, which make them unsafe when credentials like API keys are needed. Therefore, the client will automatically handle short-lived tokens needed to interact with the fal realtime endpoints (similar to how file uploads are handled).
  • It also aims to provide a good/smooth experience out-of-the-box, so it handles particularities of "too many concurrent events", "client-side + server-side rendering", "re-render lifecycles", etc

Event throttling

One of the popular use-cases for LCM/realtime is to generate images as the user draws on a canvas and/or modifies a prompt. However, these events are triggered dozens, sometimes hundreds of times every second. So the client provides a built-in outgoing message throttling mechanism, that makes the experience smooth by default, without overloading the endpoint with too many requests.

const { send } = fal.realtime.connect('my-superfast-lcm', {
  throttleInterval: 128, // default is 64ms
  onResult: handleResult
});

Client+Server rendering

A popular architecture for front-end frameworks nowadays allows developers to share the same code when rendering pages on a server or on a client (e.g. Next, Remix, etc). This enables devs to create rich pages that come populate from the server but are also super interactive on the client. However, this comes with challenges, as developers need to be mindful of some components that are only suitable for client-side rendering or vice-versa. The client is mindful of such use-cases and provides a clientOnly: boolean option that when true will ignore (i.e. no-op) when called on the server-side.

Re-rendering cycles

Another common challenge in popular front-end architectures is to create components that can react to state whenever the frameworks decide to reconcile the data state and the UI representation of such state. This means that a component, that is simply a function, can be called multiple times during its lifecycle, and code that can trigger heavy-weight, long-running operations or manage resources need to be aware of that (e.g. WebSocket connections).

The client introduces a connectionKey option, that when passed makes the client reuse a WebSocket connection available for the same key. This allows devs to simply not worry about rendering cycles, as connect can be called multiple times during the rendering phase that no new connections/listeners/etc will be created.

Here's what a full example of the client working on a Next.js page that does both ssr and csr:

  const { send } = fal.realtime.connect('110602490-lcm-plexed-sd15-i2i', {
    clientOnly: true, // in ssr+csr mode, only run in csr
    connectionKey: 'single-drawing', // reuse connection between render cycles
    throttleInterval: 128, // default is 64, 128 is still smooth but sends less requests
    onResult(result) {
      if (result.images && result.images[0]) {
        setImage(result.images[0].url);
      }
    },
    onError(error) {
      // handle error appropriately
      console.error(error);
    }
  });

Demo

The demo-nextjs-app-router now includes a "realtime" demo page that demonstrates the client usage:

fal-realtime-connect-demo.mp4

Pending

  • Fix create token call
  • Proper error handling

const data = JSON.parse(event.data);
// Drop messages that are not related to the actual result.
// In the future, we might want to handle other types of messages.
if (data.status !== 'error' && data.type !== 'x-fal-message') {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@isidentical
Copy link

One of the popular use-cases for LCM/realtime is to generate images as the user draws on a canvas and/or modifies a prompt. However, these events are triggered dozens, sometimes hundreds of times every second. So the client provides a built-in outgoing message throttling mechanism, that makes the experience smooth by default, without overloading the endpoint with too many requests.

Does this drop messages or just throttles them (so can there be an eventual backlog of messages causing a laggy re-render)? In the server side we also keep a limited deque so in the case of the LCM app when the client sends the following 10 prompts before we can complete the current image generation process, we only keep the last 3 in memory and answer to them.

client: {req_id: 1}
client: {req_id: 2}
client: {req_id: 3}
...
server: {req_id: 1}
server: {req_id: 9}
server: {req_id: 10}

The server will never respond to requests 2 to 8, and this is by design (although maybe it can't be generalized). This is not a generic solution for all apps, but at least for the real time I think either some sort of a backpressure system or at the very least making the buffer limited might be helpful if it isn't already. Not for this PR, but something to think about maybe (since we already handle it on the server for the limited apps we have).


async function getConnection(app: string, key: string): Promise<WebSocket> {
const url = builRealtimeUrl(app);
// const token = await getToken(app);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we integrate (or enable) this, would this mean we have to make a new REST request (to generate a JWT token) even though we might have an unexpired token before every connection? Since we regularly close the underlying websockets, this might be a bit inefficient without some sort of caching to make sure we squeeze out as much as possible from individual JWT tokens (e.g. getToken with a TTL'd cache, if it is easy to implement) should help with perf a lot (in a world where re-connects happen often).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, once the wip impl is done, it will reuse the token as long as it's not expired.

@drochetti
Copy link
Collaborator Author

Does this drop messages or just throttles them (so can there be an eventual backlog of messages causing a laggy re-render)? In the server side we also keep a limited deque so in the case of the LCM app when the client sends the following 10 prompts before we can complete the current image generation process, we only keep the last 3 in memory and answer to them.

This means the send function itself is throttled, meaning the messages won't ever be enqueued, they're simply dropped. The throttle(func, wait) creates a throttled function that only invokes func at most once per every wait milliseconds.

This is not a generic solution for all apps, but at least for the real time I think either some sort of a backpressure system or at the very least making the buffer limited might be helpful if it isn't already. Not for this PR, but something to think about maybe (since we already handle it on the server for the limited apps we have).

This is exactly the side-effect of the current throttle implementation. :)

cc @isidentical

@isidentical
Copy link

Amazing!

@drochetti drochetti merged commit 145159a into main Nov 27, 2023
1 check passed
@steveruizok
Copy link

steveruizok commented Dec 2, 2023

Does this work with the pages router? Or just the app router? Getting 401 errors.

@squat squat deleted the feature/real-time-support branch March 18, 2024 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants