SignalR Problems Resolved

When last I posted, I was having SignalR reconnection issues. It’s important that the nodes are able to recover from SignalR connectivity loss. SignalR will recover on it’s own for some time, but will eventually give up. The nodes can’t give up. I wrote a connector class that wraps the SignalR connection object, and keeps on trying to connect until it’s good.

The RECONNECT is initiated two ways:

  • If the client detects the disconnection and fires the Disconnected event.
  • If the bus publishes a message, and the publish fails because the connection is down.

The connector starts a task that stays in a loop until the connection is received. Because there are always oodles of threads running, it has to make sure that the loop is never started more than once. I introduced some thread wrappers to handle that, which I’ll cover in another post.

My issue last night was that the client was not reconnecting when the server came back up. I uncovered 2 issues while troubleshooting this

SignalRBusServer Bug – I recently posted an apology to my brain for the mental and emotional trauma I cast upon it while working through bus design issues. If you recall, the issue was more about elegance than functionality as it has been working forever. The key challenge there was elegance, not making it work… it’s been working forever. The LocalBus is stand alone, but you can inject a publisher. So, you inject the SignalR publisher. When a message is received from SignalR (IReceiver) it needs to be passed to the local bus. That 2 way relationship is the cause of much of the shenanigans. Attempt #715 at resolving this resulted in having the IBusServer create the local bus with the correct publisher, and also send messages to the bus it created. The problem I discovered while testing is that the bus was being created and destroyed as the server was starting and stopping. So, the app server fires up with one localbus, but if you restart the SignalRBusServer object, then there’s a new localbus that no one knows about. That is why the bus wasn’t reconnecting; the old bus was running the loop, and the new one wasn’t referenced by anything. I changed it to only destroy the localbus on dispose, not on stop.

Establishing the Connection – Microsoft APIs have changed a lot in recent years. WCF is ridiculously complicated. WebAPI and SignalR are much simpler, but they keeps changing especially as OWIN was introduced. They achieve the simplicity, in some cases, by having static global objects to be used by everything in the appdomain. I’m ok with that, as long as you don’t have to use them. And, to this point, that has been a challenge, because I’m using them. In the app server, things are associated to features. If a WebAPI controller is associated to a stopped feature, then that controller should not be reachable when the feature is stopped. WebAPI isn’t aware of any of that, though. It just finds all of the controllers in the app domain and wires them up. To counter that, I created a filter to return 404 for controllers that are hit but aren’t supposed to be up. Based on what I did tonight with SignalR, I’m hopeful that I’ll be able to modify the WebAPI code to just not load the controllers that it needs.

Anyway, back to the point: When i stopped/started the signalr server, clients couldn’t connect even though it started successfully. This is the relevant code:

image

The ugly parts are the usages of GlobalHost. That’s is the part that I don’t like. It turns out that was the cause of the problem, although I don’t know, in particular, why. All I know is that once I figured out how to stop using GlobalHost, the restart worked and clients reconnected.

Here’s the new code.

image

Much better. GlobalHost is gone. Instead of using it, it creates its own dependency resolver (which is the same type GlobalHost uses.)

I look forward to revisiting the webapi code.

Working Well

All of my current tests and expectations regarding startup, stopping, and synchronization are working.

  • start the hub
  • create a node
  • set the node’s model
  • start the node
  • change the model – the node updates
  • shut down the node
  • change the model
  • start the node – the node updates
  • stop the signalr server
  • wait for the node connection to timeout and enter the reconnect loop
  • restart the signalr server – the node reconnects
  • stop the signalr server
  • wait for the node connection to timeout and enter the reconnect loop
  • change the model
  • start the signalr server – the node reconnects, then updates

Unlike previous iterations, it’s not barely working… it’s really working. There’s nothing flaky or questionable about it. It’s solid. Of course, there’s more work that I would like to do. There are 2 pieces of code that I’d like to revisit, but those just a few lines in a couple of methods.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: