A call came in last week from a customer that’s using Lync 2013 on premises and Exchange Online. “We can’t reach our voicemail anymore”. Lync to Lync calls, PSTN to Lync calls, none could be forwarded to the Office 365 UM services. Funny enough, if I wanted to leave them a voicemail (being a federated partner), I managed to do so without any problem.
That did make sense in a way, since I’m being forwarded by Office 365 to contact the UM servers directly when using Lync. Then again, this would not solve the internal issues.
First discovery I made was that the Edge server in unable to resolve any external DNS queries. Having some firewall changes lately, I blamed it on the firewall and waited for that to be tested again. Indeed, something was preventing the Edge from sending DNS queries to the internet. That’s fixed now, but still – same issue. Additionally, it was not affecting only communications with the Office 365 UM service, all external communications that required the usage of the AV service failed.
Needless to say, throughout the entire process – No errors on the Edge server. Management store replication is ok, certificates are ok, server is patched, restarted – It’s all dandy and happy in Edge kingdom.
I insisted on rechecking all the firewall rules again – all seemed to be in place. I used James Cussen‘s great tool to test Edge connectivity – All results were successful.
After examining the UCCAPI log files of the clients and tracing the Edge server’s logs – everything was still ‘working’ as far as the Edge server was concerned. We could see the SIP traffic working perfectly (plus, we had IM and presence functioning), and the sessions would only drop as soon as the other party “picked up” the call.
This is where things are getting a little interesting. Back to Networking 101, if I’m testing a TCP connection – I will only accept the session as “Successful” if the handshake is completed:
This means that if the service is not responding, I will not get the server’s ACK, and the connection will time out.
When using UDP, it’s a different story:
So “testing” a UDP service might be a little tricky…
This had me suspicious about the AV service, being the one in charge of our RTP traffic.
With no other options left, I started tracing the actual UDP sessions.
Here’s how it looks when the AV service is not cooperating:
lync.exe 172.25.20.99 AV.Edge.com TURN TURN:TURN:Allocate Request {UDP:59, IPv4:58}
lync.exe 172.25.20.99 AV.Edge.com TCP TCP:Flags=......S., SrcPort=10963, DstPort=HTTPS(443),
lync.exe AV.Edge.com 172.25.20.99 TCP TCP:Flags=...A.R.., SrcPort=HTTPS(443), DstPort=10963,
lync.exe 172.25.20.99 AV.Edge.com TCP TCP:Flags=......S., SrcPort=10851, DstPort=HTTPS(443),
lync.exe AV.Edge.com 172.25.20.99 TCP TCP:Flags=...A.R.., SrcPort=HTTPS(443), DstPort=10851,
lync.exe 172.25.20.99 AV.Edge.com TURN TURN:TURN:Allocate Request {UDP:59, IPv4:58}
lync.exe 172.25.20.99 AV.Edge.com TURN TURN:TURN:Allocate Request {UDP:59, IPv4:58}
lync.exe 172.25.20.99 AV.Edge.com TCP TCP:Flags=......S., SrcPort=10963, DstPort=HTTPS(443),
lync.exe AV.Edge.com 172.25.20.99 TCP TCP:Flags=...A.R.., SrcPort=HTTPS(443), DstPort=10963,
Digging a little dipper into the TURN Allocate Request, we can see all the right details:
Frame: Number = 194, Captured Frame Length = 232, MediaType = WiFi
+ WiFi: [Unencrypted Data] .T....., (I)
+ LLC: Unnumbered(U) Frame, Command Frame, SSAP = SNAP(Sub-Network Access Protocol), DSAP = SNAP(Sub-Network Access Protocol)
+ Snap: EtherType = Internet IP (IPv4), OrgCode = XEROX CORPORATION
+ Ipv4: Src = 172.25.20.99, Dest = AV.Edge.com, Next Protocol = UDP, Packet ID = 22808, Total IP Length = 168
+ Udp: SrcPort = 8588, DstPort = 3478, Length = 148
+ TURN: TURN:Allocate Request
This is where I should be getting back a TURN:Allocate Response from the application. Yet, no reply.
Tried stopping the Edge AV service – it said “Stopping” for 30 minutes but never stopped, even when using the -Force switch. Trying to kill the process and the task was unsuccessful either.
This is where I tried to remove the Lync Edge components from “Programs and Features”. This failed as well, saying there was a problem with the “Lync Server Media Relay Driver” on the Local Area Connection interface.
Immediately went to “Network Connections” and what do you know?! This is what I see:
I uninstalled it, ran Bootstrapper again, and retried the connection. The result was clear:
lync.exe 172.25.20.99 AV.Edge.com TURN TURN:TURN:Allocate Request {UDP:40, IPv4:39}
lync.exe 172.25.20.99 AV.Edge.com TURN TURN:TURN:Allocate Request {UDP:45, IPv4:39}
lync.exe 172.25.20.99 AV.Edge.com TLS TLS:TLS Rec Layer-1 HandShake: Client Hello. {TLS:47, SSLVersionSelector:46, TCP:44, IPv4:39}
lync.exe AV.Edge.com 172.25.20.99 TURN TURN:Control message, TURN:Allocate Error Response {TCP:41, IPv4:39}
lync.exe AV.Edge.com 172.25.20.99 TURN TURN:TURN:Allocate Error Response {UDP:45, IPv4:39}
lync.exe AV.Edge.com 172.25.20.99 TURN TURN:TURN:Allocate Response {UDP:40, IPv4:39}
lync.exe AV.Edge.com 172.25.20.99 TLS TLS:TLS Rec Layer-1 HandShake: Server Hello. Server Hello Done. {TLS:47, SSLVersionSelector:46, TCP:44, IPv4:39}
Almost Immediatley you can see that the application is responding and we can get both the TURN:Allocate Response and the TLS sessions complete.
Remember this next time you’re having issues with the Lync Edge AV service.