📶

Determining Protocols based on Network Packets

Recently I came across a problem of how to determine a protocol based on network data.
A very naive and flawed method is to do the following:
if port == 80: network = "HTTP" elif port == 443: network = "HTTPS" else: network = "UNKNOWN"
This can’t be the way right?… If a malware uses HTTP over port 8080, then this code would mark it as unknown. Wireshark obviously does not mark it as unknown, so there has to be a way for it to determine what protocol is being used by the network traffic.

The Ethernet Frame and the data

Networking is exceedingly complex, and there are many layers to unwrap (hehe). I won’t go back all the way and explain it from the very basics, but instead lets take a look at the ethernet frame.
An ethernet frame traverses through the Data-Link, which is layer 2 in the OSI model
notion image
 
Layer Zero is you
Layer Zero is you
Within the ethernet frame encapsulates many snippets of data, each meant for different levels in the OSI model. The segment in the ethernet frame we are concerned with is the data segment which ultimately gets used by the Application Layer
notion image
 
It’s in the data layer that, well, data gets transmitted. And it’s here that we can get a hint of what protocol the traffic is using.
Through the entire decapsulation of the ethernet frame, there is no one step that tells us what protocol is being used. The most we can get it whether it’s TCP or UDP, and that’s in the Transport Layer.
Here’s a really cool diagram that shows decapsulation process with the ethernet frame at the top (data link), followed by the network packet in the middle (network) and the data contained within the network packet (transport layer)
 
notion image
Okay, but how do we know the protocol that its using? Say it’s a TCP packet, how do we know if it’s HTTP, HTTPS, SSH, or something else?

A peek into Wireshark

Here we make a request to google.com
notion image
We can see Wireshark labelling the packet as HTTP, and when we highlight the payload below, we see the exact data segment that is HTTP.
If we look at an answer on Stackoverflow, it says that
Determining HTTP traffic "based on bytes" can be done by looking at the
payload: HTTP requests and responses have known formats. For example
HTTP 1.1 requests start with <METHOD> <URI> HTTP/1.1\r\n, and responses with HTTP/1.1 <CODE> <MSG>\r\n
So perhaps all that it’s doing is text analysis on the TCP data payload and performing some matching? Even this method is more favorable than if 80 then HTTP. Lets try to validate this with some experiments, and see if Wireshark really relies on textual data in the payload for inference. (I know, this might be the answer without all this other stuff, but where’s the fun in that without poking around?)
This time I SSH to a server, and Wireshark labels the packet as SSHv2.
We look at the data encapsulated in the TCP payload and see strings that help Wireshark deduce that it’s an SSH (SSH-2.0-OpenSSH_9...)
notion image
How about DNS then?
notion image
Ah hah! There’s no string present in the DNS packet that tells Wireshark that it’s DNS. Even the starting bytes differ, so that can’t be used as a magic byte identifier as well. 9e a5 01 ... vs db ce 01 ...
notion image
notion image
If we look at another answer on Stackoverflow.
Wireshark uses various techniques to identify protocols. For DNS and RADIUS, it does it based on the port number.
So maybe it’s a combination of both text analysis on the payload and port number inference?

Peeking at the source code

Wireshark uses what’s called Dissectors to dissect a packet.
Based on the port numbers, Wireshark chooses which dissector to parse the protocol within the traffic. You can change this in Wireshark by using the Decode As... feature.
notion image
It’s here that we see the mapping of ports to dissectors. The code defines certain fixed number of ports that will use the DNS dissector by default.
notion image
 
Hmm ok. Let’s bounce back to the TCP dissector
notion image
We see that it maps common ports like 80, 8080, 443 to HTTP.
Let’s setup a server listening on a port 6969 which is not within that range, and see if Wireshark can decode what it is
notion image
That’s weird. Seems like it’s still able to? If we look at the default ports that use the HTTP dissector, port 6969 is clearly not in there, yet it was able to know to use the HTTP dissector.
notion image

Heuristic Dissectors

Okay found the answer.
For example TCP defines port 80 only for the use of HTTP traffic. But,
this convention doesn't prevent anyone from using TCP port 80 for some
different protocol, or on the other hand using HTTP on a port number
different to 80. To solve this problem, Wireshark introduced the so called heuristic
dissector mechanism to try to deal with these problems.
So if Wireshark has to decode TCP packet data, it will first try to find
a dissector registered directly for the TCP port used in that packet. If
it finds such a registered dissector it will just hand over the packet
data to it.
AKA the ports defined in the source code we saw earlier
So the heuristic dissector will check incoming packet data for all of
the 4 above conditions, and only if all of the four conditions are true
there is a good chance that the packet really contains the expected
protocol - and the dissector continues to decode the packet data. If one
condition fails, it's very certainly not the protocol in question and
the dissector returns to WS immediately "this is not my protocol" -
maybe some other heuristic dissector is interested!
So Wireshark first passes the packet to a dissector registered to a port. If the dissector can’t dissect it, it passes it on to another heuristic dissectors until it finds the correct one.

TLDR

Aside from doing deep packet inspection,
if port == 80: network = "HTTP" elif port == 443: network = "HTTPS" else: network = "UNKNOWN"
is actually a pretty okay approach. It just needs to include more ports and not just two!