if port == 80:
network = "HTTP"
elif port == 443:
network = "HTTPS"
else:
network = "UNKNOWN"
This can’t be the way right?… If a malware uses HTTP over port 8080, then this code would mark it as unknown. Wireshark obviously does not mark it as unknown, so there has to be a way for it to determine what protocol is being used by the network traffic.
The Ethernet Frame and the data
Networking is exceedingly complex, and there are many layers to unwrap (hehe). I won’t go back all the way and explain it from the very basics, but instead lets take a look at the ethernet frame.
An ethernet frame traverses through the Data-Link, which is layer 2 in the OSI model
Â
Layer Zero is you
Within the ethernet frame encapsulates many snippets of data, each meant for different levels in the OSI model. The segment in the ethernet frame we are concerned with is the data segment which ultimately gets used by the Application Layer
Â
It’s in the data layer that, well, data gets transmitted. And it’s here that we can get a hint of what protocol the traffic is using.
Through the entire decapsulation of the ethernet frame, there is no one step that tells us what protocol is being used. The most we can get it whether it’s TCP or UDP, and that’s in the Transport Layer.
Here’s a really cool diagram that shows decapsulation process with the ethernet frame at the top (data link), followed by the network packet in the middle (network) and the data contained within the network packet (transport layer)
Â
Okay, but how do we know the protocol that its using? Say it’s a TCP packet, how do we know if it’s HTTP, HTTPS, SSH, or something else?
A peek into Wireshark
Here we make a request to google.com
We can see Wireshark labelling the packet as HTTP, and when we highlight the payload below, we see the exact data segment that is HTTP.
If we look at an answer on Stackoverflow, it says that
Determining HTTP traffic "based on bytes" can be done by looking at the payload: HTTP requests and responses have known formats. For example HTTP 1.1 requests start with <METHOD> <URI> HTTP/1.1\r\n, and responses with HTTP/1.1 <CODE> <MSG>\r\n
So perhaps all that it’s doing is text analysis on the TCP data payload and performing some matching? Even this method is more favorable than if 80 then HTTP. Lets try to validate this with some experiments, and see if Wireshark really relies on textual data in the payload for inference. (I know, this might be the answer without all this other stuff, but where’s the fun in that without poking around?)
This time I SSH to a server, and Wireshark labels the packet as SSHv2.
We look at the data encapsulated in the TCP payload and see strings that help Wireshark deduce that it’s an SSH (SSH-2.0-OpenSSH_9...)
How about DNS then?
Ah hah! There’s no string present in the DNS packet that tells Wireshark that it’s DNS. Even the starting bytes differ, so that can’t be used as a magic byte identifier as well. 9e a5 01 ... vs db ce 01 ...
If we look at another answer on Stackoverflow.
Wireshark uses various techniques to identify protocols. For DNS and RADIUS, it does it based on the port number.
Based on the port numbers, Wireshark chooses which dissector to parse the protocol within the traffic. You can change this in Wireshark by using the Decode As... feature.
It’s here that we see the mapping of ports to dissectors. The code defines certain fixed number of ports that will use the DNS dissector by default.
We see that it maps common ports like 80, 8080, 443 to HTTP.
Let’s setup a server listening on a port 6969 which is not within that range, and see if Wireshark can decode what it is
That’s weird. Seems like it’s still able to? If we look at the default ports that use the HTTP dissector, port 6969 is clearly not in there, yet it was able to know to use the HTTP dissector.
For example TCP defines port 80 only for the use of HTTP traffic. But, this convention doesn't prevent anyone from using TCP port 80 for some different protocol, or on the other hand using HTTP on a port number different to 80. To solve this problem, Wireshark introduced the so called heuristic dissector mechanism to try to deal with these problems.
So if Wireshark has to decode TCP packet data, it will first try to find a dissector registered directly for the TCP port used in that packet. If it finds such a registered dissector it will just hand over the packet data to it.
AKA the ports defined in the source code we saw earlier
So the heuristic dissector will check incoming packet data for all of the 4 above conditions, and only if all of the four conditions are true there is a good chance that the packet really contains the expected protocol - and the dissector continues to decode the packet data. If one condition fails, it's very certainly not the protocol in question and the dissector returns to WS immediately "this is not my protocol" - maybe some other heuristic dissector is interested!
So Wireshark first passes the packet to a dissector registered to a port. If the dissector can’t dissect it, it passes it on to another heuristic dissectors until it finds the correct one.
TLDR
Aside from doing deep packet inspection,
if port == 80:
network = "HTTP"
elif port == 443:
network = "HTTPS"
else:
network = "UNKNOWN"
is actually a pretty okay approach. It just needs to include more ports and not just two!