Operational latency
While I did state earlier that latency typically describes only the time that a packet spends in transit, it is often useful for you, as a network engineer, to consider the impact of latency on your end user experience. While we would all like to, no engineer can get away with ignoring a negative user experience by claiming that the causes are out of your control. So even though your software may be performing optimally, and deployed to a lightning-fast fiber-optic network, if it is dependent on an upstream resource provider that is slow to process requests, your end user will ultimately feel that pain, no matter how perfect your own code is. For this reason, it's often useful to keep track of the actual, overall window of time necessary to process a given network request, including the processing time on the remote host. This measurement is the most meaningful when considering the impact of network operations on your user's experience, and is what's called operational latency. So, while most of the contributing factors to the operational latency of a task are, typically, out of your control, it is often important to be aware of its impact and, wherever possible, try to optimize it down to a minimum.
Ultimately, what each of these individual metrics should tell you is that there are dozens of points throughout a network request at which latency can be introduced. Each of them has varying degrees of impact, and they are often under varying degrees of your control, but to the extent that you can, you should always seek to minimize the number of points in your application at which external latency can be introduced. Designing for optimal network latency is always easier than trying to build it in after the fact. Doing so isn't always easy or obvious though, and optimizing for minimal latency can look different from either side of a request.
To illustrate, imagine we are writing an application that is responsible for collecting one or more transaction IDs, looking up the monetary value of those transactions, and then returning a sum of them. Being a forward-thinking developer, you've separated this transaction aggregation service from the database of transactions to keep the business logic of your service decoupled from your data-storage implementation. To facilitate data access, you've exposed the transaction table through a simple REST API that exposes an endpoint for individual transaction lookups by way of a single key in the URL, such as transaction-db/transaction/{id}. This makes the most sense to you since each transaction has a unique key, and allowing individual-transaction lookup allows us to minimize the amount of information returned by your database service. Less content passed over the network means less latency, and so, from the data-producer perspective, we have designed well.
Your aggregation service, though, is another story. That service will need multiple transaction records to generate a meaningful output. With only a single endpoint returning a single record at a time, the aggregation service will send multiple, simultaneous requests to the transaction service. Each one of those requests will contribute their own mechanical, OS, and operational latencies. While modern OSes allow for multithreaded processing of multiple network requests simultaneously, there is an upper limit to the number of available threads in a given process. As the number of transactions increases, requests will start to become queued, preventing simultaneous processing and increasing the operational latency experienced by the user.
In this case, optimizing for both cases is a simple matter of adding an additional REST endpoint, and accepting POST HTTP requests with multiple transaction IDs in the request body. Most of us reading this will have likely already known this, but the example is useful as an illustration of how optimal performance can look very different on either side of the same coin. Often, we won't be responsible for both the service application and the database API, and in those cases, we will have to do the best we can to improve performance from only one side.
No matter what side of a request you're on, though, the impact of network latency on application performance demands your consideration for minimizing the size of atomic data packets that must be sent over the network. Breaking down large requests into smaller, bite-sized pieces provides more opportunities for every device in the communication chain to step in, perform other operations, and then proceed with processing your packets. If our single-network request will block other network operations for the duration of an entire 5 MB file transfer, it might be given lower priority in the queue of network transactions that your OS is maintaining. However, if our OS only needs to slot in a small, 64-byte packet for transmission, it can likely find many more opportunities to send that request more frequently, reducing your OS latency overall.
If our application must send 5 MB of data, then doing so in 64-byte packets gives your application's hosting context much more flexibility in determining the best way to service that requirement.