Good afternoon!

Total: the monitoring system is a complex that is connected in a non-intrusive mode to the n-th number of 10-gigabit Ethernet links, which continuously “monitors” the transmission of all RTP video streams present in the traffic and takes measurements at a certain time interval in order to save them later to the base. According to the data from the database, reports are regularly built for all cameras.

And what's so difficult?

In the process of finding a solution, several problems were immediately fixed:

  • Non-intrusive connection. The monitoring system connects to already working channels, in which most of the connections (via RTSP) are already established, the server and the client already know which ports the exchange takes place on, but we do not know this in advance. There is a well-known port only for the RTSP protocol, but UDP streams can go through arbitrary ports (besides, it turned out that they often violate the SHOULD even/odd port requirement, see rfc3550). How to determine that this or that packet from some IP address belongs to a video stream? For example, the BitTorrent protocol behaves similarly - at the stage of establishing a connection, the client and server agree on ports, and then all UDP traffic looks like “just a bit stream”.
  • Connected links can contain more than just video streams. There can be HTTP, and BitTorrent, and SSH, and any other protocols that we use today. Therefore, the system must correctly identify the video streams in order to separate them from the rest of the traffic. How to do this in real time with 8 ten-gigabit links? Of course, they are usually not filled to 100%, so the total traffic will not be 80 gigabits / s, but about 50-60, but this is not so little.
  • Scalability. Where there are already many video streams, there may be even more of them, since video surveillance has long proved itself as an effective tool. This suggests that there should be a margin for performance and a reserve for links.

Looking for a suitable solution...

Naturally, we tried to make the most of our own experience. By the time the decision was made, we already had an implementation of processing ethernet packets on the FPGA-powered device Bercut-MX (more simply - MX). With the help of Bercut-MX, we were able to get the necessary fields for analysis from the headers of Ethernet packets. Unfortunately, we had no experience in processing such a volume of traffic using “regular” servers, so they looked at such a solution with some apprehension ...

It would seem that it only remained to apply the method to RTP packets and we would have the golden key in our pocket, but MX can only process traffic, it does not include the ability to record and store statistics. There is not enough memory in the FPGA to store the found connections (IP-IP-port-port combinations), because in a 2x10-gigabit link entering the input there can be about 15 thousand video streams, and for each you need to “remember” the number of received packets , the number of lost packets, and so on ... Moreover, searching at such a speed and for such an amount of data, subject to lossless processing, becomes a non-trivial task.

To find a solution, we had to “dig deeper” and figure out what algorithms we will use to measure quality and identify video streams.

What can be measured by the fields of an RTP packet?

From the description it can be seen that from the point of view of quality measurements in the RTP packet, we are interested in the following fields:

  • sequence number - 16-bit counter that increments with each packet sent;
  • timestamp - timestamp, for h.264 the sample size is 1/90000 s (i.e. corresponds to a frequency of 90 kHz);
  • Marker bit In rfc3550, it is generally described that this bit is intended to designate “significant” events, and in fact cameras most often mark the beginning of a video frame and specialized packets with SPS / PPS information with this bit.

It is quite obvious that sequence number allows you to define the following flow parameters:

  • packet loss (frame loss);
  • resending the packet (duplicate);
  • changing the order of arrival (reordering);
  • reloading the camera, with a large "gap" in the sequence.

Timestamp allows you to measure:

  • delay variation (also called jitter). In this case, a 90 kHz counter should work on the receiving side;
  • in principle, the delay in the passage of the packet. But for this you need to synchronize the camera time with the timestamp, and this is possible if the camera transmits sender reports (RTCP SR), which is generally not true, because in real life, many cameras ignore the RTCP SR message (about half of the cameras with which we had a chance to work).

Well, M-bit allows you to measure the frame rate. True, the SPS/PPS frames of the h.264 protocol introduce an error, because are not video frames. But it can be leveled by using information from the NAL-unit header, which always follows the RTP header.

Detailed algorithms for measuring parameters are beyond the scope of the article, I will not delve into it. If interested, the rfc3550 has an example of loss calculation code and a formula for calculating jitter. The main conclusion is that only a few fields from RTP packets and NAL units are enough to measure the basic characteristics of a transport stream. And the rest of the information is not involved in the measurements and it can and should be discarded!

How to identify RTP streams?

To keep statistics, the information obtained from the RTP header must be “attached” to a certain camera (video stream) identifier. The camera can be uniquely identified by the following parameters:

  • Source and Destination IP Addresses
  • Source and Destination Ports
  • SSRC. It is of particular importance when several streams are broadcast from one IP, i.e. in the case of a multiport encoder.

Interestingly, at first we made camera identification only by source IP and SSRC, relying on the fact that SSRC should be random, but in practice it turned out that many cameras set SSRC to a fixed value (say, 256). Apparently, this is due to saving resources. As a result, we had to add ports to the camera ID. This solved the uniqueness problem completely.

How to separate RTP packets from other traffic?

The question remains: how will Berkut-MX, having accepted the packet, understand that it is RTP? The RTP header does not have such an explicit identification as IP, it does not have a checksum, it can be transmitted over UDP with port numbers that are dynamically selected when the connection is established. And in our case, most of the connections have long been established and you can wait a very long time for reinstallation.

To solve this problem, it is recommended in rfc3550 (Appendix A.1) to check the RTP version bits - this is two bits, and the Payload Type (PT) field - seven bits, which in the case of a dynamic type takes a small range. We have found out in practice that for the set of cameras with which we work, PT fits in the range from 96 to 100.

There is one more factor - the parity of the port, but as practice has shown, it is not always observed, so it had to be abandoned.

Thus, the behavior of Berkut-MX is as follows:

  1. we receive a package, disassemble it into fields;
  2. if the version is 2 and the payload type is within the given limits, then we send the headers to the server.

Obviously, with this approach, there are false positives, because. not only RTP packets can fall under such simple criteria. But it is important for us that we definitely will not miss the RTP packet, and the server will filter out the “wrong” packets.

To filter false cases, the server uses a mechanism that registers the source of video traffic for several sequentially received packets (there is a sequence number in the packet!). If several packets came with consecutive numbers, then this is not a coincidence and we start working with this stream. This algorithm proved to be very reliable.

Moving on…

Realizing that all the information that comes in packets is not needed to measure the quality and identify streams, we decided to transfer all the highload & time-critical work on receiving and isolating the fields of RTP packets to Berkut-MX, that is, to FPGA. It “finds” the video stream, parses the packet, leaves only the required fields, and sends it to a regular server in the UDP tunnel. The server takes measurements for each camera and saves the results to the database.

As a result, the server does not work with 50-60 Gigabit / s, but with a maximum of 5% (this is exactly the proportion of data sent to the average packet size). That is, at the input of the entire system 55 Gigabit / s, and only no more than 3 Gigabit / s gets to the server!

As a result, we got the following architecture:

And we got the first result in this configuration two weeks after setting the initial TOR!

What is the end result of the server?

So what does the server do in our architecture? His tasks:

  • listen on a UDP socket and read fields with packed headers from it;
  • parse incoming packets and extract RTP header fields from there along with camera identifiers;
  • correlate the received fields with those that were received before, and understand whether the packets were lost, whether the packets were sent repeatedly, whether the order of arrival changed, what was the variation in the packet transit delay (jitter), etc .;
  • record measured data in the database with reference to time;
  • analyze the base and generate reports, send traps about critical events (high packet loss, loss of packets from some camera, etc.).

Despite the fact that the total traffic at the server input is about 3 Gigabit / s, the server copes even if we do not use any DPDKs, but simply work through a linux socket (after increasing the buffer size for the socket, of course). Moreover, it will be possible to connect new links and MX "s, because the performance margin remains.

This is what the top of the server looks like (this is the top of only one lxc container, reports are generated in another):

It can be seen from it that the entire load for calculating quality parameters and accounting for statistics is evenly distributed over four processes. We managed to achieve such a distribution by using hashing in the FPGA: the hash function is calculated by IP, and the low bits of the received hash determine the number of the UDP port to which the statistics will go. Accordingly, each process listening on its port receives approximately the same amount of traffic.

cons and pros

It's time to brag and admit the shortcomings of the solution.

I'll start with the pros:

  • no loss at the junction with 10G links. Since the FPGA takes all the “blow”, we can be sure that every packet will be analyzed;
  • to monitor 55,000 cameras (or more), only one server with one 10G card is required. We are currently using servers based on 2x Xeon with 4 cores of 2400 MHz each. Enough with a margin: in parallel with the collection of information, reports are generated;
  • monitoring of 8 "dozens" (10G links) fits into only 2-3 units: there is not always a lot of space and power in the rack for the monitoring system;
  • when connecting links from MXs through the switch, you can add new links without stopping monitoring, because you don’t need to insert any boards into the server and you don’t need to turn it off for this;
  • the server is not overloaded with data, it receives only what is needed;
  • headers from MX come in a jumbo Ethernet packet, which means the processor will not be choked with interrupts (besides, we do not forget about interrupt coalescing).

In fairness, I will consider the disadvantages:

  • due to heavy optimization for a specific task, adding support for new fields or protocols requires changes in the FPGA code. This leads to more time than if we did the same on the processor. Both in development and testing, and during deployment;
  • video information is not analyzed at all. The camera can shoot an icicle hanging in front of it, or be turned the wrong way. This fact will go unnoticed. Of course, we have provided the ability to record video from the selected camera, but the operator cannot go through all 55,000 cameras!
  • a server and FPGA-powered devices are more expensive than just one or two servers;)

Summary

In the end, we got a software and hardware complex in which we can control both the part that parses packets on interfaces and the part that keeps statistics. Full control over all nodes of the system literally saved us when the cameras started switching to RTSP/TCP interleaved mode. Because in this case, the RTP header is no longer located in the packet at a fixed offset: it can be anywhere, even on the border of two packets (the first half in one, the second in the other). Accordingly, the algorithm for obtaining the RTP header and its fields has undergone fundamental changes. We had to do TCP reassembling on the server for all 50,000 connections - hence the rather high load in the top.

We have never worked in the field of high-load applications before, but we managed to solve the problem due to our skills in FPGA and it turned out pretty well. There was even a margin - for example, another 20-30 thousand streams can be connected to a system with 55,000 cameras.

Tuning linux subsystems (distribution of queues by interrupts, increase in receive buffers, direct allocation of cores to specific processes, etc.) I left outside the scope of the article, because. this topic is already very well covered.

I have described far from everything, a lot of rakes have been collected, so feel free to ask questions :)

Many thanks to everyone who read to the end!

The rapid growth of the Internet places new demands on the speed and volume of data transfer. And in order to satisfy all these demands, it is not enough to increase the capacity of the network alone; reasonable and effective methods of traffic management and congestion of transmission lines are needed.

In real-time applications, the sender generates a stream of data at a constant rate, and the receiver (or receivers) must provide this data to the application at the same rate. Such applications include, for example, audio and video conferencing, live video, medical remote diagnostics, computer telephony, distributed interactive simulation, games, real-time monitoring, and others.

The most widely used transport layer protocol is TCP. Although TCP can support a wide variety of distributed applications, it is not suitable for real-time applications.

This task is intended to be solved by the new real-time transport protocol - RTP(Real-Time Transport Protocol), which guarantees the delivery of data to one or more destinations with a delay within specified limits, i.e. data can be played back in real time.

Principles of constructing the RTP protocol

RTP does not support any mechanism for packet delivery, transmission validity, or connection reliability. All these functions are assigned to the transport protocol. RTP runs on top of UDP and can support real-time data transfer between multiple participants in an RTP session.

Note

For each RTP participant, a session is defined by a pair of packet destination transport addresses (one network address - IP and a pair of ports: RTP and RTCP).

RTP packets contain the following fields: sender ID indicating which party is generating the data, time stamps of the packet generation so that the data can be replayed by the receiving party at the correct intervals, transmission order information, and information about the nature of the packet's contents, such as video encoding type (MPEG, Indeo, etc.). The availability of such information makes it possible to estimate the value of the initial delay and the size of the transmission buffer.

Note

In a typical real-time environment, the sender generates packets at a constant rate. They are sent at regular intervals, traverse the network, and are received by a receiver that plays back the data in real time as it is received. However, due to changes in latency when packets are transmitted over the network, they may arrive at irregular intervals. To compensate for this effect, incoming packets are buffered, held for a while, and then provided at a constant rate. software The that generates the output. Therefore, for the real-time protocol to function, it is necessary that each packet contains a timestamp - so the recipient can reproduce the incoming data at the same speed as the sender.

Since RTP defines (and regulates) the payload format of the transmitted data, the concept of synchronization is directly related to this, which is partly the responsibility of the RTP translation mechanism - the mixer. Upon receiving streams of RTP packets from one or more sources, the mixer combines them and sends a new stream of RTP packets to one or more recipients. The mixer can simply combine the data, as well as change its format, for example, when combining several sound sources. Let's pretend that new system wants to participate in the session, but its link to the network does not have sufficient capacity to support all RTP streams, then the mixer receives all these streams, merges them into one and passes the last one to the new session member. When receiving multiple streams, the mixer simply adds the PCM values. The RTP header generated by the mixer includes the identity of the sender whose data is present in the packet.

A simpler relay device creates one outgoing RTP packet for each incoming RTP packet. This mechanism may change the format of the data in the packet or use a different set of low-level protocols to transfer data from one domain to another. For example, a potential recipient may not be able to process the high-speed video signal used by other participants in the session. The translator converts the video to a lower quality format that requires a different high speed data transmission.

Work control methods

The RTP protocol is used only to transfer user data - usually multicast - to all participants in the session. Together with RTP, the RTCP (Real-time Transport Control Protocol) protocol works, the main task of which is to provide control over the transmission of RTP. RTCP uses the same basic transport protocol as RTP (usually UDP), but a different port number.

RTCP performs several functions:

  1. Ensuring and monitoring the quality of services and feedback in case of congestion. Since RTCP packets are multicast, all participants in the session can evaluate how well the other participants perform and receive. The sender's messages allow recipients to evaluate the data rate and transmission quality. The recipients' messages contain information about the problems they are experiencing, including packet loss and excessive ripple. Recipient feedback is also important for diagnosing propagation errors. By analyzing the messages of all session participants, the network administrator can determine whether the problem concerns one participant or is of a general nature. If the sending application comes to the conclusion that the problem is typical for the system as a whole, for example, due to the failure of one of the communication channels, then it can increase the degree of data compression due to a decrease in quality or refuse to transmit video altogether - this allows you to transfer data over the connection low capacity.
  2. Sender identification. RTCP packets contain a standard textual description of the sender. They provide more information about the sender of data packets than a randomly selected sync source ID. In addition, they help the user to identify threads related to different sessions.
  3. Session sizing and scaling. To ensure the quality of services and feedback for the purpose of congestion control, as well as for the purpose of identifying the sender, all participants periodically send RTCP packets. The frequency of transmission of these packets decreases as the number of participants increases. With a small number of participants, one RTCP packet is sent every 5 seconds maximum. RFC-1889 describes an algorithm where participants limit the rate of RTCP packets based on the total number of participants. The goal is to keep RTCP traffic below 5% of the total session traffic.

RTP protocol header format

RTP is a stream oriented protocol. The header of the RTP packet has been designed with the needs of real-time transmission in mind. It contains information about the order of the packets so that the data stream is correctly assembled at the receiving end, and a timestamp for correct frame interleaving during playback and for synchronizing multiple data streams, such as video and audio.

Each RTP packet has a basic header and possibly additional application-specific fields.

Using TCP as the transport protocol for these applications is not possible for several reasons:

  1. This protocol only allows a connection to be established between two endpoints and is therefore not suitable for multicasting.
  2. TCP provides for retransmission of lost segments that arrive when the real-time application is no longer waiting for them.
  3. TCP does not have a convenient mechanism for associating timing information with segments, an additional requirement for real-time applications.

Another widely used transport layer protocol, LJDP does not have some of the limitations of TCP, but it does not provide critical information about synchronization.

While each real-time application may have its own mechanisms to support real-time transmission, they share many common features that make defining a single protocol highly desirable.

The new real-time transport protocol RTP (Real-time Transport Protocol) is designed to solve this problem, which guarantees the delivery of data to one or more recipients with a delay within the specified limits, i.e. the data can be reproduced in real time.

On fig. 1 presented fixed RTP header, which contains a number of fields identifying items such as the package format, serial number, sources, bounds, and payload type. The fixed header may be followed by other fields containing additional information about the data.

0 2 3 4 8 16 31

Synchronization Source (SSRC) Identifier

Contributing Source (CSRC) Identifiers

Rice. 1. Fixed RTP header.

V(2 bits). version field. The current version is the second one.
R(1 bit). Fill field. This field signals the presence of padding octets at the end of the payload. Padding is applied when an application requires the payload size to be a multiple of, for example, 32 bits. In this case, the last octet indicates the number of padding octets.
X(1 bit). Header extension field. When this field is set, the main header is followed by an additional header used in experimental RTP extensions.
SS(4 bits). The field for the number of senders. This field contains the number of identifiers of the senders whose data is in the packet, with the identifiers themselves following the main header.
M(1 bit). Marker field. The meaning of the marker bit depends on the payload type. The marker bit is typically used to indicate the boundaries of the data stream. In the case of video, it sets the end of the frame. In the case of voice, it specifies the start of speech after a period of silence.
RT(7 bits). Payload type field. This field identifies the payload type and data format, including compression and encryption. In the stationary state, the sender uses only one payload type per session, but it can change it in response to changing conditions if signaled by the Real-Time Transport Control Protocol.
sequence number(16 bits). Sequence number field. Each source starts numbering packets from an arbitrary number, then increments by one with each RTP data packet sent. This allows you to detect packet loss and determine the order of packets with the same timestamp. Several consecutive packets may have the same timestamp if they are logically generated at the same instant, such as packets belonging to the same video frame.
Timestamp(32 bits). Timestamp field. This field contains the point in time at which the first payload data octet was created. The units in which the time is specified in this field depend on the type of payload. The value is determined by the sender's local clock.
Synchronization Source (SSRC) Identifier(32 bits). Synchronization Source ID field: A randomly generated number that uniquely identifies the source within a session and is independent of network address. This number plays an important role in processing the incoming portion of data from one source.
Contributing source (CSRC) Identifier(32 bits). List of source identifier fields "mixed" into the main stream, for example, using a mixer. The mixer inserts a whole list of SSRC identifiers of sources that participated in the construction of this RTP packet. This list is called CSRC. Number of elements in the list: from 0 to 15. If the number of participants is more than 15, the first 15 are selected. An example is an audio conference, in whose RTP packets the speeches of all participants are collected, each with its own SSRC - they form the CSRC list. In this case, the entire conference has a common SSRC.

The RTCP protocol, like any control protocol, is much more complex both in structure and in terms of the functions it performs (compare, for example, the IP and TCP protocols). Although the core of the RTCP protocol is RTP, it contains many additional fields with which it implements its functions.

Resource Reservation Protocol - RSVP

To solve the priority problem for delay-sensitive data, as opposed to traditional data for which delays are not so critical, the resource reservation protocol - RSVP, which is currently under consideration by the Internet Engineering Task Force (IETF), is called upon. RSVP allows end systems reserve network resources to obtain the required quality of service, in particular resources for real-time graphics using the RTP protocol. RSVP is primarily concerned with routers, although end-node applications must also know how to use RSVP in order to reserve the required bandwidth for a given service class or priority level.

RTP, along with the other standards described, makes it possible to successfully transmit video and audio over conventional IP networks. RTP/RTCP/RSVP is a standardized solution for real-time data networks. Its only drawback is that it is only for IP networks. However, this limitation is temporary: networks will develop in this direction one way or another. This solution promises to solve the problem of transmitting delay-sensitive data over the Internet.

Literature

A description of the RTP protocol can be found in RFC-1889.


The requirement to support several types of traffic with different requirements for quality of service based on the TCP / IP protocol stack is now very relevant. This problem is addressed by the Real-Time Transport Protocol (RTP), which is an IETF standard for real-time transmission of data such as voice or video over a network that does not guarantee quality of service.

The RTP protocol guarantees the delivery of data to one or more recipients with a delay not exceeding the specified value. To do this, the protocol header provides timestamps necessary for the successful restoration of audio and video information, as well as data on the method of encoding information.

Although the TCP protocol guarantees the delivery of transmitted data in the correct sequence, the traffic is not uniform, that is, unpredictable delays occur during the delivery of datagrams. Since the RTP protocol is aware of the content of datagrams and has data loss detection mechanisms, it can reduce latency to an acceptable level.

IP protocol address scheme

The internetwork addressing scheme used in the IP protocol is described in RFC 990 and RFC 997. It is based on the separation of addressing networks from addressing devices in these networks. This scheme facilitates routing. In this case, addresses must be assigned in an orderly (consecutive) manner in order to make routing more efficient.

When using the TCP / IP protocol stack on the network, end devices receive unique addresses. Such devices can be personal computers, media servers, routers, etc. However, some devices that have multiple physical ports, such as routers, must have a unique address on each of the ports. Based on the addressing scheme and the fact that some devices on the network can have several addresses, we can conclude that this addressing scheme does not describe the device itself on the network, but a specific connection of this device to the network. This scheme leads to a number of inconveniences. One of them is the need to change the address of the device when moving it to another network. Another drawback is that to work with a device that has several connections in a distributed network, you need to know all its addresses that identify these connections.

So, for each device in IP networks, we can talk about addresses of three levels:

q Physical adress device (more precisely - a specific interface). For devices on Ethernet networks, this is the MAC address network card or router port. These addresses are assigned by the hardware manufacturers. The physical address has six bytes: the upper three bytes are the identifier of the manufacturer, the lower three bytes are assigned by the manufacturer;



q An IP address consisting of four bytes. This address is used for network layer the OSI reference model;

q Character identifier - name. This identifier can be assigned arbitrarily by the administrator.

When the IP protocol was standardized in September 1981, its specification required that every device connected to the network have a unique 32-bit address. This address is divided into two parts. The first part of the address identifies the network where the device is located. The second part uniquely identifies the device itself within the network. This scheme leads to a two-level address hierarchy (Figure 6.23).

Now the network number field in the address is called network prefix, because it identifies the network. All workstations on the network share the same network prefix. At the same time, they must have unique numbers devices. Two workstations located on different networks must have different network prefixes, but they can still have the same device number.

For flexibility in addressing computer networks The designers of the protocol determined that the IP address space should be divided into three different classes - A, B, and C. Knowing the class, you know where the boundary between the network prefix and the device number lies in a 32-bit address. On fig. Figure 6.24 shows the address formats of these basic classes.

One of the main advantages of using classes is that you can determine from the class of the address where the boundary between the network prefix and the device number is. For example, if the most significant two bits of the address are 10, then the split point is between bits 15 and 16.

The disadvantage of this method is the need to change the network address when connecting additional devices. For example, if total number number of devices in a class C network exceeds 255, then it will be necessary to replace its addresses with class B addresses. Changing network addresses will require additional efforts from the administrator to debug the network. Network administrators cannot smooth transition to a new address class, since the classes are clearly separated. You have to prohibit the use of an entire group of network addresses, change all addresses of devices in this group at the same time, and only then allow their use on the network again. In addition, the introduction of address classes significantly reduces the theoretically possible number of individual addresses. AT current version IP protocol (version 4) the total number of addresses could be 2 32 (4 294 967 296), since the protocol provides for the use of 32 bits to specify the address. Naturally, the use of some bits for service purposes reduces the available number of individual addresses.

Class A is for large networks. Each class A address has an 8-bit network prefix with the most significant bit set to 1 and the next seven bits used for the network number. The remaining 24 bits are used for the device number. AT this moment all class A addresses are already allocated. Class A networks are also referred to as "/8" because class A addresses have an 8-bit network prefix.

The maximum number of class A networks is 126 (2 7 -2 - two addresses are subtracted, consisting of only zeros and ones). Each network of this class supports up to 16,777,214 (2 24 -2) devices. Since a class A address block can hold up to a maximum of 231 (2 147483648) individual addresses, and IP version 4 can only support a maximum of 232 (4294967296) addresses, class A occupies 50% of the total IP address space. .

Class B is intended for medium-sized networks. Each class B address has a 16-bit network prefix where the two most significant bits are 10 and the next 14 bits are used for the network number. 16 bits are allocated for the device number. Class B networks are also referred to as "/16" because class B addresses have a 16-bit network prefix.

The maximum number of class B networks is 16,382 (2 14 -2). Each network of this class supports up to 65,534 (2 16 -2) devices. Since an entire class B address block can contain up to a maximum of 230 (1,073,741,824) individual addresses, it occupies 25% of the total IP address space.

Class C addresses are used in networks with a small number of devices. Each class C network has a 24-bit network prefix, in which the three most significant bits are 110, and the next 21 bits are used for the network number. The remaining 8 bits are allocated for device numbers. Class C networks are also referred to as "/24" because class C addresses have a 24-bit network prefix.

The maximum number of Class C networks is 2,097,152 (221). Each network of this class supports up to 254 (2 8 -2) devices. Class C occupies 12.5% ​​of the total IP address space.

In table. 6.9 summarizes our analysis of network classes.

Table 6.9. Network classes

In addition to these three classes of addresses, there are two more classes. In class D, the most significant four bits are 1110. This class is used for multicasting. In class E, the upper four bits are 1111. It is reserved for experimentation.

For ease of reading addresses in technical literature, application programs, etc., IP addresses are represented as four decimal numbers separated by dots. Each of these numbers corresponds to one octet (8 bits) of the IP address. This format is called dotted decimal (Decimal-Point Notation) or dotted decimal notation (Figure 6.25).

In table. 6.10 lists the ranges of decimal values ​​for the three classes of addresses. In table. 6.10 XXX entry means an arbitrary field.

Table 6.10. Address value ranges

Some IP addresses cannot be assigned to devices on the network (Table 6.11).

As shown in this table, in reserved IP addresses, all bits set to zero correspond to either this device, or this network, and IP addresses, all bits of which are set to 1, are used in broadcasting information. To refer to the entire IP network as a whole, an address with a device number is used, with all bits set to "0". Class A network address 127.0.0.0 is reserved for loopback and introduced to test communication between processes on the same machine. When an application uses a loopback address, the TCP/IP protocol stack returns this data to the application without sending anything to the network. In addition, this address can be used for the interaction of separate processes within the same machine. Therefore, in IP networks, it is forbidden to assign IP addresses starting with 127 to devices.

In addition to directed data transmission to a specific workstation, broadcast transmission is actively used, in which all stations in the current or specified network receive information. There are two types of broadcasts in the IP protocol: directed and limited.

A directed broadcast allows a device on a remote network to send a datagram to all devices on the same network. current network. A datagram with a forwarded broadcast address can pass through routers, but it will only be delivered to all devices on the specified network, not to all devices. In a directed broadcast, the destination address consists of a specific network number and a device number, all bits of which are 0 or 1. For example, addresses 185.100.255.255 and 185.100.0.0 would be treated as directed broadcast addresses for the class B network 185.100.xxx.xxx. From an addressing point of view, the main disadvantage of directional broadcast is that knowledge of the target network number is required.

The second form of broadcast, called limited broadcast, broadcasts within the current network (the network where the sending device resides). A datagram with a limited broadcast address will never pass through a router. In limited broadcasting, the network number and device number bits are all zeros or ones. Thus, a datagram with a destination address of 255.255.255.255 or 0.0.0.0 will be delivered to all devices on the network. On fig. Figure 6.26 shows networks connected by routers. In table. Figure 6.12 lists the recipients of broadcast datagrams sent by workstation A.

The IP protocol supports three addressing methods: single (unicast), broadcast (broadcast) and group (multicast).

Table 6.12. Broadcast Datagram Receivers

In single addressing, datagrams are sent to a specific single device. The implementation of this approach is not difficult, but if working group contains many stations, the throughput may not be sufficient, since the same datagram will be transmitted many times.

With broadcast addressing, applications send a single datagram, which is delivered to all devices on the network. This approach is even simpler to implement, but if in this case the broadcast traffic is not limited to local network(and, for example, another network is forwarded using routers), then the global network must have significant bandwidth. If the information is intended only for a small group of devices, then this approach seems irrational.

In multicast, datagrams are delivered to a specific group of devices. At the same time (which is very important when working in distributed networks), no excess traffic is generated. Multicast and single address datagrams differ in address. In the header of an IP datagram with multicast, instead of IP addresses of classes A, B, C, there is a class D address, that is, a group address.

A group address is assigned to some recipient devices or, in other words, to a group. The sender writes this multicast address in the header of the IP datagram. The datagram will be delivered to all members of the group. The first four bits of the class D address are 1110. The rest of the address (28 bits) is occupied by the group identifier (Figure 6.27).

In dotted decimal format, group addresses range from 224.0.0.0 to 239.255.255.255. In table. Figure 6.13 shows the class D address allocation scheme.

Table 6.13. Class D address allocation

As can be seen from Table. 6.13, the first 256 addresses are reserved. In particular, this range is reserved for routing protocols and other low-level protocols. In table. 6.14 contains some reserved class D IP addresses.

Above this range is large group addresses allocated for applications running on the Internet. The topmost address range (approximately 16 million addresses) is for administrative purposes on LANs. Class D group addresses are centrally managed and registered by a special organization called IANA.

Multicast can be implemented at two levels of the OSI model - channel (Data-Link Layer) and network (Network Layer). Link layer transmission protocols, such as Ethernet and FDDI, can support single, broadcast, and multicast addressing. Link layer multicast is especially effective if it is supported in hardware on the NIC.

To support IANA multicasting, a block of multicast Ethernet addresses has been allocated, starting from 01-00-5E (in hexadecimal notation). A multicast IP address can be translated to an address in this block. The principle of translation is quite simple: the lower 23 bits of the IP group identifier are copied into the lower 23 bits of the Ethernet address. Note that this scheme associates up to 32 different IP groups with the same Ethernet address, since the next 5 bits of the IP group identifier are ignored.

Table 6.14. Reserved class D addresses

Address Purpose
224.0.0.1 All devices on the subnet
224.0.0.2 All routers on the subnet
224.0.0.4 All DVMRP Routers
224.0.0.5 All MOSPF Routers
224.0.0.9 RIP IP Version II
224.0.1.7 audio news
224.0.1.11 IEFT audio
224.0.1.12 IEFT video

If the sender and receiver belong to the same physical network, the process of sending and receiving multicast frames at the link layer is quite simple. The sender specifies the IP address of the group of recipients, and the NIC translates this address into the corresponding group Ethernet address and sends the frame.

If the sender and receiver are on different subnets connected by routers, datagram delivery is difficult. In this case, the routers must support one of the multicast routing protocols (DVMRP, MOSPF, PIM - see below). According to these protocols, the router will build a delivery tree and correctly forward the multicast traffic. In addition, each router must support Group Management Protocol (IGMP) to determine the presence of group members on directly connected subnets (Figure 6.28).

If one day you have to quickly figure out what VoIP (voice over IP) is and what all these wild abbreviations mean, I hope this manual will help. I will immediately note that the issues of configuring additional types of telephony services (such as call transfer, voice mail, conference calls, etc.) are not considered here.

So, what we will deal with under the cut:

  1. Basic concepts of telephony: types of devices, connection schemes
  2. Bundle of SIP/SDP/RTP protocols: how it works
  3. How information about pressed buttons is transmitted
  4. How does voice and fax transmission work?
  5. Digital signal processing and audio quality assurance in IP telephony

1. Basic concepts of telephony

In general, the scheme for connecting a local subscriber to a telephone provider via a regular telephone line is as follows:



On the side of the provider (PBX), a telephone module with an FXS (Foreign eXchange Subscriber) port is installed. A telephone or fax machine with an FXO (Foreign eXchange Office) port and a dialer module is installed at home or in the office.

By appearance FXS and FXO ports are no different, they are regular 6-pin RJ11 connectors. But using a voltmeter, it is very easy to distinguish them - there will always be some voltage on the FXS port: 48/60 V when the handset is on-hook, or 6-15 V during a call. On the FXO, if it is not connected to the line, the voltage is always 0.

To transfer data over a telephone line, additional logic is needed on the provider side, which can be implemented on the SLIC (subscriber line interface circuit) module, and on the subscriber side - using the DAA (Direct Access Arrangement) module.

Wireless DECT phones (Digital European Cordless Telecommunications) are quite popular now. In terms of device, they are similar to ordinary telephones: they also have an FXO port and a dialer module, but they also have a module wireless communication stations and handsets at a frequency of 1.9 GHz.

Subscribers connect to the PSTN network (Public Switched Telephone Network) - telephone network general use, it is PSTN, PSTN. PSTN network can be organized using different technologies: ISDN, optics, POTS, Ethernet. A special case of PSTN, when a regular analog/copper line is used - POTS (Plain Old Telephone Service) - a simple old telephone system.

With the development of the Internet telephone communications moved to a new level. Stationary telephones are used less and less, mainly for official needs. DECT phones are a little more convenient, but limited to the perimeter of the house. GSM-phones are even more convenient, but are limited by the borders of the country (roaming is expensive). But for IP phones, they are also softphones (SoftPhone), there are no restrictions, except for access to the Internet.

Skype is the most famous example of a softphone. It can do a lot of things, but it has two important drawbacks: a closed architecture and wiretapping is known by which authorities. Because of the first, it is not possible to create your own telephone micronetwork. And because of the second - it is not very pleasant when you are spied on, especially in personal and commercial conversations.

Fortunately, there are open protocols for creating your own communication networks with goodies - these are SIP and H.323. There are a few more softphones on the SIP protocol than on H.323, which can be explained by its relative simplicity and flexibility. But sometimes this flexibility can put a big stick in the wheel. Both SIP and H.323 protocols use the RTP protocol to transfer media data.

Consider basic principles SIP protocol to understand how the connection of two subscribers occurs.

2. Description of the bundle of SIP/SDP/RTP protocols

SIP (Session Initiation Protocol) - a protocol for establishing a session (not just a telephone one) is a text protocol over UDP. It is also possible to use SIP over TCP, but these are rare cases.

SDP (Session Description Protocol) is a protocol for negotiating the type of transmitted data (for sound and video these are codecs and their formats, for faxes - transmission speed and error correction) and their destination addresses (IP and port). It is also a text protocol. SDP parameters are sent in the body of SIP packets.

RTP (Real-time Transport Protocol) is an audio/video data transfer protocol. It is a binary protocol over UDP.

General structure of SIP packets:

  • Start-Line: A field indicating the SIP method (command) when requested, or the result of executing the SIP method when responding.
  • headers: Additional Information to the Start-Line, formatted as strings containing pairs of ATTRIBUTE: VALUE.
  • Body: binary or text data. Typically used to send SDP parameters or messages.

Here is an example of two SIP packets for one common call setup procedure:

On the left is the content of the SIP INVITE packet, on the right is the response to it - SIP 200 OK.

The main fields are framed:

  • Method/Request-URI contains the SIP method and URI. In the example, the session is established - the INVITE method, the subscriber is called [email protected]
  • Status-Code - response code for the previous SIP command. AT this example command completed successfully - code 200, i.e. Subscriber 555 picked up the phone.
  • Via - address where subscriber 777 is waiting for an answer. For the 200 OK message, this field is copied from the INVITE message.
  • From/To - display name and address of the sender and recipient of the message. For the 200 OK message, this field is copied from the INVITE message.
  • Cseq contains the sequence number of the command and the name of the method to which the given message refers. For the 200 OK message, this field is copied from the INVITE message.
  • Content-Type - the type of data that is transmitted in the Body block, in this case SDP data.
  • Connection Information - IP address to which the second subscriber needs to send RTP packets (or UDPTL packets in case of fax transmission via T.38).
  • Media Description - the port to which the second subscriber must transmit the specified data. In this case, these are audio (audio RTP/AVP) and a list of supported data types - PCMU, PCMA, GSM codecs and DTMF signals.

An SDP message consists of lines containing FIELD=VALUE pairs. The main fields include:

  • o- Origin, session organizer name and session ID.
  • With- Connection Information, the field is described earlier.
  • m- Media Description, the field is described earlier.
  • a- media attributes, specify the format of the transmitted data. For example, they indicate the direction of sound - reception or transmission (sendrecv), for codecs they indicate the sampling rate and the binding number (rtpmap).

RTP packets contain audio/video data encoded in a specific format. This format specified in the PT (payload type) field. A table of how the value of this field corresponds to a specific format is given in https://wikipedia org wiki RTP audio video profile .

RTP packets also contain a unique SSRC identifier (determines the source of the RTP stream) and a timestamp (timestamp, used to play audio or video evenly).

An example of interaction between two SIP subscribers through a SIP server (Asterisk):

As soon as a SIP phone starts up, the first thing it does is register itself with remote server(SIP Registar), sends a SIP REGISTER message to it.


When calling a subscriber, a SIP INVITE message is sent, the body of which contains an SDP message containing the audio/video transmission parameters (which codecs are supported, which IP and port to send audio to, etc.).


When the remote subscriber picks up the phone, we receive a SIP 200 OK message also with SDP parameters, only the remote subscriber. Using the sent and received SDP parameters, you can set up an RTP audio/video session or a T.38 fax session.

If the received SDP parameters did not suit us, or the intermediate SIP server decided not to pass RTP traffic through itself, then the SDP re-negotiation procedure, the so-called REINVITE, is performed. By the way, it is precisely because of this procedure that free SIP proxy servers have one drawback - if both subscribers are on the same local network, and the proxy server is behind NAT, then after redirecting RTP traffic, none of the subscribers will hear another.


After the end of the conversation, the subscriber who hung up sends a SIP BYE message.

3. Transferring information about pressed buttons

Sometimes, after the session is established, during a call, access to additional services (VAS) is required - call hold, transfer, voice mail, etc. - which react to certain combinations of pressed buttons.

So, in a regular telephone line, there are two ways to dial a number:

  • Pulse - historically the first, was used mainly in phones with a rotary dialer. The dialing occurs due to the sequential closing and opening of the telephone line according to the dialed digit.
  • Tone - dialing with DTMF codes (Dual-Tone Multi-Frequency) - each button of the phone has its own combination of two sinusoidal signals (tones). By executing the Goertzel algorithm, it is quite easy to determine the pressed button.

During a conversation, the pulse method is inconvenient for transmitting the pressed button. So, it takes approximately 1 second to transmit "0" (10 pulses of 100 ms each: 60 ms - line break, 40 ms - line close) plus 200 ms for a pause between digits. In addition, characteristic clicks will often be heard during pulse dialing. Therefore, in conventional telephony, only the tone mode for accessing VAS is used.

In VoIP telephony, information about pressed buttons can be transmitted in three ways:

  1. DTMF Inband - generating an audio tone and transmitting it inside the audio data (current RTP channel) is a normal tone dial.
  2. RFC2833 - a special telephone-event RTP packet is generated, which contains information about the pressed key, volume and duration. The number of the RTP format in which RFC2833 DTMF packets will be transmitted is specified in the body of the SDP message. For example: a=rtpmap:98 telephone-event/8000.
  3. SIP INFO - a SIP INFO packet is formed with information about the pressed key, volume and duration.

DTMF transmission inside audio data (Inband) has several disadvantages - these are overhead resources when generating / embedding tones and when detecting them, limitations of some codecs that can distort DTMF codes, and poor transmission reliability (if part of the packets is lost, then detection may occur pressing the same key twice).

The main difference between DTMF RFC2833 and SIP INFO: if the SIP proxy server has the ability to transfer RTP directly between subscribers bypassing the server itself (for example, canreinvite=yes in asterisk), then the server will not notice RFC2833 packets, as a result of which they become unavailable services DVO. SIP packets are always transmitted through SIP proxy servers, so VAS will always work.

4. Voice and fax transmission

As already mentioned, the RTP protocol is used to transfer media data. RTP packets always specify the format of the transmitted data (codec).

There are many different codecs for voice transmission, with different ratios of bitrate / quality / complexity, there are open and closed ones. Any softphone must have support for G.711 alaw/ulaw codecs, their implementation is very simple, the sound quality is not bad, but they require a bandwidth of 64 kbps. For example, the G.729 codec requires only 8 kbps, but is very CPU intensive, and it's not free.

For fax transmission, either the G.711 codec or the T.38 protocol is usually used. Sending faxes using the G.711 codec corresponds to sending a fax using the T.30 protocol, as if the fax were sent over a regular telephone line, but at the same time analog signal from the line is digitized according to the alaw/ulaw law. This is also called Inband T.30 faxing.

Faxes using the T.30 protocol negotiate their parameters: transmission speed, datagram size, type of error correction. The T.38 protocol is based on the T.30 protocol, but unlike the Inband transmission, the generated and received T.30 commands are analyzed. Thus, not raw data is transmitted, but recognized fax control commands.

The T.38 command is transmitted using the UDPTL protocol, which is a UDP-based protocol and is only used for T.38. TCP and RTP protocols can also be used to transmit T.38 commands, but they are used much less frequently.

The main advantages of T.38 are reduced network load and greater reliability compared to Inband fax transmission.

The procedure for sending a fax in T.38 mode is as follows:

  1. A normal voice connection is established using any codec.
  2. When paper is loaded in the sending fax machine, it periodically sends a T.30 CNG (Calling Tone) signal to indicate that it is ready to send a fax.
  3. On the receiving side, a T.30 signal CED (Called Terminal Identification) is generated - this is the readiness to receive a fax. This signal is sent either after pressing the "Receive Fax" button or the fax does it automatically.
  4. The CED signal is detected on the sending side and the SIP REINVITE procedure occurs, and the T.38 type is indicated in the SDP message: m=image 39164 udptl t38.

Sending faxes over the Internet preferably in T.38. If the fax needs to be transmitted within the office or between objects that have a stable connection, then Inband T.30 fax transmission can be used. In this case, before sending a fax, the echo cancellation procedure must be turned off so as not to introduce additional distortions.

Very detailed information about faxing is written in the book "Fax, Modem, and Text for IP Telephony" by David Hanes and Gonzalo Salgueiro.

5. Digital signal processing (DSP). Ensuring sound quality in IP telephony, test examples

We have dealt with the protocols for establishing a conversation session (SIP / SDP) and the method of transmitting audio over an RTP channel. There was one important question - sound quality. On the one hand, the sound quality is determined by the selected codec. But on the other hand, additional DSP procedures (DSP - digital signal processing) are still needed. These procedures take into account the peculiarities of VoIP telephony: a high-quality headset is not always used, there are packet drops on the Internet, sometimes packets arrive unevenly, throughput networks are also not rubber.

Basic procedures that improve sound quality:

VAD(Voice activity detector) - a procedure for determining frames that contain voice (active voice frame) or silence (inactive voice frame). This separation can significantly reduce network load, since the transmission of information about silence requires much less data (it is enough to transmit the noise level or nothing at all).


Some codecs already contain VAD procedures (GSM, G.729), while others (G.711, G.722, G.726) need to implement them.

If the VAD is configured to transmit information about the noise level, then special SID packets (Silence Insertion Descriptor) are transmitted in the 13th CN (Comfort Noise) RTP format.

It is worth noting that SID packets can be dropped by SIP proxy servers, so for verification it is advisable to configure the transmission of RTP traffic past SIP servers.

CNG(comfort noise generation) - a procedure for generating comfort noise based on information from SID packets. Thus, VAD and CNG work in conjunction, but the CNG procedure is much less in demand, since it is not always possible to notice the work of CNG, especially at low volume.

PLC(packet loss concealment) - the procedure for restoring the audio stream in case of packet loss. Even with 50% packet loss, a good PLC algorithm can achieve acceptable speech quality. Distortions, of course, will be, but you can make out the words.

The easiest way to emulate packet loss (on Linux) is to use the tc utility from the iproute package with the netem module. It only performs shaping of outgoing traffic.

An example of running network emulation with 50% packet loss:

Tc qdisc change dev eth1 root netem loss 50%

Disable emulation:

Tc qdisc del dev eth1 root

jitter buffer- a procedure for getting rid of the jitter effect, when the interval between received packets changes very much, and which, in the worst case, leads to an incorrect order of received packets. Also, this effect leads to speech interruptions. To eliminate the jitter effect, it is necessary to implement a packet buffer on the receiving side with a size sufficient to restore the original order of sending packets at a given interval.

You can also emulate the jitter effect using the tc utility (the interval between the expected moment of packet arrival and the actual moment can be up to 500 ms):


tc qdisc add dev eth1 root netem delay 500ms reorder 99%

LEC(Line Echo Canceller) - a procedure for eliminating local echo when the remote subscriber begins to hear his own voice. Its essence is to subtract the received signal from the transmitted signal with a certain coefficient.

Echoes can occur for several reasons:

  • acoustic echo due to poor-quality audio path (sound from the speaker enters the microphone);
  • electrical echo due to impedance mismatch between telephone and SLIC module. In most cases, this occurs in circuits that convert a 4-wire telephone line to 2-wire.

Finding out the reason (acoustic or electrical echo) is not difficult: the subscriber on whose side the echo is created must turn off the microphone. If the echo still occurs, then it is electrical.


For more information on VoIP and DSP procedures, see VoIP Voice and Fax Signal Processing. A preview is available on Google Books.

This completes a superficial theoretical overview of VoIP. If you are interested, then an example of the practical implementation of a mini-PBX on a real hardware platform can be considered in the next article.

[!?] Questions and comments are welcome. They will be answered by the author of the article Dmitry Valento, a software engineer at the Promwad electronics design center.

Tags:

  • for beginners
  • for newbies
Add tags