Documentation

Many developers have asked more documentation about Peers. Specifically about how Peers works.

Abstract

This page describes how sip softphones work. Then an in-depth analysis of Peers source code is provided. It has been considered that the reader/developer must understand general considerations before peers source code. Some notions about voice over ip are required to understand source code, they are explained in first part.

Table of Contents

  1. General considerations
    1.1. Why standards?
    1.2. Media management
      1.2.1. First, voice capture.
        1.2.1.1. Sample frequency
        1.2.1.2. Sample size
        1.2.1.3. Number of channels
        1.2.1.4. Endianness
        1.2.1.5. Signed/unsigned
      1.2.2. Then, audio data encoding.
      1.2.3. Packetization
      1.2.4. Audio playback
    1.3. Session control
      1.3.1. Registration
      1.3.2. Codec negociation
  2. Source code analysis
    2.1. Overview
      2.1.1. Architecture
      2.1.2. State machines
      2.1.3. Managers
      2.1.3. Package users
    2.2. Package details
      2.2.1. SIP
        2.2.1.1. Message anatomy
        2.2.1.2. Transport
        2.2.1.3. Transaction
        2.2.1.4. Dialog
        2.2.1.5. User-Agent
      2.2.2. SDP
      2.2.3. Media
      2.2.4. GUI

Table of Figures

  Figure 1. Analog to digital conversion
  Figure 2. Peers architecture
  Figure 3. SIP stack
  Figure 4. Abstract state for all state machines
  Figure 5. SIP message and its components
  Figure 6. SIP transport management
  Figure 7. SIP transactions class diagram
  Figure 8. Invite client transaction state machine
  Figure 9. Invite server transaction state machine
  Figure 10. Non-invite client transaction state machine
  Figure 11. Non-invite server transaction state machine
  Figure 12. Transaction manager
  Figure 13. Dialog state machine
  Figure 14. Method handlers
  Figure 15. Request managers
  Figure 16. User-Agent
  Figure 17. Peers GUI

1. General considerations

Let's start with simple things. First, it may seem obvious, but we have to answer the question: what is Peers? Peers is a lightweight java sip softphone, i.e. a software that enables its users to place calls on internet.

1.1. Why standards?

When you place calls over internet, your computer is doing many things: capture microphone, play remote contact's voice, send and receive voice over the network, etc. When we look at existing solutions, we see that there are many ways to do this. I will not list all applications that can be used to place calls over internet, but let's see main solutions: skype, msn, gtalk, yahoo. Those solutions are all proprietary solutions, i.e. big companies use their own way to communicate between their client applications (that's what we call a protocol). The benefit of this solution is that they control evolutions of their protocol.

Concerning microsoft, google or yahoo, we must not forget that those companies take the opportunity of their visibility and their marketing power to distribute their applications. They are not necessarily telecommunications experts. This does not mean that they are providing poor quality software, but what would you think if your car manufacturer was to sell you an oven or washing machines? You would probably (at least) hesitate.

The problem of proprietary protocol is that a yahoo client cannot make calls to msn or skype to gtalk, etc. Not directly. There are gateways to place calls from one network to another, but they are error prone and imply complicated translation mechanism. They do not necessarily give exactly the same features, etc. This is the reason why standards have been created. With a common specification, developers can write various applications that communicate between each other. Internet is based on standards: HTTP, HTML, etc. Those standards are the reason of internet success. Everyone can reach everyone, because everyone is using the same language. For web pages, HTTP and HTML are the standard. For media sessions control (voice, video, games, etc.), SIP (Session Initiation Protocol) is the standard. For media sending and reception, RTP (Realtime Transport Protocol) is the standard. And for media sessions description, SDP (Session Description Protocol) is the standard. Those standards (apart HTML) are specified by an organization called IETF (Internet Engineering Task Force). This organization writes its specifications - standards - as plain text files called RFC (Request For Comments). They have a number and a title, but engineers often refer to their number... Thus SIP is specified in RFC3261.

Let's take a deep breath and dive in technical details...

SIP is responsible for media sessions establishement, update and teardown. If someone wants to talk with a friend, he or she will tell the software : "I want to invite Bob for an audio session". Let's call Alice the person who is calling Bob. When the conversation is terminated, Alice or Bob will tell their client application : "I'm done with he or she, I want to terminate the session". That's it. This is SIP. SIP means Session Initiation Protocol, but SIP is not only responsible for sessions initiation, but also updates and ending. Its name is not perfect, but let's deal with it.

Neither Alice nor Bob wants to make more complicated things for a conversation. Thus, we can see SIP as the highest level protocol for internet calls, the protocol that "interacts" with users.

Let's make a pause, that was the first level of our technical diving.

Now that we see the aim of SIP, let's understand next steps.

1.2. Media management

Let's start with media handling.

1.2.1. First, voice capture.

Just a glimpse of theory.

1.2.1.1. Sample frequency

Voice is captured using a microphone. Voice stream is an analog signal (a wave). To convert this stream to a computer-usable stream, it has to be converted to a digital signal (stairs). This digital signal is then converted to bits (0, 1) that computer understand. How is it done? A sample is taken in this analog wave at a regular interval and we will consider that this sample is valid up to the next sample. This is not really the case, this is an approximation. And if we take enough samples, we can draw a curve that is very similar to the original wave. The interval at which we take samples is called the sample frequency of the sound.


 

Figure 1. Analog to digital conversion

1.2.1.2. Sample size

Another parameter will play an important role in sampling: sample size. To convert our sample to a computer-usable data, we have to define bounds for the integer that will be considered as the sample value. This is the sample size. The more bits we use to store one sample, the best the sound quality.

1.2.1.3. Number of channels

As we are diving into digital sound analysis, we should consider the number of sources of sound that can be mixed up and heard by human ears. This is the number of channels. For telephony applications, the number of channels is generally one: one microphone makes voice capture, and delivers one source of sound. Even if voice applications generally consider only one source of sound, it is important to understand this notion as many tools use it to manage sound.

1.2.1.4. Endianness

When we define a data format for integers on a computer, we have to define entirely the way bits are converted to integer and vice versa. Unfortunately, everybody did not use the same conventions to store integers. When we convert a decimal number to binary, we generally write binary numbers as hexadecimal numbers:
    1000 (decimal) = 1111101000 (binary) = 3E8 (hexadecimal)
Binary data is generally sliced in groups of 8 bits, i.e. one byte, eight times the value one or zero. To store one byte on a computer, everybody agrees: let's take the first bit (highest power of two within this byte) and put it first, then second, etc. But when our number cannot fit into eight bits, how do we do? We split our number in groups of eight bits:
    1000 (decimal) = 03 E8 (hexadecimal)
Here, I added a 0 in front of 3 to fit 3 on eight bits, that's what we call zero-padding. Some people decided that 03 (the most significant byte) would come first, followed by E8, the way we write. And some people decided that they would store E8 first (the least significant byte), followed by 03. The first solution is called Big-Endian, and the second one is Little-Endian.

1.2.1.5. Signed/unsigned

The last parameter that will play an important role in audio data format is integer boundaries. Actually, some people decided they needed an integer value that would vary between zero and a positive value. And some people decided they needed a value that would be positive or negative, with a sign in front of the previous absolute number. The problem is that we still have a multiple of eight bits to store our integer. The common solution is to drop the most significant bit (not byte) and use its empty space to store the sign of our integer.

1.2.2. Then, audio data encoding.

The samples we have for the moment are sometimes called raw data of linear PCM (Pulse Code Modulation). To find the bandwidth that's necessary to transfer audio data, the following formula can be applied:
    bandwidth = frequency * sample size * number of channels
(signed/unsigned or endianness do not infer on bandwidth as they keep the same storage space) If we consider that 16 bits samples are taken at 8 kHz with only one channel, 128000 bits of data must be sent during one second to keep our voice quality. Even if internet providers bandwidths are growing, such an upload bandwidth is huge and generally not available on internet.

To face this issue, smart optimisations are applied on audio data so that they can fit on reasonable bandwidth, available on most networks on internet. Those optimisations are called codecs (coder decoder). They rely on voice and ear physical properties to avoid naive raw data transfert. I will not give too much details on codecs in this documentation. There are many audio codecs, but the first one that is generally implemented by SIP clients is G711 mu-law. Please refer to wikipedia article for more information. Once optimisations are done, data is compressed and needs less space to transport the same voice stream.

1.2.3. Packetization

After data encoding, audio stream is packetized. It means that slices of audio stream are extracted from the encoder output. But on a SIP network, media data cannot be sent raw, as is. It must be sent using RTP (Realtime Transport Protocol). RTP strives to solve realtime media transport issues that can occur on IP networks. Thus, it provides a header to include a timestamp. This timestamp gives a clue about when the packet must be played by the receiver. It also includes a sequence number that enables packets re-ordering. The fact is that, on SIP networks, RTP is often transported over UDP, because we can afford losing a few media packets and small disorder in packets reception. UDP is appreciated for its speed over its reliability, which is a big advantage in realtime environment. RTP is a binary protocol which transports binary data.

On the receiver side, RTP packets are parsed. RTP headers are dropped and media data is extracted.

1.2.4. Audio playback

Once media data is extracted, still compressed, it is transmitted to decoder that will generate raw uncompressed data samples. Those playable samples are then transmitted to a player which will send instructions to the sound card to play voice samples correctly.

1.3. Session control

Let's now understand how SIP makes people reachable on a network and talking the same codec.

1.3.1. Registration

Let's come back to Alice and Bob. Alice and Bob both use an IP network to reach each other. When Alice wants to call Bob, she knows his SIP uri (sip:bob@biloxi.com), but her computer does not know where Bob is, on which computer, on which IP address. Thus, Alice and Bob's client application registers when their computer starts or when they want to be reachable by SIP network to tell a central server: hello, I'm here, my IP address is 1.2.3.4 and the port I'm using is 5060. Smart readers that you are know that there are NATs (Network Address Translation) on internet but for the moment, I will consider that a public IP address is used for Alice and Bob's client application. Actually, this can be true if port forwarding is installed on NATs.

As Bob's computer is registered on a central server (called registrar), Alice's client application (User-Agent) sends its request to Bob's registrar, who will then forward the request to Bob's IP address and port. SIP considers that there may be several domains/realm with a registrar for each domain (several providers). Another important element on SIP networks is proxy. A proxy is an element that receives requests from User-Agents (or other SIP nodes), may modify those requests, ask authentication, computes routes, and then, forwards those requests to other proxies, registrars or User-Agents. It's a sort of relay. It may filter malformed requests, etc.

1.3.2. Codec negociation

Now that our User-Agents (client software) are talking the same control protocol: SIP, they must establish a media session so that Alice can hear Bob and Bob can hear Alice. SIP is a flexible protocol. Thus, it states that User-Agents can support several codecs to send or receive media packets. It specifies that G711 must be supported at least. This is the reason why most SIP User-Agents implement G711 first, and then add more complicated codecs.

As several codecs can be supported, User-Agents use a common language to describe their codecs in their SIP messages. This is SDP (Session Description Protocol). When a User-Agent sends a request to create a media session, it includes a description of its supported codecs. And when a User-Agent answers a request that is willing to create a media session, it also includes the set of supported codecs, even if there is only one. This is codec negociation. Each User-Agent takes the remote party's codec list and takes the first one in this list that matches a codec in its own codecs list. Thus both User-Agents use the same way to encode and decode media data for Alice and Bob's voice. Generally, User-Agents put their "worst" codec in last position in their list so that best quality codecs are preferred.

Actually, codec negociation relies on offer/answer model for SDP. This model is specified in RFC3264. Thus, the request that wants to create a new media session may be empty, without any offer. In this case, the User-Agent is telling: make me an offer, and I'll give you my supported codecs appropriately. In this case, the SDP offer is in SIP response (200 OK), and the SDP answer is in ACK.

2. Source code analysis

The following paragraphs apply to peers version 0.3.1.

2.1. Overview

2.1.1. Architecture

Peers has been developed in java, an object-oriented programming language. Here is Peers architecture:


 

Figure 2. Peers architecture

Peers is separated in packages (gui, sip, core, etc.). For some features, it relies on external libraries (swing, javasound, etc.). If you don't want to download sources, and import them in your preferred IDE, you can use the web interface to browse sources online. If you take a look at the source (this will now be necessary), you will see that the first categorization in the source code is done on protocols and very high level capabilities:

  • net.sourceforge.peers.gui
  • net.sourceforge.peers.media
  • net.sourceforge.peers.nat
  • net.sourceforge.peers.sdp
  • net.sourceforge.peers.sip

For your information, nat package is not used for the moment, it was an experiment about Port Restricited Cone NAT traversal. Peers VoIP-related code is based on an external library to manage RTP protocol: jrtp, this is the only protocol that has not been implemented in Peers. All references to this external library are in media package. media package is also responsible for sound encoding. SDP and SIP do not rely on any external library. Of course, sdp and sip packages contains SDP related sources and SIP stack implementation. The only complicated (but interesting) package is sip. Let's see what sip is made of:

  • net.sourceforge.peers.sip.core
  • net.sourceforge.peers.sip.transactionuser
  • net.sourceforge.peers.sip.transaction
  • net.sourceforge.peers.sip.transport
  • net.sourceforge.peers.sip.syntaxencoding

As you probably remarked, it corresponds to RFC3261 layers:


 

Figure 3. SIP stack

The reader will probably need to keep an eye on RFC3261 to fully understand the following paragraphs. I did not reinvent the wheel. The sip package has been implemented with simplicity and extensibility in mind. This implementation should not be obscure to a java developer that already knows SIP. The following paragraphs will contain UML diagrams. But before we explain the meaning of each package, let's see some common techniques that have been used in several packages.

2.1.2. State machines

SIP defines several state machines, a design pattern has been used for state machines. It consists of one class for the object that handles its state, one mother class for all states of this state machine and one class for each state. As there were several state machines, and as it was useful to log state transitions, a generic abstract state class has been defined, it just prints the old state, the new state and the transition employed. Then each mother state class in state machines extends this abstract state class:


 

Figure 4. Abstract state for all state machines

This figure shows that there are five state machines in Peers. We won't detail the role of each state machine now, just keep in mind the way they are managed and implemented, not what they are done for.

2.1.3. Managers

The second design pattern that has been used in Peers is Factory. It has been used in several packages. Actually, in Peers, factories are called managers. Managers are more than factories, because they are employed to create object instances, but they are also used to store all references to those objects. Thus, when an external object needs to access one of the objects created by a manager, it uses its get method. In some cases, one manager can create several types of objects. In those cases, the appropriated get method must be employed. All managers have been implemented the same way. They contain hashtables for the object series they manage. For those reasons, and as they are used to delete references to those objects, the word manager has been preferred to factory.

2.1.4. Package users

Interaction between packages is sometimes made using interfaces. Those interfaces defines objects users. Thus when an object outside a package needs to get information from one object, it implements its corresponding User interface. And then it gets notified about events. Users interfaces are quite similar to Listeners. But they are not called Listeners because they do not necessarily apply to pure beans or POJOs (plain old java objects).

In SIP, some interactions may occur between several layers at a time on the same side (core and transaction user), this is the reason why one package user may be either layer A or layer B.

2.2. Package details

2.2.1. SIP

Let's start with sip-related packages. The best way to get in touch with SIP is probably using wireshark network analyzer and its sip filter, and trying to place calls. We have already seen that Peers source code is separated in packages that correspond to SIP layers. We will start with the lowest layer (syntax/encoding) with simple message examples. Then we will climb up the layer stack. The next step is transport management, i.e. the way messages travel over the network. Next, we will see how those messages are grouped to form transactions. Then, we will explain how those transactions are grouped to manage dialogs. And last but not least, we will understand how dialogs are managed by core layer. But before we explain how those high-level layers are translated in java, let's discover SIP by its messages.

2.2.1.1. Message anatomy

SIP uses two types of messages: requests and responses. Requests contain a method (a word, in bold font in the following example) that will give request aim and a request-uri (italic) for the person/server we want to reach. And responses contain a status code (an integer, in bold font in the following example) that gives response status: success, failure, etc. Each SIP message is made of several headers and one body. A header has a name, and generally one value. But it may contain several values. A SIP header can contain one or several parameters with the following syntax:

header_name: header_value;param=param_value

Here is an example message quoting RFC3665, which gives simple call-flow examples.

INVITE sip:bob@biloxi.example.com SIP/2.0
Via: SIP/2.0/TCP client.atlanta.example.com:5060;branch=z9hG4bK74bf9
Max-Forwards: 70
From: Alice <sip:alice@atlanta.example.com>;tag=9fxced76sl
To: Bob <sip:bob@biloxi.example.com>
Call-ID: 3848276298220188511@atlanta.example.com
CSeq: 1 INVITE
Contact: <sip:alice@client.atlanta.example.com;transport=tcp>
Content-Type: application/sdp
Content-Length: 151

v=0
o=alice 2890844526 2890844526 IN IP4 client.atlanta.example.com
s=-
c=IN IP4 192.0.2.101
t=0 0
m=audio 49172 RTP/AVP 0
a=rtpmap:0 PCMU/8000

This message is a request. Here is an example response. Here message bodies are plain text. Generally, SIP message body is either empty, either text. But RFC3261 states that body can contain any type of data, even binary data.

SIP/2.0 200 OK
Via: SIP/2.0/TCP client.atlanta.example.com:5060;branch=z9hG4bK74bf9
;received=192.0.2.101
From: Alice <sip:alice@atlanta.example.com>;tag=9fxced76sl
To: Bob <sip:bob@biloxi.example.com>;tag=8321234356
Call-ID: 3848276298220188511@atlanta.example.com
CSeq: 1 INVITE
Contact: <sip:bob@client.biloxi.example.com;transport=tcp>
Content-Type: application/sdp
Content-Length: 147

v=0
o=bob 2890844527 2890844527 IN IP4 client.biloxi.example.com
s=-
c=IN IP4 192.0.2.201
t=0 0
m=audio 3456 RTP/AVP 0
a=rtpmap:0 PCMU/8000

Thus, here is how those messages have been separated in objects to ease message content access in Peers.


 

Figure 5. SIP message and its components

For the moment, don't bother with body content, this is SDP (starting with v=0...). We will explain this syntax later. Just remember that this is not SIP but SDP, and thus, it's specified in another RFC.

2.2.1.2. Transport

Now that we have seen SIP message bones, let's see how those messages are transported over the network.

Transport package is quite simple: TransportManager creates client transports and server transports. Those client transports and server transports are called message senders and message receivers. Actually, behind the stage, DatagramSockets are doing the real job. It must be noted that TCP transport is not supported in Peers. Most User-Agents first support UDP, and then TCP. Peers does not break the rules. That's why TCP does not appear on the following class diagram. The transport layer is also generally responsible for message retransmissions. As SIP works over UDP, those message retransmissions are very important to avoid losing messages.


 

Figure 6. SIP transport management

For datagram sockets indexing (in transport manager hashtable), it must be noted that keys are not strings nor integers, but SipTransportConnection objects. Those objects contain local address, remote address, local port, remote port and transport protocol used to convey packets. Thus when it's necessary to communicate with the same machine on the same port and using the same transport protocol, the same object is employed.

Thread management is not the same for message sending and message reception. MessageReceiver implements Runnable, thus it must be started in its own Thread. It has been considered that was necessary to perform message reception in one Thread, as it can occur at any time. But message sending is not done in its own Thread. It's done in caller's Thread.

Transport management has been done in a very naive way. In theory, UDP packets may contain several SIP messages, but this feature is not implemented in Peers. Actually on client side, this would probably be very odd to receive several SIP messages in the same UDP packet. In day-to-day life, it never occurs. Several multi-SIP messages UDP packets generally only occur between high-loaded servers, not on User-Agents. An improvement would be a sort of dialect/protocol definition and using nio to support easily several transport protocols and new protocols. Apache Mina gives a great example of those transport optimizations. Transport implementation in Peers has been pragmatic and quick...

We won't explain all SIP routing philosophy, but remember that requests are routed using Route header if it's present, and request-uri domain name or IP address if Route header is not in message. Responses are routed using Via header. It generally contains an IP address and a port on which the response must be sent.

2.2.1.3. Transaction

Those of you who are familiar with databases probably already know transactions. We could also compare SIP transactions with financial transactions. In each case, transaction aim is the same: do something if everything goes well, else do nothing. It's exactly the same with SIP. If any error occurs during transaction management, abort modifications on transaction-related objects (generally dialog state, etc.) and come back to the original state, before transaction management. Actually, this is quite dumb to start Peers transaction implementation description with transaction-fallback mechanism because no "failover" technique has been implemented in Peers... but at least, you are aware of it.

In SIP, a transaction is made of:

  • exactly one request,
  • eventually one or several provisional response(s) (status code between 101 and 199),
  • exactly one final response (status code between 200 and 699).

We forget forking, it's intended. Fork is not implemented in Peers.

Transaction layer is probably the most complicated layer in SIP specification. There are several transaction families. To find transaction family, you have to answer the two following questions:

  • Did this transaction receive or sent the request on the network?
  • Will this transaction create a Dialog?

Both questions have two exclusive answers. Transactions that receive requests are called server transactions and transactions which send requests are called client transactions. Transactions that create a dialog are called invite transactions, as INVITE is the only method that can create dialogs in RFC3261. And transactions that will not create a dialog are called non-invite transactions. Thus there are four transaction types:

  • invite client transaction,
  • invite server transaction,
  • non-invite client transaction,
  • non-invite server transaction.

Transactions are uniquely identified using branch parameter in header Via and method name. As requests and responses belong to a transaction, those parameters are present in both request and response. They are in bold font in the following example, which is just an extract from the previous messages:

INVITE sip:bob@biloxi.example.com SIP/2.0
Via: SIP/2.0/TCP client.atlanta.example.com:5060;branch=z9hG4bK74bf9
Max-Forwards: 70
[...]

SIP/2.0 200 OK
Via: SIP/2.0/TCP client.atlanta.example.com:5060;branch=z9hG4bK74bf9
;received=192.0.2.101
From: Alice <sip:alice@atlanta.example.com>;tag=9fxced76sl
To: Bob <sip:bob@biloxi.example.com>;tag=8321234356
Call-ID: 3848276298220188511@atlanta.example.com
CSeq: 1 INVITE
Contact: <sip:bob@client.biloxi.example.com;transport=tcp>
[...]

Server and client aspects of transaction have been implemented as interfaces in Peers, and invite and non-invite property have been implemented in abstract classes. Thus those four transactions have been implemented in their own class in Peers, extending and implementing the appropriate class and interface, as shown in the following diagram:


 

Figure 7. SIP transactions class diagram

This class diagram also shows which classes are using transport layer using their corresponding SipXxxTransportUser interface.

What makes transaction package particularly verbose in Peers is that each transaction type has its own state machine, thus its corresponding mother state class and its corresponding classes for each state. Those state machines are provided in RFC3261 but here is how they have been implemented in Peers:


 

Figure 8. Invite client transaction state machine


 

Figure 9. Invite server transaction state machine


 

Figure 10. Non-invite client transaction state machine


 

Figure 11. Non-invite server transaction state machine

Well, now that we know our transactions behavior, let's see their manager. Transaction manager works with transactions using their client/server property. Thus it uses ClientTransaction and ServerTransaction interfaces to handle them.


 

Figure 12. Transaction manager

Transactions are identified by their branch id and a method name. Take a look at peers.log to find it in state machine transitions. That's it for transactions, now let's make groups of transactions to create dialogs!

2.2.1.4. Dialog

Actually, in SIP specification there's a sort of confusion between transaction user and dialog layer. Several layers are using transaction layer on the upper side: core and dialog. Core is either User-Agent, Proxy, Registrar or Redirect Server; and dialog is transaction user.

Transaction user is probably the most simple layer in SIP. It contains Dialogs. A dialog is the representation of a media session on the control side. Remember there are two sides in SIP: media and control. Dialog is on control side, and media session is on media side. Media session is often the term employed in SDP and RTP. One state machine is necessary for dialogs. Please refer to RFC3261 for information about what must be inside a dialog. Bird view: local and remote contact addresses, unique id, etc. It's not a surprise, dialogs are managed using DialogManager.


 

Figure 13. Dialog state machine

Actually, Dialog is not really a group of transactions, but a transaction can occur within a dialog or not. The parameter that will determine if a transaction is performed within a dialog is its Call-ID header. To be exhaustive, a dialog is identified by its Call-ID, tag parameter in From header and tag parameter in To header. This is what you will see in peers.log. Thus a transaction which is performed within a dialog must use the same Call-ID, the same local-tag and the same remote-tag. Local-tag and remote-tag are To-tag and From-tag but may be inverted if the request is coming from the UAS (User-Agent Server), i.e. the one who received the call. Here is an illustration of dialog identifier components in request and response:

INVITE sip:bob@biloxi.example.com SIP/2.0
Via: SIP/2.0/TCP client.atlanta.example.com:5060;branch=z9hG4bK74bf9
Max-Forwards: 70
From: Alice <sip:alice@atlanta.example.com>;tag=9fxced76sl
To: Bob <sip:bob@biloxi.example.com>
Call-ID: 3848276298220188511@atlanta.example.com
[...]

SIP/2.0 200 OK
Via: SIP/2.0/TCP client.atlanta.example.com:5060;branch=z9hG4bK74bf9
;received=192.0.2.101
From: Alice <sip:alice@atlanta.example.com>;tag=9fxced76sl
To: Bob <sip:bob@biloxi.example.com>;tag=8321234356
Call-ID: 3848276298220188511@atlanta.example.com
[...]


In this example, the request does not contain a tag parameter in header To. Actually, at this time, the dialog does not exist yet.

2.2.1.5. User-Agent

On the top of transaction user layer, we find core layer. Core layer defines the SIP element role. On a SIP network, we've already seen that there were serveral nodes:

  • proxy,
  • registrar,
  • redirect server,
  • user-agent

Peers is a user-agent. It's the software employed by users to place or receive calls. Actually, a user-agent is just the SIP part of this software. User-agent can be considered as the image of the software in SIP stack. This is the reason why the corresponding package name is: net.sourceforge.peers.sip.core.useragent. Peers SIP core layer, or core role is User-Agent.

In SIP, the core layer is the brain. Depending on its role, it can be more or less sophisticated, but it's the place where general behavior is defined. Another property of SIP protocol is that complex things are managed in client software applications. Sometimes we hear that complexity is deported on the border of the network in SIP protocol. To support this complexity, each single feature has been implemented in a separate class in Peers. There are two types of classes in core layer: request managers and handlers. Handlers implement method-specific work. In basic SIP specification, there are several methods: INVITE, BYE, CANCEL, ACK, OPTIONS and REGISTER. Thus, each method has its own handler. All methods are not dialog-related methods, those methods are implemented in classes that inherit MethodHandler directly. Dialog related methods classes have a common abstract class called DialogMethodHandler. This class is reponsible for dialog construction and updates, calling the appropriate methods in dialog package. The following class diagram shows those classes:


 

Figure 14. Method handlers

Method names are generally quite explicit: INVITE is there to create dialogs, CANCEL cancels dialogs in progress and BYE terminates dialogs. But INVITE can also be used to update codec, the IP address and port on which rtp packets can be sent. In this case, they are called re-INVITEs, but the actual method that is present in requests is INVITE. The trick to find if an INVITE is an initial INVITE or a subsequent INVITE (re-INVITE) is to look at To header tag parameter. If this parameter is defined, a dialog has already been created and thus, the INVITE request is within this dialog. REGISTER is used to register user-agent IP address and port, so that it can receive SIP calls. And OPTIONS is used to get information about what is supported in user-agent, proxy, etc. Well, I forgot ACK... ACK is a very particular method. It does not generate any response. It's just employed to acknowledge the creation of a dialog on the client side. Thus the server side is notified of dialog creation.

A user-agent always contains both a user-agent client and a user-agent server. A user-agent client is responsible for requests sending and a user-agent server is responsible for incoming requests processing. In Peers, they are called UAC and UAS. Sorry, I must insist on one thing. The most important aspect of a request is whether this request occurs inside a dialog (subsequent request) or outside any dialog (initial request). This is really important because the processing in user-agent is totally different. In one case, you may have to create a dialog, in the other you may have to update this dialog. Some processing is the same for all methods for initial requests and some processing is the same for all subsequent requests. Request creation is the typical example. All methods share some processing in creation process. Thus, in Peers, InitialRequestManager is responsible for initial requests specific handling and MidDialogRequestManager is responsible for subsequent requests specific handling. There's another important manager, it's ChallengeManager. In a SIP network, each element that receives a request can reject that request, asking the sender a common secret, this is a Challenge. ChallengeManager is responsible for Challenge management. Challenges are specified in RFC2617. They provide a sort of authentication in SIP. All methods share the same behavior for authentication. Thus this class handles them. This RFC defines a framework for HTTP authentication, but the same authentication framework is applied to SIP.


 

Figure 15. Request managers

The RequestManager is the class that keeps a reference to all Handlers. Even if it's an abstract class, it provides references to each method handler in its subclasses: InitialRequestManager and MidDialogRequestManager. Instances of those classes are references in UAC and UAS. May Peers act as user-agent server or user-agent client, it has to support intial requests sending, subsequent requests sending, initial requests reception and subsequent requests reception.

Let's come to our main class: UserAgent. UserAgent keeps references to many important objects: UAC, UAS, media related objects (CaptureRtpSender and IncomingRtpReader), and managers (ChallengeManager, DialogManager, TransactionManager and TransportManager). Each layer manager is referenced here.


 

Figure 16. User-Agent

All core Handlers and Managers are instanciated within UserAgent constructor. Thus, when you instantiate a new UserAgent, you implicitly create its underlying layers objects. This is the reason why it's really easy to use Peers in external applications.

2.2.2. SDP

SDP package is responsible for codec negociation. SDP itself is the way media sessions are described, it's specified in RFC2327. This codec negociation is specified in RFC3264. The negociation principle is quite simple. At any time, an entity generates an offer, with all supported codecs. This offer is sent to another entity. Later, the entity that received the offer parses this offer, analyzes it, and generates an answer. There is always one answer for one offer. The answer depends on offer, it's not always the same.

In SIP theory, an offer can be present in either INVITE or 200 body. If the offer is in INVITE, the answer is in 200, and if the offer is in 200, the answer is in ACK body and INVITE body is empty. In practice, this former case is extremely rare. Peers does not support empty INVITE body. SDP contains critical information about media streams. It provides the IP address and the port on which it whishes to receive RTP packets, but it also describes the payload types that it supports. Remember, SDP gives media description, not media content. The protocol that transports media streams is RTP. This protocol encodes media with a RTP specific format, this is the payload type. Here is an example SDP session description:

v=0
o=alice 2890844526 2890844526 IN IP4 client.atlanta.example.com
s=-
c=IN IP4 192.0.2.101
t=0 0
m=audio 49172 RTP/AVP 0
a=rtpmap:0 PCMU/8000

Reading this SDP, we can conclude that this entity is listening on IP address 192.0.2.101 on port 49172 for RTP PCMU packets, sampled at 8000 Hz.

If we take a look at Peers source code, SDPManager is the place where everything is done at SDP level. This class generates offers, parses answers to extract useful information (ip address, port, payload type), and generates answers based on incoming offers. This simple. Session descriptions and media descriptions have their own classes. The content of those classes corresponds to RFC4566 parameters description.

2.2.3. Media

Media package is a bit more complicated than sdp. The main classes of this package are IncomingRtpReader and CaptureRtpSender. Actually, IncomingRtpReader is responsible for RTP depacketization, media decompression and media playback ; and CaptureRtpReader is responsible for microphone capture, media encoding and RTP packetization. Thus, CaptureRtpSender has references to Capture, Encoder and RtpSender instances. Each of this class implements Runnable, and is running in a separate Thread. Data is transmitted using PipedOutputStreams and PipedInputStreams amongst those media manipulation objects. Nevertheless, IncomingRtpReader does not use separate threads for depacketization and media playback.

This media package relies on two APIs: jrtp and javasound. Jrtp defines RtpSessions and RtpPackets for RTP management. Please refer to their documentation for more information. Their RTP implementation is based on RFC3550. Javasound is standard sun javasound API. Thus you can use their web pages for more information and tutorials. Bird view: javasound defines SourceDataLines for media playback and TargetDataLine for media capture. A global AudioSystem class is there to retrieve all information about sound card, etc. AudioDataFormats give description about codec and audio bitstream format. The last important aspect of javasound is Line. Lines are used for stream control: start/stop, etc. Peers captures audio data at 8 kHz, using 16 bits samples, one channel (mono), and signed little-endian samples.

The use of javasound for media capture and playback is critical. Even if it's not the simplest java media API, it has the advantage of being tested by sun on each supported platform (windows, linux, mac, solaris, etc.). Peers has been tested successfully on linux, windows and mac os 10.6. Javasound has many drawbacks: few guaranteed features, no standard audio data format (frequency, sample size, etc.). But it's already integrated in java standard edition API, and it avoids third-party libraries with native parts, etc.

2.2.4. GUI

Last but not least, graphical user interface. Peers is based on swing for gui management. Once more, swing is already integrated in sun JRE. It's light, efficient and uses staight-forward development methods. It's not shiny, but it works.


 

Figure 17. Peers GUI

The main frame is BasicGUI. Each call has it own window: CallFrame. sip package triggers events (SipEvent in core package) and gui package is listening to those events to update gui appropriately. Actions are performed on UAC and InviteHandler to perform initial actions or to answer requests. Those invocations are performed in SwingWorkers, thus gui is not blocking on button click. This is often a reproach made to swing: actions can be performed synchronously. But actually, the issue is poor developers usage of swing classes.

SipEvent is made of an EventType, which is a java 5 enumeration, and a SipMessage. It has been considered that sip events where related to one sip message. Actually, this seems reasonable for sip package, but it may be insufficient for communication between all packages (media error, etc.) and gui. Observable/Observer design pattern has been used to communicate between sip package and gui package. Thus Dialog and InviteHandler implement Observable interface. We can understand easily that Dialog must be seen by gui to update its CallFrame, which is Dialog view. InviteHandler is also made visible for gui because it notifies incoming INVITEs arrival and it notifies gui when dialog is not accessible. For example, when an initial INVITE is sent from Peers, the Dialog is not created yet. It's only created when 200 OK is received. But there must be something to show to user. This is the reason why communication is performed between InviteHandler and gui instead of Dialog and gui. Thus, Dialog remains a pure SIP dialog and not a call "image".

It must be noted that gui package is not very clean. It has been implemented as quickly as possible. A real GUI would probably implement state machines, view states or something like that... but it's simple and it works.