# Introduction

Researchers who work with smaller populations that have clearly defined boundaries (Laumann, Marsden, and Prensky 1989) may be able to design survey instruments that can feasibly document an entire social network. These complete, *sociocentric* networks can exhibit some systematic biases (Ready et al. 2020), but they typically do a better job of characterizing the full suite of direct and indirect connections within a population.

Many researchers, however, work in larger populations that have fuzzy boundaries, if any. In these settings it is usually not feasible to construct a complete social network using only a survey or semi-structured interview. An alternative approach is to collect *egocentric* networks from a random sample of individuals.

In this post, I will describe what an egocentric network is, and provide some simulation and sampling tools that can be used to understand the structural implications of an egocentric approach.

# Egocentric networks

An egocentric network contains all of the alters connected to a particular *ego* (a single node), as well as all of the people to whom those ego’s alters are also connected. These two degrees of separation are sometimes referred to as the `\(1^{st}\)`

and `\(2^{nd}\)`

order neighborhoods of ego.

Put differently, an egocentric design attempts to:

[observe] the network of interest from the point of view of a set of sampled actors, who provide information about themselves and anonymized information on their network neighbors (Krivitsky and Morris 2017: 1).

Look at the quote above, we can see that when we use an egocentric network design, we develop a survey instrument that asks survey respondents to do, at minimum, two network tasks.

First, we ask each ego to list the names of people with whom they have a particular relationship. In my work, I might ask people to list individuals with whom they have shared a meal in the past month. I might also ask them for details about the relationship, like how frequently they share meals, or how they are related to the person.

This procedure yields a short edgelist that might look something like this:

```
## Node Edge Frequency Relation
## 1 Ego A 3 friend
## 2 Ego B 2 parent
## 3 Ego C 2 sibling
## 4 Ego D 5 sibling
## 5 Ego E 3 friend
```

Once the respondent have listed their partners, the next steps is to ask them who those partners might also be connected with. For example, now that we know `Ego`

has had a meal with person `A`

, we will ask if they know who person `A`

(and person `B`

, etc.) has shared a meal with. We can also ask ego to estimate the frequency and how `A`

is related to the person. In some cases, `Ego`

might not know the answers to these questions.

```
## Node Edge Frequency Relation
## 1 B E 1 parent
## 2 B F 3 sibling
## 3 B G 2 <NA>
## 4 C I 5 friend
## 5 C J 3 <NA>
## 6 C K 4 sibling
## 7 D I 4 sibling
## 8 D J 4 coworker
## 9 E F 4 sibling
## 10 E G 5 <NA>
```

These two edgelists represent the `\(1^{st}\)`

and `\(2^{nd}\)`

order neighborhoods of ego.

# Limitations and benefits of egocentric networks

Observing a network through the lens of sampled egos presents has some limitations. Each respondent must be able to report information on themselves and all of their partners. For some relationships and details, this can be very challenging. The resulting network is also a biased representation of the complete network. This means that some centrality metrics – e.g., betweenness and closeness – are greatly affected by the incomplete network structure, while others – e.g., degree centrality – are less impacted (Marsden 2002).

But there are also some benefits that come from working with egocentric networks. A large amount of information can be obtained by only sampling a small number of people. This greatly reduces the amount of research effort needed to sample large populations. Egocentric networks also lend themselves to respondent driven sampling (Heckathorn 1997, 2002). Together these approaches can be used to document senstive or stigmatized social relations. Egocentric networks can also be constructed using contact diaries, which lend themselves to longitudinal network documentation (Fu 2007).

# Egocentric sampling: A simulation study

So if you do work in a large population, and you intend to use an egocentric approach, you will be faced with some design questions:

- How many egos do you need to sample?
- What kinds of metrics will be biased?
- Can you conduct valid inferential statistics?

These questions are the focus of this simulation study. Here, I provide some tools you can use to simulate and visualize a sample of egos from a complete network and immediately see the ramifications. There are several packges used.

```
library(statnet)
library(igraph)
library(tidygraph)
library(ggraph)
library(ggforce)
library(patchwork)
library(intergraph)
```

`egonetworks`

function

To begin, I’ve created a function that has several settings that can help you understand the implications of an egonetwork design. This is superficially similar to a *power analysis* used in statistical research design.

The overall purpose of the function is to simulate a complete network to use as a reference, and then sample egos from this complete network. By sampling egos and looking at their neighborhoods, we can get a sense of how the collection of egonetworks compare to the complete network structure.

```
egonetworks <- function(
# SIMULATION SETTINGS
N=30,
directed = F,
formula="net ~ edges + nodematch('group')",
params=c(-3.5,3),
groups = 4,
# EGONETWORK SETTINGS
select_egos=F,
N_egos=3,
egoIds,
seed=777)
{
# SIMULATE A COMPLETE NETWORK
set.seed(seed)
n <- N
net <- network(n, directed = directed, density = 0)
net %v% 'group' <- sample(1:groups, size = n, replace = T)
g <- simulate(as.formula(formula), coef=params, seed = seed)
# CONVERT TO IGRAPH
ig <- asIgraph(g)
# SAMPLE EGOS, FIND NEIGHBORHOODS
if (select_egos == F) {
egos <- sample(V(ig)$vertex.names, size=N_egos, replace = F)
first <- ego(ig, order=1, nodes=egos, mindist = 0)
second <- ego(ig, order=2, nodes=egos, mindist = 1)
} else if (select_egos == T) {
egos <- egoIds
first <- ego(ig, order=1, nodes=egos, mindist = 0)
second <- ego(ig, order=2, nodes=egos, mindist = 1)
}
# CREATE ATTRIBUTES
tg <- ig %>%
as_tbl_graph() %>%
activate(nodes) %>%
mutate(Neighborhood = as.factor(ifelse( vertex.names %in% egos, 1,
ifelse( vertex.names %in% unlist(first) | vertex.names %in% unlist(second), 2, 'Unobserved'))),
egolab = ifelse( vertex.names %in% egos, paste('Ego', vertex.names), '' ),
IsEgo = ifelse(vertex.names %in% egos, 'Yes','No')) %>%
activate(edges) %>%
mutate(Neighborhood = as.factor(ifelse(from %in% egos, 1,
ifelse(to %in% egos, 1,
ifelse( from %in% unlist(first), 2,
ifelse( to %in% unlist(first), 2, 'Unobserved' ))))))
# SAVE EGO NETWORKS
E(ig)$id <- seq_len(ecount(ig))
V(ig)$vid <- seq_len(vcount(ig))
egographs <- make_ego_graph(ig,order=2,nodes=egos)
# RETURN COMPLETE, TIDYGRAPH, AND EGONETWORKS IN A LIST
return(list(g, tg, egographs))
}
```

The `egonetworks`

function generates the following:

- One complete network.
- Node and edge attributes to visualize each egonetwork within the complete network.
- A list of egonetworks.

## Simulation Settings

The function has the following default settings:

```
# SIMULATION SETTINGS
N=30,
directed = F,
formula="net ~ edges + nodematch('group')",
groups = 4,
params=c(-3.5,3),
# EGONETWORK SETTINGS
select_egos=F,
N_egos=3,
egoIds,
seed=777)
```

### Simulation settings

`N`

controls the number of nodes that are in the complete network simulation. `directed`

controls whether the simulation generates a directed or undirected network.

`egonetworks`

accepts a ergm `formula`

that determines the structure for the network simulation. We covered ergm formulae in a previous post on network simulations with `statnet`

. Currently, the default formula creates a network based on group membership homophily. The number of groups is controlled by the `groups`

argument. But you can also include any ergm term in your formula, as long as the term does not require an attribute (see `?ergm-terms`

for details).

The `params`

argument is used to specify parameters values that are required by the `formula`

. Since the default formula is `"net ~ edges + nodematch('group')"`

, we need one parameter for network density (`edges`

) and another for how strong the homophily is (`nodematch`

). These parameters are specified in *log-odds*; log-odds = `0`

is the same as `0.5`

probability.

## Egonetwork settings

Once a complete network is simulated from the settings above, we need to sample a number of egos from it. Here you have two options. If `select_egos = FALSE`

, then you can set the number of egos to sample using `N_egos`

. The number of egos you choose will be randomly sampled from the nodes in the complete network (based `seed`

). Alternatively, if `select_egos = TRUE`

, you can provide a vector of `egoIds`

. This is useful if you want to take a deep examinations of a specific node.

## Testing it out

You can run `egonetworks`

with just the defaults and a list will be generated. The first element of the list is the complete network as a `network`

object.

```
test <- egonetworks()
test[[1]]
```

```
## Network attributes:
## vertices = 30
## directed = FALSE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## bipartite = FALSE
## total edges= 54
## missing edges= 0
## non-missing edges= 54
##
## Vertex attribute names:
## group vertex.names
##
## No edge attributes
```

The second element is a `tbl_graph`

object with node and edge attributes that can be used for visualization.

```
test[[2]]
```

```
## # A tbl_graph: 30 nodes and 54 edges
## #
## # An undirected simple graph with 2 components
## #
## # Edge Data: 54 x 4 (active)
## from to na Neighborhood
## <int> <int> <lgl> <fct>
## 1 1 6 FALSE 2
## 2 1 9 FALSE 1
## 3 1 11 FALSE 1
## 4 1 14 FALSE 2
## 5 1 27 FALSE 2
## 6 1 30 FALSE 2
## # ... with 48 more rows
## #
## # Node Data: 30 x 6
## group na vertex.names Neighborhood egolab IsEgo
## <int> <lgl> <chr> <fct> <chr> <chr>
## 1 2 FALSE 1 2 "" No
## 2 4 FALSE 2 2 "" No
## 3 4 FALSE 3 2 "" No
## # ... with 27 more rows
```

And the third element is a list of `igraph`

objects based on all the ego graphs that were subset by `make_ego_graph`

.

```
test[[3]]
```

```
## [[1]]
## IGRAPH 22e6c59 U--- 20 37 --
## + attr: group (v/n), na (v/l), vertex.names (v/c), vid (v/n), na (e/l),
## | id (e/n)
## + edges from 22e6c59:
## [1] 1-- 5 1-- 8 1-- 9 1--11 1--17 1--20 2-- 8 2-- 9 2--10 2--15
## [11] 3-- 5 3-- 6 4-- 7 4--19 5-- 6 5-- 8 5--15 5--16 5--17 5--18
## [21] 6--12 6--17 6--18 7-- 8 7--12 7--19 8--14 8--18 9--16 10--11
## [31] 10--15 11--16 13--14 14--17 14--18 16--18 17--20
##
## [[2]]
## IGRAPH 22e6c59 U--- 14 26 --
## + attr: group (v/n), na (v/l), vertex.names (v/c), vid (v/n), na (e/l),
## | id (e/n)
## + edges from 22e6c59:
## [1] 1-- 4 1-- 5 3-- 5 4-- 5 1-- 6 3-- 6 2-- 7 6-- 7 2-- 8 3-- 8
## [11] 7-- 8 1-- 9 8-- 9 3--10 4--10 8--10 4--11 6--11 9--11 1--12
## [21] 4--12 4--13 5--13 11--13 1--14 12--14
##
## [[3]]
## IGRAPH 22e6c59 U--- 14 23 --
## + attr: group (v/n), na (v/l), vertex.names (v/c), vid (v/n), na (e/l),
## | id (e/n)
## + edges from 22e6c59:
## [1] 1-- 3 3-- 4 1-- 6 2-- 6 3-- 6 5-- 6 6-- 8 7-- 8 7-- 9 7--10
## [11] 9--10 3--11 1--12 3--12 4--12 8--12 3--13 4--13 6--13 8--13
## [21] 11--13 1--14 12--14
```

## Visualization

First, let’s take a look at what the complete network looks like for a few different `seed`

values.

Coloring each node by group membership, we see how our simulation has generated a network based on group homophily.

The default settings draw three egos randomly out of the thirty used in the simulation. We can use the `Neighborhood`

and `IsEgo`

variables to color the edges and nodes. These variables tag which nodes and edges are part of the `\(1^{st}\)`

and `\(2^{nd}\)`

degree neighborhoods and which nodes are unobserved (`Neighborhood = 0`

). The `egolab`

variables can be used to label each ego.

Let’s focus on `seed = 777`

and `seed = 72`

.

```
p1 <- egonetworks(seed = 777)[[2]] %>%
ggraph() +
geom_edge_link0(aes(color=Neighborhood)) +
geom_node_point(aes(color=Neighborhood, shape=IsEgo), size=2) +
geom_node_label(aes(label=egolab), repel = T) +
theme_void() + theme(legend.position = 'none') +
scale_edge_color_manual(values=c('#3300ff','magenta','#00000033')) +
scale_color_manual(values=c('#3300ff','magenta','#00000033')) +
ggtitle('seed = 777')
p2 <- egonetworks(seed = 72)[[2]] %>%
ggraph() +
geom_edge_link0(aes(color=Neighborhood)) +
geom_node_point(aes(color=Neighborhood, shape=IsEgo), size=2) +
geom_node_label(aes(label=egolab), repel = T) +
theme_void() + #theme(legend.position = 'none') +
scale_edge_color_manual(values=c('#3300ff','magenta','#00000033')) +
scale_color_manual(values=c('#3300ff','magenta','#00000033')) +
ggtitle('seed = 72')
p1 + p2
```

Looking at Figure 4, something clearly stands out. Can you tell what it is?

A sample of three egos is capable of recovering a lot of the complete network structure, but only if those egos occupy different structural positions in the network. This is somewhat true when we used `seed = 777`

, but `seed = 72`

happened to select three nodes that are all connected to each other. This greatly reduces the our ability to observe the complete network.

It is easy to notice this *when we have the complete network* already. But how would we know this if we haven’t and cannot collect a complete network?

One approach is to used stratified random sampling instead of purely random sampling. This requires that we have some background, contextual knowledge of the populationwe are studying. For instance, if we know that there are a number of organization in the population, we need to be sure to sample egos from each of them. In our simulation study, this means we need to sample at least one ego from each of our homophily groups.

Similarly, if we have reason to believe that a person’s position in a network is correlated with other personal attributes (e.g., wealth, age, location), then we can take stratified random samples of people who have different attributes.

## A larger network

Now that we have a sense of how this function behaves, let’s take it up a notch.

In reality, you probably wouldn’t use an egocentric approach for a population of `30`

. More likely you’d what to use it for a much large population, say, `500`

nodes. Let’s parameterize this.

First, we need to build some intuition. Start by simulating a network with `500`

nodes using the default `params`

and then calculating the density on the complete network.

```
eg <- egonetworks(N = 500)
graph.density(asIgraph(eg[[1]]))
```

```
## [1] 0.03502204
```

This is much too dense. Most real world networks are more sparse – a topic I discuss in this blog post. The `sparsepoint`

for a network with 500 nodes is:

```
sparsepoint <- function(n, directed=F) {
if ( directed == F ) { n / (n*(n-1)/2) }
else if ( directed == T ) { n / (n*(n-1)) }
else {
print('Must be TRUE or FALSE.')
}
}
sparsepoint(500, directed = F)
```

```
## [1] 0.004008016
```

So to improve the simulation, we should set more extreme `params`

.

```
eg <- egonetworks(N = 500, params = c(-8,4))
graph.density(asIgraph(eg[[1]]))
```

```
## [1] 0.004857715
```

Much better. The visual difference is striking.

```
par(mfrow=c(1,2))
gplot(egonetworks(500, params = c(-3.5,3))[[1]] )
gplot(egonetworks(500, params = c(-8,4))[[1]])
```

Now that we have some more reasonable parameters, let’s sample different quantities of `N_egos`

and see how well we recover the complete network. We’ll try `3`

,`10`

,`35`

, and `60`

. For brevity, I’ve omitted the plotting code, but an example is show below.

```
params <- c(-8,4)
mypal <- c('#3300ff','magenta','#00000033')
p1 <-
egonetworks(
N=500,
params = params,
N_egos = 3)[[2]] %>%
ggraph() +
geom_edge_link0(aes(color=Neighborhood)) +
geom_node_point(aes(color=Neighborhood, shape=IsEgo), size=1.5) +
theme_void() + theme(legend.position = 'none') +
scale_edge_color_manual(values=mypal) +
scale_color_manual(values=mypal) +
ggtitle('N_egos = 3')
```

As we increase the number of egos, we get much closer to the complete network. We also end up smapling some isolates by chance. But this qualitative approach isn’t all that satisfying. We need to make some quanitative comparisons between the complette network and the merged egocentric networks.

To do this, we first need to take all of the egonetworks, (i.e., `egonetworks()[[3]]`

) and combine them into a single network. Then we will look at how the structural characteristics compare between the complete and egocentric versions.

## Merging the egonetworks

There may be some fancy `igraph`

ways to combine these networks, but I think the most straight forward approach is to get the edgelist for every network, using `rbind`

to combine them, and then eliminate any duplicate edges.

I’m going to do this once with 10 egos and once with 60 egos. Since each egonetwork is stored in a list, I’ll use `lapply`

to `get.edgelist`

and converted all the edgelists into `data.frame`

so that they can be unlisted using `dplyr::bind_rows`

.

```
params <- c(-8,4)
egonet10 <- egonetworks(N=500, params = params, N_egos = 10)[[3]]
egonet60 <- egonetworks(N=500, params = params, N_egos = 60)[[3]]
# add names to egonetworks
for(i in seq_along(egonet10)) {
V(egonet10[[i]])$name <- V(egonet10[[i]])$vid
}
for(i in seq_along(egonet60)) {
V(egonet60[[i]])$name <- V(egonet60[[i]])$vid
}
# get edgelists
el10 <- dplyr::bind_rows(
lapply(
lapply(egonet10, get.edgelist), as.data.frame) )
el60 <- dplyr::bind_rows(
lapply(
lapply(egonet60, get.edgelist), as.data.frame) )
# graph the edgelist without duplicates
g10 <- graph.data.frame(el10[ !duplicated(el10), ])
g60 <- graph.data.frame(el60[ !duplicated(el60), ])
```

## Network comparisons

How do network statistics calculated on the compiled egonetworks compared to the full network?

```
## Network N_nodes Density AvgPathLength Connectedness
## 1 Complete 500 0.004857715 7.705835 0.7498677
## 2 60Egos 289 0.004060938 1.775687 0.9793829
## 3 10Egos 91 0.010866911 1.617834 0.3789988
```

# Concluding thoughts

- Egocentric networks are capable of estimating the structure of large networks with much lower research effort.
- Network simulation and the
`egonetworks`

function are tools for anticipating how much research effort is needed. They can be used like a power analysis. - Theory and ethnographic knowledge is necessary for adequate research design, but relatively little prior knowledge is needed to come up with useful simulations.

We have not touched on inferential statistics with `ego.ergm`

but this package is very useful for developing statistically models on egocentrically sampled data. We will cover this in a future post.

# References

Fu, Yang-chih. 2007. “Contact Diaries: Building Archives of Actual and Comprehensive Personal Networks.” *Field Methods* 19 (2): 194–217.

Heckathorn, Douglas D. 1997. “Respondent-Driven Sampling: A New Approach to the Study of Hidden Populations.” *Social Problems* 44 (2): 174–99.

———. 2002. “Respondent-Driven Sampling II: Deriving Valid Population Estimates from Chain-Referral Samples of Hidden Populations.” *Social Problems* 49 (1): 11–34.

Krivitsky, Pavel N, and Martina Morris. 2017. “Inference for Social Network Models from Egocentrically Sampled Data, with Application to Understanding Persistent Racial Disparities in HIV Prevalence in the US.” *The Annals of Applied Statistics* 11 (1): 427.

Laumann, Edward O, Peter V Marsden, and David Prensky. 1989. “The Boundary Specification Problem in Network Analysis.” *Research Methods in Social Network Analysis* 61: 87.

Marsden, Peter V. 2002. “Egocentric and Sociocentric Measures of Network Centrality.” *Social Networks* 24 (4): 407–22.

Ready, Elspeth, Patrick Habecker, Roberto Abadie, Carmen A Dávila-Torres, Angélica Rivera-Villegas, Bilal Khan, and Kirk Dombrowski. 2020. “Comparing Social Network Structures Generated Through Sociometric and Ethnographic Methods.” *Field Methods* 32 (4): 416–32.