Skip to contents

The Webgraph package is still under development, some features may not be working right now.

In this article, we will introduce you to the basics of Webgraph.

You can first install the Webgraph package through the install_github() function of the devtools package :

library(devtools)
#> Loading required package: usethis
install_github("Paulogcd/Webgraph", quiet = TRUE)
#> Installing 1 packages: cpp11

How to make edgelists

The main feature of the Webgraph package is creating edgelists of webpages connections. For that, several options are possible, depending on the number of webpages you are collecting data on :

  1. Several webpages
  2. Several webpages around one target
  3. One webpage

Several webpages

In the case that you want to collect the data of connections between several webpages around one target, it is recommended that you use the edgelists_of() function.

This function still needs to be implemented.

Several webpages around one target

In the case that you want to collect the data of the network formed by the connections of pages aoround one main target, it is recommended that you use the network_from_webpage() function. Currently, only the prototype network_from_webpage1() function exists. This function takes not only the target as a parameter, but also the iteration value, that is the number of levels you want the scrapping process to go. You can use it as :

library(Webgraph)
library(igraph)
#> 
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#> 
#>     decompose, spectrum
#> The following object is masked from 'package:base':
#> 
#>     union
target = "http://google.com"
n <- network_from_webpage1(target, iteration = 2)
#> Time for edgelist_of http://google.com: 0.109718084335327 seconds.
#> Time for edgelist_of https://www.google.com/imghp?hl=en&tab=wi: 0.138445615768433 seconds.
#> Time for edgelist_of http://maps.google.com/maps?hl=en&tab=wl: 0.240963459014893 seconds.
#> Time for edgelist_of https://play.google.com/?hl=en&tab=w8: 0.524559736251831 seconds.
#> Time for edgelist_of https://www.youtube.com/?tab=w1: 0.191280841827393 seconds.
#> Time for edgelist_of https://news.google.com/?tab=wn: 0.238469362258911 seconds.
#> Time for edgelist_of https://mail.google.com/mail/?tab=wm: 0.537929534912109 seconds.
#> Time for edgelist_of https://drive.google.com/?tab=wo: 0.547325372695923 seconds.
#> Time for edgelist_of https://www.google.com/intl/en/about/products?tab=wh: 0.194056749343872 seconds.
#> Time for edgelist_of http://www.google.com/history/optout?hl=en: 0.741497039794922 seconds.
#> Time for edgelist_of http://google.com/preferences?hl=en: 0.120362997055054 seconds.
#> Time for edgelist_of https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.com/&ec=GAZAAQ: 0.297790288925171 seconds.
#> Time for edgelist_of http://google.com/advanced_search?hl=en&authuser=0: 0.312436580657959 seconds.
#> Time for edgelist_of http://google.com/intl/en/ads/: 0.241690874099731 seconds.
#> Time for edgelist_of http://google.com/services/: 0.0728864669799805 seconds.
#> Time for edgelist_of http://google.com/intl/en/about.html: 0.179411888122559 seconds.
#> Time for edgelist_of http://google.com/intl/en/policies/privacy/: 0.0720691680908203 seconds.
#> Time for edgelist_of http://google.com/intl/en/policies/terms/: 0.0735063552856445 seconds.
#> Time for network_from_webpage1 : 4.84938979148865 seconds.
g <- graph_from_data_frame(n)
#> Warning in graph_from_data_frame(n): In `d' `NA' elements were replaced with
#> string "NA"
plot(g,
     layout=layout_with_fr,
     vertex.size=4,
     vertex.label=NA,
     vertex.label.dist=0.5,
     vertex.color="red",
     edge.arrow.size=0.5)

One webpage

In the case that you want to collect the data of the links of only one page, you can use the edgelist_of() or the network_from_webpage1() (with iteration=1) functions. You can use them as :

target = "http://google.com"

# Function graph_from_webpage() :

g1 <- graph_from_webpage(target)
#> Time for edgelist_of http://google.com: 0.0806617736816406 seconds.
#> Time for graph_from_webpage : 0.0823814868927002seconds.
plot(g1,
     layout=layout_with_fr,
     vertex.size=4,
     vertex.label=NA,
     vertex.label.dist=0.5,
     vertex.color="red",
     edge.arrow.size=0.5)


# Function network_from_webpage1 :
n2 <- network_from_webpage1(target, iteration = 1)
#> Time for edgelist_of http://google.com: 0.0753128528594971 seconds.
#> Time for network_from_webpage1 : 0.0757803916931152 seconds.
g2 <- graph_from_data_frame(n2)
plot(g2,
     layout=layout_with_fr,
     vertex.size=4,
     vertex.label=NA,
     vertex.label.dist=0.5,
     vertex.color="red",
     edge.arrow.size=0.5)