If you are a little bit further away from the comfort zone of practice datasets, you already faced the hard truth of “data does not come easy”. Usually you need specific data pipelines and querying on relational databases to make a basis to your data analysis in the haystack of millions of rows.

Luckily, there is a simple , yet powerful querying language called SQL which makes easier to pull the data we would like to perform our analysis. And of course, R has a solution to run SQL queries or directly make connections to many kinds of databases and get the data.

In this article, we will explore this simple ,yet very powerful features of R! 😎👍

Needed Libraries

Hide / show the code

# Loading the libraries-----------------------------------------------------
if (!require("pacman")) install.packages("pacman")

Loading required package: pacman

Hide / show the code

p_load(pacman, dbplyr, RMariaDB, rstudioapi , tidyverse)

Opening a Connection

To create a connection to a database, we use dbConnect function in DBI package.

In R, when we create the connection by dbConnect and assign it to a variable like “conn”, we are creating a “Connection” object. For example, if we are connecting to a MariaDB database, we can connect as below.

conn <- dbConnect(RMariaDB::MariaDB(),
    host = "127.0.0.1", # write your host address
    dbname="test2" # write your database name here
    user = rstudioapi::askForPassword("Database username:"),
    password = rstudioapi::askForPassword("Database password:")
)

Listing Database Contents

Listing the tables

Hide / show the code

# List tables in the database
dbListTables(conn)

[1] "burak"         "exampletable"  "homes"         "parts"        
[5] "travelbooking"

Listing the fields (or columns) in a table:

Hide / show the code

# List columns in customer table
dbListFields(conn, "homes")

[1] "HomeID"          "HomeType"        "HomeDescription" "Price"          
[5] "DateAvailable"   "NumberofBeds"    "AgentID"

Data Wrangling and Query Operations

Creating a Data Frame in R by using SQL

We can run SQL codes in r by using the dbGetQuery() function in DBI package. To simplify the code, we can put the SQL code in a text variable (let’s call it query ) and then call it with dbGetQuery().

Hide / show the code

# Saving the SQL codein a text:
query = "SELECT * FROM homes ; "
# Running sending code to database using our open connection and saving results
results=dbGetQuery(conn, query)
results

     HomeID      HomeType            HomeDescription Price DateAvailable
1 X23465743 Single Family Great location, at reasona  2100    2023-07-10
2 X23465744 Single Family                       test  1890    2023-10-10
  NumberofBeds AgentID
1            3    1272
2            3    1272

Saving a Data Frame as a table back to the database

We can save a data frame in R as a table in the database as well.

We can use the function dbWriteTable(connectionvariable, tablename, dataframename) to create a table in the database. If there is an existing table with the same name, it will throw an error.

Hide / show the code

# Create a vector of names
names <- c("John", "Mary", "Peter", "Sarah")
# Create a vector of ages
ages <- c(25, 30, 35, 40)
# Create a dataframe
df <- data.frame(names = names, ages = ages)
# Print the dataframe
df
# Write the df as a new table to the database with the name: ExampleTable
dbWriteTable(conn,"ExampleTable", df)

Adding a new record into an existing table

To add, delete, update or any kind of other data manipulations, we can use dbExecute(dbconnectionname, SQLcodetoExecute) as our SQL code runner.

Hide / show the code

myquery = 
"
INSERT INTO ExampleTable (names, 
                          ages) 
VALUES ('Leo', 3);

"
dbExecute(conn, myquery)    # Execute the code

[1] 1

Seeing the tables and fields in a single table

It’s always important to see which fields and tables are existing in our database. In fact, we can see all of them in a data frame with the code below:

Hide / show the code

## TO ONLY GET THE TABLE NAMES AS A TABLE

# Get all the table names
table_names <- dbListTables(conn)
# Get all the field names of each table and create a list with lapply
table_fields <- lapply(table_names, function(table) {
  query <- paste("SHOW COLUMNS FROM", table)
  result <- dbGetQuery(conn, query)
  result$Field
})
# Get the maximum number of fields in any table
max_fields <- max(sapply(table_fields, length))
# Create an empty data frame for tables and fields
tables_and_fields <- data.frame(matrix(ncol = length(table_names), 
                                       nrow = max_fields))
# Give the column names
colnames(tables_and_fields) <- table_names

# Write the field names
for (i in 1:length(table_fields)) {
  tables_and_fields[1:length(table_fields[[i]]), i] <- table_fields[[i]]
}

tables_and_fields

     burak exampletable           homes           parts      travelbooking
1 buraksID        names          HomeID          PartID          BookingID
2     <NA>         ages        HomeType        PartName PassengerFirstName
3     <NA>         <NA> HomeDescription PartDescription   PassenerLastName
4     <NA>         <NA>           Price           Price         TravelDate
5     <NA>         <NA>   DateAvailable       DateStart      OriginCountry
6     <NA>         <NA>    NumberofBeds        VendorID DestinationCountry
7     <NA>         <NA>         AgentID            <NA>     NumTravelParty

If we also want to see the data structure, we can also add the field types near to the field names:

Hide / show the code

# Get all the table names
table_names <- dbListTables(conn)

# Get all the field names and data types of each table and create a list
table_fields <- lapply(table_names, function(table) {
  query <- paste("SHOW COLUMNS FROM", table)
  result <- dbGetQuery(conn, query)
  field_names <- paste(result$Field, toupper(result$Type), sep = " - ")
  
  # Add primary key indicator
  field_names[result$Key == "PRI"] <- 
    paste(field_names[result$Key == "PRI"], "PRI", sep = " - ")
  
  # Add foreign key indicator
  fk_indexes <- result$Key_name[result$Key_name != "PRIMARY"]
  for (fk_index in fk_indexes) {
    fk_fields <- result$Field[result$Key_name == fk_index]
    field_names[match(fk_fields, result$Field)] <- 
      paste(field_names[match(fk_fields, result$Field)], "FK", sep = " - ")
  }
  
  field_names
})

# Get the maximum number of fields in any table
max_fields <- max(sapply(table_fields, length))

# Create an empty data frame for tables and fields
tables_and_fields <- data.frame(matrix(ncol = length(table_names), nrow = max_fields))

# Give the column names
colnames(tables_and_fields) <- table_names

# Write the field names and data types
for (i in 1:length(table_fields)) {
  tables_and_fields[1:length(table_fields[[i]]), i] <- table_fields[[i]]
}

# Print the resulting data frame
tables_and_fields

                 burak       exampletable                          homes
1 buraksID - INT - PRI names - VARCHAR(5)        HomeID - CHAR(10) - PRI
2                 <NA>      ages - DOUBLE         HomeType - VARCHAR(60)
3                 <NA>               <NA> HomeDescription - VARCHAR(200)
4                 <NA>               <NA>                    Price - INT
5                 <NA>               <NA>           DateAvailable - DATE
6                 <NA>               <NA>             NumberofBeds - INT
7                 <NA>               <NA>             AgentID - CHAR(10)
                           parts                    travelbooking
1         PartID - CHAR(8) - PRI            BookingID - INT - PRI
2         PartName - VARCHAR(60) PassengerFirstName - VARCHAR(20)
3 PartDescription - VARCHAR(200)   PassenerLastName - VARCHAR(10)
4                  Price - FLOAT                TravelDate - DATE
5               DateStart - DATE      OriginCountry - VARCHAR(10)
6             VendorID - CHAR(8) DestinationCountry - VARCHAR(10)
7                           <NA>             NumTravelParty - INT

Using “Lazy Loading”

Instead of creating SQL code and running it on the server, we can also create a connection object which will mimic a data frame in our local environment. The beauty of this method is, we don’t need to download the content of the table. Whenever we need, data will be accessible and downloadable.

For this, we use dbplyr library function called tbl() to create a connection to a table.

Hide / show the code

# Let's create different objects and see the structures
p = tbl(conn, "exampletable")
# To look to the structure of this object:
str(p)

List of 2
 $ names:List of 2
  ..$ con  :Formal class 'MariaDBConnection' [package "RMariaDB"] with 7 slots
  .. .. ..@ ptr                   :<externalptr> 
  .. .. ..@ host                  : chr "127.0.0.1"
  .. .. ..@ db                    : chr "test2"
  .. .. ..@ load_data_local_infile: logi FALSE
  .. .. ..@ bigint                : chr "integer64"
  .. .. ..@ timezone              : chr "UTC"
  .. .. ..@ timezone_out          : chr "UTC"
  ..$ disco: NULL
  ..- attr(*, "class")= chr [1:4] "src_MariaDBConnection" "src_dbi" "src_sql" "src"
 $ ages :List of 5
  ..$ x         : 'ident' chr "exampletable"
  ..$ vars      : chr [1:2] "names" "ages"
  ..$ group_vars: chr(0) 
  ..$ order_vars: NULL
  ..$ frame     : NULL
  ..- attr(*, "class")= chr [1:3] "lazy_base_remote_query" "lazy_base_query" "lazy_query"
 - attr(*, "class")= chr [1:5] "tbl_MariaDBConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...

As we look to the details, actually, we only created a kind of ‘mapping’ to the data source. It has the attributes like “lazy_base_remote_query” “lazy_base_query” “lazy_query” whichi indicates, it will not download any data before we try to access it! Let’s try to access it:

Hide / show the code

# Get the first 3 rows
head(p,3)

# Source:   SQL [3 x 2]
# Database: mysql  [root@127.0.0.1:NA/test2]
  names  ages
  <chr> <dbl>
1 John     25
2 Mary     30
3 Peter    35

Et voila! It works and acts like a usual data frame in R!

But, what if I want it to download the data and save it to my local environment? We have 2 options for this:

We can use as_tibble function and turn it into a tibble saved in our local environment
We can use collect() function which will do the same operation in option 1

Hide / show the code

q = as_tibble(tbl(conn, "exampletable"))
r = tbl(conn, "exampletable") %>% collect()

str(q)

tibble [8 × 2] (S3: tbl_df/tbl/data.frame)
 $ names: chr [1:8] "John" "Mary" "Peter" "Sarah" ...
 $ ages : num [1:8] 25 30 35 40 3 3 3 3

Hide / show the code

str(r)

tibble [8 × 2] (S3: tbl_df/tbl/data.frame)
 $ names: chr [1:8] "John" "Mary" "Peter" "Sarah" ...
 $ ages : num [1:8] 25 30 35 40 3 3 3 3

As you can see, now the data is stored just as good old tibble()! 😎 Don’t you like tibbles? ere the data frame for you! (With even the object size differences as a bonus 😉)

Hide / show the code

s = as.data.frame(r)

str(r)

tibble [8 × 2] (S3: tbl_df/tbl/data.frame)
 $ names: chr [1:8] "John" "Mary" "Peter" "Sarah" ...
 $ ages : num [1:8] 25 30 35 40 3 3 3 3

Hide / show the code

str(s)

'data.frame':   8 obs. of  2 variables:
 $ names: chr  "John" "Mary" "Peter" "Sarah" ...
 $ ages : num  25 30 35 40 3 3 3 3

Hide / show the code

print(paste("Size of the tibble: ",object.size(r)))

[1] "Size of the tibble:  1392"

Hide / show the code

print(paste("Size of the data frame: ",object.size(s)))

[1] "Size of the data frame:  1256"

Closing the Connection

When we are finished, we can disconnect using dbDisconnect:

Hide / show the code

dbDisconnect(conn)