If you are a little bit further away from the comfort zone of practice datasets, you already faced the hard truth of “data does not come easy”. Usually you need specific data pipelines and querying on relational databases to make a basis to your data analysis in the haystack of millions of rows.
Luckily, there is a simple , yet powerful querying language called SQL which makes easier to pull the data we would like to perform our analysis. And of course, R has a solution to run SQL queries or directly make connections to many kinds of databases and get the data.
In this article, we will explore this simple ,yet very powerful features of R! 😎👍
Needed Libraries
Hide / show the code
# Loading the libraries-----------------------------------------------------if (!require("pacman")) install.packages("pacman")
To create a connection to a database, we use dbConnect function in DBI package.
In R, when we create the connection by dbConnect and assign it to a variable like “conn”, we are creating a “Connection” object. For example, if we are connecting to a MariaDB database, we can connect as below.
conn <-dbConnect(RMariaDB::MariaDB(),host ="127.0.0.1", # write your host addressdbname="test2"# write your database name hereuser = rstudioapi::askForPassword("Database username:"),password = rstudioapi::askForPassword("Database password:"))
We can run SQL codes in r by using the dbGetQuery() function in DBI package. To simplify the code, we can put the SQL code in a text variable (let’s call it query ) and then call it with dbGetQuery().
Hide / show the code
# Saving the SQL codein a text:query ="SELECT * FROM homes ; "# Running sending code to database using our open connection and saving resultsresults=dbGetQuery(conn, query)results
HomeID HomeType HomeDescription Price DateAvailable
1 X23465743 Single Family Great location, at reasona 2100 2023-07-10
2 X23465744 Single Family test 1890 2023-10-10
NumberofBeds AgentID
1 3 1272
2 3 1272
Saving a Data Frame as a table back to the database
We can save a data frame in R as a table in the database as well.
We can use the function dbWriteTable(connectionvariable, tablename, dataframename) to create a table in the database. If there is an existing table with the same name, it will throw an error.
Hide / show the code
# Create a vector of namesnames <-c("John", "Mary", "Peter", "Sarah")# Create a vector of agesages <-c(25, 30, 35, 40)# Create a dataframedf <-data.frame(names = names, ages = ages)# Print the dataframedf# Write the df as a new table to the database with the name: ExampleTabledbWriteTable(conn,"ExampleTable", df)
Adding a new record into an existing table
To add, delete, update or any kind of other data manipulations, we can use dbExecute(dbconnectionname, SQLcodetoExecute) as our SQL code runner.
Hide / show the code
myquery ="INSERT INTO ExampleTable (names, ages) VALUES ('Leo', 3);"dbExecute(conn, myquery) # Execute the code
[1] 1
Seeing the tables and fields in a single table
It’s always important to see which fields and tables are existing in our database. In fact, we can see all of them in a data frame with the code below:
Hide / show the code
## TO ONLY GET THE TABLE NAMES AS A TABLE# Get all the table namestable_names <-dbListTables(conn)# Get all the field names of each table and create a list with lapplytable_fields <-lapply(table_names, function(table) { query <-paste("SHOW COLUMNS FROM", table) result <-dbGetQuery(conn, query) result$Field})# Get the maximum number of fields in any tablemax_fields <-max(sapply(table_fields, length))# Create an empty data frame for tables and fieldstables_and_fields <-data.frame(matrix(ncol =length(table_names), nrow = max_fields))# Give the column namescolnames(tables_and_fields) <- table_names# Write the field namesfor (i in1:length(table_fields)) { tables_and_fields[1:length(table_fields[[i]]), i] <- table_fields[[i]]}tables_and_fields
If we also want to see the data structure, we can also add the field types near to the field names:
Hide / show the code
# Get all the table namestable_names <-dbListTables(conn)# Get all the field names and data types of each table and create a listtable_fields <-lapply(table_names, function(table) { query <-paste("SHOW COLUMNS FROM", table) result <-dbGetQuery(conn, query) field_names <-paste(result$Field, toupper(result$Type), sep =" - ")# Add primary key indicator field_names[result$Key =="PRI"] <-paste(field_names[result$Key =="PRI"], "PRI", sep =" - ")# Add foreign key indicator fk_indexes <- result$Key_name[result$Key_name !="PRIMARY"]for (fk_index in fk_indexes) { fk_fields <- result$Field[result$Key_name == fk_index] field_names[match(fk_fields, result$Field)] <-paste(field_names[match(fk_fields, result$Field)], "FK", sep =" - ") } field_names})# Get the maximum number of fields in any tablemax_fields <-max(sapply(table_fields, length))# Create an empty data frame for tables and fieldstables_and_fields <-data.frame(matrix(ncol =length(table_names), nrow = max_fields))# Give the column namescolnames(tables_and_fields) <- table_names# Write the field names and data typesfor (i in1:length(table_fields)) { tables_and_fields[1:length(table_fields[[i]]), i] <- table_fields[[i]]}# Print the resulting data frametables_and_fields
burak exampletable homes
1 buraksID - INT - PRI names - VARCHAR(5) HomeID - CHAR(10) - PRI
2 <NA> ages - DOUBLE HomeType - VARCHAR(60)
3 <NA> <NA> HomeDescription - VARCHAR(200)
4 <NA> <NA> Price - INT
5 <NA> <NA> DateAvailable - DATE
6 <NA> <NA> NumberofBeds - INT
7 <NA> <NA> AgentID - CHAR(10)
parts travelbooking
1 PartID - CHAR(8) - PRI BookingID - INT - PRI
2 PartName - VARCHAR(60) PassengerFirstName - VARCHAR(20)
3 PartDescription - VARCHAR(200) PassenerLastName - VARCHAR(10)
4 Price - FLOAT TravelDate - DATE
5 DateStart - DATE OriginCountry - VARCHAR(10)
6 VendorID - CHAR(8) DestinationCountry - VARCHAR(10)
7 <NA> NumTravelParty - INT
Using “Lazy Loading”
Instead of creating SQL code and running it on the server, we can also create a connection object which will mimic a data frame in our local environment. The beauty of this method is, we don’t need to download the content of the table. Whenever we need, data will be accessible and downloadable.
For this, we use dbplyr library function called tbl() to create a connection to a table.
Hide / show the code
# Let's create different objects and see the structuresp =tbl(conn, "exampletable")# To look to the structure of this object:str(p)
As we look to the details, actually, we only created a kind of ‘mapping’ to the data source. It has the attributes like “lazy_base_remote_query” “lazy_base_query” “lazy_query” whichi indicates, it will not download any data before we try to access it! Let’s try to access it:
Hide / show the code
# Get the first 3 rowshead(p,3)
# Source: SQL [3 x 2]
# Database: mysql [root@127.0.0.1:NA/test2]
names ages
<chr> <dbl>
1 John 25
2 Mary 30
3 Peter 35
Et voila! It works and acts like a usual data frame in R!
But, what if I want it to download the data and save it to my local environment? We have 2 options for this:
We can use as_tibble function and turn it into a tibble saved in our local environment
We can use collect() function which will do the same operation in option 1
As you can see, now the data is stored just as good old tibble()! 😎 Don’t you like tibbles? ere the data frame for you! (With even the object size differences as a bonus 😉)