Log files and Excel
Let's consider a somewhat realistic use case where you have been provided a number of modified web log files that you want to create some visualizations from.
In Chapter 4, Addressing Big Data Quality, we will discuss data profiling (in regards to data quality), but for now, we'll assume that we know the following about our data files:
- The files are of various sizes and somewhat unstructured.
- The data in the files contain information logged by Internet users.
- The data includes such things as computer IP addresses, a date, timestamp, and a web address/URL. There is more information in the files, but for our exercise here we really just want to create a graphical representation showing the number of times each web address was hit during each month (there are actually software packages that provide web statistics, but we'll suppose you don't have access to any of them).
The following is a sample transaction (record) from one of our files:
221.738.236 - - [15/Oct/2014:6:55:2] GET /cart.do?action=view&itemId=EST-6&productId=SC-MG-G10&JSESSIONID=SD5SL9FF2ADFF4958 HTTP 1.12002334 http://www.rpropgramming.com
One's first inclination might be to load up the files into Microsoft Excel, do some pivoting and filtering, and then prepare some Excel graphs, but these files exceed the capacity of the tool (refer to the following screenshot):
Your next thought might be: big data and head for Hadoop.
Realistically though, although the files are too big for Excel to handle, we are able to manipulate them using our standard business issue laptop (although, perhaps, a bit slowly) and even though there are several files we need our data visualization to source from, it still is manageable. So, this exercise really doesn't qualify as big data, based upon volume. The data is a bit unstructured, but (notice the sample transactions) they are not at all complicated or propose a variety constraint.
Finally, for this particular exercise, the files to be used are historic (based upon a previous period of time), so we don't have to worry about velocity. Again, this is not a big data project (although, later in this section, we see how it will grow into one).
So, a rudimental solution option is to use the power of R to manipulate all of the files and create some simple visualizations based upon content.
For this example, I won't be explaining the details of the R scripts used, I just want to prove the point that R can be used to create interesting visualizations with data files too large for MS Excel. I have included my scripts here for completeness, however, you need to make a note that the R scripts shown here work, but were created quickly in an ad hoc fashion using the R console for MS Windows and most likely are not the most efficient in style. We will take more time to understand R in Chapter 3, Understanding Your Data Using R.
An R scripting example
My first step was to combine (or bind) my log files into a single comma delimited text file.
This is simply done in R using the rbind()
function:
complete.dat <- rbind(C:/Big Data Visualization/weblog1.txt,C:/Big Data Visualization/weblog2.txt) datafile1 <- read.csv("C:/Big Data Visualization/weblog1.txt", header=T, sep=",") datafile2 <- read.csv("C:/Big Data Visualization/weblog2.txt", header=T, sep=",") datafile <- rbind(datafile1, datafile2) write.csv(datafile,"C:/Big Data Visualization/oneWebLog.txt")
Once I had a single (big!) file, I wanted to pull out only the information I want to use for my visualization, so again I used a simple R script to create another text file that contains only two columns: the date (actually just the month name) and a web address. You can see that the data and web address are in columns four and nine, respectively. Note that our weblog files are separated only by spaces:
tmpRTable<-read.table("C:/Big Data Visualization/oneWebLog.txt") tmpRTable<-tmpRTable,c(4,9)] data.df <- data.frame(tmpRTable) adata.df <- data.frame(nrow(data.df)) for(i in 1: nrow(data.df)) { adata.df[i,1]<- paste(substr(data.df[i,1],start=5,stop=7), ", ", data.df[i,2]) } write.table(adata.df, file = "C:/Big Data Visualization/WebsitesByMonth.txt", sep = ",", quote = FALSE, col.names = FALSE, row.names = FALSE)
When I created my new file, I instructed R to not add row and column headers and to not wrap my fields with quote characters. This will make it easier for me to process later.
The following are some sample records from my generated file:
Jun, http://www.readingphilles.com Sep, http://www.hollywood.com Sep, http://www.dice.com Jun, http://www.farming.com Nov, http://www.wkipedia.com Aug, http://www.r-project.com Oct, http://www.rpropgramming.com Feb, http://www.aa.com Nov, http://www.farming.com
Next, I created (another) simple script to count the number of websites by month.
The basic idea is that there is now, what I call, month transactions in my file (shown previously); one transaction (or record) per valid website for the month (note that there may be, and are, multiple records for each month).
So I now have effectively summarized my web log files into a count of hits for each month using the following script:
tmpTableSpace<-read.table("C:/Big Data Visualization/ WebsitesByMonth.txt ") data.df <- data.frame(tmpTableSpace) M01 <-0; M02 <-0; M03 <-0; M04 <-0; M05 <-0; M06 <-0 M07 <-0; M08 <-0; M09 <-0; M10 <-0; M11 <-0; M12 <-0 for(i in 1:nrow(data.df)) { if (substr(data.df[i,1],start=1,stop=3) == 'Jan') {M01 <- M01 + 1} if (substr(data.df[i,1],start=1,stop=3) == 'Feb') {M02 <- M02 + 1} if (substr(data.df[i,1],start=1,stop=3) == 'Mar') {M03 <- M03 + 1} if (substr(data.df[i,1],start=1,stop=3) == 'Apr') {M04 <- M04 + 1} if (substr(data.df[i,1],start=1,stop=3) == 'May') {M05 <- M05 + 1} if (substr(data.df[i,1],start=1,stop=3) == 'Jun') {M06 <- M06 + 1} if (substr(data.df[i,1],start=1,stop=3) == 'Jul') {M07 <- M07 + 1} if (substr(data.df[i,1],start=1,stop=3) == 'Aug') {M08 <- M08 + 1} if (substr(data.df[i,1],start=1,stop=3) == 'Sep') {M09 <- M09 + 1} if (substr(data.df[i,1],start=1,stop=3) == 'Oct') {M10 <- M10 + 1} if (substr(data.df[i,1],start=1,stop=3) == 'Nov') {M11 <- M11 + 1} if (substr(data.df[i,1],start=1,stop=3) == 'Dec') {M12 <- M12 + 1}
Now, I can visualize my data using a pie chart:
slices <- c(M01, M02, M03, M04, M05, M06, M07, M08, M09, M10, M11, M12)lbls <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec") pie(slices, labels = lbls, main="Pie Chart of Month Hit Counts")
Points to consider
The preceding exercise was simply meant to make the point to the reader that not all data visualization projects require going to Hadoop (although some clearly would benefit from using Hadoop). It is obligatory for the reader to do the appropriate analysis before choosing one technology over another.
You can now see that large files can be manipulated and visualizations can be created with simple open source options such as R scripting on a machine with moderate resources. The reader must keep in mind though that such a solution may work, but may still not be appropriate as an enterprise solution. For example, processing two or three log files once or twice as part of discovery is fine, but expecting to use the process on a routine basis on files growing in number and size can quickly become a burden, inefficient, and ultimately, almost impossible.
Now, let's move onto Hadoop!