In the SEO world some shops like to use the result to search ratio AKA “R/S Ratio” to help determine which keywords to pursue. This keyword metric compares the total results of a keyword in a Google search to the search volume of the keyword. Google has a tool that takes a keyword as input and finds related keywords (around 50-100) and outputs the monthly search volume for each keyword in CSV format. In traditional R/S Ratio finding, a human will then search each keyword to find the search volume for the keyword; this is a tedious and time-consuming task. I wrote a program that automates most of the human involvement needed to find R/S Ratios for a batch of keywords, accomplishing a task in minutes which could take a human hours to perform.
What it does: SEO Scraper scrapes Google to find the R/S ratio of many related keywords
Technologies used: I wrote this program in C#.net using Windows Forms
Challenges: Google’s anti-scraping measures
Why: Proof of concept, to learn C#.net
Finding the R/S ratio is a tedious and repetitive task, a person has to search each keyword in Google to find the search volume, copy and paste that number to a spreadsheet, copy and paste the search volume to a spreadsheet, then do some math to find the R/S Ratio. I wrote a program to solve this tedious task. The program was simple, it took the CSV file that Google supplied, got the relevant keywords and search volumes, searched Google for each keyword (including inurl, intitle, etc. searches), and scraped out the search volume using Reg Ex. Then it also calculated the R/S Ratio, and wrote all relevant information out to a convenient Excel file.
The hardest part of this project was to access Google using scraping tools. The first problem I ran into was that Google gave a 403 error when I tried to retrieve the HTML using a web agent. I quickly solved this by changing the user-agent property of the web agent. After changing the user-agent property, I was able to scrape for about 10 seconds continuously (probably around 200 requests) before receiving another 403 error from Google. This time, they forbade my scraper because it was working too fast. After some research and trial and error, I realized that I could get a cookie from Google that would allow me to scrape some more if I was able to fill out a captcha. The web agent I was using was not able to display that captcha, so I ended up using Internet Explorer through .net to be able to display the captcha. Internet Explorer remained invisible unless the program needed a new cookie to continue scraping.
I wrote the program to learn C#.net and Windows Forms. Before this project I had been using Java Swing; Windows Forms look so much better and are very easy to put together. C#.net was a very easy transition to make from Java, .net library is very powerful and has a helpful API.
By writing this program and bypassing Google’s anti-scraping mechanism I wrote a program that completed a task in minutes that would normally take a human many hours. It was fun going head to head with anti-scraping Google engineers. I decided to not distribute this program because I am not sure of the legality of selling a program like this, and it could potentially be a pain to maintain the program to keep up with changes that Google will inevitably make. It was a fun proof-of-concept and one of my favorite programs to discuss.
Recent Comments