Alexander Beletsky's Development Blog: 2010-09

Regex to match a words in dictionary on page body

Using a Regex is pretty easy in .NET applications. All you have to use is Regex object and have basic understanding of regular expression patterns.

My goal was to create a code, that would give an answer: does this particular text contain some words from dictionary or not? Using a regular expressions is an obvious choice then you do such type of operation. So, I was trying to understand what technology is demanded by job offer (Cpp, Java or .NET) and is TDD skill demanded. To archive that I created a set of "matchers" small classes each of its own area. Crawler just used those matchers to get actual data.

    protected bool MatchToTdd(string description)
    {
      return new TddMatcher().Match(description);
    }

    protected bool MatchToJava(string desciption)
    {
      return new JavaMatcher().Match(desciption);
    }

    protected bool MatchToCpp(string desciption)
    {
      return new CppMatcher().Match(desciption);
    }

    protected bool MatchToDotNet(string desciption)
    {
      return new DotNetMatcher().Match(desciption);
    }

* This source code was highlighted with Source Code Highlighter.

As you see, I have 4 matchers to cover my requirements: CppMatcher, DotNetMatcher, JavaMatcher, TddMatcher. All of them implements simple IMatcher interface.

namespace Crawler.Core.Matchers
{
  public interface IMatcher
  {
    bool Match(string input);
  }
}

* This source code was highlighted with Source Code Highlighter.

Now, let's review the matcher. Because all the matchers do basically the same operations and differ only but its dictionary contents, they contain a dictionary of target words and delegates matching functionality to MatchUtil class. Let's see C++ matcher for instance.

namespace Crawler.Core.Matchers
{
  public class CppMatcher : IMatcher
  {
    private static IList<string> _patterns = new List<string>()
      {
        "c\\+\\+",
        "cpp",
        "stl",
        "cppunit"
      };

    public bool Match(string input)
    {
      return MatchUtil.Match(input, _patterns);
    }
  }
}


* This source code was highlighted with Source Code Highlighter.

I wanted to design MatchUtil.Match to be universal, as much as possible and to do not depend on kind of input words. Matching words with boundaries "\b" works perfecly, as soon as you have a simple words, like 'java', 'nunit', 'tests' and so on, but my tests stated to fail as soon as I tried 'c++' or '.net'. Because of '\b' matches boudary between 2 alphanumeric symbols, in my case '+' or '.' is not alphanumeric. That made a problem to me and asked StackOverflow for help. I finished up with such implementation, that I hope could be useful if you do similar stuff.

namespace Crawler.Core.Matchers
{
  class MatchUtil
  {
    public static bool Match(string input, IList<string> patterns)
    {
      var lower = input.ToLower();
      foreach (var pattern in patterns)
      {
        var start = pattern.StartsWith("\\.") ? "(?!\\w)" : "\\b";
        if (Regex.IsMatch(lower, start + pattern + "(?!\\w)"))
        {
          return true;
        }
      }
      return false;
    }
  }
}

* This source code was highlighted with Source Code Highlighter.

So, Regex.IsMatch static method is used to perform match.

This is it. If you see some issues or improvements, please let me know. http://github.com/alexanderbeletsky/TddDemand

Crawling a web sites with HtmlAgilityPack

Introduction

This is a first post of small series that I'm going to describe implementation and design of Crawler, that I've done recently for TDD demand analisys. I would split it up into several parts, covering its major architectural parts.

  • Part 1 - Crawling a web sites with HtmlAgilityPack
  • Part 2 - Regex to match a words in dictionary on page body
  • Part 3 - EF4 Code First approach to store data


For references, you could use a source code - http://github.com/alexanderbeletsky/tdd.demand

Warning it's quite long post, cause contain code examples, if you understand basic ideas I put here, best way it to go directly to repository and see the code, as best explanation material

Using HtmlAgilityPack

HtmlAgilityPack is one of the great open sources projects I ever worked with. It is a HTML parser for .NET applications, works with great performance, supports malformed HTML. I successfully used in one of the projects and really liked it. It contains very few documentation, but it designed so well that you can get basic understanding just by looking to Visual Studio Object Browser.

So, then you need to deal with HTML in .NET - HtmlAgilityPack is a definitely framework of choice.

I've downloaded latest version and were very pleased that now it supports Linq to Objects. That makes usage of HtmlAgilityPack more simple and fun. I'll give you just a simple idea how it works. Task of every crawler is to extract some information from particular html page. Say, we need to get inner text from div element with class "required". We have a 2 options here, classical one, using XPATH and brand new, using Linq to Objects.

XPATH approach

public string GetInnerTestWithXpath() {   var document = new HtmlDocument();   document.Load(new FileStream("test.html", FileMode.Open));   var node = document.DocumentNode.SelectSingleNode(@"//div[@class=""required""]");   return node.InnerText; } * This source code was highlighted with Source Code Highlighter.

Linq to Objects approach

public string GetInnerTextWithLinq() {   var document = new HtmlDocument();   document.Load(new FileStream("test.html", FileMode.Open));   var node = document.DocumentNode.Descendants("div").Where(     d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("required")).SingleOrDefault();   return node.InnerText; } * This source code was highlighted with Source Code Highlighter.

As I personally like Linq to Objects approach, sometimes XPATH is more convenient and elegant (especially in cases you refer to page elements with out ids or special attributes).

Loading pages using WebRequest

In previous example I loaded page content from file, located on disk. Now, our goal is to load pages by URL using HTTP. .NET framework has a special WebRequest. I've created a separate class HtmlDocumentLoader (that implements IHtmlDocumentLoader interface) that all the details inside.

using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Net; using System.Threading; namespace Crawler.Core.Model {   public class HtmlDocumentLoader : IHtmlDocumentLoader   {     private WebRequest CreateRequest(string url)     {       var request = (HttpWebRequest)WebRequest.Create(url);       request.Timeout = 5000;       request.UserAgent = @"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5";       return request;     }     public HtmlAgilityPack.HtmlDocument LoadDocument(string url)     {       var document = new HtmlAgilityPack.HtmlDocument();       try       {         using (var responseStream = CreateRequest(url).GetResponse().GetResponseStream())         {           document.Load(responseStream, Encoding.UTF8);         }       }       catch(Exception )       {         //just do a second try         Thread.Sleep(1000);         using (var responseStream = CreateRequest(url).GetResponse().GetResponseStream())         {           document.Load(responseStream, Encoding.UTF8);         }       }       return document;     }   } } * This source code was highlighted with Source Code Highlighter.

Several comments here. First, You can see that we load UserAgent property of WebRequest. We are making our request look that same as it would be a Firefox web browser. Some web servers could prevent web requests from "unknown" agents, so this is kind of preventive action. Second, is how document object is being intialized.. as you might see we have a try/catch block here and just repeat the same initialization steps in catch block. It might happen that web server fails to process requirest (due to different reasons), so WebRequest object will throw and exception. We just wait for one second and retry it. I've noticed that such simple approach could really improve robustness of crawler.

Generic Crawler

So, now we know how to load HTML documents by using of WebRequest, specifying document URL, also we know how to use HtmlAgilityPack to extract data from a document. Now, we have to create an engine, that would automatically go through the document, extract the links for next portion of data, process data and store it. That is something that is called crawler.

As I implemented and tested several crawlers, I've seen that all off them have the same structure and operations and differs only in particular details of how data is extracted from pages. So, I came up with a generic crawler, implemented as abstract class. If you need to build next crawler you just inherit generic crawler and implement all abstract operations. Let's see the heart of crawler, StartCrawling() method.

    protected virtual void StartCrawling()     {       Logger.Log(BaseUrl + " crawler started...");       CleanUp();       for (var nextPage = 1; ; nextPage++)       {         var url = CreateNextUrl(nextPage);         var document = Loader.LoadDocument(url);         Logger.Log("processing page: [" + nextPage.ToString() + "] with url: " + url);         var rows = GetJobRows(document);         var rowsCount = rows.Count();         Logger.Log("extracted " + rowsCount + " vacations on page");         if (rowsCount == 0)         {           Logger.Log("no more vacancies to process, breaking main loop");           break;         }         Logger.Log("starting to process all vacancies");         foreach (var row in rows)         {           Logger.Log("starting processing div, extracting vacancy href...");           var vacancyUrl = GetVacancyUrl(row);           if (vacancyUrl == null)           {             Logger.Log("FAILED to extract vacancy href, not stopped, proceed with next one");             continue;           }           Logger.Log("started to process vacancy with url: " + vacancyUrl);           var vacancyBody = GetVacancyBody(Loader.LoadDocument(vacancyUrl));           if (vacancyBody == null)           {             Logger.Log("FAILED to extract vacancy body, not stopped, proceed with next one");             continue;           }           var position = GetPosition(row);           var company = GetCompany(row);           var technology = GetTechnology(position, vacancyBody);           var demand = GetDemand(vacancyBody);           var record = new TddDemandRecord()           {             Site = BaseUrl,             Company = company,             Position = position,             Technology = technology,             Demand = demand,             Url = vacancyUrl           };           Logger.Log("new record has been created and initialized");           Repository.Add(record);           Repository.SaveChanges();           Logger.Log("record has been successfully stored to database.");           Logger.Log("finished to process vacancy");         }         Logger.Log("finished to process page");       }       Logger.Log(BaseUrl + " crawler has successfully finished");     } * This source code was highlighted with Source Code Highlighter.

It uses abstract fields of Loader, Logger and Repository. We have already reviewed Loader functionality, Logger is simple interface with Log method (I've created one implementaion to put log messages to console, that is enough to me) and Repository that we will review next time.

GetTechnology, GetDemand methods are the same for all crawlers, so they are part of generic crawler, rest of operations are "site-dependent", so each crawler overrides its behavior.

    protected abstract IEnumerable<HtmlAgilityPack.HtmlNode> GetJobRows(HtmlAgilityPack.HtmlDocument document);     protected abstract string CreateNextUrl(int nextPage);     protected abstract string GetVacancyUrl(HtmlAgilityPack.HtmlNode row);     protected abstract string GetVacancyBody(HtmlAgilityPack.HtmlDocument htmlDocument);     protected abstract string GetPosition(HtmlAgilityPack.HtmlNode row);     protected abstract string GetCompany(HtmlAgilityPack.HtmlNode row); * This source code was highlighted with Source Code Highlighter.

Here, we'll review one of the crawlers and how it implements all methods required by CrawlerImpl class.

namespace Crawler.Core.Crawlers {   public class RabotaUaCrawler : CrawlerImpl, ICrawler   {     private string _baseUrl = @"http://rabota.ua";     private string _searchBaseUrl = @"http://rabota.ua/jobsearch/vacancy_list?rubricIds=8,9&keyWords=&parentId=1";     public RabotaUaCrawler(ILogger logger)     {       Logger = logger;     }     public void Crawle(IHtmlDocumentLoader loader, ICrawlerRepository context)     {       Loader = loader;       Repository = context;       StartCrawling();     }     protected override string BaseUrl     {       get { return _baseUrl; }     }     protected override string SearchBaseUrl     {       get { return _searchBaseUrl; }     }     protected override IEnumerable<HtmlAgilityPack.HtmlNode> GetJobRows(HtmlAgilityPack.HtmlDocument document)     {       var vacancyDivs = document.DocumentNode.Descendants("div")         .Where(d =>           d.Attributes.Contains("class") &&           d.Attributes["class"].Value.Contains("vacancyitem"));       return vacancyDivs;     }     protected override string GetVacancyUrl(HtmlAgilityPack.HtmlNode div)     {       var vacancyHref = div.Descendants("a").Where(         d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("vacancyDescription"))         .Select(d => d.Attributes["href"].Value).SingleOrDefault();       return BaseUrl + vacancyHref;     }     private static string GetVacancyHref(HtmlAgilityPack.HtmlNode div)     {       var vacancyHref = div.Descendants("a").Where(         d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("vacancyDescription"))         .Select(d => d.Attributes["href"].Value).SingleOrDefault();       return vacancyHref;     }     protected override string CreateNextUrl(int nextPage)     {       return SearchBaseUrl + "&pg=" + nextPage;     }     protected override string GetVacancyBody(HtmlAgilityPack.HtmlDocument vacancyPage)     {       if (vacancyPage == null)       {         //TODO: log event here and skip this page         return null;       }       var description = vacancyPage.DocumentNode.Descendants("div")         .Where(           d => d.Attributes.Contains("id") && d.Attributes["id"].Value.Contains("ctl00_centerZone_vcVwPopup_pnlBody"))         .Select(d => d.InnerHtml).SingleOrDefault();       return description;     }     protected override string GetPosition(HtmlAgilityPack.HtmlNode div)     {       return div.Descendants("a").Where(         d => d.Attributes.Contains("class") &&         d.Attributes["class"].Value.Contains("vacancyName") || d.Attributes["class"].Value.Contains("jqKeywordHighlight")         ).Select(d => d.InnerText).First();     }     protected override string GetCompany(HtmlAgilityPack.HtmlNode div)     {       return div.Descendants("div").Where(         d => d.Attributes.Contains("class") &&         d.Attributes["class"].Value.Contains("companyName")).Select(d => d.FirstChild.InnerText).First();     }   } } * This source code was highlighted with Source Code Highlighter.

To make a picture complete, just review implementation of the rest of crawlers- http://github.com/alexanderbeletsky/tdd.demand/tree/master/src/Crawler/Core/Crawlers/

Conclusions

You might see that implementation of simple crawler as a simple thing as soon as you got good tools for that. Of cause, the functionality of it as very specific and limited, but I hope it could give you ideas for your own crawlers.

In next blog post I'll cover a topic of usage Regex in .NET and brand-new-cool-looking Entity Framework 4 Code First approach to work with databases.

Update - Is TDD skill actually required by employers? - with data from StackOverflow

This is a follow up for my last blog post, that I showed some data gathered by crawler, to check out how much TDD skill is valuable for development shops, how much do they ask for it in offers? As you remember I was satisfied with a quality of data provided by prgjobs.com.

Today, I was reading a blog post from a Coding Horrow and understood that I missed one good source of information, that is StackOverflow Careers

Due to the latest architectural changes I've made to Crawler and well structure of Careers, it took about hour to create new crawler and test it. Now, I'm ready to share the data.

Careers.StackOverflow results

Here we go - 212 vacancies has been extracted from this site. 49 of them were requesting TDD (23%, not so bad).

Technologies breakdown,

Conclusions

For sure, it is more correct data. We can see that results are really close the one we've got for Ukrainian market, by analysis of rabotaua site. It also make it possible to make some generalization of results.

We could say that ~20% of employers are demanding on TDD skill. Rest of employers either do not mention it in applications or do not care about such skill at all.

Ciprian Mustiata gave nice point in comments for previous post, that such demand on TDD could be reasonable for countries like Ukraine, where major market is for maintenance of existing code base (typically legacy code, no tests). But we see similar figures for USA, country where a lot of brand new product born.

That's another piece of information to think about. What's your opinion on that?

Is TDD skill actually required by employers?

Is TDD popular among developers? Do managers knows about benefits of TDD? Are employers really looking for TDD skilled people?

I was thinking about such kind of questions and decided to perform my initial research. My research was quite simple, I wanted to check popular job looking sites and review latest job offers, especially for "skills" sections. How many of employers actually seek for developers who know/use/love TDD. Since I'm geek I would not do it manually, so I've written an application for that. Crawler that could get data from job looking sites and store it to DB for further analisys. I already got the data and would like to share it in this post.

How it works?

Like any other crawler it has one big cycle that makes a web requests to site, gets response, extracts the links and data from response, stores data somewhere. It proceeds as soon as relevant data is present on pages. Vacancy crawler does a search request, extracts links for all vacancy pages. As soon as link extracted, it does request to vacancy page. It analyzes the body of vacancy description by very simple method: searching for a keywords it text. So, to detect is TDD skill required or not, crawler try to match some words from vocabulary. Similar approach used to understand what technology skills (.NET, Java, C++) is required. At final it creates a record contains site name, position, technology used and TDD demand flag and stores data to database.

Source of information

I've taken two sites as source of information. First, RabotaUa (Ukrainian one, Ukraine is one of big players of IT outsourcing in Europe, so data will be really relevant). Second one, I wanted to pick up from USA, but it was difficult to find it, since I'm not aware of its popularity, reputation and so on. I even asked question of StackOverflow, but my question was closed. I choose JobsForProgrammers as one of google suggested.

RabotaUa results

I've extracted 978 records from RabotaUa. It is latest, actual, recently posted job offers. 150 of 987 vacancies contained requirements for TDD skills.

How many of TDD skills required per technology?

PrgJobs results

I've extracted 1000 records from JobsForProgrammers. Crawler could proceed more, but I've noticed that on latest pages, site contains not really relevant data, not developers jobs and offers with short and not always adequate description. So, I still consider to crawle some other USA site for data. Anyway, here are results. Only 69 of 1000 requires TDD, 7%!

Technologies breakdown, also a bit strange. Match for technology were difficult, for several reasons. Job description headline, usually contained to generic description (as Software Developer, Web developer and so on), job description body contained multiskills requirements (like C++/Perl, or C#/Java, VB.NET/Java) that current version of crawler could not handle properly.

Conclusions

To be honest, I would not expect such data. I thought 40-60% should ask for TDD, but we see that it is less than 16% from Ukranian data source and less than 7% for USA. For me, as TDD follower this is really disappointing results. I realize that such results are very simple and rough, could not be used for some real life analytics, but it gives a vision, for sure.

Also I plan to do several technical blogs with details of implementation of Crawler, I've created for this report.

Please let me know, what you think about such results, what further improvements could be done for more fine results, what other data sources could be used?

Update

Subtext: Open source blogging engine project

One of my previous posts I mentioned a Subtext as open source project that I keep eye, recently. Originally created by Phil Haack, one of the authors of ASP.net MVC framework. I though about contribution for some of open source project for long time, so first time I saw Subtext I realized that it could be good one to try.

Project is hosted on Google Code, using SVN as source control system. So, it is easy to get read access to Subtext repository. Currently Subtext is actively developed by Simone, managed by Phil. It is still supported by community, so everyone is able to submit a patch.

What I liked about Subtext itself:

  • Easy to use. Clear installation procedure, clear interfaces. Easy, because of simplicity.
  • Proven by time. It is already 2.6 release of Subtext now. Quite mature, developed about 5 years years.
  • Used in community. Many bloggers hosts their blogs on Subtext.

But of cause I was attracted mostly by its code. I really liked how Subtext solution done and try to extract some good practices and approaches to my personal knowledge base. Code quality is high, it is clearly seen what architecture approaches did author used, how it is breakdown thought components and layers. I was happy to see a lot of unit tests created.

I'm hacking Subtext now. I try to understand how it works, what technologies used, what issues exists. I like how it goes, because I feel "follow the master" concept, during work on Subtext. So, it is one of my first experience of working on open source projects, I would like to describe, what my contribution is:

  • Find new bugs. Yeap, I do little tester job here. I click through the application and submit new issues to tracker.
  • Little fixes. As I found some problem and it is quite clear, I submit a patch for it.
  • Verification of fixes. I try to look through latest fixes made and verify them.
  • Feature request proposal. As I see something lack, it is possible to do a feature request. Sure, as soon as it is accepted, you are free to submit a patch with implementation.

I like how it goes. I've already submitted several patches, hope it is not the end. Unfortunately, I could not spend as much time as I want.. but at least. I hope it will be a good experience, both for me and Subtext.