Web crawler in dotnet core, C#

Build status MyGet CI NuGet

As there were no available crawlers for dotnet core I decided to create one for my own purpose. It's still a simple implementation but the idea is that it should be a single threaded crawler running tasks, accessed by dependency injection.

Setting up the crawler is trivial:


services.AddRecluseCrawler();

Remember to call crawler.Start(); though.

The crawler can then be dependency injected and used like this:


public Example(RecluseCrawler crawler) {
var doc = await crawler.CrawlAsync(new CrawlTask(new Uri("http://www.ycombinator.com")));
Console.WriteLine($"{doc.Uri} - {doc.StatusCode} - {doc.Headers}");
foreach (var link in doc.Links)
{
Console.WriteLine($"{link.Uri} - {link.LinkText} - {link.LinkType}");
}
}


The crawler uses a repository for crawltasks called ICrawlTaskRepository which you also could add URI's to through DI. The included implementation is a simple list, but it can be overridden with a custom implementation.


public Example(RecluseCrawler crawler, ICrawlTaskRepository repo) {
var doc = await crawler.CrawlAsync(new CrawlTask(new Uri("http://www.ycombinator.com")));
Console.WriteLine($"{doc.Uri} - {doc.StatusCode} - {doc.Headers}");
foreach (var link in doc.Links)
{
repo.Add(new CrawlTask(link.Uri));
Console.WriteLine($"{link.Uri} - {link.LinkText} - {link.LinkType}");
}
}


All crawls made through the repository will be handled by the ICrawlHandler-implementation, for example:



public class LogCrawlHandler : ICrawlHandler
{
public void OnDocumentFetched(WebDocument obj)
{
Console.WriteLine($"LogCrawlHandler: Fetched {obj.Uri}");
}
}
...
...

services.AddSingleton<ICrawlHandler, LogCrawlHandler>();

The code can be found on github for further exploring.

There's also packages on the nuget feed.