Web crawler in dotnet core, C#
As there were no available crawlers for dotnet core I decided to create one for my own purpose. It's still a simple implementation but the idea is that it should be a single threaded crawler running tasks, accessed by dependency injection.
Setting up the crawler is trivial:
services.AddRecluseCrawler();
Remember to call crawler.Start(); though.
The crawler can then be dependency injected and used like this:
public Example(RecluseCrawler crawler) {
var doc = await crawler.CrawlAsync(new CrawlTask(new Uri("http://www.ycombinator.com")));
Console.WriteLine($"{doc.Uri} - {doc.StatusCode} - {doc.Headers}");
foreach (var link in doc.Links)
{
Console.WriteLine($"{link.Uri} - {link.LinkText} - {link.LinkType}");
}
}
The crawler uses a repository for crawltasks called ICrawlTaskRepository which you also could add URI's to through DI. The included implementation is a simple list, but it can be overridden with a custom implementation.
public Example(RecluseCrawler crawler, ICrawlTaskRepository repo) {
var doc = await crawler.CrawlAsync(new CrawlTask(new Uri("http://www.ycombinator.com")));
Console.WriteLine($"{doc.Uri} - {doc.StatusCode} - {doc.Headers}");
foreach (var link in doc.Links)
{
repo.Add(new CrawlTask(link.Uri));
Console.WriteLine($"{link.Uri} - {link.LinkText} - {link.LinkType}");
}
}
All crawls made through the repository will be handled by the ICrawlHandler-implementation, for example:
public class LogCrawlHandler : ICrawlHandler
{
public void OnDocumentFetched(WebDocument obj)
{
Console.WriteLine($"LogCrawlHandler: Fetched {obj.Uri}");
}
}
...
...
services.AddSingleton<ICrawlHandler, LogCrawlHandler>();
The code can be found on github for further exploring.
There's also packages on the nuget feed.