One of my most exciting projects of last year involved a makeshift web crawler built in Python. It was a great experience, but as I added more and more features that the original script wasn’t designed, it became apparent that it was due for a rebuild from the ground up.
- A robust web-crawler
- Easy to integrate with other existing scripts
- Easy to reuse for specific projects
Choosing the Tools
My first step was figuring out how to write and compile TypeScript. That part was very quick and straightforward, so I’m not going to really detail it.
From my experience the first time around, I knew that the most difficult part was handling URLs. I didn’t want to rely on too many 3rd party dependencies, so I set out to make my own URL parser/identifier.
Once I had my parsing and deciding which URLs that I did and did not, I set out to write tests with all of the edge cases that I could think of. Something that I learned from my original crawler is that it can be very time consuming and frustrating to test the basic functionality during an actual crawl.
The next part was simply a matter of defining the entry point and figuring out how to maintain the page list. The first crawler I made worked off of a list, mainly because my URL processing wasn’t robust enough to avoid adding duplicate pages without it. This time I came equipped with a lot more knowledge and a sturdier foundation, so my initial approach was to find pages recursively. I’m not certain how well this will work on larger sites, so I might switch to a queue based approach.
Once I had launched on NPM with minimal documentation and a not-overly-fleshed-out web crawler package, I was absolutely amazed with the initial response. In the first week there was 679 downloads. I can only claim one of those as my own, so it is kind of crazy to me.
There are a few small issues regarding status codes that I want to get sorted out. I also want to write a proper test suite, and some better documentation.