I am so wicked-proud of my little crawler/text flattener:
Named after the Great Goddess of Teotihuacan who often appears with spiders (I guess it’s inaccurate to call her a goddess of spiders), my little app grabs all the text everywhere from a site. It’s a node app and has a few limitations:
Some file types seem to hang the process though I’m not certain why. I ended up just setting the crawler to ignore a bunch of image file types. That’s basically okay since I only wanted text, but that points to two sub-problems:
1. I have no idea why that broke it
2. the tool took a few hours to get to image file it hangs on. Meaning testing rigorously is… difficult.
I’m writing this several days after giving it up so some of the details are hazy, but essentially, I couldn’t make collectd work at all.
In retrospect I bet that I was modifying the configuration .yml and making an invalid file. yaml is whitespace-dependent and fails entirely if you have any validation errors. Finally, most projects with YAML configs don’t actually throw errors to the console when fed invalid yaml, they just break silently (does that make Yaml the worst markup language ever? It’s for you to decide).
But anyway, it would log a few lines then nothing, I could never make it store metrics locally. For a while it would make .rrd files (which I never figured out how to open and look at), but never made populated .csv files.
What should I have done differently?
Spinning my wheels for three days was colossally stupid. I felt like I was making some progress, but I would repeatedly find myself stuck for hours, and after trying a few things just sort of stay stumped and try to look for something else to do. This really has two sub-mistakes: a) at the end of a day where I hadn’t made much progress, I should have told someone. b) at the end of an hour where something wasn’t working, I should have bugged someone. I didn’t bug anyone because of fears of seeming like I wasn’t cut out to do this job, but that’s something of a self-fulfilling prophecy.
going into several unfamiliar environments at once. Daemon config, Linux nerdery, AND a virtualbox environment controlled by Vagrant, all situations I didn’t know well. In my follow-up project I build something as a test case that’s installed directly on my OSX machine, an environment I’m much more comfortable with.
glossing over problems. There were logging and output errors early that I tried to ignore or ‘come back to’ but really it just wasn’t working at all!