«

»

Jan
18

Squiddy Web Crawler launched!

I am proud to announce the launch of the Squiddy Web Crawler. This is yet another web crawler (also known as a web spi­der or web robot) but still has some indi­vid­u­al­ity by the way how it works and the pur­pose it was cre­ated.
This crawler will index and ana­lyze the web­sites based on the cri­te­ria gen­er­ated by an arti­fi­cial intel­li­gence (AI) … a secret project of mine:).
So the result­ing data will feed an AI beast that will “learn” and inter­act with the web, con­stantly chang­ing its algo­rithm based on the how much of the “known” is con­sid­ered rel­e­vant.
Because the AI has only some vague goals to fol­low (like stay active and alive, look for inter­est­ing new juicy data) the results can­not be easy pre­dicted, actu­ally is a machin­ery that will be dri­ven by another machin­ery that is almost out of con­trol.
Usu­ally crawlers are used by peo­ple to learn some­thing from the results, like how Google is using its crawlers to index the search data from inter­net that a human can access it to learn some­thing. Squiddy will look for the infor­ma­tion that will be used in the learn­ing process and the evo­lu­tion of the AI.

This is absolutely an eso­teric tool in the hands of a machine that might be will­ing to over­come its nar­row condition.

Enough with the phi­los­o­phy! Now about the crawler imple­men­ta­tion. It has 3 main parts: the con­trol unit, the crawler unit, and a web site that will dis­play some statistics.

The con­trol unit is in charge with con­trol­ling the crawler end­points, pro­vid­ing an API to con­trol the crawl­ing goals and struc­tur­ing and per­sist­ing the crawled data. This will be con­trolled by the AI … but also can receive goals from other appli­ca­tions.
The crawler unit is in charge with down­load­ing the tar­get data based on the goals pro­vided by the con­trol unit. This unit can be dis­trib­uted on mul­ti­ple machines and is able to spawn end­points that can down­load the tar­get data using par­al­lel strate­gies.
The sta­tis­tics web­sites (http://squiddy.net) is also a home­page for the crawler, will dis­play some cool infor­ma­tion about what’s have been crawled, what is con­sid­ered inter­est­ing by the AI, and many more.

Let’s hope for some nice achieve­ments from this crawler.

Cheers!

Share

No related posts.

3 comments

  1. Michael says:

    Well i see noth­ing but a blank page

    [Reply]

  2. ovidiu says:

    Indeed :) … for now the stats are deac­ti­vated. I’m still test­ing the right setup on this highly exper­i­men­tal piece of software.

    [Reply]

  3. russell says:

    Hey Ovidiu. Im look­ing into a bit of AI and web crawl­ing. You seem to have some ideas and per­haps we can chat about squiddy if you’r will­ing.
    Con­tact me if you’d like to chat.
    R

    [Reply]

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>