Building on my ATOM crawler (http://www.searchdaimon.com/forum/viewthread/421/—which works well, for the general case), I am thinking it would be beneficial to have a more general purpose and more powerful crawl format. Essentially, this would be a way to create a custom crawler without having to write and debug perl within a tiny window inside the searchdaimon UI.
The ability to paginate. The ATOM crawler is funny, it sits and says “Crawling..” for a very long time showing 0 documents, and then suddenly it finishes and there are a few thousand documents. It’s not so funny when you want to crawl a REALLY big dataset, especially a really big remote dataset that you can’t query all in one “page”, even if you wanted to. It would be nice if there was a way to tell the crawler to grab another page.
It should have the ability to add custom attributes (note: I am unable to find documentation on how to do this, beyond this blog post that says it can be done: http://www.searchdaimon.com/blog/new_attributes_meta_information_function/).
Here’s an example of the XML I am thinking of:
<?xml version="1.0" encoding="utf-8"?>
<results>
<document>
<content>The search content here</content>
<title>Results title</title>
<url>http://mysite.com/show?docid=253</url>
<type>html</type>
<attributes>
<myCustomAttribute>Some value</myCustomAttribute>
</attributes>
</document>
<include href="/customXmlFeed?page=2" />
</results>
Of course, it should be possible to return multiple <document> and/or multiple <include> items in any result. Having <include> effectively makes this a hybrid between my ATOM crawler and the regular Intranet crawler.
It should also be possible to pass the most recent modified_date of all documents in the collection (or the last successful crawl time, but I think modified_date is more fail-safe). This would be useful so that only documents modified since that date can be returned. In my case, one of the things I’m looking at indexing is source code commit history. I have 19,000 revisions, and of course they NEVER change once they’re committed, so loading, returning and indexing old revisions is a waste of energy.
I am not sure if this is the best way to implement, but for example, I should be able to set the crawl url as “http://mysite.com/commit-history.php?last={max_modified_date}” and searchdaimon will replace {max_modified_date} with the actual timestamp.
——
Now, I know the first couple parts can be done just by writing the appropriate connector (except for attributes/metadata—I am not sure how to add those).
I do not know about passing the modified date—I am not sure how to get the information needed, or if it’s even available to the connector code.
——
I think this would be a great feature to include by default. It makes writing custom connectors easier (which is a good thing), and it adds some abilities that are not possible:
* Writing connectors in Perl is quite limiting. In my case, aside from hating writing in Perl, I need a subversion client API. It is very hard to install this on the searchdaimon VM, and frankly, I’d rather interact with the VM just using the web API as is recommended in the documentation.
* Connectors are extremely useful for custom applications. My bug tracker is web-based, but if I index the web pages I get tons of duplicate entries (included various views of list pages sorted in different orders) as well as bad results (searching for the phrase “help link” is utterly useless since both of those words appear as part of the tracking application itself, and thus are on EVERY page).
* It’s not always possible to get at the raw data remotely, by default. If I have to write something new to expose the data to searchdaimon, I might as well have it return data in a way searchdaimon can read (rather than having to write a custom connector in addition to whatever I had to write to expose the data).
I am enjoying searchdaimon so far, and I think this sort of ability will help increase the exposure and hopefully community-contributed connectors (even if they’re only the “remote” part). I currently have a JIRA to ATOM converter, and a subversion to ATOM converter (though it doesn’t work very well because it times out trying to work in one go, hence my pagination request), both of which written in PHP simply because it was very fast for me to write and they’re easy to run on the system that hosts them.
I will probably work on this anyways, just for myself, but since I am bad at (and dislike writing) Perl, I could definitely use some help writing the searchdaimon side of this.