Custom crawl type

Posted: 29 March 2012 02:30 AM

[ Ignore ]

Newbie

Total Posts: 4

Joined 2012-03-16

Building on my ATOM crawler (http://www.searchdaimon.com/forum/viewthread/421/—which works well, for the general case), I am thinking it would be beneficial to have a more general purpose and more powerful crawl format. Essentially, this would be a way to create a custom crawler without having to write and debug perl within a tiny window inside the searchdaimon UI.

The ability to paginate. The ATOM crawler is funny, it sits and says “Crawling..” for a very long time showing 0 documents, and then suddenly it finishes and there are a few thousand documents. It’s not so funny when you want to crawl a REALLY big dataset, especially a really big remote dataset that you can’t query all in one “page”, even if you wanted to. It would be nice if there was a way to tell the crawler to grab another page.

It should have the ability to add custom attributes (note: I am unable to find documentation on how to do this, beyond this blog post that says it can be done: http://www.searchdaimon.com/blog/new_attributes_meta_information_function/).

Here’s an example of the XML I am thinking of:

<?xml version="1.0" encoding="utf-8"?>
<results>
  <document>
    <content>The search content here</content>
    <title>Results title</title>
    <url>http://mysite.com/show?docid=253</url>
    <type>html</type>
    <attributes>
      <myCustomAttribute>Some value</myCustomAttribute>
    </attributes>
  </document>
  <include href="/customXmlFeed?page=2" />
</results> 

Of course, it should be possible to return multiple <document> and/or multiple <include> items in any result. Having <include> effectively makes this a hybrid between my ATOM crawler and the regular Intranet crawler.

It should also be possible to pass the most recent modified_date of all documents in the collection (or the last successful crawl time, but I think modified_date is more fail-safe). This would be useful so that only documents modified since that date can be returned. In my case, one of the things I’m looking at indexing is source code commit history. I have 19,000 revisions, and of course they NEVER change once they’re committed, so loading, returning and indexing old revisions is a waste of energy.

I am not sure if this is the best way to implement, but for example, I should be able to set the crawl url as “http://mysite.com/commit-history.php?last={max_modified_date}” and searchdaimon will replace {max_modified_date} with the actual timestamp.

——

Now, I know the first couple parts can be done just by writing the appropriate connector (except for attributes/metadata—I am not sure how to add those).

I do not know about passing the modified date—I am not sure how to get the information needed, or if it’s even available to the connector code.

——

I think this would be a great feature to include by default. It makes writing custom connectors easier (which is a good thing), and it adds some abilities that are not possible:

* Writing connectors in Perl is quite limiting. In my case, aside from hating writing in Perl, I need a subversion client API. It is very hard to install this on the searchdaimon VM, and frankly, I’d rather interact with the VM just using the web API as is recommended in the documentation.
* Connectors are extremely useful for custom applications. My bug tracker is web-based, but if I index the web pages I get tons of duplicate entries (included various views of list pages sorted in different orders) as well as bad results (searching for the phrase “help link” is utterly useless since both of those words appear as part of the tracking application itself, and thus are on EVERY page).
* It’s not always possible to get at the raw data remotely, by default. If I have to write something new to expose the data to searchdaimon, I might as well have it return data in a way searchdaimon can read (rather than having to write a custom connector in addition to whatever I had to write to expose the data).

I am enjoying searchdaimon so far, and I think this sort of ability will help increase the exposure and hopefully community-contributed connectors (even if they’re only the “remote” part). I currently have a JIRA to ATOM converter, and a subversion to ATOM converter (though it doesn’t work very well because it times out trying to work in one go, hence my pagination request), both of which written in PHP simply because it was very fast for me to write and they’re easy to run on the system that hosts them.

I will probably work on this anyways, just for myself, but since I am bad at (and dislike writing) Perl, I could definitely use some help writing the searchdaimon side of this.

Runar Buvik

Posted: 01 April 2012 12:56 AM

[ Ignore ] [ # 1 ]

Administrator

Total Posts: 253

Joined 2010-09-13

Thank you for you feedback. I agrees that we have some work to do to get this opened up to other developers. That i something we want do do, and is discussing a lot internally.

I have started to document how you can run the crawler from the console to get better debug info. Please see:
http://www.searchdaimon.com/documentation/C36#interacting_directly_with_the_es

———-

I will get around to document the attribute function. Taking from memory, I think it work something like this: attributes is implemented as a Perl hash. You could for example have this:

$self->add_document((
            type      => 'txt',
            content   => $content,
            title     => $title,
            url       => $url,
            acl_allow => "Everyone", # permissions
            last_modified => $last_modified, # unixtime
            attributes => ('author' => 'Runar Buvik', 'company' => 'Searchdaimon'),
        )); 

Here you have attributes company and author. You can then search for i by querying something like this:

attribute:“author=Runar Buvik”

———-

It should also be possible to pass the most recent modified_date of all documents in the collection (or the last successful crawl time, but I think modified_date is more fail-safe). This would be useful so that only documents modified since that date can be returned. In my case, one of the things I’m looking at indexing is source code commit history. I have 19,000 revisions, and of course they NEVER change once they’re committed, so loading, returning and indexing old revisions is a waste of energy.

Try to cal call get_last_crawl_time inside sub crawl_update like this:


my $lastcrawl = $self->get_last_crawl_time();

The lastcrawl will be a Unix timestamp.

———-

There is also a totally undocumented api to send data to the ES by http. I will look around for some info about it.

Generalservices

Posted: 27 July 2016 07:29 AM

[ Ignore ] [ # 2 ]

Sr. Member

Total Posts: 805

Joined 2016-03-16

Tourist instructions are helpful however they are a pricey choice as well as moreover create know regardless of whether to believe in them or even not. Whether you employ a human being guide, you can’t ever eliminate a actual travel manual, mytravelguide

Nowadays males are at the mercy of even much more style overview than ladies. If you’re stuck inside a fashion ditch, or happen to be told one a lot of times that you’ll require a man style manual, read on for many great causes of fashion guidance online. fashionadvice

Selecting the right health insurance company depends in your age, wellness, medical requirements, location as well as finances. There isn’t any single provider that may lay declare to being the very best for everybody. besthealth

Knowing concerning the law associated with attraction and the aim of our emotions may be the key in order to understanding the reason why things are because they are. It’s a deep knowing from the nature of the universe which serves you operational and any kind of arena associated with life. aboutlaw

As being a great solution shopper is easy, but it does not necessarily mean that it’s easy. The tips in the following paragraphs will cause you to a excellent secret customer and help you to get more as well as better solution shopper work. proshopper

There tend to be literally a large number of possible canine house suggestions waiting that you should make them your personal. Whether it is a complex custom dog run, or an easy traditional canine house, the chances are countless. hosueideas

This is a summary of creative journey blogs i read as well as follow. They’re written through independent journey writers, the checklist include the ones that I think about as heavy-weights within travel running a blog. mrtravel

Cultural affects are stronger whenever we consider the actual economic ingredient of the style industry. Business requirements market where it may present clean ideas as well as products in most new day time. fashionhosue

Well, I’m here to inform you how the secret to presenting healthy meals available anytime and each time is preparing and having appropriate food choices staples. healthiness

There is something which no 1 else is indicating about legal requirements of Appeal and isn’t it about time you noticed it. It’s what We call the actual WOW element. It is while using law associated with attraction through inside away. insidelaw

Not just does a reduced interest price mean lower monthly installments but it will save a person substantial money within the life of the mortgage. However locating the best rates could be difficult if you have the right mortgage price shopping guidance. shoppingadvice

Everyday is really a day in the beach! ” since the old stating goes, even though everyone knows we cannot really spend every single day at the actual beach. Nevertheless, you may experience every single day at the actual beach along with beach home decor. housedecor

Because of it’s wealthy heritage, comfortable friendly individuals and unspoiled character, Cambodia is really a fabulous location for luxurious tours. luxurytour

The fashion star is somebody whose clothing, hair, or general appeal may be imitated through everyone and it has survived decades after. fashionicon

Which is what counts probably the most. Once a person develop this particular healthy routine and stick to it consistently, planning away your wholesome day the night time before will end up your ritual and people pounds may really begin falling off the body. healthyday

Generalservices

Posted: 27 July 2016 07:30 AM

[ Ignore ] [ # 3 ]

Sr. Member

Total Posts: 805

Joined 2016-03-16

Why you’ll need professional regulation advice in the event of accident injury is really a query of numerous around the globe. This question isn’t limited in order to any area or nation as mishaps occur everywhere and individuals wish to know about attorneys. lawadvice

The emergence from the Internet like a booming industry outlines the necessity for on the internet shopping weblogs. All buyers resort towards the Internet when you want to pick upward something brand new. shoppingblog

When it involves college residing, the not-for-profit student real estate development option is usually overlooked. Nevertheless, there are health improvements to heading this path, starting using the development’s main focus: the actual young university scholar. housedevelopment

There are plenty of stuff that make the break just which bit much better when choosing this kind of travel. Here are a few of the advantages that We find alllow for a excellent small team tour. grouptour

Personal Bags happen to be around for a large number of years. Bags signify a mixture of technological as well as social changes plus they express social advancements within fashionable conditions. fashionkeeper

How good insurance coverage is depends upon how nicely it suits the person. But, if you are considering a starting point, it may be good to take a look at these 3 top medical health insurance providers very first. tophealth

According towards the Law associated with Attraction, you’re just like a giant magnetic field pulling into your lifetime whatever you concentrate on. You may apply this particular basic principle from the Law associated with Attraction to create out the very best in other people. lawcafe

Now each day in changes of financial crisis purchasing unnecessary things is really a very unwise move to make. With the actual recent belt tightening up measures associated with governments worldwide shopping with regard to non-essential things is really a very bad approach to take. shoppingguru

The easiest way to improve your house security would be to install an outside camera protection camera program. It is simple to perform and less expensive than it might seem. Read on to find out more about among the best there is actually. improveyourhomes

A brand new series concerning the world associated with fashion and also the top custom labels comprise it. Each payment will concentrate on a main top style brand, it’s mission, and it is history. Within our first payment we take a look at Virgins, Team, and Angels. fashionsaint

Marketing any kind of business real world with company cards and give away fliers ought to be in your idea. These tend to be two great ways of real globe non-internet marketing possibilities that you could take selling point of today. anybusiness

Law professions are one of the most satisfying as well as fulfilling professions around. Winning an instance is such as striking gold and when you, in your way realize that justice may be served, and all due to your effort, you really feel absolutely excellent. lawcareer

It is actually never too soon to begin your Xmas shopping. The wise shoppers know that you could already discover what the warm products is going to be this drop. Use the list that will help you avoid the actual rush. shoppingnow

Whatever the reasons why you might want a somewhat more luxury in your house, the luxury do it yourself tips provides you with the suggestions on the place to begin. luxuryhomeimprovement

‹‹ 301 Redirect on Robots.txt Results filtered ››

Products

Demo

Documentation

About

Community