I also have a concern of spelling. Say a user enters "sql srver aministrater", is there any SQL Server 2005 CTP query that can help with this? Will I need a third party tool? If so, how can I implement this.
The UI is an asp.net web based app and I want to target SQL Server 2005.
Thanks in advance everyone! I appreciate what you folks do!
You should be able to do everything you need using the thesaurus feature of SQL 2005 but you will have to do some work to do it.
You just need to define the appropriate thesaurus definition for the language you are querying in.
FREETEXT queries will treat the query string as a bag of words and perform stemming and thesaurus replacement and expansion on all terms (unless you enclose the entrie string in double quotes)
If you are querying in english (1033 = ENU) you would need to edit the file:
C:\program files\microsoft sql server\mssql.1\mssql\FTData\tsENU.xml
by default this is essentially empty since everything is commented out, to get it to work for your exmaple you need to tell it that car and auto and vehicle and automobile are equivalent terms. Stemming already knows that wash, wahses and washing are forms of wash so you don't need to tell it that.
<XML ID="Microsoft Search Thesaurus">
<thesaurus xmlns="x-schema:tsSchema.xml">
<diacritics_sensitive>0</diacritics_sensitive>
<expansion>
<sub>car</sub>
<sub>auto</sub>
<sub>automobile</sub>
<sub>vehicle</sub>
</expansion>
</thesaurus>
</XML>
for your other example you could use the thesaurus to handle typos too
<XML ID="Microsoft Search Thesaurus">
<thesaurus xmlns="x-schema:tsSchema.xml">
<diacritics_sensitive>0</diacritics_sensitive>
<replacement>
<pat>svrer</pat>
<pat>servr</pat>
<sub>Server</sub>
</replacement>
<replacement>
<pat>Adminstrater</pat>
<sub>Administrator</sub>
</replacement>
</thesaurus>
</XML>
Something like that.
Basically you can either expand or replace terms using the thesaurus but you do need to define your own replacements.
You can use thesaurus in two ways:
1. freetext or freetext table queries (remember do no enclose the string in double quotes)
2. contains or containstable using FORMSOF(THESAURUS,queryterm)
Hope this helps
Dave Poole
|||Thanks for the reply Dave, however neither solution works well for me. Neither is intuitive enough...or maybe that's me.FREETEXT is too "fuzzy". It will produce a result of "car wash" from "auto wash" for example, but the result will also include results of all records that the thesaurus has an expansion on. Every record with car, auto, wash, etc will be included in the result for example.
CONTAINS is more precise however it doesn't work well since you have to provide a THESAURUS of not only synonyms, but INFELCETIONAL words as well, ie., "cars", "automobiles", "servers", "administrators", etc. Creating a THESAURUS is a daunting task by itself let alone having to expand on all inflectionals as well and I "ain't" no english major if you know what I mean.
Is there an existing way to get an inflectional list for a word and then use this result in a query? Same question for synonyms and Spell Check?
If big brother MS has their ears on, would be nice to have functions that will do this. If you need help, let me know$$$. I don't have time to do develop this and it's killing me. Infelections should be programable and not hard coded. Should also be able to define different instances of these as required in the query, database, or app. Same holds true for the Thesaurus. Should have functions that will pull either synonyms or antonyms. I know the question is why, but that's proprietary. Also, give us a Thesaurus and Dictionaries that we can hack as needed.
Also, if someone could tell me how the heck rank and weight are ranked and weighed, I'd appreciate it. The results I'm getting don't seem to be ranked and weighed as I would have thought.
Can't program myself out of a paper bag sometimes but still trying.
Thanks in advance and again.
|||
Perhaps some more complex patterns in the thesaurus may help for you case. I believe the logic in the thesaurus is to d oa longest match first so if you have entries like this:
<XML ID="Microsoft Search Thesaurus">
<thesaurus xmlns="x-schema:tsSchema.xml">
<diacritics_sensitive>0</diacritics_sensitive>
<expansion>
<sub>car wash</sub>
<sub>washing cars</sub>
<sub>automobile wash</sub>
</expansion>
</thesaurus>
</XML>
It should reduce the noise.
If you query for CONTAINS(*,'FORMSOF(THESAURUS,"car wash")') you should hit docs which contain the alternate forms but will avoid docs with just car or wash or washing etc.
This still doesn't help with your issue of requiring specification of all inflectional forms.
I guess there are really 3 things you are asking in this post, I'll take them one by one.
1. You would like to be able to access inflectional, thesaurus forms from sql programatically so you can do your own processing of queries.
Thanks for your feedback on this. We are listening. There is a high likelihood that this functionality will be included in a post SQL Server 2005 release.
In the SQL 2005 timeframe there is a utility called lrtest.exe which ships with sql server 2005 (in the binn directory) which you can use to invoke the wordbreaker, stemmer and filter on strings or input files. This uses the IWordbreaker, IStemmmer and IFilter interfaces which are documented in MSDN. Have a play with this tool and see if the kind of operations it performs could be useful to you. If so you could write your own wrapper to invoke this type of functionality from your application.
2. You would like a pre-defined thesaurus to ship with sql server which you could then modify to your needs.
Again, thanks for the feedback. This is another feature that will likely show up in a later version of SQL Server.
3. Explanation of ranking and weighting.
There is a very good BooksOnline topic called Understanding Ranking which goes into a lot of detail on the ranking algorithm used Hopefully it will answer your questions.
Hope this helps
Dave Poole
SQL Server Fulltext Team
Thanks again for the reply Dave however this is not really that helpfull to me as it does not provide a solution but I do again appreciate your reply very much. It's also nice to know that big brother has their ears on. Implementing user feedbacks is key to making MS products what they are. While not perfect, the road to it is always under construction as with us all.
Please note the spell check as well. Also, having the functionality of the three; inflectional, thesaurus and spell check, available in one unique predicate would be tremendously helpful and benificial to the MS freetext toolbox as well as possibly expanding immensly on the marketability of these products. This however could make the fuzzy problem worse, but it would be tremendous to have these resources available in one predicate. Expand on this to get the complete jest.
My project is too huge to expand on the Thesaurus in this mannor. I know I need to provide some sort of intuitive logic to hash over the users input, but this is making it a coding nightmare for me. Having to provide a solution for the infinitesimal ways that a user may search for something in this project is really blowing my mind. I've come up with some near solutions, but this again is making it a coding nightmare. I seen a big chunk of the functionality that I'm looking for out on the net so I know it's possible...wish I could find out how the heck they're doing it that's why I posted this here.
You're miles above me when it comes to writing a wrapper as mentioned above, but you've definitely sparked my interest in investigating this further. From first glance it still appears that it will not provide all the functionality that I'm looking for but I'm going to investigate further if I can find the time. If you know of some time saving tips in writing these wrappers and have the time yourself to provide it, I'd love to hear more.
I've got a million of them, but I again wanted to point out that with a thesaurus implemented in this mannor as you mentioned above, this functionality needs to be application independent. In other words, the thesaurus can't be unique to the entire server. There must be a way to implement multiple thesaurus' based on the requirements of the query, db, interface, app, etc. Same holds true for inflectionals. Also, nothing should be hard coded without some form of ability to modify and implement as required on a per interface basis for example. Ie, best shoud not always stem to good.
Thanks again for your time and all my best regards in these endeavors!
Wayne
PS Hope at least some of this makes sense as I am, as always, running on not enough sleep and not enough time.
没有评论:
发表评论