<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Query Log Topic Detection</title>
	<atom:link href="http://querylog.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://querylog.wordpress.com</link>
	<description>Experiments on query logs from search engines</description>
	<lastBuildDate>Sat, 17 Oct 2009 16:22:40 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='querylog.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Query Log Topic Detection</title>
		<link>http://querylog.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://querylog.wordpress.com/osd.xml" title="Query Log Topic Detection" />
	<atom:link rel='hub' href='http://querylog.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Sammanfattning &#8211; Content Free Clustering for Search Engine Query Log</title>
		<link>http://querylog.wordpress.com/2009/10/16/sammanfattning-content-free-clustering-for-search-engine-query-log/</link>
		<comments>http://querylog.wordpress.com/2009/10/16/sammanfattning-content-free-clustering-for-search-engine-query-log/#comments</comments>
		<pubDate>Fri, 16 Oct 2009 09:43:18 +0000</pubDate>
		<dc:creator>Frej</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[AOL]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Literature]]></category>

		<guid isPermaLink="false">http://querylog.wordpress.com/?p=65</guid>
		<description><![CDATA[Hosseini, M., Abolhassani, H., and Harikandeh, 2007 Författarna försöker klustra sökloggar från AOL med hjälp av en bipartit graf mellan söksträngar och adresser och K-meansklustring i de resulterande komponenterna. Metod Metoden innehåller fyra steg: Bygg en bipartit med söksträngar och adresser, där en kant finns om ensökning har föranlett ett klick till adressen. Elimination av [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=querylog.wordpress.com&amp;blog=9107929&amp;post=65&amp;subd=querylog&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Hosseini, M., Abolhassani, H., and Harikandeh, 2007</p>
<p>Författarna försöker klustra sökloggar från AOL med hjälp av en bipartit graf mellan söksträngar och adresser och K-meansklustring i de resulterande komponenterna.</p>
<h2>Metod</h2>
<p>Metoden innehåller fyra steg:</p>
<ol>
<li>Bygg en bipartit med söksträngar och adresser, där en kant finns om ensökning har föranlett ett klick till adressen.</li>
<li>Elimination av skräpkanter, för att kunna få ut komponenter.</li>
<li>Dimensionreducering av grannmatrisern i komponentenerna.</li>
<li>Klustring med K-means i komponenterna.</li>
</ol>
<h2>Resultat</h2>
<p>Testdata är framtagen genom att plocka 40 k slumpvisa sökningar. Dessa klassificeras manuellt till 7 olika kategorier.</p>
<p>Efter de första stegen med den bipartita grafen finns det bara en komponent som är så stor att den är intressant att gå vidare med, och den innehåller 55 % av datan. Den komponenten klustras till fyra delar med K-means.</p>
<p>Resultaten utvärderas genom att precisionen för olika ämnen kontrolleras i varje kluster. Tre av fyra kluster hade något ämne som med betydligt högre precision än de andra.</p>
<h2>Relevans för oss</h2>
<p>Att klustra i AOL-loggarna är ju precis vad vi håller på med, så det är intressant att se hur de lyckas med. De har dock nöjt sig med att försöka passa in några få jättekluster i förutbestämda kategorier, något som vi bedömmer som ointressant. Något som däremot är intressant är att de har lagt ner ett stort arbete på att manuellt klassificera 40 k sökningar i sju kategorier, vilket ger oss en bild av hur proportionerna mellan dessa borde vara i våra kluster, det är möjligtvis något vi skulle kunna använda för utvärdering.</p>
<p>Att deras initiala steg med komponenter i en bipartit graf bara gav en stor komponent stämmer väl med våra erfarenheter av loggarna,  att det är skräpiga data som är svåra att slå isär.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/querylog.wordpress.com/65/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/querylog.wordpress.com/65/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/querylog.wordpress.com/65/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/querylog.wordpress.com/65/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/querylog.wordpress.com/65/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/querylog.wordpress.com/65/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/querylog.wordpress.com/65/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/querylog.wordpress.com/65/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/querylog.wordpress.com/65/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/querylog.wordpress.com/65/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/querylog.wordpress.com/65/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/querylog.wordpress.com/65/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/querylog.wordpress.com/65/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/querylog.wordpress.com/65/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=querylog.wordpress.com&amp;blog=9107929&amp;post=65&amp;subd=querylog&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://querylog.wordpress.com/2009/10/16/sammanfattning-content-free-clustering-for-search-engine-query-log/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4e4a262c61237cb73d7de8e084b91192?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Frej</media:title>
		</media:content>
	</item>
		<item>
		<title>Topic Detection and Tracking using idf-Weighted Cosine Coefﬁcient</title>
		<link>http://querylog.wordpress.com/2009/10/15/topic-detection-and-tracking-using-idf-weighted-cosine-coef%ef%ac%81cient/</link>
		<comments>http://querylog.wordpress.com/2009/10/15/topic-detection-and-tracking-using-idf-weighted-cosine-coef%ef%ac%81cient/#comments</comments>
		<pubDate>Thu, 15 Oct 2009 15:42:46 +0000</pubDate>
		<dc:creator>Frej</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://querylog.wordpress.com/?p=62</guid>
		<description><![CDATA[Sammanfattning av Topic Detection and Tracking using idf-Weighted Cosine Coefﬁcient (J. Michael Schultz, Mark Liberman), 1999 Författarna försöker att med viktade cosine mått (tf-idf) följa (tracking) och upptäcka (detection) nyhetsämnen. Tracking Två urval av träningsdata väljs ut, ett som innehåller artiklar om ämnet man är ute efter och ett som inte gör det. Val av [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=querylog.wordpress.com&amp;blog=9107929&amp;post=62&amp;subd=querylog&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><em>Sammanfattning av Topic Detection and Tracking using idf-Weighted Cosine Coefﬁcient (J. Michael Schultz, Mark Liberman), 1999 </em><br />
Författarna försöker att med viktade cosine mått (tf-idf) följa (tracking) och upptäcka (detection) nyhetsämnen.</p>
<h3>Tracking</h3>
<p>Två urval av träningsdata väljs ut, ett som innehåller artiklar om ämnet man är ute efter och ett som inte gör det.</p>
<h4>Val av ämneskännetecken</h4>
<p>För att kunna göra en cosine-jämförelse mellan ett ämne och en artikel behöver man ta fram ord som kännetecknar ämnet från artiklar i träningsdata. Författarna försökte med fyra olika metoder:</p>
<ol>
<li>Ta alla ord i artiklarna.</li>
<li>För alla artiklar tilsammans, ta ut de <em>n</em> vanligaste orden.</li>
<li>För varje artikel, ta ut de <em>n </em>vanligaste orden.</li>
<li>Som 3, men man lägger iterativt till fler termer om de ger bättre resultat.</li>
</ol>
<p>Metod fyra gav bäst resultat, men man valde metod tre eftersom den var marginellt sämre och betydligt mindre komplicerad.</p>
<h4>Normalisering</h4>
<p>Författarna försökte på olika sätt normalisera vektorn som representerar ämnet med hjälp av träningsdata, men gav upp det eftersom resultaten inte blev bättre.</p>
<h4>Resultat</h4>
<p>Bäst resultat verkade enkla metoder ge och författarna anser att deras resultat är konkurrenskraftiga.</p>
<h3>Detection</h3>
<p>Man försöker upptäcka ämnen med hjälp av algoritmen <em>Single-linkage clustering </em>och samma likhetsmått som tidigare. Algoritmen ger problematiska kedjefenomen, olika ämnen hålls ihop i samma kluster av enskilda artiklar som behandlar båda ämnena. Författarna verkar vara missnöjda och skyller resultaten på algoritmens brister.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/querylog.wordpress.com/62/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/querylog.wordpress.com/62/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/querylog.wordpress.com/62/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/querylog.wordpress.com/62/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/querylog.wordpress.com/62/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/querylog.wordpress.com/62/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/querylog.wordpress.com/62/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/querylog.wordpress.com/62/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/querylog.wordpress.com/62/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/querylog.wordpress.com/62/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/querylog.wordpress.com/62/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/querylog.wordpress.com/62/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/querylog.wordpress.com/62/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/querylog.wordpress.com/62/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=querylog.wordpress.com&amp;blog=9107929&amp;post=62&amp;subd=querylog&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://querylog.wordpress.com/2009/10/15/topic-detection-and-tracking-using-idf-weighted-cosine-coef%ef%ac%81cient/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4e4a262c61237cb73d7de8e084b91192?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Frej</media:title>
		</media:content>
	</item>
		<item>
		<title>Success detection using query, link and goal frequencies</title>
		<link>http://querylog.wordpress.com/2009/10/15/success-detection-using-query-link-and-goal-frequencies/</link>
		<comments>http://querylog.wordpress.com/2009/10/15/success-detection-using-query-link-and-goal-frequencies/#comments</comments>
		<pubDate>Thu, 15 Oct 2009 15:11:29 +0000</pubDate>
		<dc:creator>Eskil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Literature]]></category>
		<category><![CDATA[Search Behavior]]></category>

		<guid isPermaLink="false">http://querylog.wordpress.com/?p=59</guid>
		<description><![CDATA[In Understanding the Relationship between Searchers’ Queries and Information Goals Downey, Dumas, Liebling and Horvitz describe an interesting property of users interacting with search engines: the rate of success, that is the likelihood of a user finding what he or she is looking for, is related to the frequency of the query issued and the [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=querylog.wordpress.com&amp;blog=9107929&amp;post=59&amp;subd=querylog&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>In <em>Understanding the Relationship between Searchers’ Queries and Information Goals</em> Downey, Dumas, Liebling and Horvitz describe an interesting property of users interacting with search engines: the rate of success, that is the likelihood of a user finding what he or she is looking for, is related to the frequency of the query issued and the underlying information goal it is supposed to lead to.</p>
<p>In particular, the best way to find a frequently sought information goal in a search engine is to employ a frequently used query. For example, a good way to find <a href="http://www.webmd.com/">http://www.webmd.com</a> may be to issue the query <em>webmd</em>. In the authors’ study this was a fairly frequent query and information goal. Using the more frequent query <em>medical questions page</em> would yield more results making it more difficult to find the correct website. On the other hand, the much less frequent <em>webmb</em> would probably require the user to perform spelling correction before finding what he or she was looking for.</p>
<p>Note the connection between information goal and website. In their study, the authors have chosen to make these synonymous. Of course this need not be so. It may be true that it is more difficult to get to <a href="http://www.webmd.com/">http://www.webmd.com</a> by using <em>medical questions page</em> than by using <em>webmb</em>, but if the user wasn’t explicitly looking for <a href="http://www.webmd.com/">http://www.webmd.com</a> but instead was interested in finding the answer to a medical question then <em>medical questions page</em> could have been a better query to issue. This is not reflected in the study.</p>
<p>After having submitted a query the action of the user varies depending on the frequency of the query. Frequent queries more often lead to a click in the result list than do rare queries. Rare queries are more often followed by a requery than do frequent queries. Having clicked a frequently clicked result, the chance of clicking another result is smaller than if the clicked result had been rare. Conversely, the likelihood of a user issuing a requery after having clicked a rare result is greater than if he or she had clicked a frequent result.</p>
<p>According to the study, search engines are much better at finding answers to common information goals than they are at resolving rare goals. At first glance this seems reasonable enough.  A successful search engine is one which quickly finds the information that a user requires. Finding what most people want is almost as good as finding what everybody wants. The evidence of this is that for rare information goals, users typically need to reformulate their queries more times than they do for frequent goals.</p>
<p>To sum up, the paper implies that the frequency of the queries and clicked links in a session can help us discern to what degree the user was able to find what he or she was looking for.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/querylog.wordpress.com/59/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/querylog.wordpress.com/59/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/querylog.wordpress.com/59/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/querylog.wordpress.com/59/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/querylog.wordpress.com/59/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/querylog.wordpress.com/59/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/querylog.wordpress.com/59/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/querylog.wordpress.com/59/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/querylog.wordpress.com/59/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/querylog.wordpress.com/59/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/querylog.wordpress.com/59/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/querylog.wordpress.com/59/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/querylog.wordpress.com/59/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/querylog.wordpress.com/59/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=querylog.wordpress.com&amp;blog=9107929&amp;post=59&amp;subd=querylog&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://querylog.wordpress.com/2009/10/15/success-detection-using-query-link-and-goal-frequencies/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/a25590cc78d5add8ca26d8f71c231287?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">andreene</media:title>
		</media:content>
	</item>
		<item>
		<title>Sammanfattning: Query Clustering Using User Logs</title>
		<link>http://querylog.wordpress.com/2009/10/15/sammanfattning-query-clustering-using-user-logs/</link>
		<comments>http://querylog.wordpress.com/2009/10/15/sammanfattning-query-clustering-using-user-logs/#comments</comments>
		<pubDate>Thu, 15 Oct 2009 13:47:18 +0000</pubDate>
		<dc:creator>Frej</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://querylog.wordpress.com/?p=55</guid>
		<description><![CDATA[Sammanfattning av Query Clustering Using User Logs (JI-RONG WEN,  JIAN-YUN NIE, HONG-JIANG ZHANG), 2002 Författarna försöker klustra sökningar i Encarta med avseende på sökord och klickade dokument. Tillvägagångssätt Klustringsprinciper Om två sökningar innehåller samma eller liknande sökord så representerar de samma, eller liknande, informationsbehov. Två sökningar är lika om de leder till vala av samma [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=querylog.wordpress.com&amp;blog=9107929&amp;post=55&amp;subd=querylog&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><em>Sammanfattning av Query Clustering Using User Logs (JI-RONG WEN,  JIAN-YUN NIE, HONG-JIANG ZHANG), 2002 </em><br />
Författarna försöker klustra sökningar i Encarta med avseende på sökord och klickade dokument.</p>
<h2>Tillvägagångssätt</h2>
<h3>Klustringsprinciper</h3>
<ol>
<li>Om två sökningar innehåller samma eller liknande sökord så representerar de samma, eller liknande, informationsbehov.</li>
<li>Två sökningar är lika om de leder till vala av samma eller liknande dokument.</li>
</ol>
<p>Princip ett räcker inte själv eftersom samma sökord kan representera olika behov.  Beräknad likhet innebär inte alltid semantisk likhet, detta gäller särskilt för korta söksträngar.</p>
<p>Princip två har svagheten att användare inte nödvändigtvis bara klickar på relevanta dokument och ett dokument kan innehålla information om flera ämnen.</p>
<h2>Implementation</h2>
<h3>Data</h3>
<p>Ur stora mängder loggar tar man ut sessioner som består av en söksträng och de dokument-klick som sökningen gav upphov till.</p>
<h3>Algoritm</h3>
<p>Man anser sig behöva en algoritm som klarar stora datamängder, inte kräver förutbestämt antal kluster osh som sorterar bort skräp. Valet faller på DBSCAN som har komplxitetetn O(<em>n</em>log<em>n</em>).</p>
<h3>Likhetsmått</h3>
<p>För jämförelse av söksträngar kan man använda termlikhet (exempelvis cosine) eller stränglikhet (exempelvis Levenshtein).</p>
<p>Klicklikhet kan göras baserad på överlappande dokument, jämförbart med termlikheten, eller med hjälp av de kategorier som de klickade dokumenten tillhör.</p>
<h3>Kombination av olika mått</h3>
<p>Genom att kombinera olika mått kan man få bättre resultat än med ett enskilt mått. Man måste då välja parametrar för viktning av olika mått.</p>
<h2>Utvärdering</h2>
<p>Ur en månads loggar från Encarta, 22 GB, 2,7 M sessioner väljs 20 000 sessioner slumpmässigt för utvärdering.</p>
<p>Fyra olika kombinationer av likhetsmått används:</p>
<ol>
<li>Termlikhet</li>
<li>Dokumentöverlapp</li>
<li>Termer kombinerat med dokumentöverlapp</li>
<li>Termer kombinerat med dokumentkategorier</li>
</ol>
<p>Resultaten jämförs genom att titta på antal kluster och andel klustrade sessioner, med varierande likhetskrav.</p>
<h3>Kvalitet</h3>
<p>Kvalitetsutvärderingen sker genom att manuellt granska 100 slumpmässigt utvalda kluster.</p>
<h3>Resultat</h3>
<p>Resultaten påvisar fördelar med kombinationsmåt, särskilt gällande kvalitet. Kombinationsmåtten ger fler kluster och större andel klustrade sessioner, men karar inte lika höga likhetskrav som de enkla måtten.</p>
<h2>Relevans för oss</h2>
<p>Frågeställningarna är i stor utsträckning samma som vi ställs inför, men svaren inte alltid riktigt lika, förmodligen beroende på skillnader i datakvalitet.</p>
<p>Vi skulle inte kunna använda sessioner bestående av bara en söksträng och eventuella klick, de skulle bli för små och omöjliga att klustra. Jag tror inte heller att det är så internetanvändare använder en sökmotor, jag tror mer på modellen med flera sammanhängande frågor i jakt på information (Om det inte är en navigationssökning, men då är det ändå ointressant).</p>
<p>Vi har inte lika välformade dokument i klicklistorna och inga kategorier för dokumenten. Internet är väldigt stort och betydligt spretigare än Encarta och vi har bara tillgång till de domäner klicken leder till, inte exakta dokument. Våra metoder blir därför betydligt trubbigare.</p>
<p>Jag upplever att 20 000 sessioner är en ganska liten mängd att göra klustringar på, med det kan bero på att sökningar på Encarta är mer likriktade och möjliga dokument betydligt färre. De får förmodligen betydligt större överlapp även på små mängder sessioner.</p>
<p>Utvärderingsmetoderna är klart intressanta för oss när vi kommer dit förutsatt att vi inte själva kommer på något bättre. Kvalitetsmått genom manuell granskning av 100 kluster känns lite skakigt.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/querylog.wordpress.com/55/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/querylog.wordpress.com/55/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/querylog.wordpress.com/55/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/querylog.wordpress.com/55/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/querylog.wordpress.com/55/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/querylog.wordpress.com/55/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/querylog.wordpress.com/55/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/querylog.wordpress.com/55/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/querylog.wordpress.com/55/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/querylog.wordpress.com/55/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/querylog.wordpress.com/55/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/querylog.wordpress.com/55/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/querylog.wordpress.com/55/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/querylog.wordpress.com/55/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=querylog.wordpress.com&amp;blog=9107929&amp;post=55&amp;subd=querylog&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://querylog.wordpress.com/2009/10/15/sammanfattning-query-clustering-using-user-logs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4e4a262c61237cb73d7de8e084b91192?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Frej</media:title>
		</media:content>
	</item>
		<item>
		<title>Rekursiv klustring med DBScan</title>
		<link>http://querylog.wordpress.com/2009/10/06/rekursiv-klustring-med-dbscan/</link>
		<comments>http://querylog.wordpress.com/2009/10/06/rekursiv-klustring-med-dbscan/#comments</comments>
		<pubDate>Tue, 06 Oct 2009 15:49:48 +0000</pubDate>
		<dc:creator>Frej</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Implementation]]></category>
		<category><![CDATA[Session Clustering]]></category>

		<guid isPermaLink="false">http://querylog.wordpress.com/?p=47</guid>
		<description><![CDATA[Som Eskil har skrivit om tidigare så är det problematiskt att välja parametrar i DBScan. Resultatet blir oftast antingen att en majoritet av sessionenrna blir markerade som &#8220;outliers&#8221; eller att ett dominerande kluster sväljer nästan alla sessioner. För att komma åt den här problematiken har vi valt att angripa vår datamängd med en rekursiv implementation [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=querylog.wordpress.com&amp;blog=9107929&amp;post=47&amp;subd=querylog&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Som Eskil har skrivit om tidigare så är det problematiskt att välja parametrar i DBScan. Resultatet blir oftast antingen att en majoritet av sessionenrna blir markerade som &#8220;outliers&#8221; eller att ett dominerande kluster sväljer nästan alla sessioner. För att komma åt den här problematiken har vi valt att angripa vår datamängd med en rekursiv implementation av DBScan. Vi utgår från en DBScan-klustring med väldigt lågt satt likhetskrav. I de större av de kluster som då uppstår gör vi samma samma typ av klustring igen, men med ett steg hårdare likhetskrav, och så vidare , tills likhetskravet når ett stoppvärde.</p>
<p>Det innebär att de parametrar som nu behöver sättas är startlikhet, stopplikhet och steglängd. Det sker även en utvärdering av klustringarna i varje steg som kan avgöra om man borde avbryta rekursionen innan stopplikheten är uppnådd, t.ex. på grund av klustrens storlek.</p>
<p>Resultaten ser hittills lovande ut. Ännu är ngen formell utvärdering utförd, men en stor andel av sessionerna klustras, och inget kluster blir så stort att det blir meningslöst. Ett exempel finns <a href="http://www.nada.kth.se/~frna02/Cluster/Clustering-0.4-%280.1%29-0.9_1.html">här.</a></p>
<p>Parallellt har Eskil utvecklat en klassifikation av sessioner så att vi kan slänga bort en stor del av de &#8220;navigational&#8221;-sessioner som vi är ganska ointresserade av. Men förhoppningsvis mer om det senare.</p>

<a href='http://querylog.wordpress.com/2009/10/06/rekursiv-klustring-med-dbscan/chart1_0-3-0-03-0-9_1/' title='chart1_0.3-(0.03)-0.9_1'><img data-attachment-id='48' data-orig-size='600,1600' data-liked='0'width="56" height="150" src="http://querylog.files.wordpress.com/2009/10/chart1_0-3-0-03-0-9_1.png?w=56&#038;h=150" class="attachment-thumbnail" alt="chart1_0.3-(0.03)-0.9_1" title="chart1_0.3-(0.03)-0.9_1" /></a>
<a href='http://querylog.wordpress.com/2009/10/06/rekursiv-klustring-med-dbscan/chart2_0-3-0-03-0-9_1/' title='chart2_0.3-(0.03)-0.9_1'><img data-attachment-id='50' data-orig-size='800,800' data-liked='0'width="150" height="150" src="http://querylog.files.wordpress.com/2009/10/chart2_0-3-0-03-0-9_1.png?w=150&#038;h=150" class="attachment-thumbnail" alt="chart2_0.3-(0.03)-0.9_1" title="chart2_0.3-(0.03)-0.9_1" /></a>

<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/querylog.wordpress.com/47/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/querylog.wordpress.com/47/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/querylog.wordpress.com/47/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/querylog.wordpress.com/47/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/querylog.wordpress.com/47/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/querylog.wordpress.com/47/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/querylog.wordpress.com/47/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/querylog.wordpress.com/47/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/querylog.wordpress.com/47/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/querylog.wordpress.com/47/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/querylog.wordpress.com/47/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/querylog.wordpress.com/47/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/querylog.wordpress.com/47/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/querylog.wordpress.com/47/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=querylog.wordpress.com&amp;blog=9107929&amp;post=47&amp;subd=querylog&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://querylog.wordpress.com/2009/10/06/rekursiv-klustring-med-dbscan/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4e4a262c61237cb73d7de8e084b91192?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Frej</media:title>
		</media:content>

		<media:content url="http://querylog.files.wordpress.com/2009/10/chart1_0-3-0-03-0-9_1.png?w=56" medium="image">
			<media:title type="html">chart1_0.3-(0.03)-0.9_1</media:title>
		</media:content>

		<media:content url="http://querylog.files.wordpress.com/2009/10/chart2_0-3-0-03-0-9_1.png?w=150" medium="image">
			<media:title type="html">chart2_0.3-(0.03)-0.9_1</media:title>
		</media:content>
	</item>
		<item>
		<title>A brief summary of our work thus far</title>
		<link>http://querylog.wordpress.com/2009/09/24/a-brief-summary-of-our-work-thus-far/</link>
		<comments>http://querylog.wordpress.com/2009/09/24/a-brief-summary-of-our-work-thus-far/#comments</comments>
		<pubDate>Thu, 24 Sep 2009 12:25:54 +0000</pubDate>
		<dc:creator>Eskil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Implementation]]></category>
		<category><![CDATA[Session Clustering]]></category>
		<category><![CDATA[Session Segmentation]]></category>

		<guid isPermaLink="false">http://querylog.wordpress.com/?p=43</guid>
		<description><![CDATA[When not doing literary studies we look at how we can use what we’ve learned to accomplish our goals. We have made considerable progress so far but have neglected to take proper notes during our work. For the sake of completeness we will start reporting our progress following a brief summary of what we have [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=querylog.wordpress.com&amp;blog=9107929&amp;post=43&amp;subd=querylog&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>When not doing literary studies we look at how we can use what we’ve learned to accomplish our goals. We have made considerable progress so far but have neglected to take proper notes during our work. For the sake of completeness we will start reporting our progress following a brief summary of what we have achieved so far.</p>
<h3>Parsing, session splitting and storage</h3>
<p>We decided to cleanly separate the task of parsing AOL logs into sessions and the task of clustering the sessions. Although there were many reasons for this decision, the most important was that fetching sessions from a database seemed much more attractive than having to re-parse the log files every time we wanted to test our clustering algorithms.</p>
<p>The process of parsing the logs was fairly straight forward, although it is worth mentioning that they do contain some garbage and they don’t follow the bundled specification perfectly. Among other things we have assumed that multiple subsequent repetitions of a single query are in fact not independent queries but a single query with varying operations &#8211; liked next page, page refresh etc &#8211; performed on it.</p>
<p>When splitting sessions we’ve noted, as many others before us, that “time gap is not enough”. We’ve had the most success using a combination of time gap and Levenshtein measure. As a benchmark, we note where our splitter decides to split our test data. The algorithm inserts splits correctly roughly 75% of the time when compared to our gold standard.</p>
<p>Early on we played with the idea of storing sessions in a relational database like MySQL and have that run on a separate server. The performance was appalling and we decided to go with local instances of the flat Berkeley-DB database instead.</p>
<p>The whole process from reading logs to storing sessions now takes less than an hour. During this time roughly 20 million queries are combined into 8 million sessions.</p>
<h3>Clustering</h3>
<p>Our initial strategy when clustering was to use Bisecting K-means to partition the sessions. Unfortunately we had to rethink this when it became apparent to us that K-means would have trouble handling similarity based on such small amounts of text and URLs. Instead, we opted for the DBScan algorithm as Wen et al did in <em>Query Clustering Using User Logs</em>. We also hoped to use an existing implementation of the clustering algorithm and spare ourselves the work of implementing our own. Weka has a DBScan implementation but it doesn’t allow us the control over similarity that we require. Luckily the algorithm in itself isn’t very complicated and we managed to implement it ourselves without too much of a hassle.</p>
<p>While DBScan runs in n log n time, this must be multiplied by the time it takes to fetch all neighbours for a particular node. In our implementation this requires n^2 time. Initially we used a pre-calculated bit set for this purpose. This gave us run times comparable to Wen et al with 20000 sessions being clustered in the ballpark of 3 minutes. Unfortunately we have been forced to accept that clustering the full set of sessions is unfeasible due to time and memory constraints. This raises two questions that we will need to answer:</p>
<ol>
<li>Is analysis of one or several samples enough to allow us to draw conclusions regarding the contents of the whole log? (Yes, probably).</li>
<li>How large a sample and how many of them do we need to ensure that the conclusion is reliable?</li>
</ol>
<p>So far the results of clustering are difficult to understand. A single DBScan iteration with non-trivial parameters either discards 75% of the data or so as noise or has a single cluster, usually Google, Yahoo or MySpace in our samples, dominating all other clusters. We can’t seem to find a set of parameters that will allow us to include the noise and still keep a wide variety of clusters. This can be an indication that there isn’t a single set of parameters that is good for all concepts. Some concepts simply have a smaller level of internal similarity than do others. Here are the ideas we are working on right now to combat this problem:</p>
<ul>
<li><strong>Incorporation of nominal phrases</strong>. <em>New York City</em> is the most frequent trigram in our data set, but we have yet to see a cluster describing it as a concept. Frequent trigrams probably indicate terms that should be treated as a whole in similarity calculations.</li>
<li><strong>Recursion</strong>. If one cluster dominates the others, why not attempt to split it by clustering the sessions in that cluster? This is what Bisecting K-means does. As a bonus of sorts this produces hierarchical results letting us identify <em>Google Mail</em> as a sub-cluster of <em>Google</em>.</li>
<li><strong>Excluding large amounts of data prior to clustering</strong>. We note that a large amount of sessions are what could be classed as <em>navigational</em>, where as we believe <em>informational</em> sessions hold the most interest. It is also probable that these two classes have different optimal parameters when clustering. Separating the two may result in better clustering results.</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/querylog.wordpress.com/43/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/querylog.wordpress.com/43/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/querylog.wordpress.com/43/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/querylog.wordpress.com/43/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/querylog.wordpress.com/43/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/querylog.wordpress.com/43/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/querylog.wordpress.com/43/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/querylog.wordpress.com/43/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/querylog.wordpress.com/43/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/querylog.wordpress.com/43/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/querylog.wordpress.com/43/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/querylog.wordpress.com/43/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/querylog.wordpress.com/43/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/querylog.wordpress.com/43/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=querylog.wordpress.com&amp;blog=9107929&amp;post=43&amp;subd=querylog&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://querylog.wordpress.com/2009/09/24/a-brief-summary-of-our-work-thus-far/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/a25590cc78d5add8ca26d8f71c231287?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">andreene</media:title>
		</media:content>
	</item>
		<item>
		<title>Navigational or informational?</title>
		<link>http://querylog.wordpress.com/2009/09/21/navigational-or-informational/</link>
		<comments>http://querylog.wordpress.com/2009/09/21/navigational-or-informational/#comments</comments>
		<pubDate>Mon, 21 Sep 2009 12:56:03 +0000</pubDate>
		<dc:creator>Eskil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Literature]]></category>
		<category><![CDATA[Search Behavior]]></category>

		<guid isPermaLink="false">http://querylog.wordpress.com/?p=28</guid>
		<description><![CDATA[Determining the success of a user’s search session involves analyzing the user’s goal. What is the user hoping to achieve? In Automatic Identification of User Goals in Web Search, Ulchin Lee, Zhenyu Liu and Junghoo Cho summarize the state of search goal classification. According to them, there exist at least two categories of goals: A [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=querylog.wordpress.com&amp;blog=9107929&amp;post=28&amp;subd=querylog&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Determining the success of a user’s search session involves analyzing the user’s goal. What is the user hoping to achieve? In Automatic Identification of User Goals in Web Search, Ulchin Lee, Zhenyu Liu and Junghoo Cho summarize the state of search goal classification. According to them, there exist at least two categories of goals:</p>
<blockquote><p>A query is considered <em>navigational</em> when a user has a particular Web page in mind and is primarily interested in visiting the page. <em>Informational queries</em>, on the other hand, refer to the queries where the user does not have a particular page in mind or intends to visit multiple pages to learn about a topic.</p></blockquote>
<p>Incidentally there may be more than two categories. Previous research has had other categories as well, but these two are a good starting point that the researchers seem to agree on.</p>
<h3>Is it possible to predict what category a query belongs to?</h3>
<p>Lee, Liu and Cho set out to determine if it is possible to predict which of the two categories, navigational or informational, a query belongs to. Using a panel of 28 members of the UCLA Computer Science department they annotated those 50 most popular queries to Google that have been issued from within the department. The panel members voted on the best classification for each of the 50 queries. For most queries the panel was in agreement on classification, and the authors concluded that it indeed is possible to predict the query type.</p>
<p>It is unfortunate that the study examined only the most popular queries. As is often the case, simple, intuitive queries are usually the most frequent in a search system and it is not unreasonable to suspect that the chosen queries are much easier to classify than other, less frequent but more ambiguous, queries. It is not necessarily so that the ease of classification on these 50 popular queries is representative for all queries.</p>
<p>The authors discovered that the queries that were difficult to classify were those about software or people. After this discovery they decided to discard these queries and only work with those that were easy to classify. They note that query routing can be used to invoke special treatment for this type of difficult queries.</p>
<h3>Automating the process</h3>
<p>In automating the process, Lee, Liu and Cho looked at <em>past user-click behavior</em> and <em>anchor-link distribution</em>. Past user-click behavior studies what links have been clicked for a given query in the past. For the given query, the authors create a histogram of what links have been clicked for a given query. The characteristics of the distribution are assumed to tell us something of the classification of the query. If a majority of the clicks have the same destination link then this query is likely to be navigational in nature. If, on the other hand, the distribution is flat the query probably has an informational intent.</p>
<p>Mathematically, this can be expressed through several measures. The authors experimented with mean and median values along with measures of skewness and kurtosis. Simply put, skewness tells us how much the sorted distribution leans to the left. Kurtosis is a measure of how much the distribution tends to peak or how flat it is. In both cases a higher value is expected to mean a higher probability of navigational intent in the query than do lower values.</p>
<p>Another feature is how many clicks in average the query resulted in. A single click probably means navigational, where as we would expect informational queries to require several clicks.</p>
<p>Anchor-link distribution is based on link data from sites on the Web. For each query a distribution is compiled from links on the web that contain the query in the anchor description. Needless to say this requires massive amounts of data from highly informative and authoritative web sites. As with past user-click distributions the mean, median, skewness and kurtosis measures provide good candidate features.</p>
<p>Results show that when using past user-click behavior mean, median, skewness and kurtosis are equally good at predicting the query type with about 80% accuracy. Using average number of clicks as a feature yields the same accuracy. All of this is with bias toward informational queries. The results for anchor-link distribution resulted in 75% for each feature but with bias toward navigational queries. Combining features from both anchor-link and past user-click behavior got them 90% accuracy.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/querylog.wordpress.com/28/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/querylog.wordpress.com/28/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/querylog.wordpress.com/28/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/querylog.wordpress.com/28/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/querylog.wordpress.com/28/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/querylog.wordpress.com/28/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/querylog.wordpress.com/28/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/querylog.wordpress.com/28/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/querylog.wordpress.com/28/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/querylog.wordpress.com/28/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/querylog.wordpress.com/28/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/querylog.wordpress.com/28/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/querylog.wordpress.com/28/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/querylog.wordpress.com/28/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=querylog.wordpress.com&amp;blog=9107929&amp;post=28&amp;subd=querylog&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://querylog.wordpress.com/2009/09/21/navigational-or-informational/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/a25590cc78d5add8ca26d8f71c231287?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">andreene</media:title>
		</media:content>
	</item>
		<item>
		<title>The America On-line search query log</title>
		<link>http://querylog.wordpress.com/2009/09/21/the-america-on-line-search-query-log/</link>
		<comments>http://querylog.wordpress.com/2009/09/21/the-america-on-line-search-query-log/#comments</comments>
		<pubDate>Mon, 21 Sep 2009 08:36:16 +0000</pubDate>
		<dc:creator>Eskil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Miscellaneous]]></category>

		<guid isPermaLink="false">http://querylog.wordpress.com/?p=24</guid>
		<description><![CDATA[In August of 2006 America On-line (AOL) made available to the public three months of query logs from their web search engine. Large data sets for information retrieval experimentation is difficult to get holds on and AOL received some praise for providing this new, complete and authentic data. However, on the most part what they [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=querylog.wordpress.com&amp;blog=9107929&amp;post=24&amp;subd=querylog&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p style="text-align:justify;">In August of 2006 America On-line (AOL) made available to the public three months of query logs from their web search engine. Large data sets for information retrieval experimentation is difficult to get holds on and AOL received some praise for providing this new, complete and authentic data. However, on the most part what they were criticized for releasing data that could threaten to reveal or expose individual sensitive information about individual users of the service. Of special note is the New York Times article <a title="A Face Is Exposed For AOL Searcher No. 4417749" href="http://www.nytimes.com/2006/08/09/technology/09aol.html?ex=1312776000&amp;en=996f61c946da4d34&amp;ei=5088&amp;partner=rssnyt&amp;emc=rss">A Face Is Exposed For AOL Searcher No. 4417749</a> in which the identity of one of AOL’s users is determined through inspection of the logs alone. Within four days of releasing the data AOL had published an official letter of apology to their users and had removed the data from public access. By then the logs had already been mirrored around the Internet.</p>
<p style="text-align:justify;">As far as data goes, the AOL logs are nice to play with because they contain authentic data that shows how real people use search on the web. The log contains over 20 million queries issued by 650 000 users over three months of time. The data is split over ten files with each row  formatted according to</p>
<p style="text-align:justify;">{UserID, Query, QueryTime, ClickedRank, DestinationDomainUrl}</p>
<p style="text-align:justify;">Our primary domain of interest is in private networks such as corporate intranets, and we expect that there are many differences between the AOL logs and those that we intend to study. In particular we assume that search in private networks is more domain-oriented and focused than web search. Users searching for information at a medical university are probably interested in chemical compounds or what’s for lunch at the local cantina, but are probably not looking to use the intranet search application to buy a car. The web, on the other hand, boasts information on very diverse topics and we expect that users will probably be looking for all of the above and much more.</p>
<p style="text-align:justify;">On the whole, we believe the AOL logs are useful for the core of our research needs. After all, the framework of parsing the logs, dividing them into sessions and clustering them should be common to any search engine implementation, regardless of domain.</p>
<p>Some general information about the the logs can be found in the Wikipedia article <a title="AOL search data scandal" href="http://http//en.wikipedia.org/wiki/AOL_search_data_scandal">AOL search data scandal</a>.<a title="Chronicle of AOL Search Query Log Release Incident" href="http://sifaka.cs.uiuc.edu/xshen/aol_querylog.html"> Chronicle of AOL Search Query Log  		Release Incident</a> has a timeline over the events of the AOL log scandal.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/querylog.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/querylog.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/querylog.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/querylog.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/querylog.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/querylog.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/querylog.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/querylog.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/querylog.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/querylog.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/querylog.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/querylog.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/querylog.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/querylog.wordpress.com/24/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=querylog.wordpress.com&amp;blog=9107929&amp;post=24&amp;subd=querylog&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://querylog.wordpress.com/2009/09/21/the-america-on-line-search-query-log/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/a25590cc78d5add8ca26d8f71c231287?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">andreene</media:title>
		</media:content>
	</item>
		<item>
		<title>Beyond the Session Timeout</title>
		<link>http://querylog.wordpress.com/2009/09/03/beyond-the-session-timeout/</link>
		<comments>http://querylog.wordpress.com/2009/09/03/beyond-the-session-timeout/#comments</comments>
		<pubDate>Thu, 03 Sep 2009 17:29:35 +0000</pubDate>
		<dc:creator>Eskil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Literature]]></category>
		<category><![CDATA[Session Segmentation]]></category>

		<guid isPermaLink="false">http://querylog.wordpress.com/?p=12</guid>
		<description><![CDATA[We are by far the only ones studying session segmentation. User sessions seem to be the basic platform for any query log mining activity. This shouldn’t come as a surprise. After all, while simple queries are short, uninformative and ambiguous, sessions provide a context with potential to remedy many of the problems with user queries. [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=querylog.wordpress.com&amp;blog=9107929&amp;post=12&amp;subd=querylog&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>We are by far the only ones studying session segmentation. User sessions seem to be the basic platform for any query log mining activity. This shouldn’t come as a surprise. After all, while simple queries are short, uninformative and ambiguous, sessions provide a context with potential to remedy many of the problems with user queries.</p>
<h3>Are timeouts good enough?</h3>
<p>So how do we go about finding sessions? In their paper <a href="http://portal.acm.org/citation.cfm?id=1458176&amp;dl=GUIDE&amp;coll=GUIDE&amp;CFID=51232498&amp;CFTOKEN=88510591">Beyond the Session Timeout: Automatic Hierarchical Segmentation of Search Topics in Query Logs</a>, Rosie Jones and Kristina Lisa Klinkner provide a short background to this problem before presenting their own methods and findings. They note that most recent attempts have been based on temporal features. The theory is that if enough time has passed between two subsequent queries then it is likely that the user’s task at hand isn’t the same for the two queries. There is some evidence to support this, but as Jones and Klinkner point out in the title of their paper temporal features by themselves simply aren’t enough.</p>
<p>As an example, take #4578, a fictive user. Like any other user, #4578 often uses her preferred search engine to find information on whatever she may be up to at the moment. #4578’s actions are stored in a query log and left for us to investigate. By simply inspecting the queries we can in most cases determine where #4578 shifts the topic of her searches. Sometimes though, and often due to the ambiguous nature of queries, we cannot do this. In those cases the timestamp can help us. If there was a short pause between the two queries then perhaps #4578 is simply reformulating herself. However, a large gap can lead us to suspect that she has either found what she was looking for or  given up, and so is now looking for something different.</p>
<p>So yes, timestamps are useful in finding boundaries between search sessions. Unfortunately there isn&#8217;t a single time gap that will allow us to confidently determine where sessions start and end. The reason is that this metric, just like the query string itself, is highly ambiguous. Does a long gap really indicate that the user has stopped searching for something and begun afresh? Yes, but only if we choose that gap to be hours or even days. And if we do, then we will miss all those occasions where #4578 finds what she is looking for and immediately switches to the next task.</p>
<h3>The hierarchical model</h3>
<p>Jones and Klinkner contribute to the problem of session detection in two ways. For starters, they propose a hierarchical model for query logs. Using their terminology, a <em>Search Session</em> is the set of all queries in the log issued by the same user, for example by #4578. Sessions contain <em>Search Missions</em>. These constitute the user’s intentions or needs, for example the need to have a suit cleaned. Finally, missions may contain <em>Search Goals</em>. Search goals are subtasks that need to be carried out in order to achieve an intention or need, for example finding a decent dry cleaner, and finding a service that can plot the route to the cleaner’s on a map.</p>
<p>Jones and Klinkner acknowledge that the term <em>Search Session</em> has traditionally been used to describe those queries submitted by a single user and sharing the same intent, but point out that not even among previous writers has there existed a common definition. For the sake of conformity we will continue to let <em>Search Session</em> mean the set of queries a user issues for a given intent, that is, what Jones and Klinkner define as a <em>Search Goal</em>.</p>
<h3>If timeouts aren&#8217;t enough, what else is needed?</h3>
<p>Their second contribution is a classifier that is able to tag search sessions with up to 95% accuracy. This is to be compared with the best result of those classifiers based solely on temporal data and able to tag sessions at only 70% accuracy. In their effort to develop a better classifier, Jones and Klinkner have looked at the following features.</p>
<ul>
<li><em>Temporal features</em>, such as the time gap between subsequent queries are still useful in combination with other features</li>
<li><em>Word and character edit features</em>, such as the number of words the queries have in common</li>
<li><em>Query log sequence features</em>, like frequently collocated pairs of words in the query log as a whole</li>
<li><em>Web search features</em> like having clicked links in common, or the overall similarity of the queries&#8217; result sets.</li>
</ul>
<p>One of the most interesting results of their investigation is the great impact of <a href="http://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a> as a similarity feature. Levenshtein distance, or the edit distance as it is also known, is the number of operations required to transform one string into another. Levenshtein distance could, by itself, manage to yield a 90% accuracy score for tagging sessions. Seeing as to how easy the Levenshtein distance algorithm is to implement, this would probably serve as a good start for our own classifier.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/querylog.wordpress.com/12/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/querylog.wordpress.com/12/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/querylog.wordpress.com/12/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/querylog.wordpress.com/12/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/querylog.wordpress.com/12/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/querylog.wordpress.com/12/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/querylog.wordpress.com/12/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/querylog.wordpress.com/12/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/querylog.wordpress.com/12/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/querylog.wordpress.com/12/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/querylog.wordpress.com/12/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/querylog.wordpress.com/12/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/querylog.wordpress.com/12/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/querylog.wordpress.com/12/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=querylog.wordpress.com&amp;blog=9107929&amp;post=12&amp;subd=querylog&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://querylog.wordpress.com/2009/09/03/beyond-the-session-timeout/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/a25590cc78d5add8ca26d8f71c231287?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">andreene</media:title>
		</media:content>
	</item>
		<item>
		<title>Some background to our work</title>
		<link>http://querylog.wordpress.com/2009/08/26/some-background-to-our-work/</link>
		<comments>http://querylog.wordpress.com/2009/08/26/some-background-to-our-work/#comments</comments>
		<pubDate>Wed, 26 Aug 2009 11:54:31 +0000</pubDate>
		<dc:creator>Eskil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://querylog.wordpress.com/?p=4</guid>
		<description><![CDATA[Why analyze query logs? A query log is a sequence of records of users interacting with an information retrieval system such as a search engine. Ultimately, the success of such a system is measured by how well its users can use it to find what they&#8217;re looking for. In this context query logs can help [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=querylog.wordpress.com&amp;blog=9107929&amp;post=4&amp;subd=querylog&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<h3>Why analyze query logs?</h3>
<p>A query log is a sequence of records of users interacting with an information retrieval system such as a search engine. Ultimately, the success of such a system is measured by how well its users can use it to find what they&#8217;re looking for. In this context query logs can help us evaluate how well the system is performing. If we can somehow detect what users are looking for, and if we can measure how well they seem to be able to find this information in the system, then we can measure how successful the system is. The same methods can also be used to identify areas where the system is not performing well, and where efforts from administrators and information managers can help improve it. Together this makes the analyzing of query logs a key tool when administering an information retrieval system.</p>
<h3>What are users looking for?</h3>
<p>Two popular metrics when analyzing query logs are <em>most frequent queries</em> and <em>most frequent 0-hit queries</em>. A list of most frequent queries is assumed to tell us what people are looking for in the system. A list of most frequent 0-hit queries tells us what people are looking for and not finding. A good system administrator or information manager will want to know what frequent queries aren’t being answered so that he or she can take steps to add the information or make it more accessible.</p>
<p>However, merely counting occurrences of queries isn’t good enough. The problem is that there may be many queries describing a given need. How do we know which queries represent the same need? At the same time a query may be relevant to more than one need. How do we know which of several different needs a query represents? Thus when compiling a list of the most frequent queries, we are actually producing a list of the most entered query strings, when we would much rather learn the actual information needs of the user.</p>
<p>How do we go about doing this? We have already assumed that there is a connection between the information need of a user and the queries he or she issues to the search system. Based on previous research we believe that studying <em>search sessions </em>will yield better clues as to what the user was looking for than studying plain search queries. A search session contains all the queries that a single user issued to the system in order to fill a single information need. Furthermore, if the query log provides such information, we can extract additional clues from which documents the user clicked during the search session.</p>
<p>Two sessions containing similar queries and sharing clicked documents are probably similar also, with respect to the information needs of the two users. Using this similarity we can divide the set of search sessions in the query log into clusters, where sessions in the same cluster reflect the same need. Large clusters imply information needs that many users are trying to fill, which is exactly what we are looking for.</p>
<h3>Our contribution</h3>
<p>Methods like the ones above have been outlined in a few recent scientific articles. We aim to implement a query log topic detection system based on these, but we also hope to contribute extensions to topic detection that we believe to be useful when evaluating an information retrieval system.</p>
<p>First, we intend to investigate if it is possible to evaluate how well the information retrieval system satisfies the information needs of the users. For example, if many users are searching for a topic that can be roughly described as “pharmacy sleeping pills”, then we would like to know if they were able to find relevant information regarding this topic. We hope to gain understanding of this by analyzing each search session and detemining how likely it was that the user successfully found what he or she was looking for. Through this we can then draw conclusions for each topic based on the ratio of successful and unsuccessful sessions.</p>
<p>Second, we will present temporal statistics for the search topics. By doing this we can show the level of popularity of topics in a given time frame, and also track changes to topic popularity over time. This can be combined with our first contribution so that it is possible to track how well a topic in the system performs with regards to the share of sessions that managed to find relevant information. In the event that popular topics aren’t being covered by the system, administrators and information managers may take steps to improve this rating. Temporal statistics will be useful in proving that their effort did indeed have effect (or the opposite, if they were unsuccessful).</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/querylog.wordpress.com/4/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/querylog.wordpress.com/4/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/querylog.wordpress.com/4/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/querylog.wordpress.com/4/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/querylog.wordpress.com/4/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/querylog.wordpress.com/4/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/querylog.wordpress.com/4/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/querylog.wordpress.com/4/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/querylog.wordpress.com/4/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/querylog.wordpress.com/4/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/querylog.wordpress.com/4/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/querylog.wordpress.com/4/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/querylog.wordpress.com/4/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/querylog.wordpress.com/4/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=querylog.wordpress.com&amp;blog=9107929&amp;post=4&amp;subd=querylog&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://querylog.wordpress.com/2009/08/26/some-background-to-our-work/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/a25590cc78d5add8ca26d8f71c231287?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">andreene</media:title>
		</media:content>
	</item>
	</channel>
</rss>
