Till KTH:s startsida Till KTH:s startsida

Computer assignments

The computer assignments are carried out individually, and examined orally by the computer. Each assignment is worth 2 credits (about 50 hours of work).

  • Assignment 1 (Deadline February 9, 2016) Book a time for presentation in the Doodle below, make a note that you are presenting Assignment 1.
  • Assignment 2 (Deadline March 8, 2016) Book a time for presentation in the Doodle below, make a note that you are presenting Assignment 2.
  • Assignment 3 (Deadline April 5, 2016) Book a time for presentation in the Doodle below, make a note that you are presenting Assignment 3.

On May 27, 13-15, there will be an delayed-assignment presentation. Book a time for presentation in this Doodle. If all times are taken, make a booking without choosing any timeslot, and I will create more times. The Doodle will close Wednesday May 25 at 19.00.

The assignment should be completed before the start of the respective review session. There is no time for help during the sessions. For help with the assignments, please post a question in the News feed in the menu to the left, and we will try to help as soon as possible.

Grading

Each assignment is graded as A-F(fail) at the review session. In the case of F, the assignment has to be presented again at a later session. The grade depends on how large part of the assigment that was carried out correctly - this is specified on each assigment description. The grade is also affected by delays in the presentation of the assignment. The tables below show how the grades for assignment 1, 2, and 3 decays over time. "Grade Assignment X" corresponds to the grade according to the specification on the assignment description:

Deadline February 9 March 8 April 5 April 29 May 27
Grade Assignment 1
A A B C D E
B B C D E E
C C D E E E
D D E E E E
E E E E E E
Deadline March 8 April 5 April 29 May 27 August 26
Grade Assignment 2
A A B C D E
B B C D E E
C C D E E E
D D E E E E
E E E E E E
Deadline April 5 April 29 May 27 August 26 September 23
Grade Assignment 3
A A B C D E
B B C D E E
C C D E E E
D D E E E E
E E E E E E

In other words: Be on time with the assignments! The first three dates are scheduled; to present before any of the later deadlines in April-September, please make an individual appointment with Hedvig Kjellström or Johan Boye at least two weeks in advance.

The grades on the individual assignments are weighted together according to the following:

Computer Assignment GradeCriteria (grades for individual assignments, regardless of order)
A (A,A,A), (A,A,B)
B (A,A,C), (A,A,D), (A,A,E), (A,B,B), (A,B,C), (A,B,D), (A,C,C), (B,B,B), (B,B,C)
C (A,B,E), (A,C,D), (A,C,E), (A,D,D), (A,D,E), (B,B,D), (B,B,E), (B,C,C), (B,C,D), (B,C,E), (B,D,D), (C,C,C), (C,C,D)
D (A,E,E), (B,D,E), (B,E,E), (C,C,E), (C,D,D), (C,D,E), (C,E,E), (D,D,D), (D,D,E)
E (D,E,E), (E,E,E)

Hedvig Kjellström skapade sidan 9 december 2015

Lärare Hedvig Kjellström ändrade rättigheterna 9 december 2015

Kan därmed läsas av alla och ändras av lärare.
Lärare kommenterade 16 januari 2016

Dear all,

Very welcome to the course! The first assignment is now published on the homepages.

All the best,

/Hedvig Kjellström, course director

kommenterade 19 januari 2016

Hi! These code skeletons are in Java, right? Are there any specific requirements to use Java, or can we choose other languages?

Lärare kommenterade 19 januari 2016

You can use any language you like!

Best,

/Hedvig

kommenterade 21 januari 2016

Hi, I am an EIT Digital student. I don't get this part:

The   code   needed   for   the   assignment   can   be   found   in   the   course   directory  /info/DD2476/ir16/labon  the  CSC  UNIX system

where exactly is this course directory on the CSC Unix system? how can I access it? ssh, sftp, scp?

Lärare kommenterade 21 januari 2016

Mustafa, I will see to it that you get a CSC account.

kommenterade 23 januari 2016

I have a problem on assignment 1.2, i get the wrong number of hits for the query "money transfer" (106 vs 105). While trying to debug, I noticed that the grep command "grep –rEil ”\btransfer\b” . > textfile" found less occurences than the program when just querying for "transfer". I found the files that the program found but not the grep-command and both of these contained the word "[...]_transfer". Trying the intersect grep command I found that the file CalFresh.f was the difference, and that file contains "_Transfer" and "money". The tokenizer seems to be causing the difference, and not any code that the assignment have had us writing so far. What should we do about this? Have anyone else gotten the right result for 1.2?

Lärare kommenterade 23 januari 2016

See earlier postings.

kommenterade 23 januari 2016

Is it possible to access the course directory from a home computer somehow so I can download the code needed for the assignment?

kommenterade 23 januari 2016

this worked for me

scp -r <your csc userid>@u-shell.csc.kth.se:/info/DD2476/ir16/lab .

kommenterade 23 januari 2016

Thank you, it worked

Lärare kommenterade 25 januari 2016

The Assignment 1 description is now updated:

   1) Info on how to use the grep debugging - grep gives lower bound on #docs since it only looks at whitespace-surrounded terms.

   2) The #doc counts in sections 1.2 and 1.3 are now the ones returned from the tokenizer, not from grep.

Please print out the new version (or update your electronic copy) so that you have the new debugging instructions, you will need them in the subsequent assignments as well, when the deviations between different tokenizers will be more pronounced.

En användare har tagit bort sin kommentar
Hedvig Kjellström redigerade 25 januari 2016

The computer assignments are carried out individually, and examined orally by the computer. Each assignment is worth 2 credits (about 50 hours of work).


* Assignment 1 (Deadline February 9, 2016) EDIT January 25
* Assignment 2 (Deadline March 8, 2016) 
* Assignment 3 (Deadline April 1, 2016)
The assignment should be completed before the start of the respective review session. There is no time for help during the sessions. For help with the assignments, please post a question in the News feed in the menu to the left, and we will try to help as soon as possible.

Grading Each assignment is graded as A-F(fail) at the review session. In the case of F, the assignment has to be presented again at a later session. The grade depends on how large part of the assigment that was carried out correctly - this is specified on each assigment description. The grade is also affected by delays in the presentation of the assignment. The tables below show how the grades for assignment 1, 2, and 3 decays over time. "Grade Assignment X" corresponds to the grade according to the specification on the assignment description:

Deadline February 9 March 8 April 1 April 29 May 27 Grade Assignment 1 A A B C D E B B C D E E C C D E E E D D E E E E E E E E E E Deadline March 8 April 1 April 29 May 27 August 26 Grade Assignment 2 A A B C D E B B C D E E C C D E E E D D E E E E E E E E E E Deadline April 1 April 29 May 27 August 26 September 23 Grade Assignment 3 A A B C D E B B C D E E C C D E E E D D E E E E E E E E E E In other words: Be on time with the assignments! The first three dates are scheduled; to present before any of the later deadlines in April-September, please make an individual appointment with Hedvig Kjellström or Johan Boye at least two weeks in advance.

The grades on the individual assignments are weighted together according to the following:

Computer Assignment GradeCriteria (grades for individual assignments, regardless of order) A (A,A,A), (A,A,B) B (A,A,C), (A,A,D), (A,A,E), (A,B,B), (A,B,C), (A,B,D), (A,C,C), (B,B,B), (B,B,C) C (A,B,E), (A,C,D), (A,C,E), (A,D,D), (A,D,E), (B,B,D), (B,B,E), (B,C,C), (B,C,D), (B,C,E), (B,D,D), (C,C,C), (C,C,D) D (A,E,E), (B,D,E), (B,E,E), (C,C,E), (C,D,D), (C,D,E), (C,E,E), (D,D,D), (D,D,E) E (D,E,E), (E,E,E)

kommenterade 28 januari 2016

Is it possible to present multiple assignments at the 9th of february?

Lärare kommenterade 28 januari 2016

No, on Feb 9 we only examine Assignment 1. You need to prepare a presentation of this year's assignment, there are small changes from last year.

kommenterade 28 januari 2016

Hi,

For task 1.3, are supposed to fit the index in the 1GB heap? can we use more?

BR

Lärare kommenterade 29 januari 2016

You can use any configurations you want! For all tasks up to 1.6, the index should be kept in memory, in 1.6 it should be partly on disk.

kommenterade 29 januari 2016

I have not been able to start doing the assignment1 because of the following problem. When I tried to compile the code, I received a lot of errors saying 'unmappable character for encoding UTF8'. Do I need to set the character encoding to ISO-8859-1(Latin 1) to support the special characters( '·', '‡', '‚', 'Â', '‰', 'È', 'Ë', 'Í', 'Ì', 'Ò', 'ˆ', 'Ù',etc.)?

Screen Shot 2016-01-27 at 12.27.23 PM.png

Lärare kommenterade 29 januari 2016

If you go into the source code and in the header comment replace the ö in my last name with an o, it should work! This problem is new this year, I also encountered it.

kommenterade 29 januari 2016

Tried it and nothing changed. The errors are at line 45 and 52 of SimpleTokenizer.java, though. 

kommenterade 30 januari 2016

Either convert the encoding of SimpleTokenizer.java to UTF-8, or use the "-encoding ISO-8859-1" option of the javac compiler.

kommenterade 31 januari 2016

Questions about Assignment 1 task 1.6

Is it allowed to have the whole index in working memory during indexing and first after the whole index is created write it to files, where only some of the files are read to working memory when a query is executed?

For example perform the following steps:

  1. Build the whole index in the working memory
  2. Write index to disk
  3. Read only some parts of the index from disk to working memory when a query is executed

Or should the index be written continuously to disk when it is created?

Is there a time limit for constructing the index and write it to disk?

Lärare kommenterade 31 januari 2016

No, the implementation should scale to indices that do not fit in the memory. Therefore, the entire index should never be in the memory, neither during building nor during querying.

There are no time limits.

kommenterade 1 februari 2016

How much time should it take to index davisWiki directory?

Lärare kommenterade 1 februari 2016

Depending on implementation and hardware, up to 15 mins. It should be done before the start of the presentation. Use part of the corpus while debugging.

kommenterade 1 februari 2016

Hedvig, in an apply to an earlier post, you stated (or at least I interpreted) that it was okay to assume that the index would fit into working memory during indexing for all tasks up to the last one of Assignment 2. But some posts up in this thread you say that "implementation should scale to indices that do not fit in the memory. Therefore, the entire index should never be in the memory, neither during building nor during querying." I find these two statements quite contradictory. Perhaps I've misunderstood something, but it would be great with a clarification on this.

This is the post I'm referring to: https://www.kth.se/social/course/DD2476/post/writing-to-files-in-java-is-tedious-is/

/Simon

Lärare kommenterade 1 februari 2016

Oh my, I have written Assignment 2 instead of Assignment 1. The index can fit in working memory up until the last task of Assignment 1. This is the correct statement:

In Tasks 1.1-1.5, the index can be in working memory the whole time. But in Task 1.6, the index should be on disk.

Lärare kommenterade 1 februari 2016

You can now book a 15 min time slot for presentation of Assignment 1, by following the link next to the Assignment 1 description on the homepage.

Looking forward to seeing your solutions!

Best,

/Hedvig

En användare har tagit bort sin kommentar
En användare har tagit bort sin kommentar
kommenterade 2 februari 2016

Hi, For the task 1.3, do we only need to search for the two-words phrases, or the number of words are unlimited? Thanks

Lärare kommenterade 2 februari 2016

Since you should design your own 3+ word query, your engine needs to be able to handle arbitrarily long queries.

kommenterade 2 februari 2016

Hi, I noticed that the tokenizer takes "world's" and transforms it in the tokens "worlda" and "s". However, for some other colleagues the tokenizer transforms it in the tokens "world" and "s". I use Windows. How to deal with this difference?

Lärare kommenterade 3 februari 2016

Hi Gabriela, Yes, there are some strange discrepancies between platforms and encodings. You do not have to tend to it at all, just leave as is. It is great that you were able to diagnose it!

kommenterade 3 februari 2016

How much of the index can be in memory at the same time for 1.6? Specifically, can I keep the PostingsList of each term in a query in memory for the duration of the query?

Lärare kommenterade 3 februari 2016

Absolutely! It is also a good idea to keep a cache of, say, the postings lists of the N latest used query terms in memory.

Lärare kommenterade 5 februari 2016

Assignment 2 is now published!

Have a great weekend,

/Hedvig

kommenterade 5 februari 2016

Can we use database for 1.6? Because 1.6 requires a sublineer solution and that probably means we need to keep a tree of some sort. The database is already doing that on our behalf.

Lärare kommenterade 5 februari 2016

Yes, that would be fine.

kommenterade 7 februari 2016

Will more spots open up for the Assignment 1 presentation open up?

Lärare kommenterade 7 februari 2016

Yes, rest assured that we will let everyone present. I have to make arrangements with the TAs, then I will open more slots between 18 and 19.

kommenterade 7 februari 2016

Is doing the intersection in a correct way the important part of 1.2 or implementing the exact algorithm from the book what matters? I have created my own solution for intersection and want to know if this is okay or if it's obligatory to implement the given algorithm.

kommenterade 7 februari 2016

Hi,

In task 1.4, what does "recall estimate up to an unknown scale factor" mean? Does it mean more than "calculating the recall assuming the total number of relevant documents in the corpus is equal to 100"?

BR, Mohamed

kommenterade 7 februari 2016

Here's a script that automates the process of opening every link for task 1.4. First save your search results in a file and then paste this in the terminal, editing the application and file name to suit you.

sed -E 's/^.*davisWiki\/(.*).f$/http:\/\/daviswiki.org\/\1/' graduate_program_mathematics.txt | xargs /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome

E.g. for the school computers with Ubuntu this will instead be

sed -E 's/^.*davisWiki\/(.*).f$/http:\/\/daviswiki.org\/\1/' graduate_program_mathematics.txt | xargs google-chrome

If you're using Windows you're on your own.

Lärare kommenterade 7 februari 2016

Denise, that would be ok, but be prepared to explain the differences between your method and the book - i.e., you have to understand the method in the book.

Lärare kommenterade 7 februari 2016

Mohamed, it means that you do not know the total number of relevant docs, you just guess.

Lärare kommenterade 7 februari 2016

Can everyone that want a slot for presentation of Assignment 1 send me an email? I will then arrange for the appropriate number of slots.

If you have booked a slot, but will not use it, unbook it as soon as possible, so that someone else can book it!

Looking forward to seeing your presentations!

/Hedvig

kommenterade 8 februari 2016

For 1.6, are you allowed to generate the complete index in working memory before storing it to a file, or do you have to store the index to a file continuously while generating it?

Lärare kommenterade 8 februari 2016

No, the index should at no time be entirely in memory.

kommenterade 8 februari 2016

I have troubles with getting the index generated in time, it is bounded by file IO taking roughly 20 minutes. As I understand it that's no good as I have to demonstrate the index creation at the time of the presentations. Did I get that right?

Would it be fine to generate an intermediate representation during indexing and then continue on that representation? What other (high-level) options are there?

Lärare kommenterade 8 februari 2016

It is totally fine to do the indexing before the presentation! You can prepare before the presentation by having two guis open, one with the index in memory (Task 1.1-1.5) and one with the index on disk (Task 1.6).

kommenterade 8 februari 2016

Hedvig, if one presents an assignment at the first deadline is it allowed to present it at a later deadline too if it could yield a higher grade (For example presenting E-level for assignment 1 on February 9 which results in an E and later presenting A-level on March 8 raising the grade to a B)?

kommenterade 8 februari 2016

As the Doodle is full I send a mail for a booking slot.

When will we know the Timeslot?

Thanks

Lärare kommenterade 8 februari 2016

At 22.00 there will be timeslots. I will move around people a bit after 18 since some overwrote placeholders (times without teachers).

kommenterade 9 februari 2016

Given that I have presented tasks 1.1-1.5 today (February 9), is it possible to present task 1.6 before March 8? Or do I have to wait until March 8 to present the single task 1.6? I would like to present it as soon as possible if it is permitted.

Thanks.

Lärare kommenterade 9 februari 2016

You have to wait until March 8, as there are so many students - we do not have the possibility to make separate appointments. You have to book two separate times, one for Assignment 1 and one for Assignment 2.

We normally do not allow dividing up the presentation on several occasions, as it increases the teacher admin load. It is ok for this time since you did not know, but in the future, it is much better that you wait until you have finished everything you want and present once.

kommenterade 15 februari 2016

For assignment 2, how can we check our ranked queries? Mine are roughly the same with a few switched places, is this fine? 

Thanks!

Lärare kommenterade 15 februari 2016

This is fine! As described in the assignment description, you check by counting the number of query terms in the highest ranked documents, and make an estimate of the tf_idf score for each doc. Do the higher ranked docs contain a higher percentage of query terms than the lower ranked docs?

kommenterade 25 februari 2016

Hello Hedvig,

I have some questions regarding task 2.4, 2.5 and 2.6. 

For 2.4 is | x-x' | that are in the slides computed as the sum of the differences? I used this and got almost exactly the same result as in the task, but in 2.5 you use the sums squared so I'm unsure if this is correct, or do you perhaps mean || x-x' || ?

For 2.5 it's stated that we should stop when it converges and that the goodness measure is within the criteria for the error limits stated in task 2.4, but those error limits are stated as positions and the value difference for each of the 50 documents while the goodness is their sums squared (also don't you usually take the square root of the sum here?). Do you mean when goodness 1 < EPSILON (from the code skeleton) or am I missing something here?

Also for algorithms that stop at dangling nodes in MC4-5, should they also stop with probability BORED? As otherwise they can end up in infinite loops.

For 2.6 should we really have to reverse everything and then look in the articleTitles.txt for the correct index for each document name? I see no connection between the docID used in the indexer and any of the docIDs used by the pagerank so I did this and it works but it seems like a very roundabout way to do it, have I missed something?

Lärare kommenterade 25 februari 2016

2.4: I guess you mean the convergence criterion? I does not matter, but makes more sense to use the abs diff |x-x'| since x is not really in a metric space, but just a vector with all the documents.

2.5: There is no iteration until convergence, you can only evaluate WHETHER OR NOT the method has converged to the correct solution. If you want to add an automatic way of reaching convergence, you can add more and more sample runs, and add them to the solution iteratively until the estimate has converged to the true solution.

As for the "stop at dangling nodes" they should also stop with probability 1-c in each time step.

2.6: Of course the docIDs in your inverted index do not correspond to the id in the articleTitles file. So yes, you have to take the effort to "reverse everything". :)

kommenterade 26 februari 2016

hey

does anybody know how to retrieve the source code for 2.4~6

i.e. the exact location for /info/DD2476/ir16/lab/pagerank

thanks :)

kommenterade 26 februari 2016

That is the exact location. Try using the following command to download it:

scp -r [username]@u-shell.csc.kth.se:/info/DD2476/ir16/lab/pagerank [destination]

kommenterade 27 februari 2016

In task 2.3 we are asked to compare ranked retrieval and unranked (intersection) retrieval for the query "graduate program mathematics". In particular we are asked to compare precision and recall at 10, 20, 30, 40, and 50 documents. However there are only 22 documents returned with intersection query.

Are we only supposed to compare up to 20 documents? Or should the precision and recall scores past 22 be constant?

Lärare kommenterade 27 februari 2016

For task 1.4 there is only one precision score. For task 2.3 there is a whole curve, so there you can take the precision at 10, 20 etc.

kommenterade 29 februari 2016

To what precision do we need to calculate the PageRank?

My current solution gives me only 6 decimals, making it impossible to distinguish the last 5000 entries (All end up with the same score).

Lärare kommenterade 29 februari 2016

How come? What epsilon do you use? what floating point value representation (maybe double instead)?

kommenterade 29 februari 2016

Ok found my mistake ... it was in the printout format ... double is good enough.

Tack!

kommenterade 1 mars 2016

In Task 2.1 and 2.2, am I supposed to obtain results with the exact same order as the examples ("zombie" and "attack") given in the instructions? I am not sure if I have understood this correctly. My sorted results are ordered slightly different compared to the examples. Interpreting the instructions, this seems to be ok. Am I right? Or should I aim to get the exact same order?

Lärare kommenterade 1 mars 2016

No you are not required to - the instructions are very clear about that.

kommenterade 1 mars 2016

For RandomStart, the first MC method, do we jump randomly when we hit a dead end or do we break? From the slide example, it turns red and ends on 3 (a dangling node) during the run, so that indicates to me that we end either on dangling nodes or with probability BORED.

mc1

Lärare kommenterade 1 mars 2016

No you jump to any other node from a sink (dangling node). In the image there is indeed a jump - an then it stops at 3 since it is bored.

kommenterade 1 mars 2016

Thanks!!!!

kommenterade 2 mars 2016

Please clarify a little bit about the MC task.

If the probability of stopping a random walk is the 'bored' probability, must the walker always follow a link on a page and never jump EXCEPT when he hits a dangling node? This would eliminate the probability of jumping to any page from any page.

Lärare kommenterade 2 mars 2016

Exactly, the only case in which the random surfer jumps is when the node is a sink, i.e., has no outlinks. The "bored" probabillity of stopping the walk in each timestep will have the same effect on the distribution of pagerank as random jumps with probability "bored".

Lärare kommenterade 2 mars 2016

There is now a Doodle for booking a presentation time on Tuesday. Note that the Doodle will close on Monday at 19.00! If there are no times left, make a booking without any timeslot, and I will see it and make a time for you.

Looking forward to your presentations!

En användare har tagit bort sin kommentar
kommenterade 3 mars 2016

"2.5: There is no iteration until convergence, you can only evaluate WHETHER OR NOT the method has converged to the correct solution. If you want to add an automatic way of reaching convergence, you can add more and more sample runs, and add them to the solution iteratively until the estimate has converged to the true solution." 

So my interpretation is: You just need to find the value at which the difference between the power iteration values and the MC method values is less than EPSILON = 0.0001, right? So to find this you could increment N and re-compute each MC method from N=1 until the difference is less than EPSILON = 0.0001 for each MC method, then report the N value that satisfies each MC Method. Because this is for the entire set not the top 50 or bottom 50, this would take a really large value of N.

Is this what you're saying?

kommenterade 3 mars 2016

I'm going to compute the value for the top 50 and bottom 50 respectively, because this problem is a lot more tractable and still represents what we are interested in analyzing. 

kommenterade 5 mars 2016

Should I reconsider my solution if 2.4 takes minutes to execute?

Lärare kommenterade 5 mars 2016

No, it can be time consuming. If you like you can plot the diff to see the convergence.

kommenterade 6 mars 2016

In task 2.5 the 50 documents with lowest exact pagerank are for me documents that have neither ingoing nor outgoing links and can only be reached by jumping. Thus, they all have the same pagerank. Are we supposed to consider those documents as well? Because then the 2nd goodness measure doesn't really make sense to me.

kommenterade 6 mars 2016

For MC algorithm 4:  Does the walk only stop when it hits a dangling node or can the walk also get bored?  From the lecture slides I would assume the former.  What happens with nodes like this:

6927;6928,
6928;6927,

kommenterade 7 mars 2016

I let it jump to a random document if it was bored so if it gets stuck in a loop it will eventually find its way out.

Lärare kommenterade 7 mars 2016

If it is bored (probability 1-c in each timestep), the walk stops. If it is at a dangling node, it jumps to a random other node in the first three methods, and stops in the last two methods.

Lärare kommenterade 7 mars 2016

If you have booked when there were no time slots, go in and book one now, there are a couple of cancellations!

Lärare kommenterade 8 mars 2016

Would two of you with times after 16/4pm be able to switch with Sugandh Sinha? I would be very kind! In that case, send him and me an email asap: sugandh@kth.se, hedvig@kth.se

kommenterade 8 mars 2016

Hej Hedvig, I had contacted with Sugandh and I can do my presentation at 2:15 PM by using his slot at Doodle. Thanks a lot!

kommenterade 8 mars 2016

I will not attend my chosen time at 13:45, but I can't remove it since the Doodle is closed

kommenterade 8 mars 2016

I will not attend at my chosen time. I tried to cancel the booking but the Doodle is closed.

kommenterade 12 mars 2016

"Note that you have to normalize both the query weights and the document weights, by dividing the document/query vector with the length of the same vector. This has to be done prior to the Rocchio computation." 

This is because, if we had a really long document marked as relevant, it would contribute a lot more to the relevance feedback than a shorter but equally as relevant document, right? So we then normalize by length to give each word occurrence the same value across documents found to be relevant.

kommenterade 12 mars 2016

Also, do we condense the matches?

"Many terms will reoccur in different documents. Make also sure not to add terms twice to the query, but instead add to the weight of the existing term instance."

This works just fine for documents with re-occurring words, but if we have a query, then do we fold words in the query that  occur more than once into one instance with more weight?

"Math program math department" ---> Math = 1.5; program = 1;  department = 1; 

Or some other weighting scheme instead of 0.5 depending maybe on beta?

Let me know. Most queries don't contain words twice that we have considered, and I think it's not a case that's treated too "incorrectly" even without condensing words, since the +1 weight later on won't be a huge detractor.

kommenterade 13 mars 2016

Thirdly, what does this mean?

"What is the average speedup, and how is it affected by parameters in the approximation method?"

What is the "parameters in the approximation method"? Does this refer to the a and beta in the part 3.1?

Lärare kommenterade 16 mars 2016

Answers to Nick:

Vector length: Yes that is right.

Query representation: The query should be represented in exactly the same way as a document with tf-idf (or plain tf) scores.

Param in approx: The param is N, the number of random walks.

kommenterade 21 mars 2016

In 3.1, one question is "Why is the number of returned documents larger?". When a document is marked as relevant, should we consider all terms in that document when we move closer to that vector, or just the ones seen in the original query? If we only consider the ones in the original query, we should get the same number of returned documents. 

Lärare kommenterade 21 mars 2016

You should consider all returned documents!

kommenterade 21 mars 2016

Yes, but should I consider all terms in all returned documents?

Lärare kommenterade 21 mars 2016

No, but you should consider all terms in all returned documents, that are also in the query!

kommenterade 28 mars 2016

In the Query class where we receive information on relevant documents we create a new query with new weights. Are those weights the tf_idf scores multiplied by alpha or beta? Or have those weights nothing to do with tf_idf?

kommenterade 28 mars 2016

So just to make sure about Ambjörn's question: the modified query will not have new terms? only the weights will change? In other words, the rocchio algorithm update the query vector only in the space of the terms of the query and not the whole space of terms of our corpus?

Other question, for the task 3.4 and task 3.5: I cannot process more than 1000-2000 documents for the biword index without having a Heap Space error. But I can build a biword index if it is only for a few bigrams with the documents containing "zombie attack" or "money transfer" for example and still get the right scores. During the examination, I won't have the results for a query I do not know, what should I do?

kommenterade 29 mars 2016

According to everything I've seen on Rocchio the new query definitely has new terms. The book recommends limiting yourself to adding the top 20 highest ranked terms in the relevant document.

kommenterade 29 mars 2016

That's what I thought too

Lärare kommenterade 29 mars 2016

There is now a Doodle for the presentation of Assignment 3. Looking forward to your presentations!

kommenterade 30 mars 2016

Hello! In assignment 3 are we supposed to consider ranked retrieval only for ranking type "tf-idf", or should we also consider the ranking types "pagerank" and "combination"? In case we should consider all the 3 ranking types, how should the query be represented for "pagerank" and "combination"?

Lärare kommenterade 30 mars 2016

Only tf-idf needed!

kommenterade 1 april 2016

I need further clarification on the Rocchio algorithm, based on these quotes:

Yes, but should I consider all terms in all returned documents?

No, but you should consider all terms in all returned documents, that are also in the query!

If we are only considering the terms in all returned documents that are also in the query we will not be getting any more returned documents. As far as I can tell that's not how the algorithm works in general. What am I missing? I also do not understand how/why we need to consider all returned documents since the algorithm only considers the set of relevant documents (when gamma = 0).

Lärare kommenterade 1 april 2016

Fredrik:

You should include all terms that are in the documents marked relevant, and all terms in the original query. That is, the UNION of the terms in the original query and in the documents marked relevant, not the INTERSECTION.

I do not understand the last paragraph.

kommenterade 1 april 2016

Okay, I understand, thanks! I misunderstood "No, but you should consider all terms in all returned documents, that are also in the query!" to mean query terms exclusively.

What I mean is that (AFAICT) all returned documents (say the 249 for zombie attack) need not be considered, only the ones marked as relevant by the relevance feedback (i.e. 10 or less documents).

Lärare kommenterade 1 april 2016

ah, I see. No, the point of the relevance feedback is to take into account what the user said was relevant, i.e., the documents marked as relevant in the GUI.

kommenterade 1 april 2016

Two questions regarding Task 3.1:

1. Should the results from a standard ranked retrieval search without relevance feedback be the same as in Assignment 2, or can you use tf-idf weights for the query terms? (I used FastCosineScore in Assignment 2.)

2. Typical values are α = 1 and β = 0.75 according to the course book (and other sources). However, when I use these values, in many cases documents often retain their positions in the results list after relevance feedback. If I change the values to around α = 0.1 and β = 5 to increase the influence that the relevance feedback has, the results look much better (in my opinion). Does this sound reasonable or am I doing something wrong?

Lärare kommenterade 1 april 2016

Joakim:

1) Yes, it shoudl be exactly the same!

2) It sounds reasonable. Play with different alpha/beta ratios and different queries, and try marking a doc that contains totally different terms than the orig query, and see how the new ranking becomes.

kommenterade 2 april 2016

I may have missed something here, but it seems extremely expensive to include all the terms in all documents marked relevant in the expanded query. For every term, its postings list must be fetched as part of the tf_idf algorithm, according to the lectures and the book. It takes about three seconds for my computer to do one search and if a couple of documents are selected as relevant, the number of terms might suddenly be 500, and it would take 25 minutes to search for them all. A union database search with a huge WHERE clause doesn't do the job, because then it isn't possible to know what terms or how many of the terms are present in each result.

Lärare kommenterade 2 april 2016

Rasmus, that is exactly the pedagogical point of then moving to Task 3.3 where you introduce approximations!

kommenterade 3 april 2016

Are there any more spots on April 5th to demonstrate assignment 3? I can't sign up due to no spots apparently.

Lärare kommenterade 3 april 2016

As stated above, If all times are taken, make a booking without choosing any timeslot, and I will create more times.

kommenterade 3 april 2016

As a follow-up/clarification to what Rasmus asked: is the intention for task 3.1 that we loop over every term in the dictionary and get each postingslist and find the relevant documents in those lists?

I don't see any other way to structure it other than reading the terms from each document (which might save at least some time if the documents are from varying domains with differing sets of terms), but the structure of the given code indicates that it's not the intended way to do it.

Lärare kommenterade 4 april 2016

no, the list I am talking about is the returned list of docs in the gui. The user will mark some of them as relevant.

kommenterade 4 april 2016

I will not be able to make it to my time slot on April 5th 14.15

Lärare kommenterade 4 april 2016

No worries, just unbook in the Doodle!

kommenterade 5 april 2016

I would like to take that 14:15 slot if possible. The Doodle poll seems to be closed so updates are not possible

kommenterade 5 april 2016

Anyone who has a late presentation and wants to do it now, feel free to turn up early.

Lärare kommenterade 18 april 2016

Doodle for April 29 now available!

Lärare kommenterade 28 april 2016

Note the doodle closing time - all changes after yesterday at 19 are unvalid since the schedule already has been exported - I changed back a rebooking this morning. Please do not change or unbook in the doodle!

Good luck tomorrow!

Lärare kommenterade 13 maj 2016

Dear all,

Now we have created a new delayed-assignment session on May 27, 13-15. Same procedure as in April, visit the Doodle on the homepage, and book your time slot BEFORE 19.00 May 25.

Good luck with the assignments!

Lärare kommenterade 26 maj 2016

The doodle is now closed and updated so that all students have a time slot. There are fewer TAs than time slots so there might be short delays, depending on the number of no-shows.

Looking forward to your presentations tomorrow!