Till KTH:s startsida Till KTH:s startsida

Assignments

The two assignments are carried out individually and may partially be examined orally.

  • Assignment 1 (CH Ek's py script create_index_set): deadline November 29 30 (midnight towards Dec 1).
  • Please include ”[mladv16]” in the subject line also when submitting Assignment 2.  A new version of Assignment 2 (the problem pointed out by Möckelind & Guirao fixed).  DEADLINE EXTENSION: the new deadline is 23.59 CET, December 29, 2016. Solutions submitted before 16.00, December 21, can be resubmitted. Please consider Arvid's warning concerning book versions and the Simple VI problem. Again thanks Arvid. 
  • Late assignments. We will accept late assignment submission between April 3 and 10. The highest possible grade for late submissions is E. Please send your solutions with subject [lateAss2MLAdv16] to Jens Lagergren (jensl@csc.kth.se), and in case there are late submissions of Assignment 1 send them to Pawel (paherman@kth.se) with subject [lateAss1MLAdv16]. Please also resend "earlier late submissions". Those of you who failed Assignment 2 but passed the course with an E (due to otherwise good performance) can resubmit Assignment 2 (i.e., if an E on Assignment 2 can give you a higher grade this may be meaningful). 

    Notice that we will compare later submissions with earlier submitted solutions, in order to catch and report any case of plagiarism. 

  • April 11 we will have a meeting and discuss possibilities to obtain a somewhat better grade on Assignment 2.

Grading

Each assignment is graded as A-F(fail). In the case of F, the assignment has to be submitted again at a later time. The grade depends on how large part of the assignment that was carried out correctly. Assignments that are handed in after the respective deadline receive a maximum grade of E.

In other words: Be on time with the assignments! 

The grades on the individual assignments are weighted together as explained in the table below. NOTE: You need to pass both assignments, i.e., get at least grade E on both assignments.

Assignment GradeCriteria (individual assignment grades, regardless of order)
A (A,A), (A,B)
B (A,C), (A,D), (B,B), (B,C)
C (A,E), (B,D), (B,E), (C,C), (C,D)
D (C,E), (D,D), (D,E)
E (E,E)

Jens Lagergren skapade sidan 19 oktober 2016

kommenterade 6 november 2016

Hi! With only a week to solve the assignments and collisions with other assignments and projects in other classes, what is your recommendation in order to get a satisfying grade, besides reading the chapters from the book? Previous assignments from mladv15 are over twenty pages long and it seems like we would need at least a full-time week in order to solve the problems to get A

Lärare kommenterade 6 november 2016

Do also exercises from the book. We will try to extend the time window available for each assignment. The first will probably be less extensive than last year, no guaranties though.

En användare har tagit bort sin kommentar
kommenterade 6 november 2016

Will they focus more on implementation with Python, numpy, scikitlearn, pandas etc or will it focus more on theoretical stuff and more math problems? Yeah, the assignments from last year looked very extensive. 

Assistent kommenterade 7 november 2016

The first assignment will be very similar to the last year's assignment.

kommenterade 7 november 2016

what was the assignment of last year?

Assistent kommenterade 7 november 2016

The assignment will be posted this week.

kommenterade 7 november 2016

With the same deadline?

kommenterade 7 november 2016

@Zhao

2015: 

https://www.kth.se/social/course/DD2434/subgroup/ht-2015-mladv15/page/assignments-51/

2014: 

https://www.kth.se/social/course/DD2434/subgroup/ht-2014-51466/page/assignments-41/

kommenterade 13 november 2016

In assignment 1, equation (8), how are we meant to interpret the probability density of the prior \(p(\mathbf W) = \mathcal N(\mathbf W_0, \tau^2 \mathbf I)\) considering that \(\mathbf W\) is a matrix?

Assistent kommenterade 14 november 2016

Please read the remarks attached at the top of the assignment. You can think about this prior for sets of weights corresponding to each output Y (with different Wo), which is analogous to the vectorised form of matrix W. Most important is the independence assumption that you can deduce from the nature of the covariance matrix.

En användare har tagit bort sin kommentar
En användare har tagit bort sin kommentar
kommenterade 14 november 2016

Hi! 

Do you mean that every \(\bf{W_0}\) represents a row of \(\bf{W}\) ? 

Why don't you write \(\bf{w_{\it{i}}}\) instead so that 

$$p(\bf{W}) = \prod_{\it{i}}^{\it{N}} \mathcal{N}(\bf{w_{\it{i}}}|{\tau}^2 \bf{I})$$

??

Assistent kommenterade 14 november 2016

It is a matter of notation. In essence this is a normal distribution spanned over an array of variables (matrix normal distribution) so you can also think about it in the vectorized form (Wo can be treated as a long vector containing all parameters w in the matrix).

En användare har tagit bort sin kommentar
kommenterade 14 november 2016

I am wondering about the dimensions of the matrices X and Y in assignment 1. From Question 2 I understand that the Y matrix is D x N so I thought the X matrix was stacked by vectors the same way which would indicate dimension q x N but then the computations in Question 6 get really tricky.

Assistent kommenterade 14 november 2016

Ok, I have now simplified this last question. Please see the text of the assignment (Q6 in Part I) and focus on a single output variable case.

Assistent kommenterade 15 november 2016

Eq. (23) has been corrected in Assignment 1.

En användare har tagit bort sin kommentar
kommenterade 15 november 2016

Hi Pawel!

To make sure that I understood your change in question 6 correctly (single output variable). Are we supposed to derive the posterior for the linear regression problem: 

$$\mathbf{y} = \mathbf{X} \mathbf{w} + \mathbf{\epsilon}$$

with \({\mathbf{y}} \in {\mathbb{R}}^N\), \({\mathbf{X}} \in {\mathbb{R}}^{N \times D}\), \({\mathbf{w}} \in {\mathbb{R}}^{D}\)\({\mathbf{\epsilon}} \sim \mathcal{N}(\mathbf{0}, {\sigma}^2 \mathbf{I} )\) where \(\mathbf{I}\) denotes the \(N \times N\) identity matrix. 

Is that what you mean?

Assistent kommenterade 15 november 2016

Yes, please derive and comment.

kommenterade 16 november 2016

Wait, should it be \(\textbf{y} = \textbf{Xw} + \epsilon\) or \(\textbf{y} = \textbf{Wx} + \epsilon\) ?
The latter is the definition in the assignment.

kommenterade 16 november 2016

@Erik

That's indeed the definition in the assignment, but Pawel changed question 6 so that \(\mathbf{y}\) is a vector instead of a matrix \(\mathbf{Y}\) which implies that \(\mathbf{w}\) is also a vector instead of a matrix \(\mathbf{W}\). I think one would run into dimension problems when one doesn't change the order to \(\mathbf{y} = \mathbf{X} \mathbf{w} + \mathbf{\epsilon}\)

@Pawel

At the beginning of the nonlinear regression part you defined that \(f_i = f(\mathbf{x_i})\) which might imply that \(f\) denotes the actual function. 

In question 9 you define the joint likelihood of the full model as \(p(\mathbf{Y},f,\mathbf{X},\mathbf{\theta})\) and the marginal likelihood as 

$$p(\mathbf{Y}|\mathbf{X}, \mathbf{\theta}) = \int_{}^{} p(\mathbf{Y}|f) p(f|\mathbf{X},\mathbf{\theta}) \textit{d}f$$

Do you mean by \(f\) now the vector \(f = (f_1, ..., f_N) = \mathbf{f}\) according to Murphy or the actual function? So, shouldn't it be e.g. 

$$p(\mathbf{Y}|\mathbf{X}, \mathbf{\theta}) = \int_{}^{} p(\mathbf{Y}|\mathbf{f}) p(\mathbf{f}|\mathbf{X},\mathbf{\theta}) \textit{d}{\mathbf{f}}$$

with \(\mathbf{f} = (f_1, ..., f_N)\)

Assistent kommenterade 16 november 2016

We think of f as a function. However, since we take a nonparametric approach we define it as a collection of points fi (we do not have any model for the function that accounts for data). So, to cut a long story short, vector f implies a collection of samples (f1,..,fn).

Assistent kommenterade 16 november 2016

@Erik

The bottom line is that for the derivation it is fine to think about a univariate output y, which implies that w is a weight (parameter) vector. How you define the dimensionality of your vectors, matrices is of secondary importance as long as you stay consistent and all the algebraic operations you demonstrate match.

kommenterade 17 november 2016

Eq 11 is currently stated as \(p(f|\bold{X},\bold{\theta})=N(\bold{0},k(\bold{X,X})).\) Should this be \(p(f|\bold{X},\bold{\theta})=N(\bold{0},k(\bold{X,X'}))\)?

Assistent kommenterade 17 november 2016

Yes, this is a matter of notation.

To be very precise, the arguments of kernel function are vectors not matrices so one could write N(0,K), where Kij=k(xi,xj).

En användare har tagit bort sin kommentar
kommenterade 18 november 2016

Hi Paweł,

Could you please, once again, explain how we are supposed to understand \(\boldsymbol{W}_0\) in the Prior, BOTH in the context of Question 6 and in general? I have a hard time understanding where did it come from.

Thank you,

Wojciech

Assistent kommenterade 18 november 2016

Hi Wojciech

I suggest that you clearly point out what you do not understand, it will be easier for me to help you with a specific answer. So, if you look through the discussion above (numerous comments) could you let me know which remarks/comments remain unclear? I will do my best then to straighten them out. As a general comment I can say that a difference or exception in Q6 consists in considering our regression problem to have only one output variable, which implies that our W0 becomes a vector, w0. In the rest of this part of the assignment we consider a more general case where input vector x is mapped to output vector y (not a scalar y as in Q6), which implies that regression parameters form a matrix W. The problem can be vectorised and matrix W can be then treated as a vector (given assumptions that are met here). Then you do not really have to worry about it and you can treat the problem in a similar way we did in the lectures.

Please let me know then where specifically more clarification is needed.

Next year I am going to make it look more straightforward.

kommenterade 18 november 2016

Pawel, thank you very much for the fast response. I guess my question was too general, sorry about that, but I had a hard time wrapping my head around this. 

To straighten out the confusing parts, if you could please confirm that my understanding is correct:

When you wrote: "You can think about this prior for sets of weights corresponding to each output Y (with different Wo)"

Here "each output Y" means each output variable in Y - in this assignment (using the notation from Theory 2.1) we would have D such output variables, as we can split the problem in to D separate regression tasks?

Using the notation introduced by Peter Langenberg, we could understand the Prior as:

\(p(\bf{W}) = \prod_{\it{i}}^{\it{N}} \mathcal{N}(\bf{w_{\it{i}}}|\bf{W}_{0}^{(i)},{\tau}^2 \bf{I})\), where \(\bf{W_0}^{(i)}\) is a vector from \(\bf{W_0}\) that holds the means of the distribution corresponding to \(\bf{w_i}\).

And the last part:

The elements in \(\bf{W_0}\) are the means of the distribution of the Prior corresponding to weights \(\bf{W}\).

The values of the elements in \(\bf{W_0}\) are arbitrarily chosen by us (according to our prior beliefs) and could be seen as "hyperparameters" of the regression problem we are solving?

Using the notation from the Bishop book we would have: \(p(\bf{W} | \bf{W_0}, \tau^2) = \mathcal{N}(\bf{W} | \bf{W}_0, \tau^2\bf{I})\)

Thanks in advance! :)

Assistent kommenterade 18 november 2016

Each output implies each univariate component, y_j (j=1,..,D), of the multivariate output vector y.

I would not necessarily split it into independent tasks (though it is often done this way) but rather interpret your Wo as a vector (notation is not very obvious here) obtained from matrix vectorisation (a simplified approach to matrix normal distribution). This will imply then that you operate in the space spanned by all the parameters. Bear in mind however that these details are not really all that relevant to the questions (except Q5 where you just need to take into account this multidimensional aspect of W representation).

We can assume we incorporate our beliefs in Wo so the parameters of the prior (tau and mean values in Wo) are fixed.

kommenterade 18 november 2016

Hi Pawel, can we take use of a Jupyter iPython Notebook for the entire assignment(s), and hand this in as the report as well, or is a PDF with LaTeX more preferred? iPython Notebooks are great for having everything in one place; explaining theory, computing in python, visualising the data etc.

Assistent kommenterade 18 november 2016

Hi Oktay, please submit a pdf.

kommenterade 18 november 2016

You can generate a PDF directly from an IPython notebook with LaTeX markdown cells. To hide the code and just show the figures, follow the guide in the link below (requires Pandoc!).

https://ga7g08.github.io/2015/06/25/ipython-nbconvert-latex-template-to-hide-code/ 

kommenterade 18 november 2016

Awesome, thanks Fabian! :D 

kommenterade 19 november 2016

Hi Pawel! 

In question 17 we are supposed to perform the marginalisation according to 

$$p(\mathbf{Y}|\mathbf{W}) = \int_{}^{} p(\mathbf{Y}|{\mathbf{X}},{\mathbf{W}}) p({\mathbf{X}}) \textit{\rm d}{\mathbf{X}}$$

Is it ok, when I do that for vectors with \(\mathbf{x} \in {\mathbb{R}}^M\) (latent variable), \(\mathbf{y} \in {\mathbb{R}}^D\) (observed variable) and \(\mathbf{W} \in {\mathbb{R}}^{D \times M}\) so that 

$$p(\mathbf{y}|\mathbf{W}) = \int_{}^{} p(\mathbf{y}|{\mathbf{x}},{\mathbf{W}}) p({\mathbf{x}}) \textit{\rm d}{\mathbf{x}}$$

\Peter

Assistent kommenterade 19 november 2016

Yes, most definitely!

En användare har tagit bort sin kommentar
kommenterade 20 november 2016

Hej! Do you want to have a specifc level of simlification in the expression for question 19.2 (gradients of the objective with respect to the parameters)? 

My expression should be correct but it looks kind of messy (a very long term including a lot of linear algebra such as trace, inverse matrices ...). 

\Peter

kommenterade 21 november 2016

Hi Pawel,

I spent a lot of time trying to understand what create_index_set was doing, and it seems that it's simply returning the order of indices that sorts the input list (except the first index btw). Why not not simply use numpy.argsort in that case? Did I miss something?

Thanks,

Alexandre

kommenterade 21 november 2016

Is anyone else getting that the result of \(k(\mathbf X, \mathbf X)\) is a singular matrix? I tried generating 500 evenly spaced points [0, 1, ..., 499] and calculating the covariance function for every pair of points (which reduces to \(\sigma_f^2 \exp(-(x_i - x_j)^2/l^2)\) in the one-dimensional case), but unless I choose a rather small \(l^2\) (seemingly at most 9), the the resulting matrix is singular (it is both symmetric and positive semi-definite, however). This makes it difficult to sample more "smooth" functions. Changing \(\sigma_f^2\) doesn't seem to help, and I've mostly stuck to a value of 1.

I also tried generating 500 evenly spaced points in the [0, 1) range instead, but when I did that I got a singular matrix with an \(l^2\) of anything above around \(10^{-5}\).

kommenterade 21 november 2016

If we are supposed to do Question 6 as well as Question 17 for a single output variable y, how are we then supposed to be able to do the practical part 2.4 where we do not have a single output variable y?

Also, the dimensions does not seem to add up in part 2.4. Should it perhaps be Y^T instead of Y in Eq. (28)?

En användare har tagit bort sin kommentar
Assistent kommenterade 22 november 2016

Yes, you may end up with nearly singular covariance matrix if you estimate it in high dimensional space  (a relatively large number of points used to construct your Gram, or more general kernel, matrix). Still, you can draw samples from it, most packages are capable of doing that resorting to different approaches, e.g. Cholesky decomposition/estimation of cov matrix, eigenvector structure etc. Take a look at scipy.stats.multivariate_normal.

kommenterade 22 november 2016

I managed to solve it by using numpy.random.multivariate_normal instead of creating a scipy.stats.multivariate_normal object as I was doing previously. It also seems to work if you call scipy.stats.multivariate_normal.rvs as a static method rather than calling rvs on a multivariate_normal object.

Assistent kommenterade 22 november 2016

Wonderful! Great job.

Assistent kommenterade 23 november 2016

Dear students

Having talked to some of you, I have made some cosmetic changes in the document. Please find the new version uploaded. In short, in Q11 it says clearly now that you should set the prior (cov) and visualise it. The longest text update has been made in Q17, where a hint is provided as to how the marginalisation can be done. I think it is more productive for you to address it in a simpler way rather than get bogged down in the derivations (which was my original plan). In Q19 there is an explicit reference to Eq.23. In practical 2.4 I am asking you now to discuss (explain) any invariance you observe. Two equations are also refined since the dimensions did not match.

kommenterade 24 november 2016

Hi! In the part 2.4.1 (question 20), are we only supposed to learn the linear mapping, so matrix A that generates the data? Or are we also supposed to learn the non-linear mapping (function f_nonlin)? 

In question 20 we are supposed to visualize X. Do you mean by X the true underlying parameters or is it just a symbol for what we learn from performing the optimization?

Assistent kommenterade 24 november 2016

You do not need to learn f_nonlin since you know what it is.

Please visualise the learned representation X of the data you observe.

kommenterade 24 november 2016

Whenever the questions don't explicitly tell us to derive a result ourselves, is it OK to just use formulas as given in the lectures/book?

Assistent kommenterade 24 november 2016

Yes, with a comment on where the equation comes from and what it means (if applicable).

kommenterade 24 november 2016

Two more questions:

1. In question 24, should we talk about marginalization in general and what it does, or what the implication of marginalization is in this specific context, or both?

2. The text after question 24 mentions the prior \(p(\boldsymbol \theta \mid M_i)\), but the marginalization in (40) uses \(p(\boldsymbol \theta)\); shouldn't the latter be \(p(\boldsymbol \theta \mid M_i)\) as well?

kommenterade 24 november 2016

Also, is there any limit on how low S can be when doing the Monte Carlo sampling? Because generating the evidence for all of the data sets turned out to be unbearably slow for S larger than 1000, and I'm wondering if it's worth spending the time to try to optimize my naive approach.

Assistent kommenterade 25 november 2016

There is no need for fine tuning over a massive number of sample data sets.

Assistent kommenterade 25 november 2016

I see that it did not come across clear in the lecture. In Q17 when you marginalize please do not think about distributions in terms of likelihood. You deal with prior and posterior that are integrated over all samples of x (X). Therefore, it might be easier for you to think of P(y_i|W) (y_i=W x_i + mu + noise) that becomes in fact a distribution over random variables, not individual samples (in the spirit of marginalization).

Assistent kommenterade 25 november 2016

@ Arvid Fahlström Myrman      

Ad.1) It is about the marginalization in the Bayesian context and certainly the equation in Q24 is a good representative here. You might want then to explain it given the framework of this equation (it has a general meaning though in Bayesian modelling where we average variables out. What does it mean?

Ad.2) Yes, to be strict, it should. However, the interpretation from a sampling perspective is that the marginal likelihood is the probability of generating the data set D from a model whose parameters are sampled at random from the prior p(\(\textbf{$\theta$}\)), hence sometimes we just write p(\(\textbf{$\theta$}\)). To stick with the principle, you may want to write as you suggest though.

Assistent kommenterade 25 november 2016

Dear all
I am just writing to let you know that I will not be able to address your questions over the weekend.
Good luck with the assignment and enjoy the weekend!

kommenterade 25 november 2016

I think create_index_set is buggy. When populating N, it checks if D[ind] == L[-1], but L[-1] has already been removed from D previously, so the expression will always be false, and as a result N will always be empty.

En användare har tagit bort sin kommentar
kommenterade 25 november 2016

@Alexander While I do have an updated local version, I'm still not confident that it's 100% correct (I already found one other bug other than the one I mentioned). You're probably better off either writing your own implementation based on the original paper, or coming up with your own sorting scheme.

Assistent kommenterade 25 november 2016

@Arvid et al.
It is strange, this piece of code has worked for me. However, I cannot certainly guarantee as I have no access now to a proper PC to re-run the assignment. I inherited this piece of code from the last year's course round and it did the job for me.

Anyway, if you find it buggy, please (anyone) implement these few lines of code in line with the original report by Murray et al. "A note on the evidence and Bayesian Occam’s razor" (as far as I remember) and post it. I apologise for the trouble and will certainly extend the deadline for such effort.

kommenterade 26 november 2016

@Pawel I think the reason it looks like it works is that it (inadvertently) does something similar to simply sorting the data sets by the sum of probabilities from each model, which actually ends up being a fairly good way to visualize the data sets.

kommenterade 28 november 2016

Hi,

I want to make sure the deadline, is that on the 23:59 on 11.29 in Stockholm time zone or on the noon of 11.29?

Thanks.

Assistent kommenterade 28 november 2016

Sure, the midnight is perfectly fine.

kommenterade 28 november 2016

Does that mean that the deadline given in the assignment PDF is incorrect?

Assistent kommenterade 28 november 2016

It just means that 12 hours of difference does not play any role, especially if there is anyone who travels and is currently working from a very distant location in a different time zone.

kommenterade 28 november 2016

Hello @Pawel,

I'm a little bit unsure about the formulation of question 20. As I understand the topic of the section, we want to recover our 1 dimensional x values. But the question states, that we should plot it in a two dimensional space. Are we supposed to do the PCA backwards and get the y's from the estimated x's  or should I learn a 2- or higher-dimensional representation and plot the first to dimensions? I get reasonable results in both cases. But I can't state them here without revealing some possible answers to the question... Can you give me a hint, how the question was intended.

Thank you in advance.

kommenterade 28 november 2016

Hi @Pawel,

In a question 23 there's a subquestion "What does this actually imply in terms of uncertainty?" I'm a bit confused about the kind of uncertainty meant here. Is it an uncertainty inside each particular model, an uncertainty about model selection or any other kind of uncertainty?

Thank you in advance.

Assistent kommenterade 28 november 2016

@Simon

The focus is on the latent variable that generated Y. So you could plot x in one dimensional space. Then in order to discuss invariances you can also plot the full mapping (nonlin) you have recovered (here is where the hint to plot X as a two-dimensional representation comes).

Assistent kommenterade 28 november 2016

@Dmytro
I am asking about probabilistic uncertainty associated with the selection of a model based on the evidence from data (posterior) and prior. Naturally, it involves individual models since we select from them.

kommenterade 28 november 2016

Hi!

Did somebody update the create_index_set script or is it still buggy? 

En användare har tagit bort sin kommentar
kommenterade 28 november 2016

@Hanna I think you're on the right track, but as D_org points to the same object as D, any changes to D in your code will be reflected in D_org, so I think the code still works identically. Try adding some debug output to see if the "else" clause is ever actually run.

I attempted to implement the algorithm myself, but I find that the resulting graph contains terrible oscillations. While it's quite possible that I have a bug in the code, I'd say it's probably better to simply run argsort on the summed evidence to get a nice graph.

Assistent kommenterade 28 november 2016

Dear students
Just a quick update: given extenuating circumstances, I have decided to extend the deadline to November 30 (midnight). Those who have already submitted their work can resubmit it if they want to. Again, thank you for the good work and solid assignment reports.

I am afraid that tomorrow I will be unable to respond to your queries.
Good luck!

kommenterade 28 november 2016

@ Arvid 

Yes, you are totally right. Thank you. Even when I change to np.copy(D) I only get to the else part once in the beginning. So I decided to just remove my script since it might just be confusing and I don't have the time to fix it. 

kommenterade 28 november 2016

Hi!

For me, the code worked well and could plot correctly so I think it is ok now.

En användare har tagit bort sin kommentar
kommenterade 29 november 2016

@Motoya No the code is still broken, but after fixing the bugs Arvid found I get an almost identical output to the original buggy code anyway. So it should suffice for the purposes of this assignment.

kommenterade 29 november 2016

@Arvid I also got those oscillations at first, until I defined the distance metric in terms of abs(sum_i p(D|H_i) - p(D'|H_i)). Also the line "L.append(N[dist[L[-1],N].argmin()]) should" use argmax() instead since L should be the furthest point from L in N according to the paper. This resulted in a plot very similar to the ones in the paper.

kommenterade 29 november 2016

@Axiel

Yeah it's oscillating, but you mean we can still infer the meaning of the evidence plots from them and thus no problem for the assignment, right?

kommenterade 29 november 2016

@Axel I'd noticed the argmin/argmax bug, but I never got around to testing different distances. I tried your suggestion now, and it turns out that the result is actually exactly the same as what I get when I simply run argsort on the summed evidence. However, you inspired me to see what happens when I use \(\sum_i |P(D \mid H_i) - P(D' \mid H_i)|\) instead, which actually yields something more similar to the graphs in the original paper. Then I tried changing this to the Euclidean distance instead, i.e. \(\sum_i (P(D \mid H_i) - P(D' \mid H_i))^2\), and the result is almost identical to the original paper.

Here is the code I used. Note that this code is written for Python 3 (though it may or may not work with Python 2), and that it assumes that the evidence matrix has one row per data set, and one column per model. Pass the whole evidence matrix to the function rather than summing over the evidence for each data set.

kommenterade 29 november 2016

@Arvid Brilliant work! This seems to be what the authors intended and I find this visualization a lot cleaner than simply running argsort.

kommenterade 29 november 2016

@Axel

Hi, I think the sorting code does not have bug.

The problem is, as you guys say, the definition of the distance.

The graphs in the paper seem to be based on the 1)maximum evidence (l_infinity norm), and 2)the difference between the maximum and the second largest value.

Therefore, heuristic but easy way is just change the line

"index = create_index_set(np.sum(evidence,axis=0))" to

"npmax = np.max(evidence,axis=0)
index = create_index_set(npmax+np.max(npmax-evidence,axis=0))", for example.

The result is still oscillating but this is because of the smaller number of samples I think.

But the point is not the definition of the distance.

kommenterade 29 november 2016

@Axel

Yeah, your code works perfectly even if the number of samples is small.

But I cannot understand the inner structure of your code....

Btw, thanks very much!

kommenterade 29 november 2016

Wow! Huge difference, thanx a lot @Arvid :)

kommenterade 29 november 2016

@Pawel Is it okay if I use the index function (CH Ek's py script) in the assignment?

kommenterade 1 december 2016

Will there be a confirmation after receiving the report?

Assistent kommenterade 1 december 2016

Unfortunately, I am not physically capable of doing that given the number of submissions.

kommenterade 5 december 2016

Hi Jens!

Just before questions 1-3 you mention "Answer the following questions (in this problem you do not have to motivate your answers)". By that, you mean that we need just to right down the results without even mention the blocked/unblocked paths? Or that we need to right only the basic assumptions that we made?

Thank you,

Christos

En användare har tagit bort sin kommentar
kommenterade 6 december 2016

Hi,

When can we expect to have assignment 1 graded? This would be nice to know in order to have the grade requirements in check to set a target for assignment 2.

Secondly, you should really consider a proper hand-in system if you cannot manually give confirmations. Given the impact on a failed hand-in, no matter which end (or middle) is at fault, it is quite an uneasy experience.

Thank you,

David

Lärare kommenterade 6 december 2016

I don't really know, but we are trying to do this as fast as possible. Although I realize that it is advantageous with full information, sometimes we have to deal with uncertainty. Moreover the deadline for the second assignment is not at all now. 

Lärare kommenterade 6 december 2016

As an answer to Christos question. It is enough with a correct answer. That is which set which property. 

kommenterade 6 december 2016

@Jens
Thank you 

kommenterade 7 december 2016

Hi Jens,

"All problems except those concerned with VI can be attacked right now", does that include the exercise 2.3 as well, or only 2.6 and 2.7?

Thanks,

Alexandre

kommenterade 7 december 2016

Hi Jens!

In Question 5  (Implement the Casino model (in Matlab or Python).): Are we supposed to write down anything regarding this question in our report? 

Thanks, 

Peter

kommenterade 7 december 2016

Regarding Q10: How specific should the algorithm description be? Is it sufficient to show a mathematical expression that's easy to compute given the parameters and definitions given in the lecture, or are we supposed to e.g. provide pseudo code for computing the entire expression, or for things defined during the lectures?

Lärare kommenterade 7 december 2016

Alexandre: Yes you correctly spotted that 2.3 is mature enough to be attacked at this point. So, wait with the last two, but the other problems are OK to work on. 

Peter: You should be ready to show your implementation if we ask you to, otherwise you should only use it. 

Nikos: Yes mathematical expression that are easy to compute is sufficient. So do what I did, start with what you need to compute and stop whenever you reached an expression that is easy to compute (a bit of judgment is required here I have to believe that you know how to compute it, but there is no reason to redo what you can find in the book or what I have done during lectures).

kommenterade 7 december 2016

In the casino model, how is the starting state (\(t_1\) or \(t_1'\)) chosen?

Lärare kommenterade 8 december 2016

Let's say with equal probability. 

kommenterade 8 december 2016

Arvid, I initialized it with equal probability of either table as the start.

Jens, for q6: do you want the sums for each table for some players, player sums over all tables, or both in the report?

Lärare kommenterade 8 december 2016

The sum for each table. 

kommenterade 8 december 2016

In section 2.6 VI:

Can we assume that \mu and \lambda (not with k) are given parameters (i.e. parameters for the prior distribution)?

And also that \epsilon and l (not with k) are also given (i.e. parameters for the prior distribution)?

kommenterade 8 december 2016

In question 6 when you say to provide generated data, do you mean that we should actually include samples from the model in the report? Also, do you expect actual written answers to questions 5 and 7?

Lärare kommenterade 8 december 2016

Not "raw" samples, but a visualisation of what you get. I have earlier answered no, concerning text answers on coding questions. I will, however, update this and give the following instruction: state the language you used for the implementation and the number of code lines.

kommenterade 8 december 2016

Hi Jens,

For the casino model, the introduction paragraph says we should note the outcome of the player's dice Z_k. Why is the player's dice indexed by a table k and not by a player i ? Isn't it the categorical distribution of the dice? It seems more natural for me to index the player's dice by i. Otherwize, S^i_k = Z_k + X_k with no dependence on i.

kommenterade 8 december 2016

Hi Jens,

In question 10, it states that we are given "a sequence of tables R1 …. Rn"  am I correct to assume that n < K in this case?

Lärare kommenterade 8 december 2016

Yes Bertrand, the notation is not fully describing the situation. Writing S_k = Z_k + X_k would specify a one player model (the k for the player shows that it is the value the player obtains at the kth table, all X_k are sampled from the same distribution), which can then can be repeated for each player. By writing S^i_k = Z^i_k + X^i_k all the players become explicit. OK?

Lärare kommenterade 8 december 2016

Joonatan: In fact, this must be a typo and n should be K, that is a sequence of tables r_1,...,r_K.

kommenterade 8 december 2016

A warning to those doing task 2.3, depending on what edition of the book you are using, the result given for \(\ln q^*_\tau(\tau)\) (equations 10.28-10.30 on page 471) might be incorrect; see the list of errata for the first printing of the book. I spent way too much time trying to figure out why I didn't get the same expression as the book.

kommenterade 9 december 2016

Dear Jens

I have some trouble applying the theory for Q10, which requires checking the probability of a sequence of hidden states, given observations and parameters. Bishop explains how to find p(z_n = k | X_{1:T}) through the forward-backward algorithm, but I am not sure how to go from this (prob of a hidden state at one particular time) to a -joint- conditional probability of the whole sequence (in our case all tables r). In your slides, in sampling the posterior (Q11), you explain how a sample can be acquired one timestep (table) at a time through smoothing, which fits better with these methods because its stepwise, but how can I find the probability of the whole sequence given its steps? In short, I don't understand how to estimate the right-most expression here:
Screen Shot 2016-12-09 at 17.25.14.jpg

PS: I was also thinking of using the Viterbi algorithm, following the path of the hidden states I am interested in rather than doing argmax, but since this algorithm is mainly used for decoding (rather than evaluation), I feel like the idea is a bit off-track.

En användare har tagit bort sin kommentar
En användare har tagit bort sin kommentar
Lärare kommenterade 9 december 2016

Amund Q10 is easier than the sampling and there is no reason to use Viterbi (in fact it cannot be used for this purpose). Your expression looks really fishy, i.e., look at the limits of the product. Anyway, you need to think about what you want to compute here. 

kommenterade 9 december 2016

Just to confirm, for questions 1-3 on A2, it is sufficient with an answer like "3,4,5"?

kommenterade 9 december 2016

Hi Jens,

At the beginning of section 2.1, you use variables \(x\) and \(y\) to define the the plus/minus notation used in the figure. What are these variables?

kommenterade 10 december 2016

What are we supposed to actually write in our reports for tasks 2.6 and 2.7? Or will the questions be added later?

Lärare kommenterade 10 december 2016

You should merely derive the algorithms. I should have added that as a question.

kommenterade 10 december 2016

In section 2.6, is the posterior parameters given (i.e. epsilon and lambda, (without k))?

Lärare kommenterade 10 december 2016

Oktay: yes.

Lucas: I had a quick look (I'm in a cab ... ) and the notation is not cristal clear. However, y must be a child x a specific parent, and c values for the other parents.

Lärare kommenterade 10 december 2016

Erik: I don't really get your question the parameter have indices as far as I can see. Perhaps I'm missing something and you can expand on your question.

kommenterade 10 december 2016

I think I was a bit unclear.  In the question it says that the prior distribution has the form \(\varepsilon_n \sim N(\varepsilon, \varsigma^{-1})\). I was wondering if \(\varepsilon\) and \(\varsigma\) are given constants (the same for the \(mu_k \sim N(\mu, \lambda)\))

kommenterade 10 december 2016

In questions 10 and 11, we are only concerned with a single person's walk through the casino, right? So the distribution of the player-owned dice is the same throughout?

Lärare kommenterade 12 december 2016

Yes it is a single player with a single dice.

Lärare kommenterade 12 december 2016

Erik: Yes these hyper parameters are known constants. 

kommenterade 12 december 2016

Hello Jens,

In question 8 we are asked to "describe the exact posterior". What does exactly "describe" mean in this context? The formula of a Normal-Gamma posterior can be found in books, so is it enough if we just write it down or should we show the derivation?

Thanks,

Wojciech

Lärare kommenterade 13 december 2016

Yes it is enough. 

En användare har tagit bort sin kommentar
kommenterade 13 december 2016

There seems to be a typo in the top box of the assignment 2 pdf, it says "These grades are valid for review December 18th, 2015. Delayed assignments can only receive the grade E.". Shouldn't this instead read December 22th, 2016?

Lärare kommenterade 14 december 2016

It should definitely. Thanks for catching this.

kommenterade 14 december 2016

In 2.6 it seems like Z have been mixed up with Y. Could you confirm this jens?

Thanks.

kommenterade 14 december 2016

In 2.6 are we supposed to implement VI, or is the mathematical solution enough?

Lärare kommenterade 14 december 2016

I just had a quick look and my impression is that 2.6 is (self) consistent, but has another notation than the earlier problems (i.e., Y rather than Z)

A mathematical solution is sufficient.

kommenterade 14 december 2016

I think the following sentences could raise a bit of confusion:

1. "...and \(\mu_k\) has prior distribution \(N(Y|\mu, \lambda^{-1})\), ..."  (using \(Y\) in another context rather than the outcome of the player)

2. "... and \(P_n\) is equipped with \(N(Z|\xi_n;\iota_n)\), ..." (using \(Z\) instead of \(Y\) to denote the outcome of the player)

kommenterade 15 december 2016

Jens, the thing is that both Z and Y occurs in 2.6. Or maybe there is something you wish to convay by this?

Lärare kommenterade 15 december 2016

It was a bit late yesterday :). I have now uploaded a new version, I believe that I've managed to weed out the Y's. 

kommenterade 15 december 2016

Hi, in the first part of the current assignment, we are supposed to answer questions regarding the probability pairs with respect to the given DAG. I am running into a - perhaps semantic - issue with the third question. What do we mean when we state that a pair is incomparable? I am hesitant to provide more clarification as it might answer the question, but @Jens got my question by email and asked me to ask it here, so I assume any answer will factor in the appropriate amount of detail!

Any and all input is received with gratitude!


Lärare kommenterade 15 december 2016

The explanation in the text is "the two values can not be compared based on the information available in the DAG". It should perhaps be  "no relation between the two values can be inferred from the information provided". 

kommenterade 16 december 2016

In 2.2 it says "We then observe the sum S^i_k of the two dice, while the outcome of the table’s dice X_k and the player’s dice Z_k are hidden variables."

Does this mean that the graphical model should show the sums as observed?

kommenterade 16 december 2016

Hi Jens,

I have a quick question about the drawing of the Casino. I would somehow like to represent that the table-dice value (which is an RV) depends on the Categorical Distribution parameters (which are constant) but they are different between primed/unprimed tables (which is an RV). Will the notation shown below be considered proper (circle is RV, dot is a constant) or is there a different way of representing this type of relation?

Thanks,

Wojtek

 example pgm

Lärare kommenterade 17 december 2016

Well this is the type of question I normally avoid answering. However, in this case it seems appropriate to point out that a constant which depends on a random variable, or rather that dependence, is not so useful. 

Lärare kommenterade 17 december 2016

Andre: focus on the relation between the variables. That is the most important part, but of course adding the information concerning hidden and observed is good. 

kommenterade 18 december 2016

Hi, are we not allowed to assume that "explaining away" is the interaction for all cases of intercausal reasoning in Question 1 - Question 3? 

Lärare kommenterade 19 december 2016

The formulation is "You should also assume the parents in any V-structure are conditionally dependent given the the common child". That answers your question, right?

kommenterade 19 december 2016

Is it as with assignment 1 that if we hand it in after the deadline (22nd of December), we can only get an E? I would really appreciate to get a few more days after christmas to finish the assignment and I assume the examiners would not start revising the assignments before christmas?

kommenterade 19 december 2016

In task 2.5 it says that we should find the parameter \(\Theta\) that maximizes \(P(s_1^i, \ldots, s_K^i|\Theta)\) . I'm a bit confused about the \(s_k^i\) terms, does this mean we should only analyze the one-player case or does the question imply we should find \(\Theta\) that maximizes the joint probability over all \(i\)?

Lärare kommenterade 19 december 2016

Thanks Axel. Yes it is clearly confusing. It is the joint probability over all the players (i:s) that should be maximised. 

kommenterade 19 december 2016

I agree with Viktor. Many of us have exams this week too, and have been studying for them for weeks parallel with the assignment. I'd love to have an extension until the end of this week on Sunday night so I can work on more problems.

kommenterade 20 december 2016

Is it okay to do Task 2.3 using only broad non-informative priors since nothing is known about the parameters?

Lärare kommenterade 20 december 2016

I'm somewhat puzzled by your question. Equation (10.22) and (10.24) specifies the priors distributions and you know the hyper parameters. Please ask again if this didn't resolve your issue. 

kommenterade 20 december 2016

So I wonder if I can consider only the case with all the hyperparameters set to zero, \(\mu_0 = a_0 = b_0 = \lambda_0 = 0\), and use Eq. (10.31)-(10.33) in Bishop to compute the distribution of the factors.

Lärare kommenterade 20 december 2016

Your solution should work for any choice of hyper parameters. So, you cannot assume that they are all zero. 

kommenterade 20 december 2016

Hi! 

My implementation works fine for part 2.3. I'm just wondering whether you wanna see s.th. special in question 9 ? I'm able to reproduce the the images from Bishop. Is that enough for this question?

\Peter

kommenterade 20 december 2016

I have a question regarding 2.6, should Xk and Zn be included in the joint distribution used for the variational inference approximation? I.e. should it be p(S,X,Z,\mu,\xi) or p(S,\mu,\xi)?

/Markus

Lärare kommenterade 20 december 2016

Peter: one image can not be classified as a couple of of interesting cases. 

Markus: I would advice you to consider the tip " what distribution to you get from the sum of two Gaussian random variables? What is the relation between the means?"

kommenterade 20 december 2016

Yes I have the distribution of S, but I got two different answers depending on if X and Z is included in the joint or not. I think both have the correct mean but different precisions, and I cannot find from the book which is the correct joint.

/Markus

Lärare kommenterade 20 december 2016

This is somewhat hard to answer. You should probably take a decision. 

kommenterade 20 december 2016

For question 13: Do I have to use the data generated in Task 2.2 or can I create extra (more interesting data) for this question?

Lärare kommenterade 20 december 2016

More interesting is, well, more interesting. 

kommenterade 20 december 2016

I really do not want to annoy but I would like to get an answer to my question from yesterday as well. We even do not have our grades for the first assignment, so it's pretty tricky for us to balance between the effort we put into this assignment and other tasks that we have to finish... 
Since I'm really interested in putting more time into the ML assignment, an extended deadline would be awesome.

kommenterade 20 december 2016

Extended deadline would be much appreciated, as I too want to learn as much as possible. Given circumstances much like others here - a lot of other tasks in other courses and the extensiveness of ML assignments - time has been hard to find for full completion of this assignment. Motivational rewards (i.e extended deadline for possibility to complete more tasks) for more in depth learning should be in your best interest as well? I believe a lot of students are about to write thesis work in the area too, and are therefor eager to gain as much knowledge as possible.  

So in other words: Please help us in the process of making good thesis work in order to reflect good results upon your institution! :)

Lärare kommenterade 21 december 2016

You have been granted a very substantial deadline extension. 

kommenterade 21 december 2016

Thank you, Jens!

kommenterade 21 december 2016

Thanks for the extension Jens! A question about 2.6: do we have to figure out the functional form of \(q^{*}\) or is an expression of the type (10.11) in Bishop with a normalizing term enough?

Lärare kommenterade 21 december 2016

Please specify the actual distribution. 

kommenterade 22 december 2016

I don't know when you were planning on grading the assignments originally, but is there any chance you could start grading early assignment submissions ahead of time? If possible I'd like to know how I did on the assignments before the end of the year.

kommenterade 23 december 2016

Question 13 says that we should test the EM algorithm on data generated in task 2.2. Does this mean that even though we are in task 2.5 modelling the casino as having only one set of tables \($$t_1, \ldots, t_K$$\)\(t_1, \ldots, t_K\) we are supposed to test the EM algorithm on data generated from having two sets of tables, \(t_1, \ldots, t_K, t_1', \ldots, t_K'\)?

Lärare kommenterade 23 december 2016

Yes. How do you handle that? 

kommenterade 23 december 2016

Hi, Jens.

I have some questions about EM algorithm in general

  1. As far as I got, EM algorithm guarantees the increase of log likelihood at each step but says nothing aboutcomplete log likelihood, so complete log likelihood can decrease. Is that correct?
  2. For complete log likelihood there's a notation \(l(\Theta; D)\). Is there any semantic meaning of this semicolon, I mean why a simple comma is not used instead?
  3. Is there any way of verification that EM derivations are correct? I mean, they look reasonable to me, but there should exist a less subjective sanity check :)

Thanks!

Lärare kommenterade 23 december 2016

1. Well the expected complete log-likelihood may decrease during the execution. I guess that that's what you wanted ask about.

2. It is used in order to emphasise that it is a function of \Theta, but not a probability (so in general integrating over \Theta don't give 1). 

3. I guess that the easiest way to test it is to implement and see whether the log-likelihood increases. Sure you may implement something slightly incorrect and it is not an absolut proof, but it will make you more confident.

Merry Christmas!

kommenterade 9 januari 2017

Hello, 

When can we expect to have the assignment's grades released (at least the first) ? 

kommenterade 31 januari 2017

Any news on the grading Pawel and Jens?