thomas' Blog

Tuesday, April 04, 2006

SIGIR review comment

這些人真是會罵啊~~

呼,我的論文被批得體無完膚, X的,我的老天!

------------- Review from Reviewer 1 -------------
Relevance to SIGIR : 5
Originality of work : 4
Impact of results : 2
Quality of arguments : 2
Quality of presentation : 2
Confidence in review : 5
Overall recommendation : 2

-- Comments to the author(s):
The paper describes a novel way of dealing with subword-based indexing of spoken documents using reinforcememt learning.



There is a lack of direction and no clear research hypothesis that is tested. The experiments which mainly simulate user information need and behaviour are not well justified.
-- Summary:
positive: novel approach

negative: experiments not well justified; simulated user need and beaviour may be unrealistic
---------- End of Review from Reviewer 1 ----------
------------- Review from Reviewer 2 -------------
Relevance to SIGIR : 4
Originality of work : 3
Impact of results : 3
Quality of arguments : 3
Quality of presentation : 3
Confidence in review : 2
Overall recommendation : 3

-- Comments to the author(s):
The paper deals with the interactive retrieval of spoken documents through the use of hierarchies of keys, which is ranked through reinforced learning. Retrieval of non-text documents is a difficult process and any successful empirical work in this regard is an important contribution to the field.



I am not an expert in the field, so I cannot judge the content of the paper in much depth. My comments are therefore mainly restricted to structure, argumentation, methodology and language issues.

The paper is well structured and the figures are useful to illustrate the processes.

The experiment is large scale enough to be significant, and the methodology is appropriate.

The argumentation is logical.

There are several grammatical errors in the paper, and if the paper is accepted, I would strongly suggest that an English first language speaker proofreads it.


-- Summary:
I am not an expert in the field and my comments are mainly about issues other than content.




---------- End of Review from Reviewer 2 ----------
------------- Review from Reviewer 3 -------------
Relevance to SIGIR : 3
Originality of work : 2
Impact of results : 2
Quality of arguments : 1
Quality of presentation : 2
Confidence in review : 5
Overall recommendation : 1

-- Comments to the author(s):
Main contributions:



This is a poor, unfocused paper about the use of term hierarchies as a mediation device between users and systems in spoken document retrieval. The approach the authors took to select terms may be novel, but it is inadequately described. The use of simulated users is interesting, but the authors' explanation of the research that resulted in this paper is poor.



Technical soundness



The paper suffers from an identity crisis. It is not sure whether it is meant to be about interactive IR, term weighting, document ranking or reinforcement learning. It is a hodge-podge of all these issues, and lacks focus, clarity, and direction.



If it is about Interactive IR, then more time needs to be spent relating the research described to earlier work (see, for example, the paper by Turpin & Hersh at SIGIR 2001, which talks about searcher capabilities in making up for shortcomings in IR performance). The authors make a number of claims about the query formulation process that are questionable: "This is why he [the user] formulates his query as short as possible to keep it flexible"...I doubt whether such rationale is adopted by searchers; chances are they are simply lazy or lack sufficient topic/collection knowledge to formulate queries.



More description is necessary about how the authors feel their approach would benefit interactive IR other than "topic hierarchy to help the user focus on his information needs efficiently" -- i'm sure the authors mean focus on the most important terms to describe their information needs -- the paper is littered with examples such as this.



The authors need to reference more related literature in this field and in areas such as query expansion and user-system communication. The problem they try to describe in Figure 1 has been the topic of some debate for over 50 years in the library sciences community, yet no reference is made to any of that literature.



The authors should reference some of the related work on using searcher simulations. This is a subject that is attracting a great deal of interest from IR researchers (see, for example, White et al, ACM TOIS, July 2005).



The way in which they simulate user behavior is also questionable. The authors claim that they simulate two things: information needs and retrieval behavior. They arrive at the former "by observing a real user's information needs", but provide no description of this process -- how would you observe someone's needs anyway? did you ask experimental participants what they were looking for, what there criteria for relevance were, etc.? The retrieval behavior was simulated through a series of related term selections, but curiously never any document selections. It is true that there needs to be a metric to evaluate performance, and perhaps the authors regarded the selection of documents as an unnecessary behaviour to simulate. If this was the case then you should have made this clear in the paper. However, even if this was the case the authors ignore the fact that the retrieved documents do have an impact on the terms that are selected in subsequent steps.



Where is the proof that this simulated methodology is appropriate for testing your research questions. In fact, what are your research questions? Also, you try to claim that simulations can be used to evaluate the performance of "interactive retrieval in terms of task success" and "success(ful) retrieval", and perhaps even more outrageous "user satisfaction". If you have developed measures that serve as surrogates for these inately human activities then they should be described in detail.



I was surprised as i read Section 5.1 to see "Another 50 real users including the 50 desired document set D". Where did these users come from? there is no discussion of their role in the process.



There is no discussion about why fewer terms should be regarded as a more successful outcome, and the simulation appears to only add one term per intertion



The very low standard of the discussion and conclusions leave me wondering whether this was rushed for submission at the last minute. Much more needs to be said about the approach adopted and the findings.



SDR is menition very rarely in the latter parts of the paper but appear to play an imporant role in your argumentation early on.



Relevance, importance and originality



The authors fail to convince me that this is an important problem, nor do they do a good job in convincing me that this work is important or novel.



Clarity and presentation



Clarity is poor (text is very dense) and the quality of presentation is also poor.



Other comments



- There is no need to expand up acronyms every time you use them. OOV, SDR and NE are defined continually throughout the paper.

- You should reference every claim that is not your own.

- Figure 4 is missing a compete legend.
-- Summary:
I would argue strongly for rejection of the paper. It contains a number of flaws in the methodology, and does not sufficiently describe the approach adopted. The findings are weak and the discussion/conclusions are non-existent.
---------- End of Review from Reviewer 3 ----------
------------- Review from Reviewer 4 -------------
Relevance to SIGIR : 3
Originality of work : 3
Impact of results : 2
Quality of arguments : 2
Quality of presentation : 2
Confidence in review : 3
Overall recommendation : 2

-- Comments to the author(s):
This paper presents a new method for weighting dimensions, which are often disparate and of inconsistent scale, so that query-by-example is more accurate.



The approach proposed is rather ad-hoc, and this reviewer found the argument rather unconvincing. The results appear good...



The performance of this method is reported based on only two queries? Where did this data come from? This test seems way too small to make a concrete conclusion.



Is query-by-example still a viable approach? Certainly none of the search engines are using this approach. Are any users?




-- Summary:
Ad-hoc approach and weak argument. Even weaker test of performance.
---------- End of Review from Reviewer 4 ----------


////////////////////////////////////////////////////
Powered by ConfMaster.net
///////////////////////////////////////////////////

0 Comments:

Post a Comment

<< Home