Onboard

Merge lp://staging/~onboard/onboard/word-completion into lp://staging/~onboard/onboard/main

word-completion
Merge into main

Proposed by marmuta on 2009-10-06

Status:	Merged
Merged at revision:	518
Proposed branch:	lp://staging/~onboard/onboard/word-completion
Merge into:	lp://staging/~onboard/onboard/main
Diff against target:	111635 lines (+26524/-67988) 124 files modified .bzrignore (+3/-0) Onboard/Config.py (+38/-5) Onboard/KeyCommon.py (+31/-12) Onboard/KeyGtk.py (+146/-10) Onboard/Keyboard.py (+308/-27) Onboard/KeyboardGTK.py (+3/-2) Onboard/KeyboardSVG.py (+95/-42) Onboard/Layout.py (+17/-0) Onboard/OnboardGtk.py (+5/-0) Onboard/WordPredictor.py (+242/-0) data/onboard.gschema.xml (+23/-2) layouts/Full Keyboard-Alpha.svg (+1/-4) layouts/Full Keyboard.onboard (+184/-170) po/ace.po (+0/-842) po/af.po (+264/-791) po/am.po (+0/-867) po/ar.po (+230/-781) po/ast.po (+264/-866) po/az.po (+0/-883) po/be.po (+257/-797) po/bg.po (+264/-848) po/bn.po (+244/-875) po/br.po (+264/-790) po/bs.po (+242/-879) po/ca.po (+267/-868) po/ca@valencia.po (+0/-986) po/cs.po (+240/-878) po/cy.po (+0/-840) po/da.po (+266/-801) po/de.po (+270/-868) po/el.po (+271/-806) po/en_AU.po (+272/-868) po/en_CA.po (+251/-813) po/en_GB.po (+266/-862) po/eo.po (+264/-860) po/es.po (+269/-864) po/et.po (+228/-769) po/eu.po (+245/-783) po/fi.po (+266/-854) po/fil.po (+226/-722) po/fo.po (+0/-847) po/fr.po (+271/-871) po/ga.po (+226/-724) po/gl.po (+274/-871) po/he.po (+272/-810) po/hi.po (+256/-812) po/hr.po (+231/-736) po/hu.po (+262/-860) po/hy.po (+0/-841) po/id.po (+238/-824) po/is.po (+231/-730) po/it.po (+269/-865) po/ja.po (+250/-856) po/kk.po (+260/-808) po/km.po (+0/-844) po/kn.po (+231/-738) po/ko.po (+246/-822) po/ku.po (+226/-722) po/ky.po (+0/-840) po/lt.po (+241/-782) po/lv.po (+263/-860) po/ml.po (+233/-749) po/mr.po (+0/-840) po/ms.po (+268/-862) po/my.po (+0/-840) po/nb.po (+255/-815) po/ne.po (+0/-885) po/nl.po (+279/-876) po/nn.po (+0/-846) po/oc.po (+263/-864) po/onboard.pot (+345/-569) po/pl.po (+253/-882) po/pms.po (+226/-722) po/pt.po (+263/-833) po/pt_BR.po (+277/-872) po/ro.po (+270/-806) po/ru.po (+260/-858) po/si.po (+0/-846) po/sk.po (+267/-802) po/sl.po (+253/-880) po/sn.po (+226/-722) po/sq.po (+263/-862) po/sr.po (+268/-874) po/sv.po (+263/-859) po/ta.po (+0/-861) po/te.po (+0/-845) po/th.po (+239/-766) po/tl.po (+226/-722) po/tr.po (+267/-862) po/ug.po (+262/-860) po/uk.po (+261/-799) po/vi.po (+264/-800) po/zh_CN.po (+262/-800) po/zh_HK.po (+260/-856) po/zh_TW.po (+259/-855) prediction/gpredict (+456/-0) prediction/makemodels (+233/-0) prediction/pypredict/Makefile (+31/-0) prediction/pypredict/README (+119/-0) prediction/pypredict/__init__.py (+2/-0) prediction/pypredict/analyze (+337/-0) prediction/pypredict/entropy (+62/-0) prediction/pypredict/ksr (+70/-0) prediction/pypredict/lm.cpp (+409/-0) prediction/pypredict/lm.h (+263/-0) prediction/pypredict/lm_dynamic.cpp (+48/-0) prediction/pypredict/lm_dynamic.h (+756/-0) prediction/pypredict/lm_dynamic_cached.h (+471/-0) prediction/pypredict/lm_dynamic_impl.h (+935/-0) prediction/pypredict/lm_dynamic_kn.h (+393/-0) prediction/pypredict/lm_merged.cpp (+223/-0) prediction/pypredict/lm_merged.h (+130/-0) prediction/pypredict/lm_python.cpp (+1781/-0) prediction/pypredict/ngram-test (+252/-0) prediction/pypredict/optimize (+217/-0) prediction/pypredict/pool_allocator.cpp (+377/-0) prediction/pypredict/predict (+98/-0) prediction/pypredict/pypredict.py (+357/-0) prediction/pypredict/setup.py (+21/-0) prediction/pypredict/split_corpus (+72/-0) prediction/pypredict/test_pypredict.py (+295/-0) prediction/pypredict/train (+66/-0) prediction/test-client (+44/-0) setup.py (+1/-0)
To merge this branch:	bzr merge lp://staging/~onboard/onboard/word-completion
Related bugs:	Link a bug report
Related blueprints:	Support word completion and prediction in Onboard (Medium)

Reviewer	Review Type	Date Requested	Status
Onboard Devel Team	preview	2009-10-06	Pending
Review via email: mp+12908@code.staging.launchpad.net

Revision history for this message

marmuta (marmuta) wrote on 2009-10-06:

Hi Chris, Fernando et. al.!

I'm working on word completion/prediction in onboard and there is a partially working prototype in this branch now, ready for a first benevolent look. I'd be glad if you could take a moment off of your busy schedules and and try it. Mind you, this is work in progress and far from being mergable, but it should at least give a first impression and help weed out design fails and omissions. Code review is very welcome too, I've learned a lot last time.

If you want to try it, run ./makedicts from the project home and it ought to download training texts and create dictionaries. Then run onboard from the project home as well and select the Word Completion layout. Type away, click the words in the top row: Left click with auto punctuation, right click without. It does completion only, no prediction yet. No learn mode yet either, the Learn-, Punct-, Dict- buttons aren't wired.

Different languages than english are supported, but for now only one at a time and you can switch only by changing the dictionary file in WordPredictor.py. If you need more dictionaries then install additional language packages for aspell and rerun makedicts, although training texts are only downloaded for en, es and de yet.

Cheers, let me know what you think.

Revision history for this message

marmuta (marmuta) wrote on 2009-10-06:

> Hi Chris, Fernando et. al.!
>
Uh-oh, that would be Francesco, sorry for that <:I

lp://staging/~onboard/onboard/word-completion updated on 2009-10-11

191. By marmuta on 2009-10-07: Some cleanup, added early support for multiple dictionaries.
192. By marmuta on 2009-10-08: Added auto-learning (always on) and dictionary saving (still too often, needs timed auto save)
193. By marmuta on 2009-10-11: Learn and Dict buttons are working now and hopefully properly connected to gconf.
Gconf schema has changed to add a new folder 'word_completion'.

Revision history for this message

Chris Jones (tortoise) wrote on 2009-10-13:

Sorry this has taken a while for me to look at.

I haven't looked at the code in detail yet but it works well.

A couple of things that occur to me:
1. The dictionaries are quite quite big would it be possible to re-use the firefox or openoffice dictionaries?
2. I know this is just a prototype but wouldn't it be better to keep as much of the code as possible in a separate library? Other applications/input methods might find it useful.

What do you think about a soft dependency on AT-SPI, Yuk I know but people are working on it, that would allow the word completion engine to detect widget focus change and caret movement?

Cheers, Chris

> Hi Chris, Fernando et. al.!
>
> I'm working on word completion/prediction in onboard and there is a partially
> working prototype in this branch now, ready for a first benevolent look. I'd
> be glad if you could take a moment off of your busy schedules and and try it.
> Mind you, this is work in progress and far from being mergable, but it should
> at least give a first impression and help weed out design fails and omissions.
> Code review is very welcome too, I've learned a lot last time.
>
> If you want to try it, run ./makedicts from the project home and it ought to
> download training texts and create dictionaries. Then run onboard from the
> project home as well and select the Word Completion layout. Type away, click
> the words in the top row: Left click with auto punctuation, right click
> without. It does completion only, no prediction yet. No learn mode yet either,
> the Learn-, Punct-, Dict- buttons aren't wired.
>
> Different languages than english are supported, but for now only one at a time
> and you can switch only by changing the dictionary file in WordPredictor.py.
> If you need more dictionaries then install additional language packages for
> aspell and rerun makedicts, although training texts are only downloaded for
> en, es and de yet.
>
> Cheers, let me know what you think.

Revision history for this message

Francesco Fumanti (frafu) wrote on 2009-10-13:

> A couple of things that occur to me:
> 1. The dictionaries are quite quite big would it be possible to re-use the
> firefox or openoffice dictionaries?

The dictionaries will probably come in a separated debian package; so size will probably not be a problem for the LiveCD. If there are other reasons for having smaller dictionaries or using those from firefox, than it is another topic.

Revision history for this message

Francesco Fumanti (frafu) wrote on 2009-10-13:

> 2. I know this is just a prototype but wouldn't it be better to keep as much
> of the code as possible in a separate library? Other applications/input
> methods might find it useful.

In GNOME they are planning to start with a port of GOK to python, so they might be interested in a shared prediction library. (As they are primarily only concentrating on switch users, they excluded onboard as suitable starting point.)

However, I would prefer that at the moment we concentrate on creating a well working word completion/prediction for onboard. Will it not be easier to develop it directly in onboard instead of doing it immediately as an external library!?

lp://staging/~onboard/onboard/word-completion updated on 2009-10-14

194. By marmuta on 2009-10-14: updated makdicts to support options (see makdicts -h) and a list of languages on the command line.
195. By marmuta on 2009-10-14: Word completion keeps better track of recent input now and can restart at any point when backspacing.
Added auto-save for saving modified dictionaries, default is every 10min and on exit. New gconf key word_completion/auto_save_interval.

Revision history for this message

marmuta (marmuta) wrote on 2009-10-14:

> Sorry this has taken a while for me to look at.
No problem,

> I haven't looked at the code in detail yet but it works well.
>
> A couple of things that occur to me:
> 1. The dictionaries are quite quite big would it be possible to re-use the
> firefox or openoffice dictionaries?
Word completion needs word frequencies to be useful. I had looked at myspell, ispell and aspell and didn't find any frequency based weighting. Aspell has the advantage that it can dump it's dictionaries to stdout, that is why I'm using it as a basis for the frequency counting. So, currently I don't see an alternative for separate dictionaries for onboard. I do believe that the dictionary sizes can be reduced though. Running makedicts with -f gives dictionaries <200kB each and they could even be compressed on disk. They still have 15-20000 words and considering that GOKs dictionary is around 3000 words, that seems like a good enough starting point.

> 2. I know this is just a prototype but wouldn't it be better to keep as much
> of the code as possible in a separate library? Other applications/input
> methods might find it useful.
I believe it is too early for this. I'm constantly changing interfaces so it would just slow things down at the moment. I'm trying to keep the core of the completion and punctuation reasonably separate anyway, so this should be doable later. The only currently not build-in dependency is KeyCommon.

> What do you think about a soft dependency on AT-SPI, Yuk I know but people are
> working on it, that would allow the word completion engine to detect widget
> focus change and caret movement?
I think we should try that. The word completion is currently trying to keep track of what's happening, but there just isn't enough information to get it right. I've enabled AT a while ago just to see how it feels and I hardly see a difference at all. So, I guess I'll look into it at some point

> Cheers, Chris

Cheers

lp://staging/~onboard/onboard/word-completion updated on 2011-11-11

196. By marmuta on 2009-10-16: Only load dictionaries when the layout has use for them.
Removed depedency KeyCommon from WordPredictor.py.
Fixed spurious "U"s in auto punctuation
WC keeps track of additional editing keys: del, cursor left/right.
Toggling punctuation doesn't reset input line anymore. '
197. By marmuta on 2009-10-20: Added additional weighting of words based on their usage oder. A new gconf key 'frequency_time_ratio' controls the ratio between the old frequency based weighting and time of last use.
198. By marmuta on 2009-10-22: - fixed save on exit
- learn button discards input line when turning off, keeps it when turning on
- added Francesco's learning texts for french and italian
- makedicts defaults to "expand affixes", "don't include infrequent words" -> dictionary sizes around 200kB
- exclude Project Gutenberg license headers and footers from training data
- esc key clears input buffer
- don't learn words with more than 3 repeated characters
199. By marmuta on 2009-10-26: added ability to toggle word completion including its ui via new gconf key enable_word_completion
200. By marmuta on 2009-10-28: experimental detection of mouse clicks outside of onboard; reset word completion on every detected click.
201. By marmuta on 2009-10-29: - reworked the punctuation logic again to get key feedback; hopefully fixing the issues with ; and : in the process
- added Francescos fixes to gconf schemas and delete button
- fixed up experimental outside click detection
202. By marmuta on 2009-10-29: work around memory leak in pangocairo.CairoContext.create_layout() (gnome #599730)
203. By marmuta on 2009-11-08: - added input history with color highlighting (negotiable ;) blue: ignored, yellow/red: new word to learn
- added stealth button + new gconf key stealth_mode
- fixed wrong default values for auto_save_interval and frequency_time_ratio when gconf keys are missing
- replaced word_completion with word_prediction in gconf schemas and most everywhere every
- increased default frequency_time_ratio from 50 to 75
- another small update of the keyboard logic, potential for slightly less bugs
204. By marmuta on 2009-11-08: added word prediction test application, n-grams of arbitrary order, various smoothing algorithms incl. kneser-neyinterpolation
205. By marmuta on 2009-11-08: Merge from main branch
206. By marmuta on 2009-11-08: fixed crasher on start with unavailable dictionaries
207. By marmuta on 2009-11-11: - fixed updating problem with the color highlighting of the history line
- allow single letter words into the dictionaries for a, I,...
208. By marmuta on 2009-11-12: - learning with incremental parameter calculation for kneser-ney smoothing
- around 10 times speed-up of prediction queries
209. By marmuta on 2009-11-13: switched ngram-test from strings to indices, another 2x speed-up
210. By marmuta on 2009-11-21: switched ngram-test data structures to python trie as preparation for a C implementation
211. By marmuta on 2009-11-29: new (temporary) sub-project lm, python extension for a dynamically updatable n-gram language model
212. By marmuta on 2009-12-06: prediction:
- some clean up of the c++ code and improved code comments
- use python memory manager as often as possible
- added pool allocator for maybe 15% less memory usage
- added save/load to depth-first file format: too little improvement in loading speed, left the old arpa-like one in place
- added python tools split_corpus, entropy/perplexity, ksr (keystroke-savings-rati, see README)
- added a minimal d-bus prediction server + test-client
213. By user <user@dingsdale> on 2009-12-15: - initial word prediction support for onboard, prediction and learning through D-Bus calls
- prediction service loads, caches and saves language models
- added linear interpolation of language models
- reworked tokenization and moved it into the python extension in pypredict.py
- simplified and fixed all python tools: split_corpus, train, predict, entropy, ksr
- plenty of bug fixes, still more to do
214. By user <user@dingsdale> on 2009-12-22: - added log-linear interpolation and onboard's simple overlay-algo for merging lms
- comments and cleanups
215. By user <user@dingsdale> on 2010-01-07: - added two new smoothing options: Witten-Bell and Absolute Discounting
- reworked Kneser-Ney smoothing for more robust normalization
- changed default smoothing to Absolute Discounting
- added new tool analyze for plotting pretty entropy and ksr charts, needs matplotlib
- split_corpus supports additional parameters to influence the size of split texts
- added python unit tests for tokenization and language model normalization
- fixed word insertion bug in onboard; tokenization is fully done via D-Bus now
- fixed a crasher in PoolAllocator
- random fixes and comment updates throughout the code
216. By user <user@dingsdale> on 2010-01-07: fixed traceback at service startup when the models directory wasn't found'
217. By user <user@dingsdale> on 2010-01-14: - try multiple encodings in pypredict.read_corpus before giving up, default is [utf-8, latin-1]
- use timeout_add_seconds instead of timeout_add for the autosave timer in gpredict to allow for grouping wakeups
218. By user <user@dingsdale> on 2010-01-14: - reworked pypredict.split_sentences( )and prettyfied the results of sentence splitting
- fixed erroneous joining of sentences when using texts generated by the split_corpus tool
219. By user <user@dingsdale> on 2010-01-14: don't commit_input_line() when scrolling with the mouse wheel
220. By user <user@dingsdale> on 2010-01-25: wrapped DynamicModel and NGramTrie in templates to allow for alternative memory layouts, i.e. recency caching and no kneser-ney parameters if they aren't needed
221. By marmuta on 2010-02-02: experimental workaround for traceback at reset_clip on lucid
222. By marmuta on 2010-02-26: Added a new model type CachedDynamicModel for recency based ngram-caching with exponential fall-off over time.
The prediction now remembers recently used ngrams. The current parameters where found by trial and error, need to more thoroughly investigate what works best later.
223. By marmuta on 2010-03-03: Merge with onboard main
224. By marmuta on 2010-03-04: Added new D-Bus method lookup_text to get onboards input line display working again.
225. By marmuta on 2010-03-04: Removed all traces of dictionary auto saving from onboard. The D-Bus service has been saving language models for a while.
226. By Francesco Fumanti on 2010-03-05: Use utf-8 coding to avoid problems with build_i18n
227. By marmuta on 2010-03-10: Added makemodels script to create language models for all available aspell dictionaries. Filter models based on the aspell vocabulary.
228. By marmuta on 2010-03-13: Experimented with loading models in a separate thread with mixed results, disabled again. Python's global interpreter lock complicates things.
229. By marmuta on 2010-03-14: - Added color feedback to the mouse click buttons
- Fixed old bug in Keyboard.iter_keys() that led to always returning to the main pane when pushing click buttons
- Set color of bright checked buttons to the same as buttons that are "on".
230. By marmuta on 2010-03-29: - Extended the analyze tool to investigate caching parameters
- Added an optimize tool that tries to find better caching parameters with simulated annealing
- Set new, marginally better caching parameters for recency caching
- Allow floats in addition to ints for recency_halflife property of CachedDynamicModel
231. By marmuta on 2010-10-03: Merge with onboard main
232. By marmuta on 2010-10-03: Fixed oversized key labels for small window heights. Fallout from last merge.
233. By marmuta on 2011-06-04: Merge from main. Needs additional work.
234. By marmuta on 2011-07-10: Merged with main, additional changes
- Converted to GTK3/gnome introspection
- Moved word prediction gconf keys to gsettings
- Kept mouse click polling for button updates and word learning
- Always convert key labels to unicode to avoid breaking calls to the word predictor
- Regression: input line display disabled because of broken get_char_extents, https://bugzilla.gnome.org/show_bug.cgi?id=654343
235. By marmuta on 2011-07-11: - Disabled more of the input line; Pango introspection is in bad shape
- Fixed merge mistake in classic layout
236. By marmuta on 2011-11-11: Merge from trunk. Word prediction technically still works, but the ui needs lots of polishing.
237. By marmuta on 2011-11-11: Unbreak auto-punctuation.
238. By marmuta on 2011-11-11: Partially bring back the input line. Introspection of pango attributes is still utterly broken, use parse_markup instead.