Intro video. The hexagonal cells shown to vibrate and excite hair cells in the video are the Reticular lamina (aka membrana reticularis) from the Organ of Colti
http://www.open.edu/openlearn/science-maths-technology/science/biology/hearing/content-section-3.3
http://vaczy.dk/htm/acoustics.htm
http://www.newmusicbox.org/articles/The-Musical-Ear/
Actually, which sounds sound nice together is apparently a far more complex question than rhythms (not an expert here just curious). The main explanation I can find (given that there are many things yet unknown, such as the roles spatial, temporal, and neural encoding play) is mentioned here: http://www.newmusicbox.org/articles/The-Musical-Ear/ The basilar membrane is known to certainly play a role in pitch perception. Now, most times we hear a frequency we hear it from some object (like an instrument) that generates harmonics of that frequency (ultimately due to ratios of lengths and linear dispersion relations). Now, harmonic frequencies (with simple ratios as you say) share a lot of harmonics themselves. These will excite the basilar membrane in the same spots. And as long as the harmonics don't differ by more than about 10Hz, they will be indistinguishable (as far as the basilar membrane is concerned, due to bandwidth). However, if you make two non harmonic sounds with two non-conmensurate objects, a lot of their harmonics will be very close, and within the so called critical frequency, which has been shown to cause dissonant perception. Now, a plausible theory for why even pure sinusodial waves at simple ratios tend to sound better (though I did the test now, two non-harmonic sin waves don't sound nearly as bad as two non-harmonic piano notes), may be that the brain develops neuronal networks to prefer these sounds.
Your theory of the brain detecting the rhythms is still interesting though, and may be relevant to the "temporal coding" theoreis that have been proposed, but I have not read much about those..
http://plasticity.szynalski.com/tone-generator.htm
The Neural Code of Pitch and Harmony
https://en.wikipedia.org/wiki/Basilar_membrane
https://en.wikipedia.org/wiki/Pitch_%28music%29#Theories_of_pitch_perception
https://en.wikipedia.org/wiki/Consonance_and_dissonance#Physiological_basis_of_dissonance
https://en.wikipedia.org/wiki/Music_psychology#Neural_correlates_of_musical_training
https://en.wikipedia.org/wiki/Psychoacoustics#Music
Music and measure theory The reason it works so well to have twelve notes in the chromatic scale is that powers of the twelveth root of two tend to be within a 1% margin of error of simple rational numbers. And it's good to have powers of the same factor for the notes, because the brain perceives separation between frequencies logarithmically not linearly.
Travelling waves propagate through the Cochlea, causing Resonance of different spatial locations of the basilar membrane (tonotopy), which excited the Hair cells, which send a signal via the Cochlear nerve.
The spatial location of excitation by different frequencies has been described via the empirical relation known as Greewood function
The function of the organ of Corti is to transduce auditory signals and minimise the hair cells’ extraction of sound energy.[2] It is the auricle and middle ear that act as mechanical transformers and amplifiers so that the sound waves end up with amplitudes 22 times greater than when they entered the ear. Energy is conserved, so what we are basically doing is concentrating the energy of a large (in spatial extent) wave entering the ear, into a small space inside the cochlea, in an energy-efficient way. This concentration means that the amplitude of the Pressure wave is significanlty higher.
the stimulation can happen also via direct vibration of the cochlea from the skull. The latter is referred to as Bone Conduction (or BC) hearing, as complementary to the first one described, which is instead called Air Conduction (or AC) hearing. Both AC and BC stimulate the basilar membrane in the same way (Békésy, G.v., Experiments in Hearing. 1960).
There is also amplification within the Organ of Colti. The outer hair cells (OHCs) can amplify the signal through a process called electromotility where they increase movement of the basilar and tectorial membranes and therefore increase deflection of stereocilia in the inner hair cells (IHCs) (see here). I guess it "focuses" the signal even more to the inner hair cells, to increases sensitivity. Do they alter/modulate the signal in some way?